December Adventure' March 17-18
Concurrency, Linking, and Self-Hosted Daemons
Continuing my Psion connectivity themed March December Adventure, I persisted with the process of addressing legacy threading issues in plptools—it’s important that we have a stable foundation on which we can build future functionality, and it’s still my intuition it’s better to build on top of what we have than start again.
Foundational Work
The plptools architecture centralises all the complexity in ncpd, the daemon responsible for implementing the Psion Link Protocol (PLP) and exposing different Psion-side server end-points to PC-side TCP clients. This makes the clients relatively simple at the cost of some gnarly internal code which is, unsurprisingly, where all our problems lie.
An overly simplified overview of that looks something like this:
---
config:
class:
hideEmptyMembersBox: true
---
classDiagram
direction TB
class NCPSession
class NCP
class Link
class DataLink
class LinkChannel
class SocketChannel
NCPSession "1" --> "1" NCP : ncp_
NCP "1" --> "1" Link : link_
NCP "1" --> "n" SocketChannel : channelPtr
NCP "1" --> "1" LinkChannel : lChan
Link "1" --> "1" DataLink : dataLink_
On first blush, this seems complex, but there’s quite a bit that needs to happen and the functionality is relatively well compartmentalized:
NCPSession—provides APIs to manage the full daemon life cycleNCP—multiplexes channels to Psion-side servers, pairing them to connected PC-side TCP clientsLinkChannel—a special channel for communicating with the Psion to manage the overall connectionSocketChannel—PC-side TCP client end-point
Link—manages connection establishment and packet sequencing, transmission, and retransmissionDataLink—frames and un-frames messages, and writes and reads the serial port
Most of my effort over the last few days has been focused on DataLink—the goal is to ensure we have a robust core before focusing on other aspects of the architecture. Looking at the existing code, my theory is that this was originally written to be single-threaded and threading was added after-the-fact, with no locking at all. This has given me a wonderful opportunity to refresh my memory of C++’s take on mutexes, locks, and condition variables. The introduction of locks has a significant impact on how we shut down ncpd: the various worker threads were relying on PTHREAD_CANCEL_ASYNCHRONOUS which can stop them at any point during their execution, potentially leaving locks held and resulting in deadlock. With locks, we have to explicitly manage thread cancellation. My PR for these changes has evolved over the past few days, and I think it’s nearly there. Thanks must go to Alex and Fabrice, the other plptools maintainers, for stoically reviewing and testing my many changes, and keeping me honest during the process.
Self-Hosted Daemons
With the various thread-safety improvements in place, I took another crack at adding a self-hosted daemon to plpftp to allow it to be run without necessitating a separate ncpd instance. This involves moving the daemon classes into ‘libplp’ to allow all the plptools commands to share them and, unfortunately, doing so still resulted in a near-immediate crash in that looked like a double-free in BufferStore, our buffer convenience wrapper.
Fairly sure I wasn’t seeing a multi-threading issue this time around, I kept digging and, after altogether too long, realized the crash was a result of the free function itself being NULL. Unsurprisingly, when runtime fundamentals like this are missing, you get some incredibly strange and misleading crashes and call stacks. Thankfully the fix was easy: I had failed to link libgnu; a failure that macOS silently ignores, leaving all function pointers NULL. 🤦
With everything finally working as expected, I was able to make the change I’d intended three days prior, adding a simple call to conditionally start a daemon (NCPSession) instance if the serial port was passed as an argument to plpftp:
Semaphore *sem = new Semaphore();
NCPSession *session = nullptr;
if (serialDevice) {
session = new NCPSession(
sockNum,
115200,
host,
serialDevice,
false,
0,
[](void *context, bool connected, int version) {
static_cast<Semaphore *>(context)->signal();
},
sem);
session->start();
sem->wait();
}
As is always the way in software, this ‘feature’ work proved the easiest bit and, with the foundations in place, this worked first time: