Multi threaded application design has been a significant challenge since I first understood the problem in the 80's. It turns out the guys at ØMQ also have a thing or two to say about this that introduces the Actor Model and is worth repeating for those new to the subject. You do not need to be a developer with experience in multithreading, but the more you've suffered from multithreaded code design, the more you'll enjoy this article. Why Write Multithreaded Code? Before we look at how to write multithreaded code using ØMQ, it's worth asking why we want to do this at all. In my experience, there are two reasons why people write multithreaded code.
One example from ØMQ experience. In 2005 we wrote our AMQP messaging server, OpenAMQ. The first versions used the same virtual threading engine as Xitami, and could push 50,000 messages per second in and out (100K in total). These are largish, complex messages, with nasty stateful processing on each message, so crunching 50K in a second was a good result. This was 5x more than the software we were replacing. But our goal was to process 100K per second. So after some thought and discussion with our client, who liked multithreading, we chose the "random insanity" option and built a real multithreaded version of our engine from scratch. The result was two years of late nights tracking down weird errors and making the code more and more complex until it was finally robust. OpenAMQ is extremely solid nowadays, and this pain is apparently "normal" in multithreaded applications, but coming from years of doing painless pseudo-threading, it was a shock. Worse, OpenAMQ was slower on one core and did not scale linearly. It did 35K when running on one core, and 120K when running on four. So the exercise was worth it, in terms of meeting our goals, but it was horribly expensive. The Failure of Traditional Multithreading The core problem with (conventional) concurrent programming techniques is that they all make one major assumption that just happens to be insane. The assumption maybe comes from the Dijkstran theory that "software = data structures + algorithms", which is flawed. Software consisting of object classes and methods is no better, it just merges the data structures and algorithms into a single pot with multiple levels of abstraction like rice noodles. Before we expound a less insane model of software, let's see why placing data structures (or objects) in such a central position fails to scale across multiple cores. In the Dikstran model of software, its object-oriented successors, and even the IBM-gifted relational database, data is the golden honey comb that the little busy bees of algorithms work on. It's all about the data structures, the relations, the indexes, the sets, and the algorithms we use to compute on these sets. We can call this the "Data + Compute" model of software: When two busy algorithmic bees try to modify the same data, they naturally negotiate with each other over who will go first. It's what you would do in the real world when you 'negotiate' with a smaller, weaker car for that one remaining parking place right next to the Starbucks. First one in wins, the other has to wait or go somewhere else. And so the entire mainstream science of concurrent programming has focused on indicator lights, hand signals, horns, bumpers, accident reporting forms, traffic courts, bail procedures, custodial sentences, bitter divorces, alcoholic deaths, generational trauma, and all the other consequences of letting two withdrawn caffeine addicts fight over the same utterly essential parking spot. Here, with less visual impact and more substance, is what really happens:
These techniques all try to achieve the same thing, namely safe access to shared data. What they all end up doing is:
The only way to scale beyond single digit concurrency is to share less data. If you share less data, or use black magic techniques like flipped data structures (you lock and modify a copy of a structure, then use an atomic compare-and-swap to flip a pointer from the live copy to your new version), you can scale further. (That last technique serializes writers only, and lets readers work on a safe copy.) Lots of Chatty Boxes Reducing the conflicts over data is the way to scale concurrency. The more you reduce the conflicts, the more you can scale. So if there was a way to write real software with zero shared data, that would scale infinitely. There is no catch. As with many things in technology, this is not a new idea, it's simply an old one that never made the mainstream. Many fantastically great ideas are like this, they don't build on existing (mediocre) work, so are never adopted by the market. Historically, the Data + Compute theory of software stems from the earliest days of commercial computers, when IBM estimated the global market at 5,000 computers. The original model of computing is basically "huge big thing that does stuff". As hardware models became software models, so the Big Iron model became Data + Compute. But in a world where every movable object will eventually have a computer embedded in it, Data + Compute turns into the shared state dog pit where algorithms fight it out over memory that is so expensive it has to be shared. But there are alternative models of concurrent computing than the shared state that most mainstream languages and manufacturers have adoped. The relevant alternate reality from the early 1970's is "computing = lots of boxes that send each other messages". As Wikipedia says of the Actor model of computing: "Unlike previous models of computation, the Actor model was inspired by physical laws... Its development was "motivated by the prospect of highly parallel computing machines consisting of dozens, hundreds or even thousands of independent microprocessors, each with its own local memory and communications processor, communicating via a high-performance communications network." Since that time, the advent of massive concurrency through multi-core computer architectures has rekindled interest in the Actor model." Just as Von Neumann's Big Iron view of the world translated into software, the Actor model of lots of chatty boxes translates into a new model for software. Instead of boxes, think "tasks". *Software = tasks + messages*. It turns out that this works a lot better than focusing on pretty bijoux data structures (and who doesn't enjoy crafting an elegant doubly-linked list or super-efficient hash table?) So here's are some interesting things about the Actor model, apart from "how on earth did mainstream computer science ignore such a powerfully accurate view of software for so long":
To explain again how broken the shared-state model is, imagine there was just one mobile phone directory in the world. Forget the questions of privacy and who gets to choose which number goes with "Mom&Dad". Just consider the pain if access had to be serialized. Luckily, every mobile phone has its own directory, and they communicate with each other by sending messages. Mobile phones have demonstrated this works well, and scales to the point where we have over 3 billion mobile phones on the planet with no contest over access to a SIM card. But the actor model has more advantages over shared state concurrency:
So, eliminate shared state, turn your application into tasks that communicate only by sending each other messages, and those tasks can run without ever locking or waiting for other tasks to make way for them. It's kind of like discovering that hey, there are other Starbucks, other parking spaces, and frankly it's easy enough to give everyone theor own private city. Of course you need some good connectivity between your tasks. How to use ØMQ for Multithreading ØMQ is not RFC1149. No bird seed, no mops. Just a small library you link into your applications. Let's look how to send a message from one thread to another. This program has a main thread and a child thread. The main thread wants to know when the child thread has finished doing its work: // Show inter-thread signalling using ØMQ sockets #include "zhelpers.h" static void * child_thread (void *context) { void *socket = zmq_socket (context, ZMQ_PAIR); assert (zmq_connect (socket, "inproc://sink") == 0); s_send (socket, "happy"); s_send (socket, "sad"); s_send (socket, "done"); zmq_close (socket); return (NULL); } int main () { s_version (); // Threads communicate via shared context void *context = zmq_init (1); // Create sink socket, bind to inproc endpoint void *socket = zmq_socket (context, ZMQ_PAIR); assert (zmq_bind (socket, "inproc://sink") == 0); // Start child thread pthread_t thread; pthread_create (&thread, NULL, child_thread, context); // Get messages from child thread while (1) { char *mood = s_recv (socket); printf ("You're %s\n", mood); if (strcmp (mood, "done") == 0) break; free (mood); } zmq_close (socket); zmq_term (context); return 0; } There is no direct mapping from traditional MultiTthreaded code to ØMQ code. Whereas shared state threads interact in many indirect and subtle ways, ØMQ threads interact only by sending and receiving messages. You should follow some rules to write happy multithreaded code with ØMQ:
f you follow these rules, you can quite easily split threads into separate processes, when you need to. Application logic can sit in threads, processes, boxes: whatever your scale needs. ØMQ uses native OS threads rather than virtual "green" threads. The advantage is that you don't need to learn any new threading API, and that ØMQ threads map cleanly to your operating system. You can use standard tools like Intel's ThreadChecker to see what your application is doing. The disadvantages are that your code, when it for instance starts new threads, won't be portable, and that if you have a huge number of threads (thousands), some operating systems will get stressed. Let's see how this works in practice. We'll turn our old Hello World server into something more capable. The original server was a single thread. If the work per request is low, that's fine: one ØMQ thread can run at full speed on a CPU core, with no waits, doing an awful lot of work. But realistic servers have to do non-trivial work per request. A single core may not be enough when 10,000 clients hit the server all at once. So a realistic server must starts multiple worker threads. It then accepts requests as fast as it can, and distributes these to its worker threads. The worker threads grind through the work, and eventually send their replies back. You can of course do all this using a queue device and external worker processes, but often it's easier to start one process that gobbles up sixteen cores, than sixteen processes, each gobbling up one core. Further, running workers as threads will cut out a network hop, latency, and network traffic. The MT version of the Hello World service basically collapses the queue device and workers into a single process:
// Multithreaded Hello World server static void * while (1) { int main (void) // Socket to talk to clients // Socket to talk to workers // Launch pool of worker threads // We never get here but clean up anyhow How it works:
Here the 'work' is just a one-second pause. We could do anything in the workers, including talking to other nodes. This is what the MT server looks like in terms of ØMQ sockets and nodes. Note how the request-reply chain is REQ-ROUTER-queue-DEALER-REP: Signalling When you start making multithreaded applications with ØMQ, you'll hit the question of how to coordinate your threads. Though you might be tempted to insert 'sleep' statements, or use multithreading techniques such as semaphores or mutexes, the only mechanism that you should use are ØMQ messages. Remember the story of The Drunkards and the Beer Bottle. Here is a simple example showing three threads that signal each other when they are ready. ![]()
// Multithreaded relay #include "zhelpers.h" #include <pthread.h>
static void * return NULL; static void * // Wait for signal and pass it on // Connect to step3 and tell it we're ready return NULL; int main (void) // Bind inproc socket before starting step2 // Wait for signal printf ("Test successful!\n"); This is a classic pattern for multithreading with ØMQ:
Note that multithreading code using this pattern is not scalable out to processes. If you use inproc and socket pairs, you are building a tightly-bound application. Do this when low latency is really vital. For all normal apps, use one context per thread, and ipc or tcp. Then you can easily break your threads out to separate processes, or boxes, as needed. This is the first time we've shown an example using PAIR sockets. Why use PAIR? Other socket combinations might seem to work but they all have side-effects that could interfere with signaling:
For these reasons, PAIR makes the best choice for coordination between pairs of threads. |