IOCP


  HTTP Server design with I/O Completion Ports encompases:
  • I/O Completion port queue
  • Thread Pools
  • Windows Sockets (WinSock)
  • Asynchronous socket read/write
  • Asynchronous file read/write 
  • Buffer Memory Management


Introduction:


If you have no idea what an IOCP is, take a look at the Introduction to IOCP and then come back here. The basic idea of an I/O Completion Port is that it is a built-in queuing mechanism that can be coupled to I/O transactions. It is an awesome inter thread communication mechanism with an infinite number of uses.

Under Windows, asynchronous I/O is associated with the terms "Overlapped I/O" and IOCP or "I/O Completion Port". The basic idea of an IOCP is that it is a built-in queuing mechanism that can be coupled to I/O transactions like say HTTP requests coming in via TCP from a web browser. (The term "overlapped" in this context means that the time spent performing the I/O request overlaps the time your thread spends performing other tasks.)

IOCP is a system kernel object that allows an application to use a pool of threads that are created to process asynchronous I/O requests. The system maintains the operation of the IOCP, on which all notifications of completed I/O operations are posted. IOCP is essentially just a thread synchronization object.

The IOCP is associated with worker threads (NOT I/O threads) and these worker threads process the notifications. Notifications are removed from the IOCP queue in a first-in first-out order.

In an ideal implementation, there is ONE thread per processor as threads consume resources as they have their own memory stack. Switching between threads (context switch) is also an expensive operation.

These worker threads process the notifications one at a time. However, If one of the I/O worker threads calls
  • Sleep()
  • WaitForSingleObject()
  • WaitForMultipleObjects()
  • SignalObjectAndWait()
  • Synchronous I/O call
or any function that would cause the thread not to stop and wait, IOCP would detect this and wake another thread immediately thus keeping the CPU saturated with work. This means that the thread pool needs to be larger than the number of processor available.  A rule of thumb is twice as large.


As you might not expect, threads are awakened in a last-in first-out (LIFO) fashion. This means that a thread that is running will be given the next notifiction rather than waking up the first thread and performing a context switch.                     

By using the IOCP, you can make your application's threads achieve phenomenal throughput by reading and writing to devices without waiting for the devices to respond. The devices like TCP, Hard Drives in fact anything that can be read(), then go about the operation of reading while other notifications are processed. When the read() or write(0 is completed, a notification is posted to the IOCP. Now your application logic takes over and processes the data. A thread which is handling the I/O request can be written very simply. You just need a loop that does a small number of things, then loops back and waits for the next notification.

Note: Use of IOCP-capable functions in Winsock 2 (such as WSASend and WSARecv) is preferred over using IOCPs in WriteFile and ReadFile when using sockets.

IOCP is a well designed, battle tested inter-thread communication mechanism with an infinite number of uses. Its power becomes very apparent when utilized to manage a web server that has to scale, because spawning a new thread for each request will rapidly consume any web servers resources.

Web servers also benefit from using IOCP because of its tight integration with WinSock, the windows sockets API.

For Windows, using IOCP is not just a performance requirement, but a stability one. There is no need to write complex WaitForMultipleObjects() to handle multiple concurrent I/O operations, or manage arrays of handles, or auxiliary arrays that correlate the handles to the context of the I/O operation. There is no need to manage the creation, release, and caching of Event objects. There is no reason to poll for completion, or deal with what happens to the initiating thread if the data is not available. This is what IOCP encapsulates.

One of the biggest benefits of IOCP is that it allows you to manage multiple reads and multiple writes on the same handle. If you are wondering why you would want to post more than one I/O operation at a time on a socket. The answer is scalability. Multiple-processor machines with a worker thread per processor can utilize several processors to send and receive data on a socket at the same time. There is some byte re-ordering to be done and memory constraints to watch out for (see common issues an solutions), but the performance gain is significant of a single thread performing serial reads or writes.

In this way, IOCP allows you to write powerful multi threaded code without having to do all the hard work.



A Quick Guide for the Impatient:

To use IOCP, you must first create a socket that has the overlapped flag set. After the socket is bound to a local interface, overlapped I/O calls can be made to any of the following Winsock functions (the WSAOVERLAPPED structure is an argument in the call)

All these functions complete immediately, regardless of the socket's mode because they rely on the WSAOVERLAPPED structure to manage the completion of an I/O request.

Sockets, Pipes and Files are all opened with an overlapped (read async) flag. This instructs the system to post an IOCP event to the queue. This event is signaled and sent to the Completion port when the red/write operation completes.

For a hard disk file, this is when the last byte is read, for a socket, it is when the buffer is emptied. In the case of a file, there is no more data to read, but in the case of a socket there may be more data arriving. In this case, there must be some length parameter contained within the data so your code will know when to stop reading from the socket.

  • WSASend()
  • WSASendTo()
  • WSARecv()
  • WSARecvFrom()
  • WSAIoctl()
  • WSARecvMsg()
  • AcceptEx()
  • ConnectEx()
  • TransmitFile()
  • TransmitPackets()
  • DisconnectEx()
  • WSANSPIoctl()

IOCP is a port (one per thread) through which completion notices arrive. Tasks like network I/O and Hard Disk reads are initiated, and when they complete, the first worker threads is notified. If it is busy the next worker thread in line is notified. In this way, ICOP manages a thread pool.

There are a finite number of threads, typically two per CPU. To maintain performance we need to make sure that the threads in the IOCP thread pool do not block because if they all block, then no socket I/O will occur until they unblock. For this reason, worker threads typically do simple processing and then schedule further work.

The powerful feature of IOCP is that a worker thread can schedule socket writing or hard disk reading that generates a completion notification that arrives back in that same thread through the completion port. The completion port is essentially a queue of these notifications.

Things get a little more complex when other processing like database queries must be completed. To avoid blocking the ICOP worker threads, this work must be offloaded to a secondary thread pool that ideally would expand or contract to meet demand.

Using AcceptEx API call for connection requests that deliver small amounts of data allows your application to service an accept request and retrieve data through a single API call. This can give a small performance gain over making two calls.


Operational Layout



Main Thread :
  - WSAStartup()

  - CreateNewCompletionPort(()
  - CREATE IOCPWorkerThread() 
  - CREATE IOCPWorkerThread()
  - CREATE IOCPWorkerThread()
  - ...

  - socket( AF_INET, SOCK_STREAM, IPPROTO_TCP ) 
  - bind()
  - listen()
  - CREATE SocketListenThread() 

  - CREATE Dialog for the user

    


ListenThread :
  - aaccept() - blocks until a TCP connection arrives (HTTP request)
  - ioctlsocket()
  - AssociateDeviceWithCompletionPort() - wrapper for CreateIoCompletionPort()
  - AllocMem() - for data buffer and OVERLAPPED structure
  - WSARecv() 

        
Worker Thread 1:
IO_SOCKET_READ
  - GetQueuedCompletionStatus() - blocks until WinSock Recv has data
  - CreateFile() - Map requested .html file
  - GetFileSize()
  - AllocMem() - buffer
  - AssociateDeviceWithCompletionPort() - wrapper for CreateIoCompletionPort()
  - ReadFile() - asynchronous
   
IO_FILE_READ
  - GetQueuedCompletionStatus() - blocks until file read notifiv
  - AllocMem() - buffer
  - WSASend()  - asynchronous .html file send (HTTP response)
  
IO_SOCKET_WRITE
  - GetQueuedCompletionStatus() - blocks until WSASend is complete
  - CloseSocket()
  - ReleaseMem() - data buffer and OVERLAPPED structure
  
IO_FILE_WRITE
  - GetQueuedCompletionStatus() - blocks until Write is complete
  - CloseHandle()
  - ReleaseMem() - data buffer and OVERLAPPED structure


Worker Thread n:
........
........


Code Example Design

The example is a simple as possible. No attempt has been made to use advanced memory management for example. When a connection is received, a new instance of the overlapped structure is created and a small (specified in SockBuffLen) buffer for the data is assigned with malloc(). This buffer is then doubled each time more memory is needed.

Once the request has been received the overlapped structure and the buffer is reused for the response. There are many better ways to manage memory but this is simple to follow in the example. We reassign the completion port from the socket to the memory mapped HTML file we are reading from the hard rive and returning - as described in MSDN, so we do not need to create a new completion port.
ExistingCompletionPort [in, optional] - A handle to an existing I/O completion port or NULL.

If this parameter specifies an existing I/O completion port, the function associates it with the handle specified by the FileHandle parameter. The function returns the handle of the existing I/O completion port if successful; it does not create a new I/O completion port.

As mentioned in the quick guide, reading from a hard drive is like reading from a socket in that the system posts an IOCP even if it could not read the entire file immediately - meaning Disk I/O is exhausted at the time of the call, or the supplied buffer is not big enough to hold the entire file. The only difference it that ReadFile() returns ERROR_IO_PENDING. Either way, the notification will occur when the buffer is filled. This means that we do not need to process this "error".

If you are serving up large files consistently, a better design is to limit the buffer to a reasonable size. A notification will then post when the buffer has filled. At this point, we return the buffer to the client (freeing the buffer) before reading the next "slice" of the disk file. The code example does just this.

If you create a small "index.html" file, then it will be read into the buffer immediately. ReadFile() will return with no error and the file will be sent. If you create a large "index.html" file, then it will be read into the buffer until the buffer is full. ReadFile() will return with  "error" ERROR_IO_PENDING. This "slice" is returned to the client. We then check to see if the accumulated total equals the file size. If it does not we change the state back to IO_FILE_READ and call ReadFile() again.

The system updates the WSABUF structure for us so it knows where it left off reading with this Overlapped operation. Reading contiunes until the buffer is full or the end of the file is reached and that slice is then returned. This can be seen clearly in the debug spew lower down in the section titled Code Method.

The code is laid out procedurally for readability. Two simple structures are passed to functions by reference as needed. OOP is generally a better way to implement this. To test the compiled server, just run it locally, and point your browser to: http://127.0.0.1:8080/YourFile.html


Notification Structure


The notification message returns three items:
  1. The Number of new bytes copied to your memory
  2. A user defined key
  3. A Pointer to a notification structure
The lpCompletionKey parameter contains what may be called: per-handle data because the data is related to a socket handle when a socket is associated with the completion port. This is the data that is passed as the CompletionKey parameter of the CreateIoCompletionPort API call. The lpOverlapped parameter contains an OVERLAPPED structure followed by what we call per-I/O operation data, which is anything that the worker thread will need to know when processing a completion.

While the key can be used as a pointer to a structure of your design, this is mostly redundant as all IOCP calls take a notification structure of any size as long as the first item in the structure is the OVERLAPPED or WSAOVERLAPPED structure (they are identical). After that you can have as many variables of any type you like.

Typically you will want to keep track of the amount of memory available so that if you fill it all, you can reallocate some more space for additional data. That means keeping track of the total data in the buffer, and a pointer to the buffer too.You will probably also need the Client Socket handle so you can write data back to the client, but possibly the most important variable would an enumeration of all the possible operations that you might get a notification from.

An example would be:
  • IO_THREAD_SHUTDOWN
  • IO_SOCKET_READ   
  • IO_SOCKET_WRITE  
  • IO_FILE_READ     
  • IO_FILE_WRITE 
An instance of this structure is created for every item in the queue, so variables should be chosen with some care.
 


Coding the Worker threads

As explained your efforts will typically be focussed in coding the worker threads. There will need to be a number of conditional statements to determine what has just completed and then what to do with the buffer.

The greatest problem you typically end up with in dealing with the code is the potential concurrency with other threads.  For example, if you need to update a disk file based
on the completion, you have to synchronize access to the file so that no two threads are trying to write at the same time. But don't think of LockFileEx as the solution here! 

Instead, create a "logging thread" that logs information to a file, and make it the only thread that logs information to a file. Then simply enqueue requests to the logging thread from each of your handler threads, and trying to debug a complex synchronization problem is solved.

A thread which is handling the I/O request can be written very simply: it is a loop that does a very small number of things, then goes back and waits again.  There is no need to do a complex WaitForMultipleObjects, perhaps with a timeout, to handle multiple concurrent I/O operations, or manage arrays of handles, or auxiliary arrays that correlate the handles to the context of the I/O operation.  There is no need to manage the creation, release, and possible caching of Event objects.  There is no reason to poll for completion, or deal with what happens to the initiating thread if the data is not available.

The greatest problem you typically end up with in dealing with the code is the potential concurrency with other threads.  For example, if you need to update a disk file based on the completion, you have to synchronize access to the file so that no two threads are trying to write at the same time.  But don't think of LockFileEx() as the solution here!

Instead, create a "logging thread" that logs information to a file, and make it the only thread that logs information to a file.  Then simply enqueue requests to the logging thread from each of your handler threads, and trying to debug a complex synchronization problem is solved.

Note: If one of the I/O worker threads calls Sleep(), WaitForSingleObject(), WaitForMultipleObjects(), SignalObjectAndWait(), a synchronous I/O call, or any function that would cause the thread not to be runnable, the I/O completion would detect this and wake a third thread immediately to keep the CPUs saturated with work.

Thread handles are closed right away, because we will not need them and the worker threads will continue to execute.


The Four Deadly Horsemen of Performance

Before covering these, it should be mentioned that the biggest mistake developers make is assuming they know where the bottlenecks are without doing application level profiling. There are often surprises. That being said, as mentioned in an article by Jeff Darcy, the common performance killers are:
  1. Data copies
  2. Context switches
  3. Lock contention
  4. Memory allocation


Data copies:
If you really want to get rid of data copies, you'll need to track down a lot of things that really are data copies but don't advertise themselves as such, like TCP for example.

As mentioned in this MSDN article, If you use the SO_RCVBUF and SO_SNDBUF option to set zero TCP stack receive and send buffer, you basically instruct the TCP stack to directly perform I/O using the buffer provided in your I/O call. Therefore, in addition to the nonblocking advantage of the overlapped socket I/O, the other advantage is better performance because you save a buffer copy between the TCP stack buffer and the user buffer for each I/O call. But you have to make sure you don't access the user buffer once it's submitted for overlapped operation and before the overlapped operation completes.

The tried and true method for avoiding data copies is to use indirection, and pass buffer descriptors (or chains of buffer descriptors) around instead of mere buffer pointers. Each descriptor typically consists of the following:

  • A pointer and length for the whole buffer.
  • A pointer and length, or offset and length, for the part of the buffer that's actually filled.
  • Forward and back pointers to other buffer descriptors in a list.
  • A reference count.

Context switches:
Context switches are often behind more total "meltdowns" at high load, than data copies; the system starts spending more time going from one thread to another than it actually spends within any thread doing useful work. The amazing thing is that, at one level, it's totally obvious what causes excessive context switching - mostly having more active threads than you have processors.

As the ratio of active threads to processors increases, the number of context switches also increases - linearly if you're lucky, but often exponentially. This very simple fact explains why multi-threaded designs that have one thread per connection scale very poorly. The only realistic alternative for a scalable system is to limit the number of active threads so it's (usually) less than or equal to the number of processors.

Lock contention:
Efficient locking schemes are notoriously hard to design and often sit between two extremes:
  • overly simplistic and/or coarse-grained, serializing activities that can or should proceed in parallel and thus sacrificing performance and scalability
  • overly complex or fine-grained locking, with space for locks and time for lock operations again sapping performance.
In between, there's supposedly a narrow channel that represents locking which is both efficient and correct. Since locking tends to be deeply tied to program logic, it's often impossible to design a good locking scheme without fundamentally changing how the program works. This is why people hate locking, and try to rationalize their use of non-scalable single-threaded approaches.


Memory allocation:

Allocating and freeing memory is one of the most common operations in many applications and can be costly because two threads may try to allocate memory at the same time causing a contention of the available memory. Lock contention is often the biggest cost in allocating memory, even when lookaside lists are in use.

The simple approach is preallocation. We all know that static allocation is bad when it imposes artificial limits on program functionality, but there are many other forms of preallocation that can be quite beneficial. Usually the reason comes down to the fact that one trip through the system memory allocator is better than several, even when some memory is "wasted" in the process.

Thus, if it's possible to assert that no more than N items could ever be in use at once, preallocation at program startup might be a valid choice. Even when that's not the case, preallocating everything that a request handler might need right at the beginning might be preferable to allocating each piece as it's needed; aside from the possibility of allocating multiple items contiguously in one trip through the system allocator, this often greatly simplifies error-recovery code. If memory is very tight then preallocation might not be an option, but in all but the most extreme circumstances it generally turns out to be a net win.

Another approach is a smart multi-threaded allocator like hoard or  more advanced techniques like Scalable Lock-Free Dynamic Memory Allocation. There are however some general truths that are worth considering:
  1. SetFileIoOverlappedRange is new to Vista/2008 and it addresses use of the overlapped struct, this optimizes one set of allocations.
  2. Regardless of what you do, send and recv (WSA included) will always copy to/from an internal device buffers. Memory alignment requirements for that buffer are not going to be known by the program, some cases maybe not even by the kernel (only the driver), so the best solution is to either buy server class hardware that can handle whatever you throw at it, or make all send/recv buffers page aligned (4KB in most cases). The one way to do this is to suck up a whole page via VirtualAlloc and shove your buffer at the beginning, but this does not scale well and may cause locality problems depending on that hardware you are using. You are causing more problems then you are solving.
  3. Allocating memory in such a way that you can address its locality of reference will get you a much bigger bang for the buck then guessing about memory alignment issues that you simply cant know about in advance or that changes depending on minor hardware.
The penalty for misaligned buffers is not that significant for light loads. Your Application will scale correctly without code changes if you follow the IOCP interface so don't worry.


Windows Sockets
A little on sockets. IOCP is build upon Microsoft's implementation of Sockets known as WinSock, as distinct from BSD Sockets. In this context, a sockets is an opaque data type. It's like a handle. If a socket state changes, you will be told about it, and the socket isn't actually closed until you close it. If the remote side of the connection closes their side then you'll likely be notified if you have a read pending, or you try to write to it.

While we are on the subject, a few notes on using AcceptEx() vs Accept() for improving performance when a server has to handle many short lived client connections.Creating a socket is a relatively slow and expensive operation.  AcceptEx() uses a different model. First you create your sockets, then you "post" accept requests onto the listening socket. When these requests complete you recieve IO completion packets on the associated IO Completion Port which allows us to multi thread the work that we need to do in relation to establishing a connection.

This creates the problem of how many sockets to create initially. Len Holgate has devoted an entire article to this subject along with source code.

When calling AcceptEx() you must always pass a buffer to store the local and remote addresses of the connection. For connections that receive data first you can include space in this buffer for the initial data that is read from the connection. For server's that don't receive data before returning data, specify a data buffer size of zero so no read occurs. In this case, the AcceptEx() returns as soon as the connection is established.

The final difference is in the way the socket state is managed. WSAAccept() updates the socket automatically when a connection is accepted. When we use AcceptEx(), we must do it ourselves by calling setsocketopt() with SO_UPDATE_ACCEPT_CONTEXT to set the state of the socket. We must also keep track of this so we do not try to use that socket to accept another connection until it is released.

There's no difference in the resource limits between AcceptEx() and Accept(). Both face the same non-paged pool limits. Tests using Accept and AcceptEx based servers show both can achieve 64,000+ concurent connections. Keep in mind WSAAccept() is a synchronous call, so failures are reported directly, but with AcceptEx() the failure is reported via the IOCP that is assigned to the socket that you're calling AcceptEx on.


Shutdown

The main thing to avoid is freeing an OVERLAPPED structure when an overlapped I/O operation is in progress, so to do this properly, you need to make sure all the operations have completed and responses returned. Typically its best to stop accepting new requests for a little while and allow the queue to complete. When the last item has completed you can issue a PostQueuedCompletionStatus() call for every thread you created and use some flag (mine is called IO_THREAD_SHUTDOWN) that when processed exits the thread. After that is done you may call CloseHandle() for the CompletionPort handle. Finally the listen socket must be closed and WSACleanup() called.

Closing down a server can be an important operation if you do not have a payload structure that supports atomic data. Something will get lost if requests are not allowed to complete, and this can cause database problems for clients. When you start adding complex business logic memory leaks and other "features" are likely, especially with high performance multi threaded programming. Restarting a server can cause a lot of trouble if the shutdown sequence is short circuited.

To break out of the accept() loop you call closesocket() for the underlying listen socket. This does not affect the connections in place or the IOCP. You need to track either open sockets of the number of completion notifications pending so that you know when all work has been completed, then you can proceed to close the worker threads.

These are closed by posting a series of PostQueuedCompletionStatus() calls, one for each thread and using WaitForMultipleObjects() for the handles to signal the event. I like to set a finite amount of time I will give an IOCP worker thread to exit just in case it hangs because of a bug while developing.



HTTP

GET requests have no body and POST requests have the body size in bytes specified in the headers 'Content-Length' parameter. The end of the header is signaled by a double CRLF. At some point, the buffer must be traversed to find something (like one of these strings). Keeping in mind the first deadly horseman, you will need to use a pointer based method to find the end of the header (the beginning of a POST body) for example. Doing that quickly usually involves pointers or ASM.





BASIC Example:

A function using pointers to find the end of an HTTP request header by looking for the first double CRLF in the data.


FUNCTION PosBody( BYVAL pIn AS BYTE PTR, LenInput AS LONG ) AS LONG

                
  LOCAL ePos, k, Match, LenDelim AS LONG
  LOCAL sDelim AS STRING
  LOCAL pDelim, b, d AS BYTE PTR
                   
    IF LenInput < 1 THEN EXIT FUNCTION ' FUNCTION = 0
     
    sDelim = $CRLF+$CRLF '// HEX: 0D 0A 0D 0A
    pDelim = STRPTR(sDelim)
    LenDelim = 4
   
    FOR ePos = 0 TO LenInput-1 '// ITERATE through the STRING
      b = pIn + ePos
      Match = 1 ' DEFAULT = Match found
      FOR k = 0 TO 3 ' BYTE values
        d = pDelim + k
        IF ePos + k > LenInput THEN EXIT FOR
        IF @b <> @d THEN Match = 0 : EXIT FOR ' Match failed
        INCR b
      NEXT
      IF Match THEN FUNCTION = epos : EXIT FUNCTION
    NEXT

  FUNCTION = 0 ' // return 0
   
END FUNCTION



Code Method

The code example handles multiple requests, partial read and write for both the request/response data, and for the .html file read operation. Specifically:
  1. Partial socket reads -> several WSARecv() to read a complete request message.
  2. Partial socket writes -> several WSASend() to write a complete response message.
  3. Partial file reads -> several ReadFile() to read a complete .html file.
Since we are not uploading data, partial file writes are not coded.

Memory management, outstanding IOCP completion tracking, a secondary thread pool for business logic like database interaction and other sophisticated designs are not implemented to keep the code as small and digestible as possible.
There are two main structures:
  1. SERVER_CONFIG - configuration settings like ThreadsPerCPU and ServerPort
  2. OVERLAPPED_EX - An encapsulation of OVERLAPPED with additional members like WSABUF and AcumBytes

There are four source files in addition to the declarations in the header file:
1. Main
  • ErrMsg()
  • UserMsg()
  • MAIN()
2. Sockets
  • SocketCreate()
  • SocketListenThread()
  • SocketRead()
  • SocketWrite()
3. IOCP
  • CompletionPortAssociateDevice()
  • CompletionPortCreate()
  • CompletionPortThread()
4. Server
  • GetLocalAddress()
  • FileRead()
  • MemMapFile()

The CompletionPortThread() accounts for the majority of the application logic. Asynchronous read/write operations are begun with the OVERLAPPED_EX member IOOperation set to one of the enumerated values:
  • IO_SOCKET_READ   
  • IO_SOCKET_WRITE  
  • IO_FILE_READ     
  • IO_FILE_WRITE    

If a connection is received then a IO_SOCKET_READ operation is begun.
This completes when TCP has returned the bytes received or the buffer is full.
(If the buffer is full, more space is needed and the size is doubled)

If the request is not complete and there is more data to read then another IO_SOCKET_READ operation is begun.
This is repeated until the request is received.

With a buffer size of 48 bytes, a request will progress as follows:
Starting Server:

 3416------------ IO_SOCKET_READ Bytes= 48
++++++++++++++++++++++++++++++++++, BuffLen= 48
GET /anypage.html HTTP/1.1
Host: 127.0.0.1:8080
++++++++++++++++++++++++++++++++++

 3944------------ IO_SOCKET_READ Bytes= 48
++++++++++++++++++++++++++++++++++, BuffLen= 96
GET /anypage.html HTTP/1.1
Host: 127.0.0.1:8080
User-Agent: Mozilla/5.0 (Windows; U; Windows N
++++++++++++++++++++++++++++++++++

 3416------------ IO_SOCKET_READ Bytes= 96
++++++++++++++++++++++++++++++++++, BuffLen= 192
GET /anypage.html HTTP/1.1
Host: 127.0.0.1:8080
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.8) Gecko/20071008 Firefox/2.0.0.8 (.NET CLR 3.5.30729)
Accept: text/xml,
++++++++++++++++++++++++++++++++++

 3944------------ IO_SOCKET_READ Bytes= 192
++++++++++++++++++++++++++++++++++, BuffLen= 384
GET /anypage.html HTTP/1.1
Host: 127.0.0.1:8080
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.8) Gecko/20071008 Firefox/2.0.0.8 (.NET CLR 3.5.30729)
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0
++++++++++++++++++++++++++++++++++

 3416------------ IO_SOCKET_READ Bytes= 55
++++++++++++++++++++++++++++++++++, BuffLen= 768
GET /anypage.html HTTP/1.1
Host: 127.0.0.1:8080
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.8) Gecko/20071008 Firefox/2.0.0.8 (.NET CLR 3.5.30729)
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive


++++++++++++++++++++++++++++++++++

Notice that two threads (ID 3416 and ID 3944) worked on this series of completion notifications because the data came in so fast that the first thread was busy when the next completion was enqueued.


Next we return the file index.html by doing a progressive, asynchronous IO_FILE_READ from the memory mapped file
at the same time we will be doing an IO_SOCKET_WRITE  for each slice of the file

To begin we call MemMapFile() using the same OVERLAPPED_EX structure that was used for reading the request. The data buffer eas increased several times and is now 768 bytes. We could request a new buffer size, but for simplicity we'll just use it as is.

The first slice of the file must be preceded by an HTTP header. Rather than send the header separately, or increase the buffer size to accomodate it, we just shrink the first file slice enough to lay it in first with MoveMemory().

When ReadFile() completes (because the specified number of bytes have been read) we issue a WSASend() using the OVERLAPPED_EX structure set to IO_SOCKET_WRITE.

This returns the buffer to the client and then enqueues a completion notification which initiates the next asynchronous IO_FILE_READ slice to be read.



 3944------------ IO_FILE_READ Bytes= 678
bytesToWrite= 678, AcumBytes= 678 of 189853

 3416------------ IO_SOCKET_WRITE Bytes= 678
Read more of the file...

 3416------------ IO_FILE_READ Bytes= 768
bytesToWrite= 768, AcumBytes= 1446 of 189853

 3416------------ IO_SOCKET_WRITE Bytes= 768
Read more of the file...

 3416------------ IO_FILE_READ Bytes= 768
bytesToWrite= 768, AcumBytes= 2214 of 189853

 3416------------ IO_SOCKET_WRITE Bytes= 768
Read more of the file...

 3416------------ IO_FILE_READ Bytes= 768
bytesToWrite= 768, AcumBytes= 2982 of 189853

 3416------------ IO_SOCKET_WRITE Bytes= 768
Read more of the file...

 3416------------ IO_FILE_READ Bytes= 768
bytesToWrite= 768, AcumBytes= 3750 of 189853

 3416------------ IO_SOCKET_WRITE Bytes= 768
Read more of the file...

 3416------------ IO_FILE_READ Bytes= 768
bytesToWrite= 768, AcumBytes= 4518 of 189853

.
.
.
.
.
 3416------------ IO_FILE_READ Bytes= 768
bytesToWrite= 768, AcumBytes= 188838 of 189853

 3416------------ IO_SOCKET_WRITE Bytes= 768
Read more of the file...

 3416------------ IO_FILE_READ Bytes= 768
bytesToWrite= 768, AcumBytes= 189606 of 189853

 3416------------ IO_SOCKET_WRITE Bytes= 768
Read more of the file...

 3416------------ IO_FILE_READ Bytes= 247
bytesToWrite= 247, AcumBytes= 189853 of 189853

 3416------------ IO_SOCKET_WRITE Bytes= 247
Last slice of the file has been sent


Now that we have fully sliced up the work, the server has become fully scalable. Since we have a thread pool to take advantage of the processors on the machine, we have a multi threaded, multiprocessor solution that can take advantage of serious hardware.



Common Issues and solutions


Re-Ordering

This problem exists because there is no relationship between threads and communication; meaning operations like socket read and write will always be scheduled in the order they were submitted, but they may not complete in that order due to all the overhead associated with threads and tcp/ip. In fact it is likely that if you submit two slices of an HTTP response for sending (by two different threads simultaneously), they will be sent out of order.

As you can see in the above debug spew, the incoming HTTP request was processed by two threads. There is no guarantee that these two threads will complete in the right order resulting in slices that may need re-ordering.

Since the minimum number of threads will match the number CPUs, it is possible to call read() or write() socket operations multiple times from within a thread because, while the order of the send/receive operations is guaranteed to follow the invocation order in the IOCP queue, the order that their completion is reported is not guaranteed. This is because GetQueuedCompletionStatus() is called by multiple threads upon completion in those threads.

One way to deal with this is to assign a globally unique incremented integer to each slice. The slices can then be re-ordered based on this tracking number rather than the order the completion notification arrives.

A detailed solution to this problem is propsed in Len Holgate's excellent article devoted to this subject complete with source code. The dsign is described as:
" For each distinct IO operation we need to keep track of the next sequence number that we can process. When a call to GetQueuedCompletionStatus() returns we need to compare the sequence number in the request with the next sequence number that we can process. If these numbers match then we can process the request. If they don't then the request cannot be processed at this time. If an IO operation cannot be processed it should be stored for later processing. The storage of the out of sequence request needs to be keyed on the sequence number. When an IO thread finds that it can't process the current request it should add the current request to the store and see if there's a request in the store that can be processed. When a request is processed the last thing that the IO thread should do is atomically increment the value representing the next sequence number to process and check to see if there's an IO request in the store that can be processed."


WSAENOBUFS

This problem is not easy to detect and may quack like a memory leak at first blush. If you stress test your server and it hangs, you may have run out of memory. Debugging may lead you to the WSAENOBUFS error... or not.

This occurs because every overlapped send or receive operation may have its associated data buffer (that you are responsible for managing) locked or pinned. When memory is locked, it cannot be paged out of physical memory. When you run out of memory, the overlapped operations will fail with the WSAENOBUFS error. The limit can be reached by having many connections or issuing multiple pending reads for each connection.

This is a significant problem with managed code as the pinning affects garbage collection and can cause serious memory fragmentation and early out of memory issues. A solution would be to allocate all of your buffer space as one big block in advance, and then allocate from that buffer for each operation.

 You would need to allocate the memory pointed to in your WSABUF via VirtualAlloc() to ensure efficient 4K page locking. That's the amount of memory locked with overlapped I/O and the locks are page size (4K) wide. You want your buffers to not span more pages than necessary, though you may have multiple buffers in a single page (e.g. you may want to subdivide a 4K page into two 2K buffers for example). The reason for this is to reduce the address range for page locking needed for overlapped I/O.

Note: how you allocate your WSABUF structure itself is immaterial as is the WSAOVERLAPPED structure. It is the data whose page is locked, not the pointer to the data.



In unmanaged code it's similar but less serious. There's a 'locked pages limit', but this should not cause problems. If this turned out to be an issue you could allocate buffers on page boundaries (in multiples of page size) to limit the number of pages locked, or use the "zero byte read" trick (below).

Specifically the two limits most likely to be encountered are the number of locked pages and non-paged pool usage. The locked pages limitation is less serious and more easily avoided than running out of the non-paged pool. The non-paged pool limit is a much more serious error and is difficult to recover from because the non-paged pool is the portion of memory that is always resident in physical memory and can never be paged out. Kernel-mode operating system components (like drivers) typically use the non-paged pool. Examples are Winsock and the protocol drivers such as tcpip.sys.

Each socket created consumes a small portion of non-paged pool that is used to maintain socket state information. When the socket is bound to an address, the TCP/IP stack allocates additional non-paged pool for the local address information. When a socket is then connected, a remote address structure is also allocated by the TCP/IP stack. In all, a connected socket consumes about 2 KB of non-paged pool and a socket returned from accept or AcceptEx() uses about 1.5 KB of non-paged pool (because an accepted socket needs only to store the remote address). In addition, each overlapped operation issued on a socket requires an I/O request packet to be allocated, which uses approximately 500 non-paged pool bytes totaling about 4k.

If a 32bit server has 1GB of physical memory, there will be 256 MB set aside for the non-paged pool which is enough to handle 50,000 or more connections so long as the number of overlapped operations queued for accepting new connections and receiving on existing connections is limited.

There is no difference in the resource limits applied to WSAAccept() or AcceptEx(). With WSAAccept the failure occurs when you call the function because it's a synchronous call, with AcceptEx() the failure is reported via the IOCP that is assigned to the socket that you're calling AcceptEx on. Both face the same non-paged pool limits. Tests using Accept and AcceptEx based servers show both can achieve 64,000+ concurent connections.

If the system does run out of non-paged pool, either the Winsock calls will fail with WSAENOBUFS or the system crashes with a terminal error... you have been warned.

Zero Byte Read Trick
One solution is to set the socket options SO_SNDBUF and SO_RCVBUF to zero instructs Winsock will use your applications buffer directly in an overlapped I/O call to transmit data to and from the protocol stack, thereby eliminating a copy step. This will allow you to have a large number of sockets open at one time because the internal buffers are set to zero. You application must pass buffers for the system to fill.

When you disable send buffering, the socket is prevented from ever filling the send pipeline. This can lead to packets being sent that are
not full, meaning the overhead of the TCP/IP headers is significant compared to the payload. Disabling the receive buffer can have more serious performance repercussions since if no receive is posted and no receive buffers, the TCP stack will set the window size to zero and the client will  no longer be allowed to send data.

To set the socket option:
    int nZero = 0;

    socket = accept();

    setsockopt(socket, SOL_SOCKET, SO_SNDBUF, (char *)&nZero, sizeof(nZero));

With this done, you post a single zero byte WSARead() upon receiving each connection. With this approach, the per-socket receive buffer should be left intact because once the zero-byte receive operation is completed, the server can simply perform additional non-blocking receives to retrieve all the data buffered in the socket's receive buffer by calling WSARead() with a buffer size set to MAXIMUMPACKAGESIZE. This solution locks physical memory only when data arrives, and solves the WSAENOBUFS problem. But this solution decreases the throughput of the server because it's always faster to have a receive pending when the data actually arrives, than posting a receive after the data has arrived.

If there is no more data pending when the non-blocking receive is called, it will fail with WSAWOULDBLOCK. This design favors maximum possible concurrent connections while sacrificing per connection data throughput. If you know that the client sends data in bursts, then once the zero-byte receive is completed, it may post one or more overlapped receives to accommodate a substantial amount of data (greater than the per-socket receive buffer that is 8 KB by default). 

Your application would probably want to have a good memory management implementation at its core as described above.

Generally (and your mileage may vary depending on many hardware variables) response speed is proportional to the number of overlapped write calls. One developers experience: "twelve was the magic number to max a 100mbps LAN. For a web server, I liked an overlapped count of at least 25 pending AcceptEx() calls."

One developer comments: "I liked a pending count (and limit) of one WSARecv(). I tried a mode I called burst detection, and it did work, where I'd recurse on WSARecv() after an immediate return until I got IO_PENDING or hit the pending limit. But regarding average throughput in the long term, it made no difference. Maybe there is a difference regarding immediate need vs. supply, but averaged, I saw no difference. Hence, I stay with only one pending WSARecv() now (zero-byte for the HTTP mode, and buffered for the FTP mode)"

Access Violation

If a client connection is lost then the I/O call returns with an error. In the parameter CompletionKey, we pass a pointer to a structure ClientContext that contains client specific data. What happens if we free the memory occupied by this ClientContext structure, and some other I/O call performed by the same client returns with an error code, and we transform the parameter CompletionKey variable of DWORD to a pointer to ClientContext, and try to access or delete it? An access violation occurs!

The solution to this problem is to add a number to the structures that contain the number of pending I/O calls (m_nNumberOfPendlingIO), and we delete the structure when we know that there are no more pending I/O calls.


Performance Design

If high throughput is preferred over a large number of connections, then using a payload structure like a packet (where the first n bytes represents a header of some sort with the packet_length as a member) is a very common choice. The server can immediately read packet_length from the header and know how much more data is expected. If the client is sending the data in chunks due to the way it is buffering and processing the data, then it would make sense to have multiple pending reads to increase the throughput of the server. As mentioned, this utilizes multiple threads (each on its own core) to speed up the reading.

In this case the TCP/IP packages will be written directly into the passed buffer instead of to the TCP/IP stack, eliminating a memory copy. The penalty is that every pending receive operation causes the kernel to lock the receive buffers into the non-paged pool (the amount of non-paged pool is typically set to 1/4 of the physical memory). Depending on resources, this may generate a WSAENOBUFFS error if the physical memory is full due to many connections etc. In standard configuration, each connection will require 4k out of the available 2Gig (Xp) or 3Gig (Vista/7) limit.

Other techniques can also improve overall I/O performance. Experimenting with socket buffer sizes to increase I/O performance is always worthwhile. If your application uses one large buffer with only one WSARecv() request instead of three small buffers with three WSARecv() requests, your application will not scale well to multiprocessor machines. This is because a single buffer can be processed by only one thread at a time.

Another performance gain worth considering results from using the AcceptEx() API call for connection requests that deliver small amounts of data. This allows your application to service an accept request and retrieve data through a single API call as explained earlier in the Sockets section.

Finally, avoid using the ReadFile() and WriteFile() Win32 functions for processing I/O on a completion port in Winsock. The WSARecv() and WSASend() functions are better optimized for processing I/O in Winsock 2. ReadFile() and WriteFile() involve making many more unnecessary kernel/user mode procedure call transitions, thread context switches, and parameter marshaling, resulting in a significant performance penalty.


Some general observations culled from the internet:
1) SetFileIoOverlappedRange is new to Vista/2008 and it addresses use of the overlapped struct, this optimizes one set of allocations.

2) Regardless of what you do, send and recv (WSA included) will always copy to/from an internal device buffers. Memory alignment requirements for that buffer are not going to be known by the program, some cases maybe not even by the kernel (only the driver), so the best solution is to either buy server class hardware that can handle whatever you throw at it, or make all send/recv buffers page aligned (4KB in most cases). The one way to do this is to suck up a whole page via VirtualAlloc and shove your buffer at the beginning, but this does not scale well and may cause locality problems depending on that hardware you are using. You are causing more problems then you are solving.

3) Allocating memory in such a way that you can address its locality of reference will get you a much bigger bang for the buck then guessing about memory alignment issues that you simply can't know about in advance or that changes depending on minor hardware.

A serious server will be run on serious hardware. A simple server should be run on modest hardware. The penalty for misaligned buffers is not that significant when only dealing with a few hundred connections or light loads. The point here is that it will scale correctly without code changes if you follow the IOCP interface so don't worry about it. Worry about how to alloc your buffers to address other performance issues.

I can assure you - the *last* place that needs tweaking is the kernel/OS/hardware when it comes to performance. If you follow good programming practices and adhear to the API you will have fewer bugs and better performance overall. And your program wont crash or perform very badly when you change hardware or get an OS update...




Implementing Buisness logic

Sockets are not the only I/O that is IOCP enabled. Reading and writing to the hards disk using memory mapped files (see examples) can take advantage of overlapped I/O. But for non IOCP enabled I/O (such as database connections) it's probably best to create an expandable thread pool rather than spawning a thread for each query.

Another approach is to utilize the queuing capabilities of a messaging layer like ZMQ.


Further Reading

A lot of the information in this article has been distilled from long conversations with other developers and the excellent "Network Programming for Microsoft Windows".

Some articles well worth reading are here and here and here.

https://blog.torproject.org/blog/some-notes-progress-iocp-and-libevent
http://msdn.microsoft.com/en-us/library/aa365198(VS.85).aspx
http://www.codeproject.com/KB/IP/reusablesocketserver4.aspx
http://www.codeproject.com/KB/IP/IOCP_Server_Framework.aspx 
http://www.gershnik.com/faq/iocp.asp 
http://support.microsoft.com/kb/181611 
http://www.codeproject.com/KB/IP/iocp.aspx
http://social.msdn.microsoft.com/Forums/en/wsk/thread/21122de8-864d-440c-bb82-68dda8064d15
http://pl.atyp.us/content/tech/servers.html
http://www.kegel.com/c10k.html
http://www.usenix.org/events/hotos03/tech/full_papers/vonbehren/vonbehren_html/index.html
http://blogs.technet.com/markrussinovich/archive/2009/07/08/3261309.aspx http://sysinternals.kompjoefriek.nl/rip/www.sysinternals.com/Information/IoCompletionPorts.html
http://msdn.microsoft.com/en-us/library/ms737547(v=VS.85).aspx    
http://www.codeproject.com/KB/IP/IOCPScalableServer.aspx
http://www.codeproject.com/KB/IP/chatserverByVerifier.aspx
http://www.codeproject.com/KB/IP/IOCP_how_to_cook.aspx
http://support.microsoft.com/default.aspx?scid=kb;en-us;Q192800
http://www.winsocketdotnetworkprogramming.com/winsock2programming/winsock2advancedscalableapp6b.html
http://msdn.microsoft.com/en-us/library/aa365198(v=VS.85).aspx
http://www.lenholgate.com/archives/000563.html
http://www.lenholgate.com/archives/000306.html
 
Memory:
http://www.drdobbs.com/high-performance-computing/210604448;jsessionid=DHN1OWZGSC4CPQE1GHRSKHWATMY32JVN
http://home.comcast.net/~lang.dennis/code/ring.html   
http://www.thedelphigeek.com/2010/02/dynamic-lock-free-queue-doing-it-right.html <------------
http://www.hoard.org/

Locks:
http://msdn.microsoft.com/en-us/library/ms684122(VS.85).aspx
http://home.comcast.net/~lang.dennis/code/locks.html
http://queue.acm.org/detail.cfm?id=1454462
http://www.audiomulch.com/~rossb/code/lockfree/
http://www.drdobbs.com/high-performance-computing/210604448;jsessionid=SBTLTDMCDVDC3QE1GHRSKHWATMY32JVN?pgno=1



 



















Comments