April 19, 2009

Dive into the I/O subsystems

I decided to write this article as to give a short intro, which I hope would help one another to dive into understanding but more tweaking the I/O scheduler for better computing performance.

So what is the I/O scheduler…

Talking about the I/O scheduler requires to immediately address the issue of I/O process execution on a system. Each action made by an application whether in a simple read/write manner, memory allocation in a nutshell creates an I/O request to the filesystem/virtual memory, which in return transmit the requests to the scheduler, who handles it back to the low-level device drivers…

The I/O scheduler therefore “resides” between the generic block layer and the low-end device driver… here is how it goes

the file system, the virtual memory manager submits I/O request to the I/O scheduler… the I/O scheduler takes it from there and forwards them to the low-end device drivers which in return forwards them to the device controller using specific protocols to the device controller. The device controller then “applies” the request and performs the action.

Now let’s go one step back… Remember we said the I/O request are submitted from the generic block layer… well in reality threads as in kernel space/user space initiate those I/O request (for example the kswapd thread from the Virtual Memory Manager of the kernel). Those I/O requests are referred as raw I/O requests and like we said earlier are handled by the block layer, then submitted to the I/O scheduler… which in turn, would queue it into an internal I/O queue (the 2.6 scheduler maintains 5 queues)… here is how it goes…

the thread/(generic block layer) initiates the I/O request and calls the __make_request() function of the kernel which in turn calls a I/O scheduler function such as elevator_merge_fn(). Those functions in return pipe, merge, schedule the I/O request into internal I/O queues. So ONE I/O request originating from a block device, gets piped through many internal I/O queues, which at the end represents one unique logical queue that gets automatically associated to the block device that originated the I/O request. It is therefore this logical queue that is handled to the low-end device driver.

Still Following?

When the low-end device driver receives the logical queue, it raises at different elapsed time the elv_next_request() function which would return the next request into the logical queue. The low-end device driver as we said then takes it from there by converting the I/O request into a protocol “device specific” command to the controller which would “execute” the request.

So now that we got some “basic” glimpse at what happens under the hood…

Let’s picture it… say you run an application that each sleep(50) does a read and then a write…

Well as in the 2.6 I/O scheduler, read I/O requests are prioritized over write I/O requests. The way the scheduler achieves that, is by assigning each request a deadline. Remember we mentioned that I/O requests got piped into internal I/O queues, which formed a single unique logical queue?… well while “queuing” the I/O requests, the scheduler assigns the deadline to each I/O requests and organizes the queues by the starting logical block number, the deadline tag or most often the FIFO batch list. The FIFO batch decides which request gets unto the 5th I/O queue which is the dispatch queue to the low-end device driver.

Requests therefore gets moved through the FIFO list as to ensure that no request is starving and meets its effective deadline. Thus the scheduler maintains 2 FIFO list for read and for write operations.

If there are no request in the dispatch queue, the scheduler therefore moves the head request of one of the 4 queues to the dispatch queue. If there are pending read and write requests and no write request have been dispatched for a while, write request from the fifo write list get placed on the dispatch queue. If some requests from the read fifo list expires, they will be automatically placed on the dispatch queue etc.. etc..

There is more to discuss regarding the I/O scheduler but it would take a probably > 400 pages to dissect all the in/out of the the scheduler and understand how to improve system overall performance.

Cheers,

Ali