Posts tagged kernel

Dive into the I/O subsystems

I decided to write this article as to give a short intro, which I hope would help one another to dive into understanding but more tweaking the I/O scheduler for better computing performance.

So what is the I/O scheduler…

Talking about the I/O scheduler requires to immediately address the issue of I/O process execution on a system. Each action made by an application whether in a simple read/write manner, memory allocation in a nutshell creates an I/O request to the filesystem/virtual memory, which in return transmit the requests to the scheduler, who handles it back to the low-level device drivers…

The I/O scheduler therefore “resides” between the generic block layer and the low-end device driver… here is how it goes

the file system, the virtual memory manager submits I/O request to the I/O scheduler… the I/O scheduler takes it from there and forwards them to the low-end device drivers which in return forwards them to the device controller using specific protocols to the device controller. The device controller then “applies” the request and performs the action.

Now let’s go one step back… Remember we said the I/O request are submitted from the generic block layer… well in reality threads as in kernel space/user space initiate those I/O request (for example the kswapd thread from the Virtual Memory Manager of the kernel). Those I/O requests are referred as raw I/O requests and like we said earlier are handled by the block layer, then submitted to the I/O scheduler… which in turn, would queue it into an internal I/O queue (the 2.6 scheduler maintains 5 queues)… here is how it goes…

the thread/(generic block layer) initiates the I/O request and calls the __make_request() function of the kernel which in turn calls a I/O scheduler function such as elevator_merge_fn(). Those functions in return pipe, merge, schedule the I/O request into internal I/O queues. So ONE I/O request originating from a block device, gets piped through many internal I/O queues, which at the end represents one unique logical queue that gets automatically associated to the block device that originated the I/O request. It is therefore this logical queue that is handled to the low-end device driver.

Still Following?

When the low-end device driver receives the logical queue, it raises at different elapsed time the elv_next_request() function which would return the next request into the logical queue. The low-end device driver as we said then takes it from there by converting the I/O request into a protocol “device specific” command to the controller which would “execute” the request.

So now that we got some “basic” glimpse at what happens under the hood…

Let’s picture it… say you run an application that each sleep(50) does a read and then a write…

Well as in the 2.6 I/O scheduler, read I/O requests are prioritized over write I/O requests. The way the scheduler achieves that, is by assigning each request a deadline. Remember we mentioned that I/O requests got piped into internal I/O queues, which formed a single unique logical queue?… well while “queuing” the I/O requests, the scheduler assigns the deadline to each I/O requests and organizes the queues by the starting logical block number, the deadline tag or most often the FIFO batch list. The FIFO batch decides which request gets unto the 5th I/O queue which is the dispatch queue to the low-end device driver.

Requests therefore gets moved through the FIFO list as to ensure that no request is starving and meets its effective deadline. Thus the scheduler maintains 2 FIFO list for read and for write operations.

If there are no request in the dispatch queue, the scheduler therefore moves the head request of one of the 4 queues to the dispatch queue. If there are pending read and write requests and no write request have been dispatched for a while, write request from the fifo write list get placed on the dispatch queue. If some requests from the read fifo list expires, they will be automatically placed on the dispatch queue etc.. etc..

There is more to discuss regarding the I/O scheduler but it would take a probably > 400 pages to dissect all the in/out of the the scheduler and understand how to improve system overall performance.

Cheers,

Ali

Strace – Reverse Engineering – System Calls

If there is one recurring problem that I often see gagging the forums is “Library missing”, or often “installed libraries which a program doesn’t find”.

I decided to share a simple debugging technique which could save the day or even the hours… Google might not be the right choice all the time, when you have got strace at your finger tips.

1. System Calls

To understand strace, you first need to understand what a system call is. So what is a system call? a system call is simply a kernel function, which I would say executes within the kernel mode and thus resides between the user code and the kernel.

Whenever in a C program, you call the function open(), you are indeed calling a C function “open” which in turn just switches from user mode to kernel mode and run the system call “open” of the kernel.

So the concept of switching is very important here to our understanding of system calls and functions. A switching event would usually be either a software interrupt, a gate or trap instruction.

2. Reverse Engineering with Strace

First let’s break our system to setup


[root@web01 ~]# ldd /bin/ls
linux-gate.so.1 =>  (0x009c2000)
librt.so.1 => /lib/librt.so.1 (0x003d0000)
libacl.so.1 => /lib/libacl.so.1 (0x003b3000)
libselinux.so.1 => /lib/libselinux.so.1 (0x00b20000)
libc.so.6 => /lib/libc.so.6 (0x00243000)
libpthread.so.0 => /lib/libpthread.so.0 (0x0038e000)
/lib/ld-linux.so.2 (0x00225000)
libattr.so.1 => /lib/libattr.so.1 (0x003ac000)
libdl.so.2 => /lib/libdl.so.2 (0x00388000)
libsepol.so.1 => /lib/libsepol.so.1 (0x00110000)

As we can see those are the libraries “ls” must load before executing its system call and give us the usual pretty output.

Let’s move librt.so.1 out of /lib to our backup folder in /root/libBackup

Execute ls at the command line


[root@web01 ~]# ls
ls: error while loading shared libraries: librt.so.1:
cannot open shared object file: No such file or directory

Of course, the error message here is pretty obvious… ls needs “librt.so.1″ to run and as good systems administrators, we all know where to look in for shared libraries right ?

Anyway, for the sake of this exercice, let’s assume we have no clue that librt.so.1 is supposed to be in /lib…

(now for the fun of it, google the “above ls error” and be amazed on how many person reported this error on forums)

So let’s use our strace magic here and see how we can fix the problem.


[root@web01 ~]# strace /bin/ls
execve("/bin/ls", ["/bin/ls"], [/* 21 vars */]) = 0
brk(0)                                  = 0x88d2000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=22984, ...}) = 0
mmap2(NULL, 22984, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f88000
close(3)                                = 0
open("/lib/librt.so.1", O_RDONLY)       = -1 ENOENT (No such file or directory)
open("/lib/tls/i686/sse2/librt.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)

As we can see, “ls” is trying to open librt.so.1 at “open(“/lib/librt.so.1″, O_RDONLY)       = -1 ENOENT (No such file or directory)”… reading the following of the output, we see how the program is trying to look up for the library file in other libraries folders as set in our Library Path shell variable.

The solution, would therefore be to “MOVE” our librt.so.1 file back to our /lib folder and resolve our headaches.

(I wrote MOVE in bold, since COPY relay on this library… so copy would be broken as well at this point).

—–

Now, let us spice up things around, let’s erase the content of librt.so.1  (Make sure to backup the original).


[root@web01 ~]# echo "" > /lib/librt.so.1

let’s try… and…


ls: error while loading shared libraries: /lib/librt.so.1: file too short

Now, things are getting interesting… you may wonder, what in the world, does “file too short” could possibly mean?

The error, gives you the path “/lib”, so we know the file is there, since it doesn’t necessary complain that it can’t find it. So let’s try to strace is and get what is really happening.


[root@web01 ~]# strace /bin/ls
execve("/bin/ls", ["/bin/ls"], [/* 21 vars */]) = 0
brk(0)                                  = 0x8147000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=22984, ...}) = 0
mmap2(NULL, 22984, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7f13000
close(3)                                = 0
open("/lib/librt.so.1", O_RDONLY)       = 3
read(3, "\n", 512)                      = 1
close(3)                                = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f12000
writev(2, [{"/bin/ls", 7}, {": ", 2}, {"error while loading shared libra"..., 36}, {": ", 2}, {"/lib/librt.so.1",

15}, {": ", 2}, {"file too short", 14}, {"", 0}, {"", 0}, {"\n", 1}], 10/bin/ls: error while loading shared

libraries: /lib/librt.so.1: file too short
) = 79

Now.. thanks again to Strace, we got our answer..

For the sake of understanding all this gibberish, let’s go through the most essential ones.

1. execve executes a program pointed by the const filename and optionally an argv const parameter
2. brk(0) – brk called with the argument 0 just looks up for a breakpoint, set of free and malloc (memory management) takes place at this level
3. nmap is creating here a pagefile at 0xb7f13000

then comes the open call we saw earlier, followed which in return is followed by a “read(3, “\n”, 512) = 1″

Now… let’s break here and go back to our error “file too short”….

read() – ssize_t read (int fd, void *buf, size_t count) – access the file and loops its content through a buffer buf to  the number of bytes “count”.. upon read, read will therefore outputs the number of bytes read. In our result here, we see 2 things: the buffer starts with “\n” and the return number of bytes read is 1… whereas it is supposed to be 512, since the lib file is supposed to contain 512 bytes count of data.

A shared library is also supposed to contain an ELF header… which in this case, it doesn’t (of course, it doesn’t we did erase that lib content lines earlier :) )

A library header would therefore be as read(3, “\177ELF……”, 512)

The “\n” starting buffer and 1 byte read therefore means that our file is just empty ? :)

- Problem solved -

3. Other cases where Strace can help

Feeling like some programs run slow? Do an strace and look up the access paths for each library… this would tell you about your LD_LIBRARY_PATH and potentially for what to optimize

Another common case would be system call hangs, when the system call has no code return, which in return would lead to debug using other tools.

I hope that was useful :)