# Project Dns-check 0.1.0 Released

I have just released dns-check, a command line tool to randomly query different nameservers spread across the world or per location (country or city).

To install dns-check, simply run

Have fun,

# Shor’s Algorithm and Quantum Fourier Transform

This post assumes you have a sound understanding of Quantum Mechanics, more precisely eigenstates, superpositions and entanglements

Shor’s algorithm allows us to find a factor of any composite number $N$ in $O((log\ N)^2 (\ log^k\ (log\ N)^2))$ for any random $k$, which is important in this scope to understand, as cryptography’s security baseline is founded on the fact that factoring cannot happen in polynomial time - I have talked on this subject in my post Yes Diffie-Hellman is secure, quoting myself

Because the discrete logarithm is not NP-complete, it is not possible to compute our $\log_ba$ in polynomial time and therefore generating a big prime number to be used in generating a cryptographic key renders exhaustive search attack completely useless. - Ali Abbas

Since Shor’s algorithm runs in polynomial time to the length $log\ N$, it solves the discrete logarithm problem and hence breaks the conjecture on which public key cryptography such as RSA are based upon.

Shor works because the factoring problem can be reduced in finding the period  $r$ ( $% $) which validates the cyclic function $f(n) = n^r = 1\ mod\ N$, in other words, if the $gcd(n, N) = 1$ (n and N are coprime integers), then given $r$ is even, $n^{r/2} - 1$ and $n^{r/2} + 1$ are not multiples of $N$, but $\exists x \in \left \{1,..,N-1\right \} : (n^{r/2} + 1)(n^{r/2} - 1) = x N$. This gives us, $gcd(n, N) \Rightarrow gcd((n^{r/2} + 1), N), gcd((n^{r/2} - 1), N)$, since as we have just seen, both $n^{r/2} - 1$ and $n^{r/2} + 1$ share a common ‘non-trivial’ factor of $N$. Finding $r$ is done by transforming the periodic sequence using a Quantum Fourier Transform.

### Quantum Fourier Transform

QFT is simply a Discrete Fourier transformation applied to a quantum amplitude using a unitary operator, meaning, there exists $\hat{U}$ such that $|\tilde{\psi}\rangle = \hat{U}|\psi\rangle$, in other words, assuming a basis state, there exists a vector $\tilde{\psi}$ which is a superposition of all computational basis, defining the orthonormal basis states - it is unitary because the amplitude of the vector never changes as part of the transform.

For example, taking a set of complex numbers, $\left \{n_0,..,n_{N-1}\right \}$, the orthonormal basis state acted upon is: $0\rangle\cdots|N-1\rangle$ defined by $|j\rangle \rightarrow \frac{1}{\sqrt{N}}\sum_{k=0}^{N-1}e^{2\pi ijk/N}|k\rangle$.

This means that the Fourier transform of an amplitude $n_j$ is a mapping of an arbitary state $\sum_{j = 0}^{N-1} n_{j} |j\rangle$ to $\sum_{k = 0}^{N-1} {n}'_k |k\rangle$ since ${n}'_{k} = \frac{1}{\sqrt{N}} \sum_{j = 0}^{N-1} n_{j} e^{\frac{2\pi j k}{N}}$ per the Discrete Fourier equation, QFT can then be modelled as follow: $F = \frac{1}{\sqrt{N}}\sum_{j,k}^{N-1} e^{\frac{2\pi j k}{N}} |k\rangle \langle j|$.

Having said that, and not wanting of course to go into a full lecture on QFT - what is most important to understand here is that the speed at which we can apply QFT - by properly aligning the Qubits in a proper Quantum circuit, QFT is computed in order of $O(N^{2})$ gates (assuming of course that all factors of N are $O(logN)$) - an equivalent amplitude of  $2^N$ would require $O(N2^N)$ simple gates.

To find the period $r$ of a random $x$ such as $f(x) = f(x + r)$ and $f(x) = n^x mod\ N$, we create 2 entangled registers (input and output) of size $logN$, both initialized to $\frac{1}{\sqrt{N}}\sum_{n}^{N-1}|n\rangle|0\rangle$. We then compute our periodic function f(x) to each number (remember, register 1 is loaded with an equally weighted superposition of integers 0 to N-1) in our input register and store the value in the second register - due to the superposition state and quantum computation parallelism, this operation is done in one step.  Once the quantum computation is done, we observe the second register for some random value $k$, such that register 1 is collapsed into an equal superposition of each values ranging from $0$ to $N-1$ for each $k$ value; In other words, we have a set values of possible $x$ such as $f(x) = k$, which means we observe a lower integer $c$, such as $n^c\ mod\ N = k$. Our superposition in register 1 being $|c\rangle,|r+c\rangle,|2r+c\rangle,\cdots,|n-r+c\rangle$, we apply a Discrete Fourier Transform (see earlier) on the superposition, this will generate an amplitude of integers (spectrum) which are multiples of  $n/r$; Since we no longer have an equal superposition of states and our superposition in register 1 was periodic in nature, we can observe a state, let’s call it $a$, which has a high probability of being a multiple of $n/r$. Having $a$ and $n$, we can easily compute $r$ and verify whether $f(x) = f(x + r)$. If our verification works, we have found our periodic value and can resume to calculate $gcd((x^{r/2} \stackrel{+}{-} 1), N)$. If not, then we pretty much repeat the process with a random value $x$.

Of course, Shor’s algorithm can be more elaborated into steps and in soft and hard cases whether $r$ divides $q$ or not and if you have followed the core logic of the algorithm which we had quickly undergo, you can only come to realize that Shor’s algorithm is highly probabilistic in nature due to the nature of DFT, hence the more Shor is run, the more it becomes accurate. Quantum computing parallelism provides a significant gains since all states of all values of $n$ are computed in one step. Hence the problem of solving the discrete logarithm problem is in finding the period $r$, which in case of Shor’s is shown to be trivial and faster than classical computing using a Discrete Fourier transform.

# Patching Vmware Vmnet Module for Linux 3.2.*

When installing vmware player 4.0.4, I had the nice surprise of finding out that the vmnet kernel module would not properly compile while starting the vmware service.

Manually directly compiling the module, we see a couple of incompatibility with the kernel header definitions

I have created a small repo at my github account @alouche, containing some code changes to fix these issues - the repo will continuously updated if necessary against different kernel versions (only from 3.2 up). To fix it, simply replace vmnet.tar in /usr/lib/vmware/modules/source with my modified version

Start the vmware service and the vmnet module will automatically successfully compile.

Enjoy,

# Git :: Removing Files From All Commits

Alright… this is just a tiny hint on the process I used to nuke some committed files from all commit history

And here comes the explanation:

git filter-branch –index-filter “git rm -rf –cached –ignore-unmatch my_files” HEAD

Our action is to rewrite our branch, hence we need to use the top level “filter-branch” command.

–index-filter: there is no need to checkout the current branch, so we can move faster and simply filter in which we issue the “git rm -rf …”

git-rm: quiet obvious

–cached: Only match the paths in the index - leaving modified matching files untouched –ignore-unmatch: result 0 status in any case if no match

HEAD: obviously, we are working on our last commit

rm -rf .git/refs/original/

Even with our branch rewrite from earlier, we still have a backup in refs/original, so we need to delete it

git reflog expire –all

Here is where it gets interesting, you see, each action performed inside git is “backed up” in the reflog. Think of it as a safety net, which is an inventory hash of all the points you been at for each commit. So it is possible to restore the files commit from the reflog, hence “expire –all”.

git gc –aggressive –prune

Oh my… gc what? “garbage collector?”, well actually we have rewritten the branch, purged the reflog and we are left with a lot of unused objects, so time to save some disk space and clean up

git push origin +master

Well… I don’t want to merge and then push, that would be deafeating the whole purpose of my previous actions and since no one has yet pulled from this repo, so we need to force the non-fast-forward since we are pretty much breaking the objects inheritance, hence the “+master”

# The So-called Skype SDK IP Leaks

For the last few days, there has been a buzzing news in the community, following the recent discovery of a so-called information leak in the skype SDK. zhovner@github, published a python code sample “exploiting this vulnerability” https://github.com/zhovner/Skype-iplookup/ using a de-obfuscated SDK and published a demo site @. More related information on the skype-open-source project can be found @ http://skype-open-source.blogspot.de/

So to sump-up, the “so-called leak” takes place by: 1. having “debug logging enabled” in the hi-jacked SDK 2. triggering a lookup information on a user who happens to be online, such as seeing his vcard and blam! both private and public ip addresses of this user are shown in the log.

Now, I say “so-called” leak, because truth be told, this isn’t leak nor a bug and here is why:

The skype protocol is merely at its core a P2P protocol - while it has never standardized and is fully proprietary, a minimum understanding on how P2P architecture operates clearly explain why the skype client passes such information, after all, there aren’t truly such things as relay servers in a P2P network,  clients in this case are exchanging the VoIP traffic directly and doing all the processing. Skype’s communication architecture constitutes of 3 types of nodes, “login servers”, “supernodes” and “standard nodes”.

To put it as simple as possible, “login servers” are the servers you connect to authenticate, “standard nodes” are clients (skype clients) sitting behind a NAT firewall and “supernodes” (most interesting for this article) are simply the opposite of “standard nodes”, in other words, they are directly assigned a public IP address and no firewall rules are blocking a direct connection to the skype random port client.

For a call to take place between 2 “standard nodes”, they must have a direct non NATed connection, since “standard nodes” sit behind a NAT/firewall, a direct connection between the 2 hosts is therefore not possible, to overcome this skype uses a technique called “UDP hole-punching” and this is where “supernodes” come into play.

### UDP hole-punching “simplified”

UDP hole-punching is a simple technique which persists in somewhat convincing the firewall that the incoming UDP packets are responses of an already established connection. Now I say “established”, but remember, UDP is not connection oriented, so we basically have no concept of sessions and all that nasty-SYN-overhead :-P, but we have a NAT recorded session which we could exploit.

To clarify: assume we have 2 hosts A, B each with a NAT/firewall and a random server called X. UDP hole-punching works as follow:

1. A and B connect to X (whether that be UDP or TCP)
2. Through the connection, X is aware of A and B’s public NAT and source ports
3. X communicate to A, B’s public NAT and source port and vice-versa to B
4. A sends a UDP packet with its previous source port to B’s public NAT
5. Naturally B’s firewall will reject the UDP packet, but as A sent that packet to B’s NAT IP, A’s firewall recorded the NAT session with the source port used by A and here is the punchole
6. B is aware of A’s source port from the exchange with server X, so B sends a packet A’s NAT IP with the destination source as A’s source port, since the NAT session was recorded in step 4, blam! the firewall forwards B’s UDP packet to A.

Now that you have a basic understanding of UDP hole-punching, keep in mind that a supernode will act as a relay in the case that A and B are still unable to communicate in cases where the firewalls are randomizing the source IP of the clients when the NAT process takes place.

“Ok, I get it! A needs to know B’s public IP and vice-versa, but why the private IP addresses, as seen in the so-called skype SDK IP leak”.

Well, you are right, why is Skype communicating the private IP address? Well, it is like answering, why would 2 hosts communicate over their NAT IPs if they could directly communicate through their privately assigned IP addresses? in other words, if 2 clients are located on 2 routeable LANs, it would be faster for them to exchange internally than externally and hence the reasons why during the hole-punching process, both A and B reported not only their public IPs but also their private IPs which server X exchanged between the 2 clients, both ending up knowing the public and private IPs as well as source ports of the opposite peer.

So… there is no leak, just some guys who figured out how to enable “printf” in an SDK which hooks to your client :)

Needless to say, if you are curious as the kind of information which circumvent over skype  - read more @ http://www.scribd.com/doc/69593950/Skype

# Mercedes Museum

Here are a few snaps from my visit at the Mercedes Museum this weekend in Stuttgart - for those visiting Germany, it is definitely worth the stop. It was both cultivating and a lot of fun :) [flickr-gallery mode=”photoset” photoset=”72157629512062396”]

# Lower Initial TCP RTO - Redhat Kernel Patch

I have recently back-ported the rfc2988bis changes (initRTO=1 and fallack) to the redhat 2.6.32 kernel - find the patch on my github account at @ https://github.com/alouche/redhat-2.6.32-kernel-patches/blob/master/rfc2988bis.patch

On short lived connections with a lot of 3WHS, a lower initial RTO will improve 3WHS latency by 22000msX% (X% being the average of packet drops of a specific route). For further technical details, refer to https://www.ietf.org/proceedings/77/slides/tcpm-1.pdf

# Linux CFS Algorithm and Virtual Runtime

Since the 2.6.23 kernel, the Linux kernel process scheduler previously O(1) was replaced by CFS - a Completely Fair Scheduler.

CFS uses a red-black tree as data-structure and unlike previous Unix process scheduler does not account a traditional time slice of process execution but accounts what is referred as the process virtual runtime, expressed in nanoseconds (as opposed to Hz or jiffies). The usage of a self-balanced tree as the red-black tree allows for a lookup of $O(\log\ n)$ time per the height of the tree, but more on this later.

### Virtual Runtime

The Virtual Runtime, declaration vruntime is a variable part of the process inherited C structure as defined in linux/sched.h which accounts for the time a process run in relation to the number of running processes. Each process holds a process descriptor “task_struct” dynamically allocated via the slab allocator.

While the other variables also play a role in CFQ decision’s algorithm, vruntime is by far the core variable which needs more attention as to understand the scheduling decision process. As stated earlier, CFQ does not account time slice as did previous schedulers, the vruntime is evaluated

The vruntime for each process is calculated as followed:

1. Compute the time spent by the process on the CPU
2. Weight the computed running time against the number of runnable processes

The update_curr function is responsible to calculate the running time spent by the process, which stores the value into an unsigned long variable delta_exec which is computed as followed:

delta_exec is passed unto __update_curr

calc_delta_fair will return the weighted value of the process’s calculated runtime delta_exec in relation to number of runnable processes. The sub function used for that calculation is calc_delta_mine.

To keep in mind:

• unsigned long delta_exec, is the computed running time of the process
• unsigned long weight, is the nice value of the process
• struct load_weight *lw, composed of 2 unsigned long “weight” and  “inv_weigh” (lw->weight and lw->inv_weight)

Once the vruntime is computed, it is stored into the inherited sched_entity structure of the process by calling

in the previously seen __update_curr function, we also notice the call

What this does is to compute the smallest vruntime of all runnable processes and store it at the leftmode node of the red-black tree.

The CFS selection algorithm for process to be run is extremely simple, it will run its red black tree and search for the smallest vruntime value pointing to the next runnable process. In other words, “run the process which is represented by the leftmost node in the tree”, since the leftmost node constains the smallest vruntime, referred in the source code as min_vruntime.

To conclude this post, it is important to note that CFS does not walk the whole tree since min_vruntime is referenced by rb_leftmost in the cfs_rq struct (kernel/sched.c)

as seen in this construct (kernel/sched_fair.c)

# DCB 101 - Priority-based Flow Control

DCB - Data Center Bridging is set of standard which defines 4 set of independent technologies/concepts to pretty much make Ethernet lossless, hence to support storage traffic. We will not go into a debate over FCoE, whether you should consider a single fabric for both storage and “standard/Ethernet” traffic in your data center design strategy or go a more traditional way.

As we said earlier, DCB is a set of standards, actually a set of 4 standards, which we will depict over the “DCB 101” posts series. DCB is compromised of:

• Priority-based Flow Control - 802.1Qbb
• Enhanced Transmission Selection - 802.1Qaz
• DCB Exchange - - 802.1Qaz

While these standards are independent, they do correlate as dependent layers - remove one and the nice theory of maintaining a lossless transport fabric will start falling apart. Alright! I said no debate on FCoE.

## Priority-base Flow Control

You may sometimes come across some technical documents which refer to PFC as Per-Priority Pause - for now, keep this in mind as it will shortly become clear as you read on.

Who says “Priority flow”, says “multiple flows/segments/divided lanes with different priorities” and at the end that’s naively what PFC is about. PFC’s goal is to merely segment traffic over the Ethernet fabric/medium and protocol into streamed marked priorities and define specific “guidance/action” for each of these streams. In other words, think of a highway which has a unique lane - PFC is virtually creating a secondary, third, fourth lane so that certain car types will be assigned to a specific lane - some cars can move faster, while others can be stuck in a traffic jam.

### 3bit Priority Code Point

Moving away from the highway simplistic illustration - the PCP value in the 802.1Q Tag Control Info 2bytes header is used by PFC to determine the priority of the traffic. It is important to keep in mind that PCP is not defined by DCB but by the 802.1p task committee; another great example is QoS which makes use of PCP for traffic classification.

Because PCP is only 3bits, we can only define 8 priorities. While the priority positive integer is ascendant in terms of higher priority, keep in mind that, a PCP value of 1 is lower than a PCP value of 0, this is simply due to the Network Priority translation value, such as PCP-0 = NP-1 and PCP-1 = NP-0. I will not detail why this is as it is now, but perhaps in a future post.

### 802.3x Pause Frames

We step away shortly from PCP and PFC to review some small concepts about Pause frames which will lay down the path to not only understand the importance of PCP but what PFC introduces to make Ethernet lossless.

Who talks about congestion, talks about the inability to not forward nor process traffic - whether that be frames or simply segments. Like on a highway, one way of getting rid of a traffic jam is to stop a specific flow of traffic or the whole traffic to dismantle the traffic jam. Pause Frames are like the highway patrol and yes they can also shoot you down ;-)… for more insights on this analogy, you can refer to my following posts:

• https://alouche.net/2009/09/23/ethernet-flow-control-and-igmp-snooping
• https://alouche.net/2010/04/25/ethernet-flow-control-dos-in-stp-environment

Without going into too much details, a Pause Frame is simply a MAC Frame which carries an OPCODE of 0x0001 in its Mac Control field and a quanta based time unit field.

The opcode value of 0x0001 simply says that this MAC Frame is a Pause Frame and the quanta time field expressed as a 2bytes unsigned integer refers to how many bytes should the sending of frames be stopped. Calculating in clock-time, how long the sender will stop sending frames is relative to the transmission rate; For example, one quanta is 512bytes, on a 10Ge link, that means that one quanta is roughly 51.2ns (512/10⁻⁸).

While Pause frames are essential to prevent congestion (at least that’s the goal), 802.3x frames cannot differentiate priority based traffic, in other words, assuming both LAN and storage traffic are processed by the same frabric, a Pause frame issued due to the beginning of a congestion on the LAN traffic would result as well in an interruption of the storage traffic.

## Per-Priority Pause

Because of the limitation of current Pause Frames, PFC defines a new Pause frame standard, which is a re-use of the existing pause frame standard but with priority capabilities. As with the 802.3x frame, the Per-Priority Pause carries the same OPCODE but at the difference of the following:

• The OPCODE is 0x0101 to differentiate between 802.3x and PFC Pause frames
• The 802.3x frame time field is removed and replaced by a 16bit vector array consisting of 2 fields, a “priority enable” field and a “time field”. The “priority enable” vector field refers whether the quanta time referred in the “time field” should be evaluated, it simply acts as an off/on switch.

Once PPP is negotiated, 802.3x frames are no longer validated on the ports received PPP frames, hence it is not possible to use 802.3x on top of PPP.

### Summary

In summary, PFC makes use of PPC as defined in 802.1p, introduces a new pause frame format designed to reduces the “probability” of packets drop, treating defined priority traffic as separate queues. Having said that, PFC isn’t without its downfall - without a Congestion protocol such 802.1Qau, PFC will results in a head-of-line blocking scenario.

Finally it is important to note when using PFC, it is important to carefully calculate the buffer size to prevent frame lost on the receiving size while a PFC pause frame is sent - another post will soon clarify the methodology to use to calculate the buffer size.

# Yes Diffie-Hellman Is Secure

When it comes to the Diffie-Hellman algorithm, there seem to be many confusions from newbies in Cryptography as to whether an attacker could easily recompute the shared key by intercepting the prime numbers + public keys. While the answer is “no”, understanding why, requires an understanding the discrete logarithm problem. So here we go.

### Discrete Logarithm Problem

The discrete logarithm problem can be summarized as follow:

Given $a, b \in \mathbb{Z}/p\mathbb{Z} \wedge P$ a prime, compute $\log_b a$

In other words, $b$ is defined to be a primitive root of the finite group $( \mathbb{Z}/p\mathbb{Z} ) ^ {*}, a \in ( \mathbb{Z}/p\mathbb{Z} ) ^ {*}$ of order $p$ and we need to calculate the exponential $n$ such as $% $ resolving $a=b^{n}$.

The logarithm is then calculated modulo the order of $b$ in the finite group such that $\log_b a \equiv x$ if $n = \left | b \right |$.

Because the discrete logarithm is not NP-complete, it is not possible to compute our $\log_ba$ in polynomial time and therefore generating a big prime number to be used in generating a cryptographic key renders exhaustive search attack completely useless.

### DH Overview

Before we dive further, here is a very short reminder of the DH algorithm:

User1 and User2 agree on a prime number $p$ and a generator $g$ such as there exist an exponent $x$ so that $% $.

Both users secretly generate a random key, let’s call it $k$ ( $k_1$ for User1 and  $k_2$ for User2); from there on,  $p$ and  $g$ are used by both users to generate their public keys.

The public key equation is:  $g ^{k} \mod {p}$, let’s call it $l$

After both users compute  $l$ ( $l_1$ for User1 and  $l_2$ for User2), they exchange their public keys. The shared secret key can then be generated from the opposite user’s public key.

The secret key equation is: $l ^{k} \mod{p}$

1. User1: $l_2 ^{k_1} \mod{p} \Rightarrow g ^{k_2*k_1} \mod{p}$
2. User2: $g ^{k_1*k_2} \mod{p}$

As you can see, to recompute the shared key, the attacker would need to resolve the discrete logarithm problem as discussed earlier, that is, detect  $k_1$ and  $k_2$ from  $l_1$ and $l_2$, since it is not possible to directly compute $g ^{k_1*k_2} \mod{p}$; after all $g ^{k_1} * g ^{k_2} = g ^{k_1+k_2}$ and not $g ^{k_1*k_2} \mod{p}$.

There are of course many algorithms that attempt to resolve the discrete logarithm, such as the “squarre root attack”, but this is the subject for another post.

I hope this clarifies to new comers why DH is secure enough to be used in key exchange.