Archive

Archive for September, 2009

Ethernet flow control and IGMP snooping

September 23rd, 2009 Ali Abbas No comments

It is important to note that TCP flow control mechanism as well as Ethernet flow control mechanism are completely 2 different mechanism, which strive to achieve the same unique goal but when in used, are completely unaware of each other.

As a matter of fact, Ethernet flow control can fully alienate your network if not planned and used carefully :)

So What is TCP flow control?

Flow control is a mechanism implemented in the TCP stack which enables a receiver endpoint to notify a sender that it can no longer receive data in its buffer. The buffer size is what is simply referred as the TCP Window Size, and is transmitted in each ACK. The receiver can therefore let the sender know, how much bytes it is able to process at once.

[ let's assume, the receiver machine can only process 8K in its buffer]

(sender) <——– ACK 1022 WIN 4096 <——– (receiver)
(sender) ———> 4K | SEQ 1022 ————–> (receiver)

[ assuming that the buffer of the receiver is now full with the first 4K ]

(sender) <——– ACK 2024 WIN 0 <——– (receiver)

[ the sender is now "blocked" from sending more data till the receiver sends a second acknowledgment]

(sender) <——– ACK 2024 WIN 4096 <——– (receiver)

Ok so now, what is Ethernet flow control?

From layer 4 (TCP flow control), we jump now to layer 2 (Ethernet flow control).

Ethernet flow control is different from TCP flow control as it makes usage of the MAC control frame “pause frame” to notify the end device to stop sending frames. It is important to keep in mind that, the sender of the pause frame sets the 2bit quanta time which defines how long the endpoint must wait to start retransmitting frames and finally to keep in mind that pause frames are not forwarded. That is to say, a MAC control frame will not be forwarded through a trunk port, nor to the adjacent device.

What is the problem when using Ethernet flow control?

If you have read so far, you can start guessing what may occur, if you have “ethernet flow control” enabled on your switch. Instead of dropping the packets when the tcp window size is exhausted, the switch will not drop the packet but generate its own pause frame and send it to the sender host. Now keep in mind that pause frames completely cease all transmission on the data link layer… that is to say if meanwhile PCX was getting a file of PCB, it would as well be “paused”. Because pause frame only work on layer 2 “data-link”, all communications associated to the targeted switch port, will completely cease for the pause period of time.

But what happens meanwhile with the TCP flow control?

Like said earlier, the TCP flow control isn’t aware of the data flow control… the TCP flow control allows TCP to throttle the amount of data it is sending, because the switch no longer drops packets due to “ethernet flow control”, TCP becomes unaware that it is sending more data than what the endpoint window size can receive and thus keeps increasing the amount of data it is sending… the result is an overloaded receiver and a switch which keeps generating pause frames, till the TCP flow control detects congestion and readjusts the sending window.

And what happens when you have IGMP snooping off?

Imagine a multicast scenario, where you have a server and a workstation on 2x 1Gb port and another workstation on a 100Mb. If the server starts sending multicast packets at 1Gpbs (in the absolute ;-) ), Ethernet flow control will directly start to throttle down the speed at which the server sends the packet to the lowest port speed of the switch. Remember we are talking multicast here and because packets would be delivered to the 100Mb port… Ethernet flow control on the switch would force the server to only send at 100Mbps. While this is good in practice, remember without IGMP snooping,the switch would be sending all the multicast packets to all the switch ports, thus to endpoints which are unsolicited in the mutlicast group, will cause Ethernet flow control to trigger bad and slow performance.

Conclusion

IGMP snooping has always been a problem in VRRP setup (aka. Checkpoint HA), causing fluctuation on the interface state (referred as flapping interfaces).

While it is possible to disable IGMP per VLAN, I would recommend disabling IGMP snooping per MAC Multicast Address (i.e 01:50:5e:xx:yy:zz)

Categories: Networking, TCP/IP

Filter networks with BGP

September 20th, 2009 Ali Abbas No comments

There are 3 easy ways to filter/restrict certain networks to be announced through BGP to a remote/adjacent AS (Autonomous System).

Those 3 simple ways include: prefix-list | Extended Access-list + Route-map | Extended Access-list + Distribute-list

To Note: before we go on, I need to specify that creating an extended access list to be in use with BGP (route-map, distribute-list) is almost as similar as creating a prefix-list… Having said that, we are therefore no longer matching source and destination address but merely address prefix and netmask with the access list.

Let’s assume in all 3 examples, we do not want add the network 192.168.4.0/24 to our routing table when advertised from our one eBGP peer – AS 64515.

* in this example, we are of course using a private ASN

1. Prefix-list

First we jump into global configuration mode and create a prefix-list filter named “DENY-PREFIX”

border1#conf t
border1(config)#ip prefix-list DENY-PREFIX seq 10 deny 192.168.4.0/25
border1(config)#ip prefix-list DENY-PREFIX seq 20 permit 0.0.0.0/0 le 32
border1(config)#router bgp 64514
border1(config-router)#neighbor 192.168.10.1 remote-as 64515
border1(config-router)#neighbor 192.168.10.1 prefix-list  DENY-PREFIX in
border1(config-router)#do wr

2. Extended access-list / Route-map

First, we create an extended access list in global config mode

border1#conf t
border1(config)#access-list 101 deny ip host 192.168.4.0 host 255.255.255.0
border1(config)#access-list 101 permit ip any any

We then now proceed to create a route map (still in global config mode)

border1(config)#route-map NET-FILTER permit 20
border1(config-route-map)#match ip address 101

We jump back in global config mode

border1(config)#route-map NET-FILTER deny 30
border1(config-route-map)#exit
border1(config)#router bgp 64514
border1(config-router)#neighbor 192.168.10.1 remote-as 64515
border1(config-router)#neighbor 192.168.10.1 route-map NET-FILTER in
border1(config-router)#do wr

3. Distribute-list

Similar to route-map, we will be using an extended access list to accomplish the filtering.

We will be using the same access list we defined early for rout- maps, which is access-list 101

border1(config)#router bgp 64514
border1(config-router)#neighbor 192.168.10.1 remote-as 64515
border1(config-router)#neighbor 192.168.10.1 distribute-list 101 in
border1(config-router)#do wr

- Final point but not last

Remember that for inbound updates, the order of preference is

  • first route-map

  • filter-list

  • prefix-list/distribute-list

and for outbound updates

  • prefix-list/distribute-list

  • filter-list

  • route-map

Categories: BGP, Networking

Cisco Switch Ether-Channel – misconceptions

September 19th, 2009 Ali Abbas 1 comment

I decided to write this post to address some of the many misconceptions out there on many mailing lists, forums etc…

1. More bandwidth

That is one of the first common misconception I often read or hear, if I “trunk” (Solaris term actually) many ports together, the number of ether-channeled port is equal of the number of ports multiplied by the bandwidth of each port.

In other words… 8 * 100Mbps port ether-channeled would create a single logical port with 800Mbps… Now while this is somewhat the anticipation, it isn’t really true. You see, only one physical link would be used for a single connection. If I connect to a workstation over the ether-channel, my packets for this single connection would be using one link and not multiple link, thus the bandwidth for this “conversation” would still be of 100Mbps.

2. All links are load balanced

It is also common to think that all links are equally shared, that is  say, one link would not be overloaded as opposed to the other ones.

Now while this is somewhat achievable, it isn’t by default. You see by default Cisco’s ether-channel algorithm hashes the destination MAC address of each packet and assigns it a number ranging from 0 to 7… This value is then assigned to a port belonging to the logical link. Because the maximum number of active links¹ which can be channeled together is 8¹, each of them can only carry one value (the hashed number from the destination mac address + session). If you happen for example to have 5 links, 3 links would be assigned 2 hashed value, while the 2 left, 1 hashed value. If you only had 2 links, then each would be assigned 4 values.

Because Cisco’s default algorithm uses the destination Mac Address, each packets destined to one server would be using the same link.

Imagine this simple scenario

[many workstations] —> switchA <==========> switchB —-> Server1

When PC1, PC2, PC3 which connects to switchA try to access a file on Server1, all connections would be going through the same ether-channel link, while the other links (assuming there is no other traffic) would be completely unused. Now the returned packets from Server1 to PC1, PC2, PC3, would nevertheless be using different links, because we would be having 3 Destination Mac address. So one way, a link would be overloaded, the other way, the link wouldn’t.

Because Ether-channeling is a one way work… it is possible to set the load balancing algorithm on switchA from Destination Mac Address to Source Mac Address, while leaving switchB on Destination Mac Address, which would then result in a “somewhat” load balanced ether-channel links.

Having said that, Cisco Algorithm uses about 9 methods to determine the link to use, those are the Destination/Source Mac Address, the Destination/Source IP Address, the Source/Destination Port, the Destination AND Source IP Address, the Source AND Destination Port and finally the Destination AND Source Mac Address.

(wow that was a long sentence :) )

Depending on which methods you decide to use, there would be some type of drawbacks depending on your network topology. That is to say, Ether-Channel isn’t the prior mean to resolve bandwidth/throughput issue; it is nevertheless useful and when necessary is a big important feature in network topology design.

¹ [ LACP allows a maxium of 16 links with only 8 active ]

iBGP route reflectors

September 18th, 2009 Ali Abbas No comments

It is by default that all BGP peers within the same autonomous systems must peer with each other to form a full mesh in order for each peer to be able to advertise routes to its adjacent peer.

“Disclaimer: BGP confederacies will not be tackled in this post”

For example

if routerB learns a new route from routerA, it wouldn’t be able to advertise the learned route to routerC and routerC would only be able to learn the route from routerA. Now imagine if your network isn’t fully meshed? well I am sure you guessed right! depending on your network infrastructure, routing on this advertised subnet from router A will be un-reachable through routerC and you will be having a big problem of convergence.

What if the peers cannot be meshed together?

It is possible when using standard iBGP to “force” a router to “reflect” the routes it learned to another adjacent peer. In simple words, routerB learned the route from routerA, routerB “reflects” that route to routerC.

Route Reflectors

As stated already, route reflectors eliminate the need for a full mesh setup, thus allow scalability but also a route reflector reduce data exchange between peers by only reflecting the best path. When setting up RR, you would be defining what is generally referred as a cluster (RR + client peers), in our example below, the RR is routerB and the client peers are routerA and routerC. This group is then defined as a cluster.

It is also important to understand how RR works.

As we said earlier, RR selects the best path when receiving a route from an iBGP peer; if the route had originated from a non-client iBGP peer (imagine routerD connected to routerA), this route will then be only reflected to all route reflectors clients (routerA and routerC for example), thus any other none-rr-clients needs to be fully meshed.  If the route nevertheless originates from either routerA or routerC, the route would then be reflected to both non-client and rr-client .

Now let’s see how we can set up a simple route reflector…

Configuration using a private ASN

Simple topology: [routerA 192.168.1.1] —– [ routerB 192.168.2.1] —– [ routerC 192.168.3.1]

routerB(config)#router bgp 64514
routerB(config-router)#neighbor 192.168.1.1 remote-as 64514
routerB(config-router)#neighbor 192.168.1.1 route-reflector-client
routerB(config-router)#neighbor 192.168.3.1 remote-as 64514
routerB(config-router)#neighbor 192.168.3.1 route-reflector-client

Now routerB will be advertising routers learned from routerA to routerC and from routerC to routerA.

What more?

I am not going to reiterate what RFC 2796 addresses, thus I suggest a read at http://www.ietf.org/rfc/rfc2796.txt to learn more about RR loop detection and avoidance.

Categories: BGP, Networking

Mathis Equation and TCP performance

September 16th, 2009 Ali Abbas 2 comments

As simple as possible laid off, the Mathis equation goes as follow

Rate <= (MSS/RTT)*(1 / p)

MSS

This is the Maximum Segment Size, which is the MTU excluding the TCP/IP headers.

MSS = MTU – TCP/IP headers – for example 1460 with an MTU of 1500 (20b IP and 20b TCP headers)

RTT

RTT is the Round Trip Time as measured by TCP. The round trip is the time it would take a packet to travel from endpoint A to B and from endpoint B to A.

On average, RTT = (Physical Distance * 20ms) / 1609 , that is to say, for each 1 609 km, you should expect an RTT of 20ms

p

p is the probability percentage of packet lost per physical segment. A fiber BER would typically be of 10⁻¹³%.

Before we go on, it is first important to understand how TCP evaluates packet loss. As simple as it can be, packet loss is simply based on late delivered ACKs. The more acknowledgment are being sent late, the more the % of packet lost increases.

Let’s get more serious

As explained earlier, the Mantis Equation allows to locate the rate or so to say throughout we can use based on the MSS, RTT and the probability % of packet loss on the link.

Imagine we have an E3 link. For those new to WAN technology, an E3 link uses an M3 signaling type as opposed to an E1 which uses a ZM signaling type. Getting back to the speed line, an E3 is the equivalent of  16*E1 ~= 34.064 Mbps (including management overhead)

1. Line is E3 with a bw of 34.064 Mbps
2. Our endpoint is roughly 3000 km from us
3. We are using a default MSS of 1460
4. An E3 would have a typical packet loss percentage of 10⁻⁶ = 0.001 % (1 packet lost each 1000 packets)

Based on 3000 km, we could assume that the average RTT would be of 37.29 ms = 0.03729 s

Mantis Eq : (1460 / 0.03729) * (1/0.001) ~=  1.23 Mbps

Now if we had no packet loss, our throughout would have been

Throughput = TCPWindow / RTT

(65535 / 0.03729) * 8 ~= 14Mbits

An original bandwidth line of 14 Mbps and an actual throughput of 1.23Mbps over 3000km with a packet lost of one packet each 1000.

How to do you increase rate?

In a perfect world, you would of course need to reduce each value variable of the equation such as decreasing RTT, decreasing the loss probability and increase the MMS (which btw you cannot on the internet, as all routers are configured with a static MTU of 1500)

I hope that was informative on how packet loss can affect throughput.

Reference

The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm (1997) http://citeseer.ist.psu.edu/old/mathis97macroscopic.html

Categories: Networking, TCP/IP