Until now, routing of public traffic to our front servers was done via some particular servers, called
load-balancers(LB) in our infrastructure. These servers were built with
nginx for SSL-offloading,
varnish for the reverse-proxy part, and
ExaBGP for the communication with our BGP routers.
Our infrastructure’s move to
rkt has lead us to re-consider the cooperation of these components.
For the current design, I was inspired by Vincent Bernat’s excellent post: High availability with ExaBGP
When applied to our own infra, it looked like this:
Each of the
load-balancers is running ExaBGP, which is talking with BGP routers. Config example of one of them:
ExaBGP, we also use its
healthcheck plugin, which is continuously testing the resource and, if available, annoucning some
healthcheck-apex.conf in detail:
LB’s config looks the same, except the VIPs order:
Why this difference ?
It is connected with the way
healthcheck works. For each specified VIP,
healthcheck makes an announce to
ExaBGP, increasing the
med attribute by 1. Example on the first
announce route 18.104.22.168/32 next-hop self med 100 announce route 22.214.171.124/32 next-hop self med 101
It means that each
LB is announcing the same VIP set, but with a different weight (using
med). As a consequence, BGP routers only route traffic for a particular VIP to a single
Just publish these 2 VIPs into DNS for a particular service (apex here), and you’ll obtain a highly available service.
In case of
LB failure, it won’t announce its VIPs anymore. Fortunately, BGP routers know at least another route, towards the second LB, so the service keeps working.
Behaviour with containers
In our new datacenter, we are using
rkt containers exclusively, as the basic run unit. Every servers are the same, and they run only a single operating system:
CoreOS. This infrastructure is massively automated.
From a network point of view, we used this huge change as an opportunity to switch from a
L2 to a
L3 model, since we don’t need VLANs anymore. Network isolation is reached using ACLs that are pre-generated and deployed onto CoreOS (the same principle as AWS security groups).
Therefore, we setup BGP on each level, from edge routers to top-of-rack (ToR) switches. It looks like this:
Then, in each rack, a single
CoreOS server is dedicated to load-balancing. It runs a set of containers (called a
pod), and this pod is in charge of running ExaBGP, and announce the VIPs handled by this LB.
ExaBGP config sample on rack #1:
When running our first tests, we encountered a problem pretty fast: for a particular service, whatever the number of LB instances was, only one LB was receiving all of the traffic.
After troubleshooting, it appears that the behaviour was as expected. Each LB was announcing the same VIPs, with a different
med, and this
med was taken into account by the correspondig ToR.
However, the VIP, when re-announced by the ToR to the upper level (aggregation switches), was loosing its
med attribute. This behaviour is by design in BGP when ASN changes: med is a non-transitive attribute
Edge routers were knowing different routes, but with different as-paths (non-eligble to
bgp-multipath, thus). So they chose the older route, which is online with the behaviour seen: only a single LB (the first started) receives all of the traffic.
We though about several workarounds:
- bgp multihop
- route server
- BGP community values
First idea we had: connect the LBs directly to aggregation switches, and not to ToR. We had to use multihop mode of BGP:
router bgp 65002 neighbor a.b.c.d remote-as 651xx ebgp-multihop 2
Unfortunately, using this solution, next-hop attribute for the received routes was unknown in the IGP of the aggregation layer. Thus, the route was not inserted into RIB.
Next idea was to use a route server. Instead talking to ToR, the LBs would have to talk to aggregation layer in route-server mode. Here again, we were luckless: our aggregation switches are Cisco Nexus 3064, and their implementation of BGP does not include support for the
Another idea: rather than using a non-transitive attribute like
med is, what about using a transitive one: community. Despite this idea was pretty sexy, we dropped it because:
- high complexity of
route-mapsthat we would have to implement on the aggregation layer, to handle this community values properly.
- need to modify the source-code of
healthcheck, in order to use
Since the best idea we had implied to modify
healthcheck’s source code, we though it would be worth writing our own plugin, and to connect it directly to our service directory.
Indeed, in our container oriented infrastructure, orchestration of the containers (instancing, quantity, deployment …) is highly relying on
It appeared to us that
ExaBGP should speak directly with
ZooKeeper to determine what to announce to BGP routers. Accordingly, we chose to code this plugin: ExaZK
- regarding a particular service, say apex, we run 3 LB pods, on 3 different CoreOS servers, located in 3 different racks
- each of these pods is running
- each ExaZK instance announces only one VIP, for whose this instance is authoritative (
- each ExaZK instance writes down to ZK its health state, using an ephemeral node
- when an instance dies, its ephemeral node in ZK goes with it. The other instances catch this event
- the other instances then announce the VIP of the faulty instance, until it comes back online.
A special mode, called
maintenance, can be used to completely put a service offline: should an operator make this special node exists in ZK, the LBs would withdraw all the VIPs.
Last, a local check (same principle as used in
healthcheck) allows every instance of
ExaZK to check itself.
ExaZKis still under testing. If it’s of some interest to you, and you want to modify it to your needs, feel free to fork it on github !