How much traffic does canary receive?
In engineering allocations the question just came up how much of our traffic targets canary. I was under the impression that it is 5%. But looking into our actual haproxy config, I don't think that's the case. Well not exactly.
Backend
Here's the ssh
backend (since engineering allocations was talking about the gitlab-sshd rollout, which targets this backend):
fe-01-lb-gprd.c.gitlab-production.internal:~$ sudo view /etc/haproxy/haproxy.cfg
backend ssh
mode tcp
balance roundrobin
option splice-auto
timeout server-fin 5s
timeout server 2h
option tcp-check
server shell-gke-us-east1-b 10.221.13.40:2222 weight 100 check inter 3s fastinter 1s downinter 5s fall 3 backup
server shell-gke-us-east1-c 10.221.14.5:2222 weight 100 check inter 3s fastinter 1s downinter 5s fall 3
server shell-gke-us-east1-d 10.221.15.5:2222 weight 100 check inter 3s fastinter 1s downinter 5s fall 3 backup
server gke-cny-ssh 10.216.8.61:2222 weight 5 check inter 3s fastinter 1s downinter 5s fall 3
How do weights work?
We have a weight of 5, but how is that weight actually applied? According to haproxy docs:
weight <weight>
The "weight" parameter is used to adjust the server's weight relative to
other servers. All servers will receive a load proportional to their weight
relative to the sum of all weights, so the higher the weight, the higher the
load.
Pitfall: Backup servers
So we do this, right? 5 / (100+100+100+5) = 1.6%
. Wrong!
Our haproxies define the zone they are in as the target (in this case c) and all others are marked backup. This means the other zones will only be used if both the current zone and canary are unavailable.
Surprisingly, I suspect it also means that if one zonal cluster is having trouble, we will start sending all of that zone's traffic to canary.
I believe this was previously discovered by @skarbek and @msmiley.
Prediction
So with this knowledge, if both the zonal cluster and canary are available, canary should receive: 5 / (100+5) = 4.7%
.
Conclusion
We can get per-server statistics from the haproxy admin socket:
iwiedler@fe-01-lb-gprd.c.gitlab-production.internal:~$ echo show stat | sudo socat stdio /run/haproxy/admin.sock | awk -F',' '$1 == "ssh" { print $2, $8 }'
FRONTEND 20421182
sock-1 20421183
shell-gke-us-east1-b 0
shell-gke-us-east1-c 19450195
shell-gke-us-east1-d 0
gke-cny-ssh 972598
BACKEND 20421180
Let's calculate: 972598 / 20421180 = 4.7%
. This checks out. Well ok, this is close enough to 5%.
Case closed!
One last mystery
While investigating this, I looked at metrics for api, and was surprised to see a canary traffic share of ~6% (between 5.3% and 6.4% depending on time of day) (source).
sum(rate(gitlab_workhorse_http_request_duration_seconds_count{env="gprd",type="api",stage="cny"}[1m]))
/
sum(rate(gitlab_workhorse_http_request_duration_seconds_count{env="gprd",type="api"}[1m]))
* 100
This is quite a bit higher than the 4.7% we predicted and measured for ssh.
It turns out that API has an additional api_canary
backend, in addition to the api
one. This one gets routed to in the case of the canary cookie being set or the path prefix being contained in /etc/haproxy/canary-request-paths.lst
, which includes:
/charts
/groups/gitlab-com
/groups/gitlab-org
/groups/gitlab-data
/groups/meltano
/gitlab-com
/gitlab-org
/gitlab-org/www-gitlab-com
/gitlab-data
/meltano
/api/v4/groups/gitlab-com
/api/v4/groups/gitlab-org
/api/v4/projects/gitlab-com
/api/v4/projects/gitlab-org
We can look at how much traffic is sent to the api
backend vs canary_api
:
iwiedler@fe-01-lb-gprd.c.gitlab-production.internal:~$ echo show stat | sudo socat stdio /run/haproxy/admin.sock | awk -F',' '$1 == "api" || $1 == "canary_api" { print $1, $2, $8 }'
api api-gke-us-east1-b 0
api api-gke-us-east1-c 134459246
api api-gke-us-east1-d 0
api gke-cny-api 6725536
api BACKEND 141140656
canary_api gke-cny-api 1312736
canary_api BACKEND 1311934
For api
, the canary share is 6725536 / 141140656 = 4.7%
. This confirms that the additional traffic must come from canary_api
.