How much traffic does canary receive?

In engineering allocations the question just came up how much of our traffic targets canary. I was under the impression that it is 5%. But looking into our actual haproxy config, I don't think that's the case. Well not exactly.

Backend

Here's the ssh backend (since engineering allocations was talking about the gitlab-sshd rollout, which targets this backend):

fe-01-lb-gprd.c.gitlab-production.internal:~$ sudo view /etc/haproxy/haproxy.cfg

backend ssh
    mode tcp
    balance roundrobin
    option splice-auto
    timeout server-fin 5s
    timeout server 2h
    option tcp-check
    server shell-gke-us-east1-b 10.221.13.40:2222 weight 100 check inter 3s fastinter 1s downinter 5s fall 3 backup
    server shell-gke-us-east1-c 10.221.14.5:2222 weight 100 check inter 3s fastinter 1s downinter 5s fall 3
    server shell-gke-us-east1-d 10.221.15.5:2222 weight 100 check inter 3s fastinter 1s downinter 5s fall 3 backup
    server gke-cny-ssh 10.216.8.61:2222 weight 5 check inter 3s fastinter 1s downinter 5s fall 3

How do weights work?

We have a weight of 5, but how is that weight actually applied? According to haproxy docs:

weight <weight>
 The "weight" parameter is used to adjust the server's weight relative to
 other servers. All servers will receive a load proportional to their weight
 relative to the sum of all weights, so the higher the weight, the higher the
 load.

Pitfall: Backup servers

So we do this, right? 5 / (100+100+100+5) = 1.6%. Wrong!

Our haproxies define the zone they are in as the target (in this case c) and all others are marked backup. This means the other zones will only be used if both the current zone and canary are unavailable.

Surprisingly, I suspect it also means that if one zonal cluster is having trouble, we will start sending all of that zone's traffic to canary.

I believe this was previously discovered by @skarbek and @msmiley.

Prediction

So with this knowledge, if both the zonal cluster and canary are available, canary should receive: 5 / (100+5) = 4.7%.

Conclusion

We can get per-server statistics from the haproxy admin socket:

iwiedler@fe-01-lb-gprd.c.gitlab-production.internal:~$ echo show stat | sudo socat stdio /run/haproxy/admin.sock | awk -F',' '$1 == "ssh" { print $2, $8 }'

FRONTEND 20421182
sock-1 20421183
shell-gke-us-east1-b 0
shell-gke-us-east1-c 19450195
shell-gke-us-east1-d 0
gke-cny-ssh 972598
BACKEND 20421180

Let's calculate: 972598 / 20421180 = 4.7%. This checks out. Well ok, this is close enough to 5%.

Case closed!

One last mystery

While investigating this, I looked at metrics for api, and was surprised to see a canary traffic share of ~6% (between 5.3% and 6.4% depending on time of day) (source).

sum(rate(gitlab_workhorse_http_request_duration_seconds_count{env="gprd",type="api",stage="cny"}[1m]))
/
sum(rate(gitlab_workhorse_http_request_duration_seconds_count{env="gprd",type="api"}[1m]))
* 100

This is quite a bit higher than the 4.7% we predicted and measured for ssh.

It turns out that API has an additional api_canary backend, in addition to the api one. This one gets routed to in the case of the canary cookie being set or the path prefix being contained in /etc/haproxy/canary-request-paths.lst, which includes:

/charts
/groups/gitlab-com
/groups/gitlab-org
/groups/gitlab-data
/groups/meltano
/gitlab-com
/gitlab-org
/gitlab-org/www-gitlab-com
/gitlab-data
/meltano
/api/v4/groups/gitlab-com
/api/v4/groups/gitlab-org
/api/v4/projects/gitlab-com
/api/v4/projects/gitlab-org

We can look at how much traffic is sent to the api backend vs canary_api:

iwiedler@fe-01-lb-gprd.c.gitlab-production.internal:~$ echo show stat | sudo socat stdio /run/haproxy/admin.sock | awk -F',' '$1 == "api" || $1 == "canary_api" { print $1, $2, $8 }'

api api-gke-us-east1-b 0
api api-gke-us-east1-c 134459246
api api-gke-us-east1-d 0
api gke-cny-api 6725536
api BACKEND 141140656
canary_api gke-cny-api 1312736
canary_api BACKEND 1311934

For api, the canary share is 6725536 / 141140656 = 4.7%. This confirms that the additional traffic must come from canary_api.

Edited Jun 01, 2022 by Igor