Skip to content

Increase the default latency threshold for the apdex portion of the Error Budget

Background

Error Budgets are made up of two components: apdex and error.

For the apdex portion, we use a latency threshold to determine if a request is fast enough. This is currently 1s.

Not all endpoints have the same latency requirements. We are building the ability for each endpoint to define its own threshold in project &525.

Because this flexibility does not currently exist, when introducing Error Budgets to stage groups, there has been a request to increase the default latency threshold so that endpoints that are permitted to be slower will not be negatively impacted with their budget spend. As improvements are made in each stage group, we can then increase the default latency threshold.

Proposal

When we record latency measurements, we do not record the exact seconds that each request took. The duration is stored in buckets (for monitoring performance and storage reasons). The buckets are [0.1, 0.25, 0.5, 1.0, 2.5, 5.0] (found here).

We could use 2.5s or 5s as the default latency. The table below demonstrates using 5s as the default. (We used a 7 day calculation to make the data gathering easier)

Stage Group Current availability (using 7 days) at 1s Availability (using 7 days) at 5s Improvement
source_code 99.9060% 🔴 99.9966% 0.09%
access 99.9330% 🔴 99.9743% 0.04%
code_review 99.5516% 99.9699% 0.42%
project_management 99.0660% 🔴 99.9421% 🔴 0.88%
global_search 97.8226% 🔴 99.3450% 🔴 1.52%

(detailed view of this data can be found on this issue: #1244 (closed))

If this approach is chosen, we recommend using the 5s bucket for the largest impact.

Edited by Rachel Nienaber