Enabling horizontal autoscaling with co-operative distributed rate limiting

Monzo runs almost 3,000 microservices to power everything from card payments to peer-to-peer transaction emoji reactions. A core part of being able to run all these systems and features safely is rate limiting.

We use them in countless places at Monzo. Whether it is third parties using our Open Banking APIs, controlling how quickly people can Get Paid Early, processing orders for Monzo Investments or paying out accrued interest for Savings Pots; we couldn’t do so safely without rate limits. This means that how we configure and maintain these rate limits overtime is really important. In this blog post we describe how that gets us to doing what we call “distributed rate limiting.”

We started with per-replica rate limits

All our services are orchestrated on Kubernetes. All services run in a replicated set with a minimum of three replicas (but normally more) for high availability. This means that any rate limit configured needs to be divided across all the replicas of a given service. In the early days we started with a simple solution to this problem on purpose.

We set rate limits on a per-replica basis. For example, in our Kafka library, engineers could set a rate limit on queue consumption by setting a configuration key named eventsPerSecondPerReplica for a topic+consumer group combination. This is easy and convenient, but it has three key downsides that become more problematic over time.

It means that you can’t vary the number of replicas as a new replica gets the same number of events allocated thus increasing your total rate limit. We like to run our services to autoscale horizontally by default, so any service that needed a rate limit had to opt out of this. This isn’t ideal.
There is no concept of ramping up. If you are under-utilizing your rate limit you don’t deal well with abrupt increases in usage as the rate limiter will just let you spike to the top limit immediately. Autoscaling systems and instant surges don’t typically go well together. Consider a batch job starting for example, 0 to 100% basically immediately.
A rate limit is set at a fixed point in time, but Monzo grows. A limit that was appropriate for 5m users may not be appropriate for 10m users (and a platform that will also have scaled up in that time). But because these limits are manually updated, there is no way to spot that systematically. We tend to notice when someone gets paged because some process has gotten too slow. Not ideal.

As a first improvement we decided to tackle just (1). We knew that if we took an extensible approach we could come to address (2) and (3) later. The idea being that if you have some rate limiting intermediary system, that it could become more sophisticated over time.

We focused on low cardinality, high throughput rate limiting

Being able to vary the number of replicas for all services at Monzo is a good goal to aim for. It means we can scale down to save money and more importantly, scale up to meet demand. To eliminate the class of services opting out because they used rate limiters we had to nail down what kind of rate limiting problem we had to solve. In general, we found that we can split rate limiting problems at Monzo into two cardinality buckets: high and low. High cardinality rate limiting problems are those where we have a high number of unique rate limiters, for example rate limiting based on unique IP addresses. Low cardinality is the exact opposite. A good example of this is rate limiting queue consumption, you need just a single rate limiter for a given topic+consumer combination.

We then also identified that most high cardinality rate limiters are relatively low throughput, where most low cardinality rate limiters were high throughput in comparison. A queue processing thousands of events per second is common. The low cardinality, high throughput combination is the one we see most often at Monzo. The eventsPerSecondPerReplica example from above is exactly this. As a result, we decided to focus our efforts there.

We settled on a co-operative rate limiting architecture

With the problem defined we immediately considered some of the classic out-of-process rate limiting solutions. Using Redis counter-based rate limiters comes to mind in particular. This quickly felt like it might not be a good fit though. Two things in particular stood out:

Using centralized rate limiters is expensive for what we’re trying to do. A round trip to Redis for each rate limiter check adds up quickly. The classic way to make this more efficient is batching, but that is not a natural model for most of the places we want to rate limit at Monzo. Since high throughput is important for us this didn’t feel ideal as a result.
We’re not operating in an adversarial environment. This could have been a reason to go with a centralized limiter despite the cost. But for us there is no reason a server needs to make the rate limiting decision. Maybe trusting our clients can unlock a different, more scaleable approach.

As a side note, the Redis approach probably would work really well for the high cardinality, low throughput use case, but we don’t go into that further here.

With this conclusion in mind, we searched more widely for other approaches. We quickly found Doorman and while it didn’t entirely meet our requirements (in how the limits are configured and its dependencies) it served as inspiration for the approach we ultimately settled on.

Doorman trusts its clients. This means that it helps divide up the “capacity” for a given rate limiter centrally (server-side) but then trusts the clients to actually respect the rate limits they are given. This idea works really well for Monzo precisely because we wanted to focus on high throughput. It eliminates the network round trip on the hot path of asking the rate limiter if a process can proceed. We came to think of this client centric approach as “co-operative rate limiting” as the client co-operates with the server that divides up the capacity by actually respecting it.

service.distrate intermediates capacity between clients

So clients will enforce the rate limit they are given. This leaves the big open question: how will they know what their rate limit actually is? This is where service.distrate comes in. It is the server whose job it is to know about all clients and to distribute capacity between them in a way that never exceeds the configured limit. Exactly how it does this is quite involved, so we won’t cover every detail here but we'll illustrate it at a high level.

Fun fact: Go’s standard lib has a package called sync containing its synchronization primitives, so our distributed version of that is called distsync. We decided to do the same with rate, getting us to distrate.

The core idea is that clients are expected to poll service.distrate every so often to update the capacity for the rate limiters they have configured. By polling, service.distrate eventually discovers all the clients that care about a certain limit and distributes the total capacity accordingly. It will do its best while still discovering clients and give some too much (and others too little), never going over the total capacity. Once all clients have polled and are known this quickly settles into an equilibrium. The clients can then use this capacity to update a local, in-memory rate limiter (we use golang.org/x/time/rate). This is what is called on the hot path to rate limit. The capacity itself is given in the form of a lease, so service.distrate can deal with clients that disappear (and more), including a refresh interval that tells the client on what cadence to poll.

For convenience all the client side interaction, including the local rate limiters themselves, are wrapped in a library, libraries/distrate. This library is all someone needs to use to get a fully working distributed rate limiter.

The interaction between these two components can be roughly summarized as the following polling process:

A service (service.balance) asks for a rate limit which is configured to 100 total to service.distrate. The distrate service responds back with you getting 25 per second with a lease of 5 mins and check in every 10 seconds

Without going into too much detail, as that would likely be closer to a whitepaper, a few additional things are interesting to call out:

This solution doesn’t couple us to the Kubernetes API. For a while we considered using that to discover the number of clients out there, but it felt like an uncomfortable coupling and something that is hard to keep consistent.
Similar to Doorman, this system aggressively prioritizes availability. In addition to lease length and refresh interval, responses also include a ‘safe capacity’ which the client is expected to fallback to if they can’t talk to service.distrate before their lease expires, rather than stopping entirely.
The algorithm for actually splitting up capacity is called EvenShare. It simply takes the total capacity and divides it by the number of clients. If that much capacity isn’t left unallocated, it allocates zero, expecting that on the next polling interval other clients will have made space available. Otherwise it allocates that number. It’s very naive, but a nice place to start. We can do more here.
Clients pass up their own total capacity, as we wanted the ownership of that data to live with service owners rather than in our central service. We use the time at which the value was grabbed from our configuration system as a fencing token to ensure that different replicas disagreeing can’t take the value back in time.
service.distrate stores the entire state of the rate limiting world in memory. It’s a small dataset and this makes it incredibly cheap to operate. For this to work we do need to do leadership election, so one replica can make all the decisions knowing it has all the information. Luckily this was relatively straightforward using our distributed locking system. A fresh leader needs to build up its picture of the world by waiting for clients to poll. This was also inspired on Doorman, you can read more about “learning mode” and how this works there.

This gives us the platform to further improve how we rate limit

The beauty of a centralized service, written by us, is that we can make the business logic more sophisticated to work better for Monzo over time. For example, we can solve problems (2) and (3) should we decide the time is right. The algorithm that does the capacity allocation could be updated to include a “rate ramp up” so we never increase the allocated capacity above a certain rate - removing the immediate ramp up issue.

Similarly, it could scale limits by the number of customers Monzo have so configured limits don’t age. Other things we’ve thought about doing include improving the algorithm to redistribute capacity from clients not using it to clients that are using all of theirs as well as varying the refresh interval so limiters with their total capacity mid-change are requested to poll faster and thus settle on an end state for the clients faster.

We’ll probably look at distributing capacity based on actual usage next, as the “even share” algorithm is the thing causing the most frustration or confusion with engineers today. But by and large we’ve found this system delightful to use. It’s nice to define a reason to opt out of horizontal auto scaling out of existence.

Does running building reliable systems and infrastructure at scale and without downtime interest you? We are hiring for Backend Engineers and Staff Engineers