Humans who can RPC: securing staff access to 2000 microservices

Read the article

At Monzo, we run over 2000 services in a microservices architecture. These services talk to each other through remote procedure calls (RPCs). Occasionally, engineers at Monzo need to use RPCs directly.

With over 6000 RPCs that are constantly changing, we need to manage access to them in a way that’s simple to configure and scales well with our growth.

In the Security Platform team, we help engineers across Monzo ship products that are secure by default so this falls within our remit! We’ve recently tackled this problem by letting engineers define permissions for RPCs directly in proto files.

Our microservices architecture

Our microservices are self-contained units of software that can be independently created, changed and deployed.

Each microservice at Monzo has a number of remote procedure calls (RPCs). RPCs are the different things you can ask a microservice to do. On the code level, we use protobuf files in every service to describe the RPCs that are available.

For example, our account service has an RPC for creating a new account. It returns information about the newly created account, including a unique ID.

Here’s an extract from the protobuf definition for the account service:

An extract from the protobuf definition for the account service showing a few RPCs and their definitions.

Monzo engineers sometimes need to use RPCs directly

At Monzo, our applications typically talk to our platform through API services. These include internal tooling, customer facing web apps and our mobile apps. API services typically make one or more RPCs to our core microservices in our platform, which themselves may call other services.

A diagram showing the Monzo mobile app request list feed items from the service.api.feed, which in turn makes RPCs to core services like service.pot, service.feed, and service.transaction.

RPCs in our platform are primarily used for service to service communication, so microservice A may call an RPC on microservice B to help complete a task.

Whilst RPCs are typical for service to service communications, engineers across Monzo occasionally need to interact with these services directly, rather than through APIs. These are typically run ad hoc by engineers when building out new features or debugging technical issues.

Our challenge is managing and scaling RPC permissions

As we’ve grown our engineering organisation, it’s become increasingly difficult to manage engineering access to the ~6000 or so RPCs in our services. The challenge is to design an access control system that supports our use cases and scales well with our growth.

A scalable solution means two things for us. From a security perspective, we need to be confident that the RPC permissions configuration is accurate. No single engineer should have the ability to abuse our systems or data, regardless of the RPCs they need day to day.

Secondly, engineers should be able to use the RPCs they need day to day with as little friction as possible. The management of permissions should be sufficiently easy to understand and configure that any blockers can be resolved easily.

Defining permissions alongside RPC definitions

Previously, the security team maintained and configured lists of RPCs deemed safe for direct usage, along with the level of access required to run them. This was a simple system in itself. However, over time it became ineffective at accurately managing access because:

  • there was poor visibility on access configuration - it was easy for engineers to forget to properly configure access for new RPCs

  • configuring the access added friction to engineers work flow; you’d need to change a separate system outside the context of your pull request

Our solution to these problems is centred on defining permissions for the RPCs alongside their definitions. We’ve put permissions directly in proto files by introducing a custom field called humans_who_can_rpc. When engineers are writing new RPC handlers, they configure permissions right alongside them.

For example:

A code snippet showing a sample RPC with an additional option to configure the permissions at the same time.

In this case, the ReadFoo RPC is configured with the CONTRIBUTORS_WITH_APPROVAL option. This means that any engineer that’s an owner of the service can RPC it, with approval from another owner. We’ll dive more into what this means later, along with the other possible configuration options.

Putting RPC permissions directly in our proto files gives us several key benefits:

  • Reduces risk of misconfiguration: Permissions defined alongside the RPCs themselves keeps things in sync

  • Service owners are empowered to make their own security decisions: teams can change permissions with a simple pull request that doesn’t require a review from the security team

  • Fewer bottlenecks: the availability of security engineers to review RPC permission changes doesn’t hold up work for service authors

Standard approaches for permissions configuration encourage keeping permissions configuration separate from the resources they are describing. We deviate from this here by embedding this information in the proto file of each service. This is not outright a bad thing! In our case, it's worth paying the cost of coupling the two.

As an alternative, we also considered doubling down on defining permissions in a separate central system. This would mean the configuration of permissions is fundamentally separated from the workflow of an engineer that is creating or modifying RPC handlers. Similarly, permissions in a separate system are less visible to engineers by default.

Pre-baked authorisation modes

When working on our services, engineers choose the humans_who_can_rpc option based on several pre-baked authorisation modes, like the CONTRIBUTORS_WITH_APPROVAL mode above.

We expect engineers to regularly tweak these when adding or changing RPC handlers. It’s important the configuration choices are as simple as possible.

In the end, we settled for 5 high level modes that engineers can choose for configuring RPCs:

  • ENGINEERS_WITHOUT_APPROVAL: all engineers at Monzo can call the RPC outright (no approval needed)

  • ENGINEERS_WITH_APPROVAL: all engineers at Monzo can call the RPC with multi party authorisation approval from any other engineer

  • CONTRIBUTORS_WITHOUT_APPROVAL: only contributors to the service can call the RPC outright

  • CONTRIBUTORS_WITH_APPROVAL: contributors to the service can call the RPC outright with multi-party authorisation approval from another contributor

  • BREAK_GLASS_ONLY: most engineers cannot call the RPC at all because it’s locked down. Security on-call engineers can call the RPC with multi-party authorisation in the context of an incident

Each of these is a rule that dynamically describes who has access, rather than a static list of teams or people. These modes strike a good balance between making the developer experience intuitive, while also encouraging least-privileged access.

Requiring approval for more sensitive RPCs

RPCs are typically for service-to-service communication. It’s common for them to expose behaviour that’s safe and expected when used by another piece of software, but is risky for an engineer to use directly.

Sometimes, engineers need to call RPCs like this. We don’t want any engineer to be able to use them outright, but it should be possible with some level of approval. This is where multi party authorisation comes in.

Multi-party authorisation for RPC permissions

Multi party authorisation system lets staff perform actions with the approval from another suitable staff member. We already use multi party authorisation widely across Monzo in contexts other than just RPCs.

For example, staff that need to credit customers money in specific contexts can do so with approval from another appropriate member of staff.

Multi-party authorisation is a good fit for RPCs here because it enables us to allow engineers to use somewhat sensitive RPCs day to day, without blocking them completely. The _WITH_APPROVAL options allow engineers to require that multi-party auth approval is required to use an RPC.

Restricting access to teams using the contributors modes

The CONTRIBUTORS_* modes lets us configure RPCs that should be callable by anyone who is responsible for developing the service. A “contributor” is any engineer who is a member of the service’s owning team.

Tying RPC permissions to service ownership is especially useful for us because being an owner of a service is a strong signal for needing to RPC it. If you're an owner of a service, then it's likely that:

  • you have context about the service

  • you're responsible for building and changing the service over time

We’ve found the CONTRIBUTORS_* modes to be widely applicable across our services. These modes also scale well with our growth. The permissions remain accurate even as service ownership changes over time.

Pre-baked modes cater for most of our needs, while keeping things simple

The range of pre-baked modes has catered well for most of our RPCs, while also being simple and intuitive for engineers to understand.

The pre-baked modes means we likely over provision permissions in some cases. For example, an RPC set to ENGINEERS_WITH_APPROVAL might not be used by every engineer across Monzo.

This is an acceptable trade-off for the simplicity we get in return. Our approach is centred on the belief that for access control, simplicity is a dominating factor in how well access is managed in practice. An approach that forces low-level fine-grained provisioning as the default can result in people choosing to over-provision access anyway.

Technical implementation

Our RPC authorisation system is responsible for making authorisation decisions for engineer-initiated RPCs. It’s called by our gateway, the internal entry point into our platform that allows engineers to call RPCs through our developer tooling.

Like most of our platform, the RPC authorisation system is made of standard microservices. Significantly, it builds on top of our core authorisation system. Our core authorisation system lets us grant users roles and provides support for multi-party authorisation.

The RPC authorisation system uses a reflect endpoint on the target service to read out the humans_who_can_rpc option that corresponds to the RPC being run. It uses this information to ask our core authorisation system if the user is authorised to run the RPC, given the humans_who_can_rpc option determined.

At a high level, the RPC authorisation system works by reading out the humans_who_can_rpc option of the RPC being authorised by using the proto reflection component. It uses the information to ask our core authorisation system if the user is authorised to call the RPC. There’s one of three outcomes here:

  • the RPC is authorised and the engineer can call the RPC. The gateway should run it and return the result to the end user. The authorisation can be either outright, or through an approved multi party authorisation request

  • the RPC is unauthorised and the engineer cannot call the RPC in any circumstances

  • the RPC is unauthorised, but it can be called by the engineer with multi-party authorisation. The engineer is then prompted to create a multi party auth request, and seek approval from another suitable engineer. This is used for all RPCs that are configured with one of the _WITH_APPROVAL modes

Proto reflection

The proto reflection component lets the RPC authorisation service determine what humans_who_can_rpc option is set for the RPC being authorised. We’ve implemented this as an endpoint hosted on each of our microservices. The reflection endpoint returns a binary representation of the proto file that can be used by the caller to analyse the contents.

For us it’s important that we get a reliable and up-to-date view of the proto file. This is why we consult the reflection endpoint at runtime, rather than it being an asynchronous process. This means we deploy RPC permissions changes just like any other changes. Similarly, we roll them back just as we can roll back deployments.

Software ownership

The software ownership system lets us support the CONTRIBUTORS_* authorisation modes. It regularly evaluates which engineers are owners of each service and then pushes the information into our authorisation system in the form of parameterised role grants. The process runs asynchronously so there can be a small delay (< 1 hour) before ownership changes are reflected in engineering permissions.

Migrating our services

The final piece of the puzzle was migrating all our services to include a suitable humans_who_can_rpc option for each of their RPCs.

We used this opportunity to ask all engineering teams to re-consider permissions configuration for each of their RPCs, using our guidance on how to choose a mode.

To support the migration, we invested heavily in tooling to make it as simple as possible because manually editing all our proto files would be an expensive and laborious ask of teams. So we shipped a command line tool that asks for the name of a service, then walks you through each RPC and asks you to select a suitable humans_who_can_rpc mode. As part of this, the tool surfaces recent RPC usage data to help engineers gauge the impact of any permissions restrictions. Finally, the tool saves the changes locally, leaving engineers to raise a pull request and get it approved by another team member.

A screenshot of a message to an engineer helping them choose a suitable mode.

Overall the migration was a huge success and the majority of teams found the process quick and pain-free. Every eligible Monzo service was migrated within 10 weeks of starting 🎉

Engineers now proactively configure permissions for services they own

Our system for configuring and authorising RPCs has been running in production since September 2021.

Since then we’ve found that engineers are proactively thinking about and configuring permissions. Rather than permissions being an after-thought, they’re front and centre when developing services.

When engineering teams are blocked by RPC permissions, they have the autonomy to change them as they see fit.

Additionally, pre-baked authorisation modes have helped us tie down access to our RPCs in general:

  • 50% of our RPCs require some level of peer approval

  • ~25% are locked down completely, and can only be called in a break glass scenario

  • ~25% are now intentionally open to all engineers, as they support a core part of their role

In the future we want to go further by automatically helping engineers choose a suitable permissions mode for their RPCs. We can do this by aggregating information on what the RPC does, historic usage and the kinds of data it processes.

If you want to help us build systems like this, come join us! We’re on the lookout for: