How we improved our staff VPN

Read the article

We use OpenVPN to protect ourselves from security threats. Virtual private networks – or VPNs – make our public internet safer by encrypting the network we use, letting us share information securely. With a VPN, we can:

1. Connect to a private network seamlessly. We can connect the software on our laptops to parts of our platform that aren’t open to the public internet without having to configure them specially.

2. Segment traffic. We know that employee traffic comes to our platform via the VPN. Which means we can make network rules control that type of traffic. It’s one of a small number of entry points that an attacker could use, so it’s important we can shut it down if we need to.

3. Improve monitoring. We can see all employee sessions in a central place, and disconnect them if we see something suspicious. To be clear, we can’t see what employees are looking at (and don’t want to), but we can see who’s connected and when.

OpenVPN is great, but we wanted to make some changes

We’ve used OpenVPN at Monzo for a while now. It’s a tried and tested solution for the actual inner workings of a VPN. But, we didn’t want to use an enterprise version because of costly and rigid licenses, and the community edition lacked some features we were using. So we started a project to upgrade our OpenVPN infrastructure.

The main things we wanted to improve were authentication, logging employee sessions and deployment.

Authentication

Our old setup authenticated users with a username and password, as well as a single-use password generated by Google Authenticator, which the user’s laptop provided to OpenVPN using its ‘static challenge’ protocol, which asks the user for an extra piece of information after they provide a username and password. A proprietary OpenVPN application would verify the single-use password. This was slow, and a frustrating experience for our employees who had to enter these codes several times a day. It meant having your phone at hand, having the Google Authenticator app installed and remembering your laptop password. You need your laptop because your VPN password is stored in your laptop keychain.

Logging employee sessions

We used to only have the OpenVPN logs to tell who was logged in at any time, and they were nearly impossible to analyse. We wanted to treat VPN sessions like we treat anything else in our platform: assign an identifier and keep track of its state in Cassandra, Monzo’s primary database.

Deployment

We wanted to fundamentally change the way the VPN is deployed. It was previously a ‘snowflake’ in our platform, with a whole EC2 box to itself. We wanted to move it to Kubernetes, with several concurrent pods, to make it more like any of our other backend services. As time goes on, we’re trying to move more and more of our platform to Kubernetes, which we're making a serious bet on as the future of software deployments. We also wanted to move away from a few components we were using with OpenVPN, and instead use the community edition with our own code to add features.

Authentication

We decided we wanted to move to notification-based login. So whenever you want to log in, all you have to do is approve the notification we send to you. You still need to have your phone with you and remember your laptop password. There’s a certificate stored in your laptop which logs you in. These certificates are generated on the backend and are scoped to a specific user, so we know which account to ping when a log in attempt is made. Here’s a diagram to show how it works:

Diagram showing four units; Openvpn, which is connected to an OpenVPN sidecar, which is itself connected to an authentication provider, which is connected to a user device.

We have over 700 employees now, so we have to deal with quite a few edge cases, including someone who works here not having a smartphone (either temporarily or permanently). For cases like this, we use small low powered devices which generate single-use passwords in the same way as Google Authenticator. If our backend says we need to use single-use password login, we request one from the user’s laptop using OpenVPN’s ‘dynamic challenge’ protocol which instructs the laptop to ask the user for an extra piece of information.

Architecture

To manage this authentication flow, we developed a Go service which runs as a sidecar to OpenVPN. It sits as another container in the same Kubernetes pod. This service controls OpenVPN through its management socket; a file on the OpenVPN server through which you can send commands to accept and reject connections. OpenVPN sends us a request through this socket for every connection attempt with a bunch of information about the session. We get the common name of the certificate from that request, which relates to an employee’s user ID. We then look up that employee to check if they’re allowed to open a connection, and to fetch an identifier so we can send a login notification.

Diagram showing our platform architecture; at the entrance an AWS NLB, which talks to an Ingress controller inside Kubernetes, which then connects to an OpenVPN pod containing OpenVPN, the OpenVPN sidecar, and an iptables sidecar.

Session management

To keep track of user sessions, we created a new microservice, service.vpn-session. When someone opens a connection, the sidecar asks this service whether that person can connect. The service then keeps track of the connection. When that person ends the connection, the service is notified so we know who was connected to the VPN and when. Which is useful information for investigations.

The VPN sidecar exchanges regular heartbeats with the VPN session service. If we choose to terminate a VPN session (for example during offboarding) this is picked up at the next heartbeat. At which point the sidecar tells OpenVPN to kill the connection. We used a ‘pull’ approach, instead of just firing a synchronous command at the sidecar, because we have multiple VPN pods. With multiple pods we would have to look up their IPs in Kubernetes and fire this command at each one. This approach is not very neat and introduces race conditions in the case where new pods are starting.

The sidecar and service.vpn-session communicate over mutual TLS (an encrypted internet connection which authenticates both parties). This is the case because it’s impossible for the service to determine using source IP alone whether incoming traffic is coming from the VPN sidecar, or the VPN itself (an employee session). We check a client certificate for all requests to make sure that employees can’t impersonate the sidecar and make requests to service.vpn-session. Fortunately, this is easy for us to do in Typhon, our microservice framework.

Deployment

Moving to Kubernetes presented a few challenges. Ones that didn’t apply to the previous ‘snowflake’ model with special EC2 instances dedicated to a VPN.

Network rules

Previously, we’d controlled the network movement from the VPN using AWS (Amazon Web Services) security groups. We’d simply whitelist traffic from the VPN’s subnet to other subnets. With the new approach, we wouldn’t be able to distinguish the VPN at a network level from any other traffic coming from the Kubernetes cluster. Which meant we wouldn’t be able to prevent the VPN from accessing a subnet without preventing all pods at the same time.

To apply network rules, we instead had to use our network overlay, Calico. Calico creates a virtual network on top of the AWS network, giving each Kubernetes pod an IP and keeping track of which IPs correspond to which pods. As a result, it’s able to apply network rules that limit which pods can talk to each other. This is a key part of our platform security. In the case of the VPN, we wanted to limit network connectivity to sensitive areas of our platform that non-technical employees would never need to access.

Connection draining

Because we now had a Go sidecar, we expected to have to deploy the VPN with core library changes and feature improvements. And, as we were now running the VPN in Kubernetes, and not as a Kubernetes ‘StatefulSet’ (which tend to stay in place), it could be politely asked to move at any time. We needed to run the VPN in a way that allowed us to shut it down and replace it without disconnecting hundreds of employees.

The key to this was connection draining. When we deployed, instead of killing the pod, we wanted to prevent new connections and then wait for all users to naturally disconnect. At this point, we’d terminate the pod. Fortunately, Kubernetes would, by design, stop routing to pods which are in terminating state, so no new connections would be created once a deployment started.

When the pod starts to terminate, Kubernetes sends a ‘terminate’ signal (SIGTERM, essentially ‘please shut down soon’) to every container. We prevent this signal from spreading to OpenVPN (this would cause an immediate shutdown), and start watching the number of connections in the Go sidecar. When this drops to zero, the sidecar instructs OpenVPN to shut down, and then exits. Our final requirement was to increase the Kubernetes shutdown grace period from its default of 30 seconds to 8 hours. Which would mean Kubernetes would wait 8 hours before sending a ‘kill’ signal (SIGKILL, essentially ‘shut down right now’) and forcibly killing the pod.

Find this interesting? You might find monzo.com/careers interesting too!