Shipping Kubernetes-native applications with confidence

November 14, 2019 | in Engineering
| By Amir Moualem

A few months ago, our team began developing a new product.

This new product has a few properties that differentiate it from other software projects we have developed so far:

  1. It’s native to Kubernetes, meaning it’s tightly coupled to the Kubernetes API and requires that specific API in order to run (or even to be tested… more on that later).
  2. It requires certain configurations to be applied in order for it to run properly. This means we have to ship this image with the Kubernetes .yaml configuration files so our users may install it smoothly. This is further complicated by our desire to simplify the onboarding for that product as much as possible, leading us to support installation via both Helm charts and “vanilla” .yaml files, both of which have to be tested and published.
  3. It’s a client-side application; we relinquish control over it the second it is published, and there’s no taking it back. Publishing hotfixes to our microservices can be done in minutes, but repeatedly asking our users to upgrade their installation will only complicate their onboarding at best.

This prompted us to invest a bit more time on planning a CI & CD pipeline that would allow us to ship such a product with confidence, without slowing ourselves (too much).

Snyk’s existing microservices CI/CD pipeline revolves around our developers’ engagement with GitHub:

  1. Opening pull requests installs, runs and tests the service in a testing environment.
  2. Merging pull requests begins the same way, followed by a semantic release for the service, building a docker image, running it, and deploying it into the Kubernetes cluster in our production environment.

We were looking for similar engagements so the process feels as natural as possible.

So, how did we approach this project?

Kubernetes integration testing with Kind

The first question mark for our inexperienced minds was how to test software that relies so heavily on the Kubernetes API. Mocking all Kubernetes API-related flows would be incredibly hard to write and maintain, and probably miss some of the critical flows that should be tested. A naive approach would probably utilize a real Kubernetes cluster, either by setting one up for every testing phase or by having a long-running environment. While such an approach would definitely work, it would bring a few too many complications than we cared to handle at this early point in our software development:

  • Spinning up new Kubernetes environments every time would take a while to develop and increase the setup time for the tests by a lot.
  • Having a long-running environment means having to take care of cleanups and may introduce a lot of flakiness to the tests.

Thankfully, we were pointed towards the Kind project. Kind is a Docker-based tool for running local Kubernetes clusters that conform to the Kubernetes API. It fit our needs perfectly, as it combines the advantages of having a clean environment for every test, with the advantage of a very fast setup.

So now our testing phase, whether it is on our CI environment or even locally, just creates a Kind cluster, loads our newly created image in it, and asserts our container to do its designated work.

Test what you deliver; deliver what you test

The next challenge was making sure that we don’t only test our code, but also test the complete product that we deliver.

The code (and service) is installed in a Docker image. The Docker image may be deployed into Kubernetes clusters with either Helm charts or “vanilla” .yaml files and the Kubernetes clusters may support different versions of Kubernetes APIs.

We aim to test and support this wide range of scenarios.

From the moment we build an image, we tag it so we’ll be able to identify that image through its testing stages. That image gets installed with both Helm charts and “vanilla” .yaml files, on Kubernetes clusters with different API versions, prior to being tagged as “approved”.

That’s also the stage when we utilize semantic-release to version our images. Rather than keeping track of our products through their Git SHAs, or arbitrary versions we assign through manual steps, we elected to use semantic versioning. That way, we provide simple, understandable versions with release notes assigned automatically and published in GitHub. The end result is having an image with a versioned tag, such as 1.2.3

All we need to do at this stage is pointing our deployment .yaml files to that new version…

Publishing made easy with Helm charts and GitHub Pages

Helm is the package manager for Kubernetes. A short research revealed two main approaches when it comes to publishing Helm charts:

  1. Hosting the Helm chart on the public curated Helm charts repository.
  2. “Self-hosting” them with Github Pages on our repository.

The second approach made more sense to us, at least initially, as it’s self-contained and doesn’t add reliance on another party. As our product and CI & CD pipeline mature, we might choose to contribute our chart to the public repository as well, at some point.

Wrapping it all up with GitHub

It appears we’ve tackled most of our problems – We’re running integration tests with Kind to simulate actual Kubernetes environments in different scenarios, and we’re using semantic versioning to tag our products, which we publish on GitHub Pages for simplicity.

So how does all of that fit in with our development process?

We’ve agreed on two primary branches in our Git repository:

  1. staging, our default branch, is the branch-off and merging point for all changes to the product (features, bug fixes, chores, etc.).
  2. master is the decision point for publishing our tested product.

There are a total of four stages to the CI & CD pipeline, between these two branches and the two main interactions our developers have with GitHub branches (opening a pull request, merging a pull request). Each of them is meant to increase the confidence we have in the product that we built and bring us closer to shipping it to our users.

It may be worth mentioning that the orchestration itself is being done by Travis CI (migration to CircleCI is in progress), but, for all intents and purposes, it could have been a short in-house Python script.

The following table shows a summary of these four stages, and what we accomplish with each one.

TriggerCauseGoal
Opening a pull request from a feature branch to staging.Lint, compile, unit tests.

Build a (discardable) Docker image. Use it in the Kind integration tests.

Test in a clean environment.
Merging a pull request from a feature branch to staging.Build a Docker image.
Tag it as a candidate.
Run it through our integration tests.
Invoke Semantic Release to tag the tested image as 1.2.3-approved.
Make sure the merged branches do not conflict.
Test and build in a clean environment.
Build the product we plan on publishing once.
Opening a pull request from staging to master.Nothing.

This is a stopping point for any manual steps we wish to perform before publishing the product, such as dogfooding.

Last minute chance for feedback.

Reminder to manually inspect certain test environments, if needed.

Merging a pull request from staging to master.Retag the image to its public form, 1.2.3.
Modify the tag reference in our .yaml files and Helm charts to point to the new version.
Release a new gh-pages branch with all the changes so the product may be consumed.
Publish the product that has already been built and tested – We know what we are publishing.

Not another boring, everyday summary here

Just kidding, it’s completely boring. But I’ll keep it short.

We’ve built an excellent CI & CD pipeline, and we’ve learned a lot along the way. We’re balancing automation with just the right amount of manual clicking that feels natural to us. Each step helps us feel more confident with what we’re shipping, with the last step actually making our delivery public.

Like any other software project, there’s still room for improvement here. We could start publishing to a public Helm chart repository. We could extend our testing infrastructure to include more long-living environments to test upgradability. When we have enough confidence in our tests, we might even skip some manual steps before releasing – who knows?