Choosing a CI that grows at the same pace as a scale-up

Johan Lorenzo
malt-engineering
Published in
11 min readAug 6, 2021

--

What did we do when our usage did not fit into a regular SaaS plan?

TL;DR

Malt has gotten bigger since last time it revisited its needs in terms of CI, and plans to keep growing exponentially for the next couple of years. We looked at what the market has to offer in 2021 and got hands-on on a handful of candidates. Spoiler alert: Today, we are proud to announce Malt is migrating to GitLab CI and we have enough objective data to be comfortable with this decision.

What made Malt decide it was the right time?

Malt expects to grow from 50 employees on the product team to 100, a year and a half from now. Many aspects of the company have to scale up. Our Continuous Integration (CI) system is one of them.

Malt first started with Jenkins, then switched to Bamboo in 2018. These past few months, we reached the limits of our Bamboo plan. We have 5 agents and this causes some delays when activity peaks. We could go with the next plan (10 agents). Although, we also hit a few issues related to spawning new agents, making sure packages are up to date, and having machines running outside of business hours. Hence, Malt seized this opportunity to revisit its needs in terms of CI.

“New phone, who dis?”

I have not had the chance to blog here, so let me introduce myself. I am Johan Lorenzo, a Lead Software Engineer at Malt. I joined the company in February and one of my first missions is to make Malt’s CI grow alongside the company. I have had the pleasure to work on CI-related topics for many years on various scales and I am thrilled about this problem we are solving!

As a newcomer at Malt, I had some leeway to first, understand what the current and future needs are, then try out some of the solutions out there.

⚠️ Disclaimer

The conclusion of this article reflects what Malt requires at its current stage of growth. It may not be applicable in your case. Also, we may have overlooked some of the possibilities for the sake of time. That said, even though the choice is made on our end, I am happy to discuss what your point of view is, in the comments.

What is the market like in 2021?

There are tens of solutions out there. We identified 23 as of March 2021. They can be classified into 3 different categories.

Self-hosted solutions

This is roughly the first generation of CI systems. In this model, you are expected to host and maintain your own main nodes and worker pools.

SaaS solutions

The second type of system is the one provided as Software as a Service (SaaS). Here, you just have to describe what your job does and it will be executed on worker pools maintained by your provider.

Cloud-based solutions

The third type is somewhat similar to the second one. You describe what a job does but it runs on the cloud provider of your choice, instead. These solutions spin up worker pools to one or many cloud providers.

Hybrid solutions

Some solutions have a hybrid approach where they both offer SaaS while allowing you to get workers on the cloud.

In one chart

Here above is a Venn diagram of the different solutions. Not all of them are displayed for the sake of clarity. Up above, we have the self-hosted solutions (like Jenkins and Bamboo). On the left, we have tha SaaS ones (like CircleCI). On the right, the cloud-based ones (like Jenkins X and Argo Workflows). Right in the center, we have the full hybrid solutions: GitHub Actions and GitLab CI. Then, between self-hosted and cloud-based, we have GoCD.

23 is a lot! How did we choose?

What do users think of the current CI?

In order to limit biases in the next step, I conducted 10 interviews. We tried to cover different types of population: frontend developers, backend ones, ops, the data team, security-savvy people, the CTO. Each of these interviews lasted about an hour. It enabled us to unveil important features and drawbacks of the current CI. It led to a survey which was sent to a broader audience.

How do we measure what is important?

We wanted to measure how important things are. We collected data with a good old survey, with the following results.

Existing features

Pain points

First pass: Remove solutions which are missing a critical feature

23 solutions is a lot. We cannot look at them all thoroughly. So, let us quickly put away some that are missing something too important to be ignored. We are not going to take every must-have feature (like Slack notifications) or pain point (like timeouts) because they may not be a strong differentiator between solutions[1]. Instead, we are going to focus on the ones that can quickly disqualify candidates.

Critical feature #1: Elastic number of workers

Within the same day, we want to be able to scale up at a given time and scale down once the load is lower. This criterion means we may discard:

  • Any self-hosted solutions. We could instrument them in some IT automation tool (like Ansible) but:
  • Our current Bamboo instances are partially described in Ansible. The cost to reach 100% is not negligible and someone at Malt has to maintain it. Cloud-based solutions handle that for us
  • Paid solutions (like Bamboo or TeamCity) have a fee that depends on the number of running instances.
  • Any SaaS solution that limit how much you can use them:
  • Travis-CI’s pricing allows up to 5 concurrent builds.
  • CircleCI supports up to 80 concurrent builds. That may be enough in the medium term. That said, we may eventually hit that number if we grow the number of applications or if we break down jobs into smaller ones.
  • Similarly, Bitbucket Pipelines offers no more than 3500 minutes of CI a month.
  • CloudBees CodeShip allows up to 20 concurrent builds.
  • Buddy.works allows up to 10 concurrent pipelines.
  • CodeMagic supports up to 3 concurrent builds.
  • jFrog Pipelines Entreprise plan offers 25k minutes per month.
  • TeamCity Cloud plan offers 24k minutes a month for 30 developers.

As for hybrid solutions, we may likely put aside the SaaS part:

  • GitHub Actions (GitHub-hosted runners) provide up to 50k minutes per month. That’s roughly 40 hours of CI/CD each business day. Today, Malt uses more than 40 hours of CI per business day. We’re above the limit and GitHub doesn’t have any larger plans. Although, they allow us to buy additional time.
  • GitLab CI (GitLab-hosted runners) provides max 50k minutes minutes each month, but they offer to buy additional minutes.

We are now down to 11 solutions: the cloud-based ones, the hybrid ones, then Semaphore and Harness CI/CD.

Critical feature #2: Clean and explicit job dependencies

The Malt tech stack revolves around many microservices that depend on many housemade libraries, all of them stored in a monorepo. We need a way to define dependencies between them and reuse artifacts from an upstream dependency. This will, for instance, ensure builds are more reproducible because they won’t take the last SNAPSHOT build. Thus, we can dispose of:

We are down to 7:

  • Jenkins X
  • Argo Workflows
  • Mozilla’s Taskcluster
  • GoCD
  • Concourse CI
  • GitHub Actions
  • GitLab-CI

Critical feature #3: Apparent popularity

Malt would like a product that is popular-enough so that we could benefit from other customers or communities. Moreover, the tool has to be actively maintained. Measuring those items is not easy, although statistics on GitHub provide some insights. Let dive into numbers:

To be transparent, I personally used Mozilla’s Taskcluster between 2015 and 2020. I love this project and it is one of the most mature solutions out there. Unfortunately, I know there are not so many people around anymore to improve and maintain Taskcluster. It is heartbreaking to me, but I crossed it out at this stage. To be honest, I kept it in mind and I mentally compared the remaining solutions to it, in case none of the remaining solutions looked mature enough. The rest of the study showed me: I did not need to.

GoCD has similar numbers. It looked safer to put it aside too. If you are a GoCD user, I would be happy to have a chat and know how this solution is.

It is hard to get a grasp at Jenkins X, GitHub Actions, and GitLab CI. At a first glance, they seem to be used, but we don’t have any data to show it. Let’s not remove them at this stage.

So, we are now down to 5 solutions:

  • Concourse CI
  • Jenkins X
  • Argo Workflows
  • GitHub Actions
  • GitLab-CI

Second pass: Proof of Concepts

You may notice Concourse CI is missing from the rest of the article. In the end, we didn’t have enough time to try it out. GitHub Actions and GitLab CI were very compelling solutions. More details below.

Jenkins X

The experiment on Jenkins X turned out to be very short. We didn’t manage to install it due to the lack of documentation, at the time.

Edit 2021–08–10: I should have provided a little bit more context. Jenkins X was the first POC we put up. It was in March. Back then Jenkins X v3 hadn’t reached General Availability. James Rawlings told me they “have vastly improved documentation on the website”. I confirm it looks much more detailed than what I remember, 5 months ago 🙂. He’s happy to be pinged on their Slack instance. I’d be glad to know if some of you all have tried it since then!

Argo Workflows

Argo Workflows (in addition to Argo Events) were very promising, especially because they could have been a way to have a single CI/CD solution since Malt uses ArgoCD for the latter. That said, the setup cost is much higher than GitHub or GitLab CI. The overall results are in the benchmark table below.

GitHub Actions

GitHub Actions were much easier to work with than Argo Workflows, to us. We could achieve in one day what took us a week on Argo Workflows. Problems came up when we wanted to run it on Kubernetes. We chatted with a solutions engineer at GitHub and he confirmed they don’t support this setup. GitHub Actions’ business model is mainly focused on the SaaS part.

GitLab CI

Just like GitHub Actions, we could achieve much more than with the previous solution. From an objective point of view, it also matches almost all the criteria of the survey. More specifically:

  • 12/12 existing features were kept
  • 13/14 pain points fixed
  • 11/13 additional points

Additional points were added during the benchmark, whenever we found that a solution looked better on a given aspect.

For more details about what’s missing, see the benchmark just below.

Functional Benchmark

(I apologize for the tables being images. I couldn’t find a way to put tables on medium. This blog post will be on my personal blog too. You will be able to find regular tables, there)

Deciding was easier than initially envisioned

When we first started the PoC, we envisioned the call would be harder to make. At the first look, each of the 5 solutions looked as mature as any other. We believed we’d have to make a full-featured PoC for each of them and an audience would have to tell which one they prefer (hoping a champion rises).

With more data today, we are able to tell without a doubt: Malt’s next CI is going to be GitLab CI.

What’s next?

We know we want to migrate over to GitLab CI. From a functional point of view, everything has been green-lit. However, there are a few things left to test out: performance, stability, costs for instance. Regarding the latter, we actually have some forecasts.

Show me the money!

Current expenditures

Expected costs

For about the same monthly price, we predict we would be able to be more efficient regarding the way we use machines so that we have enough during peak hours and none outside of business hours. We will see if this model still stands after the migration

Would you like to know more?

There will be another post once the migration hits a significant milestone 😉 In the meantime, if you are interested in helping us with our new CI, Malt is hiring!

Special thanks

I definitely did not do all this, all by myself. This was a team effort! I would like to thank

  • first and foremost, Thierry Sallé and Julien Aubert who were instrumental in setting the infra up for the proof of concepts and who taught me many things in regard to Google Cloud and Kubernetes,
  • the 10 people who accepted to be interviewed and who gave much context,
  • the folks who answered the survey,
  • the reviewers of this post.

Footnotes

[1] As a matter of fact, Slack is supported everywhere.

[2] Jenkins X is split across many repositories and none of them stands out: https://github.com/jenkins-x/. Edit 2021–08–10: James Rawlings pointed me to https://github.com/jenkins-x/jx.

[3] Same as Jenkins X: https://gitlab.com/GitLab-org

--

--

Staff Release Engineer at Mozilla. Formerly at Malt. I love automating builds and publishing them at scale!