Skip to main content

Running a CI/CD platform using your own runners can be daunting. Not from an implementation perspective, but more so from a scaling perspective.

You may be asking, “How will our platform create new runners/workers in response to a pipeline being run?

Scaling

If scaling is not appropriately managed, pipeline jobs may be held up in a queue, which delays their execution, and results in non-productive developer time. Scaling can be a difficult task. After following an install guide, the choice to scale out can be subjective.

Depending on the CI/CD Platform, typically the choices are:

  • Using the SaaS option
  • Installing an Agent to a server
  • Pulling a container from a container registry (DockerHub/ECR) with the Agent already setup

Either of these options can be scaled, however, the difficulty is usually deciding what metric you scale it on. When using CI/CD, you may have the option to use a Webhook as a trigger to scale out runners, and although it is a perfectly capable option, you will need to build custom logic behind the scaling.

An ingress point will need to be present for the Webhook, which will then need to be passed to a Webhook server or a worker/runner that will then start-up based on invocation.

Within AWS the solution may resemble something like this:

At first glance this seems simple:

  1. API Gateway for ingress.
  2. Lambda function to launch runner via EC2 API.

However, when dealing with EC2, costs can add up if Runners are Orphaned/Left Running.

To deal with these Orphaned runners, another Lambda function was added CleanUpRunnerLambda. To deal with long-running “Stuck” runners, TimedOutStaleRunnerLambda was created. Plus, an additional 3 lambda Lambdas were made to deal with potential costly LONG running instance costs and general runner failure. The overall upkeep is starting to add up.

Over the length of time, this approach is used, Lambda runtimes will need to be updated, and the knowledge set will need to be in-house.

EC2 Instances

With Agent-based EC2 instances, AMI will need to be regularly updated to avoid security issues and compatibility to be kept up with the Agent installation requirements.

The reality is that in-house built solutions for these sorts of scaling rarely keep pace, or are out-paced by community, or open-source solutions.

Take for example GitHub Actions Runners, the CI/CD platform that will take most of the focus of this blog.

For Autoscaling your own runners Github provides links to two repos:

The first is scaling using the earlier mentioned Webhook to trigger a set of runners by feeding the webhook to a Queue. The Queue will then trigger a Lambda function to spin up the runners, feed the webhook to the runner, then another Lambda function brings the runner back down.
A functional option, however when used at scale, the instance cost time can add up.
Each job that you add to your workflow will create a webhook that will trigger another instance to start up. For more complicated pipelines, this can create instances in the 20-30’s, just for pipeline tasks.

But Wait, There’s More

This solution could be modernised, as EC2 instances can be interchanged for ECS Fargate containers.

Fargate provides a managed service for containerised applications that do not require much user input to run Docker applications. This would require effectively forking the phillips-lab solution, and updating the Lambdas that interacted with Runner SpinUp.

All the extra effort here would consume dev/engineering time, and although it can be achieved, consider what Martin Fowler says here, ‘I can only think of so many good ideas in a week. Having other people contribute makes my life easier.’

A forked in-house solution can rarely compete with the likes of the Open Source Community or with, say, Amazon. This brings forward an important question, how could we leverage Fargate without reinventing the wheel?

AWS EKS Integration

Actions-runner-controller using EKS Fargate is the answer. As a quick summary, Kubernetes is an open-source container orchestration system for automating software deployment, scaling, and management. Google originally designed Kubernetes, but the Cloud Native Computing Foundation now maintains the project.

EKS has been great for running Kube in the cloud. The managed control plane eliminates the need to worry about scaling and maintaining Kubernetes components, such as etcd, coredns and kube api.

The tight integration with AWS services means you can utilise services like IAM, API GW, and ALB using either AWS Controllers for Kubernetes (ACK) or AWS Load Balancer Controller.

This gives you control of your AWS services through Kuberentes-esque objects, i.e. deployment YAML files and Kube API.

More on EKS can be found here, as the main feature we will be focusing on will be EKS Fargate.

The ARC

The Actions-runner-controller(ARC). ARC is a Kubernetes controller to create self-hosted runners on your Kubernetes cluster. With few commands, you can set up self-hosted runners that can scale up and down based on demand, and since these could be ephemeral and are based on containers, new instances of the runner can be brought up rapidly and cleanly.

The typical setup for ARC is as follows:

  1. RunnerDeployment: A deployment similar to standard Deployment kind however specific custom resource for ARC.
  2. HorizontalRunnerAutoscaler:  Much like HorizontalPodAutoscaler where it sets how to scale runner, what is unique with RunnerAutoscaler is that you can scale your runners based on GitHub Workflow specific metrics i.e. TotalNumberOfQueuedAndInProgressWorkflowRuns and PercentageRunnersBusy

Fig 6: Example of RunnerDeployment and Autoscaler.

Running a Kubernetes cluster just for CI/CD runners may seem like overkill at first, especially with looking after nodes that are running 24/7; however, running EKS Fargate removes Node management and leaves that to AWS to manage.

EKS + Fargate Concepts

EKS Fargate introduces a few concepts that effectively make this solution somewhat serverless(ish). The key concept is the Fargate Profile, which declares which pods will run on Fargate. This is done by allocating a namespace or namespace + label as a selector to be a Fargate i.e. any pods that match the selector, will look for a Fargate “node”.

Here, just like for ECS Fargate, EKS Fargate will spin up a Fargate node if there is not one available to run our pods.

In the case of ARC, to let our runners run on Fargate, a RunnerDeployment(RD) just needs to select the Fargate namespace in its selector. On running this RD, no actual runners will be spun up until we receive a response from the HorizontalPodAutoscaler. Therefore no nodes to be running consistently, and here’s the real benefit, unlike in the EC2 solution, these Pods are being spun up and scaled down as needed, and our charge will only be for the Pod itself, not multiple instances of EC2.

From a security standpoint, we have limited our attack vector, by removing a public endpoint from the solution and from a reliability perspective, our chain of events does not rely on several Lambda functions to Spin-up and down.

Conclusion

To sum up, EKS Fargate enhances an existing community solution for a CI/CD platform that simplifies management, reduces overall costs and reduces tech debt as we are not recreating the wheel.

For help tackling these challenges and much more, CMD Solutions is here to help address your ever-evolving business requirements through an AWS application modernisation journey.

We have a proven track record of providing modernisation assessments, advisory services, training and advanced development to our customers.

Ready to get started? Visit us here and discover how we can partner with you to build, run and scale Amazon Web Services.