GitHub Actions Runner on EKS Using Karpenter – Part 2 - Mantel Group | Changing how the world works for the better.

Published October 10, 2023
Written by Praveen Kumar Patidar

Introduction

Greetings! This is the second part of our GitHub Actions Runner on the EKS series. As mentioned in Part 1, we have already covered the steps required to configure VPC and EKS for GitHub Runners. Moving forward, this blog post will centre on the setup of the Karpenter and GitHub Action Runner controller and the test strategies to be carried out on Karpenter Provisioned Nodes.

GitHub Personal Access Token

In order to integrate GitHub and EKS, a Personal Access Token (PAT) is required. With the fine-grained tokens feature, we can obtain a Token with limited access solely to the token, as opposed to classic tokens which have the same permissions as the token owner.

To create a fine Grained Personal access token with the resource owner as your GitHub ID or organisation, Go to https://github.com/settings/tokens?type=beta

Allow the below permissions to token – (If the Resource Owner is an Organisation)

Repository Permissions	Organisation Permissions
Actions Access: Read and write Administration Access: Read and write Commit statuses Access: Read and write Deployments Access: Read and write Discussions Access: Read-only Environments Access: Read and write Issues Access: Read-only Merge queues Access: Read-only Metadata Access: Read-only Pull requests Access: Read and write Secrets Access: Read and write Webhooks Access: Read and write Workflows Access: Read and write	Plan Access: Read-only Projects Access: Read-only Secrets Access: Read and write Self-hosted runners Access: Read and write Variables Access: Read-only Webhooks Access: Read-only

Repository Permissions

Organisation Permissions

Actions Access: Read and write
Administration Access: Read and write
Commit statuses Access: Read and write
Deployments Access: Read and write
Discussions Access: Read-only
Environments Access: Read and write
Issues Access: Read-only
Merge queues Access: Read-only
Metadata Access: Read-only
Pull requests Access: Read and write
Secrets Access: Read and write
Webhooks Access: Read and write
Workflows Access: Read and write

Plan Access: Read-only
Projects Access: Read-only
Secrets Access: Read and write
Self-hosted runners Access: Read and write
Variables Access: Read-only
Webhooks Access: Read-only

Copy and keep the token securely.

Core Apps

Let’s revisit the repository, aws-eks-terraform-demos, and proceed with deploying the tf-apps-core module. But first, let’s take a moment to review the contents of the module.

tf-apps-core/kubernetes.tfThe file contains a connection to the EKS via Kubernetes and the Helm provider. The 3 Musketeer pattern uses containers to run the automation. Thus the container contains kubectlinstallation which is required for the provider.

.
.
.
provider "kubernetes" {
  host                   = data.aws_eks_cluster.cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    args        = ["eks", "get-token", "--cluster-name", data.aws_eks_cluster.cluster.name]
    command     = "aws"
  }
}

provider "helm" {
  kubernetes {
    host                   = data.aws_eks_cluster.cluster.endpoint
    cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
    exec {
      api_version = "client.authentication.k8s.io/v1beta1"
      args        = ["eks", "get-token", "--cluster-name", data.aws_eks_cluster.cluster.name]
      command     = "aws"
    }
  }
}

tf-apps-core/helm_release.tf Within this file, you will find the organisation of application deployment using helm constructs in Terraform. Included are various sample applications and the deployment of Karpenter. One essential application needed for GitHub Controller is Cert-Manager, which is also outlined in this document.

.
.
.
.
resource "helm_release" "karpenter" {
  name       = "karpenter"
  repository = "https://charts.karpenter.sh/"
  chart      = "karpenter"
  namespace  = "platform"
  values = [
    file("${path.module}/values/karpenter.yaml")
  ]
  set {
    name  = "clusterName"
    value = local.workspace.cluster_name
  }
  set {
    name  = "clusterEndpoint"
    value = data.aws_eks_cluster.cluster.endpoint
  }
  set {
    name  = "aws.defaultInstanceProfile"
    value = "eks-${local.workspace.cluster_name}-karpenter-instance-profile"
  }
  set {
    name  = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
    value = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/eks-${local.workspace.cluster_name}-karpenter-irsa"
  }
  depends_on = [kubernetes_namespace_v1.platform, helm_release.cert-manager]
}

resource "helm_release" "cert-manager" {
  name       = "cert-manager"
  repository = "https://charts.jetstack.io"
  chart      = "cert-manager"
  namespace  = "platform"
  version    = "1.11.2"
  values = [
    file("${path.module}/values/cert-manager.yaml")
  ]
  depends_on = [kubernetes_namespace_v1.platform, helm_release.alb]
}

tf-apps-core/actions-runner-controller.tf The file contains the GitHub Action Runner controller application specification. The idea is to prepare the secret first and then launch an application that consumes the secret.
Note: The secret in the real world can be placed in a more secure place e.g. AWS Secrets Manager or AWS SSM Parameters Secure String, however, to reduce the complexity, the solution using plain secure password field via terraform variable. GitHub_token

.
.
.
resource "helm_release" "actions-runner-controller" {
  name       = "actions-runner-controller"
  repository = "https://actions-runner-controller.github.io/actions-runner-controller"
  chart      = "actions-runner-controller"
  namespace  = "actions-runner-system"
  values = [
    file("${path.module}/values/actions-runner-controller.yaml")
  ]
  depends_on = [
    kubernetes_namespace_v1.actions-runner-system,
    helm_release.karpenter,
    helm_release.cert-manager,
  ]
}

Deploy Core Apps

Before deploying the core app module, you need to set Terraform Variable for github_token as we are using the 3Muskteer pattern, the container runtime prompt for the password when running the deployment command.

Now, run the below command to install the application –

TERRAFORM_ROOT_MODULE=tf-apps-core TERRAFORM_WORKSPACE=demo make applyAuto

You're now on a new, empty workspace. Workspaces isolate their state,
so if you run "terraform plan" Terraform will not see any existing state
for this configuration.
docker-compose run --rm envvars ensure --tags terraform
docker-compose run --rm devops-utils sh -c 'cd tf-apps-core; terraform apply -auto-approve'

var.github_token

  github_token supposed to be in environment variable
Enter a value:

As the variable is secret, when you paste the PAT value here, you won’t see any characters. Just paste, and hit enter. The outcome should look like below

helm_release.dashboard: Still creating... [20s elapsed]
helm_release.actions-runner-controller: Still creating... [10s elapsed]
helm_release.dashboard: Creation complete after 24s [id=dashboard]
helm_release.actions-runner-controller: Still creating... [20s elapsed]
helm_release.actions-runner-controller: Still creating... [30s elapsed]
helm_release.actions-runner-controller: Still creating... [40s elapsed]
helm_release.actions-runner-controller: Creation complete after 48s [id=actions-runner-controller]

Apply complete! Resources: 9 added, 0 changed, 0 destroyed.

The outcome of this step is the core apps including the GitHub Runner Controller and Karpenter are now installed. The setup is ready to create Karpenter provisioners along with GitHub Runners.

Lets test the deployment –

>  k get pods -A
NAMESPACE               NAME                                                READY   STATUS    RESTARTS   AGE
actions-runner-system   actions-runner-controller-67c7b77884-mvpbm          2/2     Running   0          10m
kube-system             aws-node-mv7l4                                      1/1     Running   0          24h
kube-system             aws-node-t6dhh                                      1/1     Running   0          9h
kube-system             coredns-754bc5455d-d4zgg                            1/1     Running   0          9h
kube-system             coredns-754bc5455d-m4lq7                            1/1     Running   0          21h
kube-system             eks-aws-load-balancer-controller-68979c8b74-559vj   1/1     Running   0          13m
kube-system             eks-aws-load-balancer-controller-68979c8b74-w2dvj   1/1     Running   0          13m
kube-system             kube-proxy-2v7bb                                    1/1     Running   0          24h
kube-system             kube-proxy-8sfqj                                    1/1     Running   0          9h
platform                bitnami-metrics-server-6997d6597c-mrzgr             1/1     Running   0          12m
platform                cert-manager-74dc665dfc-x5b5n                       1/1     Running   0          12m
platform                cert-manager-cainjector-6ddfdc5565-nm8nt            1/1     Running   0          12m
platform                cert-manager-webhook-7ccc87556c-5nbdw               1/1     Running   0          12m
platform                dashboard-kubernetes-dashboard-67c5c7b565-gl4tr     0/1     Pending   0          10m
platform                karpenter-59cc9bb69-tn594                           2/2     Running   0          11m
platform                karpenter-59cc9bb69-wwrw7                           2/2     Running   0          11m


> kubectl get nodes
NAME                                             STATUS   ROLES    AGE   VERSION
ip-10-0-14-205.ap-southeast-2.compute.internal   Ready       24h   v1.25.9-eks-0a21954
ip-10-0-23-87.ap-southeast-2.compute.internal    Ready       9h    v1.25.9-eks-0a21954

Note: Currently, the dashboard app is in a pending state due to the NodeSelector config. It is waiting for a node launched by Karpenter to become available. However, since you do not have Karpenter Provisioner at the moment, the application will continue to remain in the pending state.

GHA Apps (Karpenter Provisioners and GHA Runners)

We have reached the final stage of the solution. Here, we will configure the Karpenter Provisioner, which specifies the nodes like the EKS NodeGroups, and the GitHub Runners Deployment, which configures the runners.

Let’s understand the core components in terms of deployment –

Karpenter Provisioners

In Karpenter, Nodes are defined using Provisioner resources. This means that you can manage the properties of the instance, such as the capacity type and instance type, through the provisioner specification. Additionally, you can configure EKS Node properties, such as Labels and taints. One notable feature of Karpenter is its ability to monitor the workload for tolerations and match them with the toleration specification of Provisioners. The Karpenter application doesn’t spin the node unless there is a workload request for it. This monitoring and maintenance of nodes is the responsibility of Karpenter. If a workload request for a specific toleration configuration is removed, the node is cleaned up according to the TTL configuration in Provisioners.Taints and Tolerations

In the repo, we are using 2 provisioners just to test two different provisioners. The only difference in the provisioners is to toleration configuration –

tf-apps-gha/values/karpenter/default_provisioner.tftpl

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: ${ name }
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["t3.xlarge"]
  kubeletConfiguration:
    clusterDNS: ["172.20.0.10"]
    containerRuntime: containerd
  limits:
    resources:
      cpu: 1000
  providerRef:
    name: ${ provider_ref }
  ttlSecondsAfterEmpty: 60
  ttlSecondsUntilExpired: 907200
  labels:
    creator: karpenter
    dedicated: ${ taint_value }
  taints:
    - key: dedicated
      value: ${ taint_value }
      effect: NoSchedule

tf-apps-gha/karpenter_provisioner.tf The file contains the creation of Provisioner using the template above –

.
.
resource "kubernetes_manifest" "gha_provisioner" {
  computed_fields = ["spec.limits.resources", "spec.requirements"]
  manifest = yamldecode(
    templatefile("${path.module}/values/karpenter/default_provisioner.tftpl", {
      name         = "github-runner",
      provider_ref = "default-provider",
      taint_value  = "github-runner"
    })
  )
}

resource "kubernetes_manifest" "cluster_core_provisioner" {
  computed_fields = ["spec.limits.resources", "spec.requirements"]
  manifest = yamldecode(
    templatefile("${path.module}/values/karpenter/default_provisioner.tftpl", {
      name         = "cluster-core",
      provider_ref = "default-provider",
      taint_value  = "cluster-core"
    })
  )
}

Karpenter AWSNodeTemplate

Provisioners utilize AWSNodeTemplate resources to manage various properties of their instances, including volumes, AWS Tags, Instance profiles, Networking, and more. It’s possible for multiple provisioners to use the same AWSNodeTemplate, and you can customize the node template to fit your specific workload needs – for instance, by creating nodes specifically for Java workloads or batch jobs.

tf-apps-gha/values/karpenter/default_nodetemplate.tftpl The template file for AWSNodeTemplate. In this demo, we will use one AWSNodeTemplate.

---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: ${ name }

spec:
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        deleteOnTermination: true
        encrypted: true
  tags:
    karpenter.sh/discovery: ${ cluster_name }

  instanceProfile: ${ instance_profile_name }

  securityGroupSelector:
    # aws-ids: ["${ security_groups }"]
    Name: ${ security_groups_filter }
  subnetSelector:
    Name: ${ subnet_selecter }

tf-apps-gha/karpenter_provisioner.tf The file contains the creation of AWSNodeTemplate using the template above –

.
.
resource "kubernetes_manifest" "gha_nodetemplate" {
  manifest = yamldecode(
    templatefile("${path.module}/values/karpenter/default_nodetemplate.tftpl", {
      name                   = "default-provider"
      cluster_name           = local.workspace.cluster_name
      subnet_selecter        = "${local.workspace.vpc_name}-private*",
      security_groups        = "${join(",", data.aws_security_groups.node.ids)}",
      security_groups_filter = "eks-cluster-sg-${local.workspace.cluster_name}-*"
      instance_profile_name  = "eks-${local.workspace.cluster_name}-karpenter-instance-profile"
    })
  )
}
.
.

GitHub Runner Deployment

We already have the GitHub Actions Runner Controller application running. To configure the runner, we will utilise the GitHub RunnerDeployment resource. This resource will allow us to specify the ServiceAccountName, Labels, Runner Group, and the Organisation or Repository name for which the runner will be accessible. For this demonstration, we will use organisation-level runners that will be registered to the Default runner group and will be accessible to all repositories within the organisation.

Another key information in the RunnerDployment is the tolerations and nodeSelector. If you noticed in the Karpenter Provisioner section we used Kubernetes taints configuration. This combination will result in the Karpenter node initiated when the runner will be deployed. You can see in the current cluster, there are only 2 nodes visible and they are part of EKS NodeGroup.

tf-apps-gha/values/github-runners/default-runner.tftpl The template file for RunnerDeployment.

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: ${ name }
  namespace: actions-runner-system
spec:
  template:
    spec:
      organization: ${ organization }
      group: ${ group }
      serviceAccountName: ${ service_account_name }
      labels:
        - actions_runner_${ environment }
        - default
        - ${ cluster_name }
      resources:
        requests:
          cpu: 200m
          memory: "500Mi"
      dnsPolicy: ClusterFirst
      nodeSelector:
        dedicated: github-runner
      tolerations:
        - key: dedicated
          value: github-runner
          effect: NoSchedule

GitHub HorizontalRunnerAutoscaler

HorizontalRunnerAutoscaler is a custom resource type that comes with the GitHub Action Runner controller application. The resource used to define fine-grained scaling options. More reading av

https://github.com/actions/actions-runner-controller/blob/master/docs/automatically-scaling-runners.md

tf-apps-gha/values/github-runners/default-runner-hra.tftpl This is the template file for HorizontalRunnerAutoscaler. You can define scaleTargetRef for RunnerDeployment.

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: ${ name }-hra
  namespace: actions-runner-system
spec:
  scaleDownDelaySecondsAfterScaleOut: 600
  scaleTargetRef:
    kind: RunnerDeployment
    name: ${ name }
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'    # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
    scaleDownThreshold: '0.3'   # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
    scaleUpFactor: '1.4'        # The scale up multiplier factor applied to desired count
    scaleDownFactor: '0.7'      # The scale down multiplier factor applied to desired count

Here is Terraform resource file to deploy both RunnerDeployment and HorizontalRunnerAutoScaler

tf-apps-gha/github-runners.tf

resource "kubernetes_manifest" "default_runner_test" {
  computed_fields = ["spec.limits.resources", "spec.replicas"]
  manifest = yamldecode(
    templatefile("${path.module}/values/github-runners/default-runner.tftpl", {
      name                 = "default-github-runner",
      cluster_name         = local.workspace.cluster_name,
      service_account_name = "actions-runner-controller",
      environment          = terraform.workspace,
      group                = local.gh_default_runner_group,
      organization         = local.gh_organization
    })
  )
}

resource "kubernetes_manifest" "default_runner_test_hra" {
  manifest = yamldecode(
    templatefile("${path.module}/values/github-runners/default-runner-hra.tftpl", {
      name = "default-github-runner",
    })
  )
}

Deployment

Time to deploy! Run the below command to deploy the Karpenter Provisioners and Github Runners together –

NOTE: Change the Organisation Name in the local.tf with the name of your organisation.

TERRAFORM_ROOT_MODULE=tf-apps-gha TERRAFORM_WORKSPACE=demo make applyAuto

kubernetes_manifest.gha_nodetemplate: Creating...
kubernetes_manifest.cluster_core_provisioner: Creating...
kubernetes_manifest.default_runner_test_hra: Creating...
kubernetes_manifest.gha_provisioner: Creating...
kubernetes_manifest.gha_nodetemplate: Creation complete after 3s
kubernetes_manifest.default_runner_test_hra: Creation complete after 3s
kubernetes_manifest.gha_provisioner: Creation complete after 3s
kubernetes_manifest.cluster_core_provisioner: Creation complete after 3s
kubernetes_manifest.default_runner_test: Creating...
kubernetes_manifest.default_runner_test: Creation complete after 8s

Apply complete! Resources: 5 added, 0 changed, 0 destroyed.

Testing

Once you have reached this stage, the deployment process should be complete. It is time to verify the deployment.

Dashboard Application

The dashboard app which was pending earlier is now up and running. The app was configured to run on cluster-core(as a label) nodes which are defined as a Karpenter Provisioner and the Karpenter fulfilled the request by launching the node once the Provisioner deployed. You can also see there are nodes associated with Provisioners are now launched.

> k get provisioners
NAME            AGE
cluster-core    4m46s
github-runner   4m46s

> k get nodes
NAME                                             STATUS   ROLES    AGE     VERSION
ip-10-0-13-6.ap-southeast-2.compute.internal     Ready       5m48s   v1.25.9-eks-0a21954
ip-10-0-21-160.ap-southeast-2.compute.internal   Ready       5m42s   v1.25.9-eks-0a21954
ip-10-0-23-87.ap-southeast-2.compute.internal    Ready       32h     v1.25.9-eks-0a21954
ip-10-0-8-43.ap-southeast-2.compute.internal     Ready       8h      v1.25.9-eks-0a21954

> k get pods -n platform
NAME                                              READY   STATUS    RESTARTS   AGE
bitnami-metrics-server-6997d6597c-mrzgr           1/1     Running   0          23h
cert-manager-74dc665dfc-x5b5n                     1/1     Running   0          23h
cert-manager-cainjector-6ddfdc5565-mplh6          1/1     Running   0          19h
cert-manager-webhook-7ccc87556c-9w9t4             1/1     Running   0          19h
dashboard-kubernetes-dashboard-55d79fdd77-nfmn2   1/1     Running   0          72m
karpenter-59cc9bb69-fd45p                         2/2     Running   0          8h
karpenter-59cc9bb69-wwrw7                         2/2     Running   0          23h

To delete the dashboard app deployment, simply wait for one minute. Afterwards, you will notice that the node labelled as cluster-core has been removed. This happens because the Dashboard was the only app running on that node. Once the app is removed, the node becomes empty and Karpenter automatically deletes it based on the provisioner’s one-minute configuration.

> k delete deploy dashboard-kubernetes-dashboard  -n platform
deployment.apps "dashboard-kubernetes-dashboard" deleted
> # Wait for a minute
> k get nodes
NAME                                             STATUS   ROLES    AGE    VERSION
ip-10-0-21-160.ap-southeast-2.compute.internal   Ready       8m8s   v1.25.9-eks-0a21954
ip-10-0-23-87.ap-southeast-2.compute.internal    Ready       32h    v1.25.9-eks-0a21954
ip-10-0-8-43.ap-southeast-2.compute.internal     Ready       8h     v1.25.9-eks-0a21954

GitHub Runners

To test runners, first, see the github runner default pod running in actions-runner-system namespace. The result should be an action-runner-controller along with the default GitHub runner. The single runner is running as per of specification in the HorizontalRunnerAutoscaler minimum count.

> k get po -n actions-runner-system
NAME                                         READY   STATUS    RESTARTS   AGE
actions-runner-controller-67c7b77884-9mpfg   2/2     Running   0          82m
default-github-runner-sn7kg-pxpt8            2/2     Running   0          5h19m

Let’s see the runner in the GitHub Console –

Note: The URLs may not be accessible as the organisation is limited

As we have registered the runner with Organisation, in the default runner group, you can see runners listed in the organisation settings –

https://github.com/organizations/<ORGANIZATION NAME>/settings/actions/runners

As the default runner group is configured with all repository access in the GitHub organization, the runner must be available in the repository – Let’s check in test-workflow repo in the organization

https://github.com/<ORNIZATION NAME>/<REPO NAME>/settings/actions/runners

Now you are ready to run workflow by using the tags of the runner as listed in the above image along with self-hosted – e.g.

name: sample-workflow
on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]
  workflow_dispatch:
jobs:
  build:
    runs-on: [ self-hosted , actions_runner_demo, lab-demo-cluster ]
    steps:
      - uses: actions/checkout@v3
      - name: Run a one-line script
        run: echo Hello, world!

Considerations

After running the demo, you may have gained insight into the integration points between EKS, GitHub, and Karpenter. There are several options available to diversify the nature of runners. Here are some pointers to consider. –

Different kinds of runners can be used for different workflow types. e.g. java, DevOps tools, NodeJS etc.
EKS Service Accounts can be used with AWS IAM to achieve pod-level security.
Use Runner Groups to organise the runners effectively and this also helps to define different levels of security.
Use separate AWSNodeTemplate to achieve instance-level security for runner types. e.g. Prod Nodes, NonProd Nodes.

Learnings

From our implementation experience, we’ve gained valuable insights, one of which is the recommendation to avoid utilising Karpenter’s Consolidation property. This is because runner pods are susceptible to being terminated during runtime, and our solution requires pods to adhere to the lifecycle set up by the GitHub Runner Controller.

Clean up

To clean up the environment, here are the commands to be run in order.

Note: You have to type yes with each prompt after the command.

TERRAFORM_ROOT_MODULE=tf-apps-gha TERRAFORM_WORKSPACE=demo  make destroy
# Make sure Karpenter nodes are destroyed before run next command. (usually 1min)

TERRAFORM_ROOT_MODULE=tf-apps-core TERRAFORM_WORKSPACE=demo  make destroy
# The command ask for environment variable, which you can just type anything and enter
# to proceed. Once done again you need to type 'yes' before destroy

TERRAFORM_ROOT_MODULE=tf-eks TERRAFORM_WORKSPACE=demo  make destroy

TERRAFORM_ROOT_MODULE=tf-vpc TERRAFORM_WORKSPACE=demo  make destroy

Conclusion

With that, you are finally able to run the GitHub Action Runners on EKS, using Karpenter. There are many options that can be used to diversify the nature of runners and scalability. Combining the capability of Karpenter with GitHub Runner in EKS surely delights the complex DevOps Platform.

GitHub Actions Runner on EKS Using Karpenter – Part 2