Why We Dropped ECS in Favour of EKS

When undergoing Application Transformation, choosing a container orchestration solution is something almost every organisation using AWS will have to go through at some point. Having implemented a few Elastic Container Service and Elastic Kubernetes Service solutions (including the client I am currently consulting with), I have gained insight into what questions to ask when choosing between the two and some considerations that may not be immediately obvious.

Considerations for picking EKS

Cultural and organisational

Kubernetes needs a platform team. It is understandable that at a glance, both ECS and EKS look like managed options, but make no mistake – Amazon EKS offloads the bare minimum of operational burden to AWS. Unlike ECS, the Kubernetes that EKS provides cannot realistically support an organisation’s requirements out of the box without significant investment. It is a building block upon which you can develop a platform that is tailored to the requirements of your organisation.

‘Kubernetes is a platform for building platforms. It’s a better place to start; not the endgame’.

– Kelsey Hightower (@kelseyhightower) November 27, 2017

In contrast, ECS Fargate can be run without the support of a platform team. The non-functional requirements to run an application (metrics, logging, load balancing, autoscaling) can be fulfilled out of the box with native AWS services.

At my current client, this was a tough choice. The lean infrastructure team almost made this discussion a non-starter, but a few design choices in the platform to reduce operational burden make it a workable solution.

Here are a few questions to keep in mind:

How well understood are the underlying technologies of these platforms?
How intimately do software engineers understand AWS and containerisation?
Are they comfortable writing their own Infrastructure as Code and Dockerfiles, or do they need the assistance of a centralised infrastructure team?
Do the members of the proposed platform team have any experience in Kubernetes?

Whilst you can leverage the knowledge of third parties to set up your platform (via pre-built Kubernetes distributions or consulting services) and upskill existing staff in order to build a platform team, you will probably want some level of prior experience in managing container platforms in the team.

Teams that don’t – and often, infrastructure teams that also have other responsibilities – will struggle to keep their Kubernetes clusters up to date. Kubernetes updates are rapid (now ~3 times a year) and often have significant deprecations and changes. Leaving your EKS clusters on old versions indefinitely is not an option either, as EKS will automatically upgrade clusters that are on unsupported versions.

In addition to Kubernetes expertise, a deep understanding of Linux, AWS services and containers is always useful for platform engineers in debugging issues.

At my client, DevOps culture is wholly embraced. Software engineering teams have very strong AWS and containerisation capabilities which almost makes ECS the better choice, but it certainly does not hurt for Kubernetes either.

Mindshare in the industry. Kubernetes is popular, there’s no doubt about it. Now, the bigger risk here is probably jumping on the Kubernetes bandwagon without careful consideration, but it is still useful to understand which option is more popular in the wider industry.

This can influence the ability to hire talent, the amount of upskilling required, the amount and quality of tooling available and the longevity of the platform.

Now, for some of the more technical considerations…

Compute

Workload size should not disqualify ECS. A common misconception I see is that ECS “cannot scale”. The implication here is that applications deployed to it will fall over or fail to meet performance requirements at a certain workload size. Limits apply, of course, but for the vast majority of workloads, ECS is perfectly equipped to handle scale.

Fundamentally, there is no difference in application performance between containers created by Kubernetes and containers created by ECS; ECS and EKS are just the control plane and do not directly influence application performance. ECS Fargate has a soft limit of 1000 concurrent containers on brand-new accounts, which can be increased further if required.

With that said, there is one major exception to this, which I will get to later on.

What are your requirements for autoscaling latency?
ECS Fargate and EKS have two distinct scaling models:

EKS will scale pods very responsively, but launching EC2 instances when more cluster capacity is required is slower
ECS Fargate is slower to launch tasks (the ECS equivalent of pods) but does not require you to autoscale underlying infrastructure

It is rather difficult to measure scaling latency as AWS does not expose the timings for scaling events in an easily digestible format (if at all), but my impression is that ECS Fargate is not significantly faster than EC2. Combined with the fact that scaling pods within cluster capacity is incomparably faster makes EKS the winner for me.

Another factor is that ECS’s Application Auto Scaling has a minimum resolution of 1 minute per metric point, whereas Kubernetes re-evaluates metrics every 15 seconds by default.

The scaling requirements for my client’s applications (highly bursty) make a “serverless” option like Fargate very appealing, but this too was deemed workable with cluster-autoscaler.

Do you need flexibility on which metrics you’re scaling on? Scaling ECS Services on custom metrics is possible, but it is enough of a hassle that it is not usually done.

On the other hand, if you are already using Kubernetes and Prometheus then the metrics are already in the right place – you just need the Prometheus metrics API adapter or KEDA to wire them up.

Monitoring & Observability

What monitoring and observability tools are you using? Speaking of Prometheus, one thing to know about Kubernetes is that almost all tooling for it integrates with Prometheus.

If you are already using Prometheus, this is fantastic – you will easily be able to integrate your new platform with existing metrics.
If you do not have an existing monitoring solution or your current one is not working that well for you, Prometheus is a good “sane default” as a monitoring tool for a lot of organisations.
If you have an existing monitoring tool, there are often solutions for deploying it to Kubernetes or using adapters for Prometheus metrics.

Keep in mind that Prometheus is yet another thing that needs to be maintained – and this means worrying about state, because Prometheus is effectively a database.

Tracing is usually well-supported in Kubernetes components, enabling you to use something like Jaeger.

In comparison, ECS Fargate integrates with CloudWatch for some very basic metrics (CPU and memory utilisation) without any setup. You have AWS X-Ray for tracing. There is no obvious out-of-the-box option for application performance monitoring (APM). It is possible to integrate Prometheus with ECS, but not by default. You need a custom discovery mechanism.

In either scenario, if you have an existing APM solution that is instrumented directly in your application code then there is nothing to worry about particularly – it will continue to work on either platform.

At my current client, software engineers are already quite familiar with Prometheus. Having everything integrate out of the box was a nice bonus with going to EKS.

What centralised logging solution are you using? As with metrics, ECS Fargate integrates with CloudWatch Logs without much effort. If you do not already have a centralised logging solution, this is a great option to get by until you’re ready to invest in one.

CloudWatch Logs does not have the greatest UX, and it can get expensive, so you are likely going to want to look at implementing something like Elastic Cloud or Sumo Logic at some point. In this case, Kubernetes makes this a lot easier, with packages available to set up e.g. fluentd for Elasticsearch log shipping. To do this in ECS Fargate, you’ll need to run a sidecar for every task, or “proxy” all your logs through CloudWatch Logs (again, expensive).

Not needing to run sidecars on every task was another small bonus for using EKS for my client.

Networking

Do you have any unusual networking requirements? Kubernetes offers some extreme flexibility in networking. Some novel examples include a CNI written in bash, a CNI using VPN software (WireGuard) and a CNI using eBPF to also provide observability.

Chances are you are going to be using the VPC CNI. Functionally this is almost exactly the same as ECS Fargate networking. Each pod (task) gets its own IP from an ENI in your private subnets. In practice, it is pretty common for EKS clusters to exhaust available private IP addresses, particularly in larger organisations where almost all private IP space has already been assigned, and undersized VPCs are provisioned. The workaround around for this is not so bad though.

What do you need from your ingress? If you are planning to run EKS, you will need to set up your own Ingress controller. Whether that is the AWS Load Balancer Controller or something like Traefik – it is something that needs to be installed and configured. This usually includes something like external-dns to provision Route53 records automatically.

If you are using ECS Fargate, all you need is to provision an ALB/NLB, Target Group and Route53 with whatever infrastructure-as-code tool you are using.

At my client, standard networking options and the AWS Load Balancer Controller were fine for typical HTTP and gRPC-based microservices. However, it is quite likely that future applications would require the flexibility that EKS provides, due to the unique nature of my client’s applications.

Do you need service mesh? Probably not. Chances are, there are some more fundamental requirements to be fulfilled before service mesh becomes a worthwhile tradeoff.

But when we do need one, Kubernetes is a lot better supported in this area. We have the heavyweight option of Istio, but also a lot of other worthwhile options like Linkerd.

In comparison, ECS has AWS App Mesh. Supposedly some of the other service mesh solutions do also work with ECS, but I suspect this would not be very common.

Security Groups and Network Policies

Chances are you need network segmentation for your workloads. In EKS, there are two main ways you can do this – Security Groups or NetworkPolicies.

Neither of those two are great. Security Groups can be used with the VPC CNI, but it has a few caveats, it requires certain instance types, SecurityGroupPolicies are not reevaluated on running pods, and it requires a “manual” change to the VPC CNI to enable.

If you are not using VPC CNI you can use Kubernete’s NetworkPolicies. In the past, I saw some severe performance penalties for using this, and the customer had to turn it off (!) – but hopefully, that is fixed by now.

Security

There is not much to mention here for ECS Fargate. Tasks always have full isolation from each other thanks to Firecracker. Permissions to the control plane (AWS APIs) are managed through AWS IAM.

Kubernetes is another story. Security in Kubernetes will be another dimension that you will need to manage. This includes:

IAM as well as RBAC for resources in the Kubernetes cluster;
Integrating the two with IRSA for applications and the aws-auth ConfigMap for users;
You’ll need to consider designing access policies for your users and applications;
Secrets management and access, often combined with namespaces for logical isolation of workloads;
If you’re using Parameter Store or Secrets Manager you’ll need to think about how you’ll integrate them;
Ensuring isolation of processes on shared VMs;
You may even want to use an alternative sandbox for your containers; and
Finally, you have an increased attack surface with the Kubernetes API and node credentials – another reason to make sure you have the resources to keep your cluster up to date.

Deployments and developer experience

How important is development velocity to you? Deployments and developer experience is (in my opinion) the greatest differentiator between Kubernetes and ECS.

Deployment speed

Note: the following was written before Amazon announced “faster scaling of applications” in Fargate.

ECS (EC2) deployments are slow. This is a common complaint with ECS. ECS Fargate deployments are even slower. ECS Fargate deployments are so slow, that this hardcoded 10 minute timeout was often not enough to deploy 2 tasks.1

It takes less than 2 minutes to deploy 1000 pods to Kubernetes.

time make deploy
kubectl apply -f deployment.yml
service/example-svc unchanged
deployment.apps/example-deployment configured
kubectl rollout status deployment/example-deployment --namespace=test
Waiting for deployment "example-deployment" rollout to finish: 0 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 0 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 250 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 500 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 500 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 500 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 500 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 500 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 500 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 500 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 500 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 671 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 671 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 671 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 671 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 928 out of 1000 new replicas have been updated...
Waiting for deployment "example-deployment" rollout to finish: 161 old replicas are pending termination...
Waiting for deployment "example-deployment" rollout to finish: 161 old replicas are pending termination...
Waiting for deployment "example-deployment" rollout to finish: 161 old replicas are pending termination...
deployment "example-deployment" successfully rolled out

real    1m44.285s
user    0m1.337s
sys     0m0.324s

Trying to do the same thing in ECS Fargate took 45 minutes. This is the caveat I mentioned around large workloads. The runtime is fine, but the DX is not pleasant.

This became almost the single defining reason to switch to EKS. The existing EC2-based application deployment times were already problematic due to some technical restrictions around when deployments were possible. This extends to rollbacks: if your deploy times are long, then your rollbacks are too.

Deployment tooling

There is a lot of tooling for AWS services that makes it easy to get started, but the rest is up to you to implement. In contrast, Kubernetes has a robust deployment system built-in that does pretty much everything you’d want in a rolling deployment: and all server-side too, so it doesn’t matter what deployment tool you are using.

ECS lacks good deployment tooling. Every organisation I have worked for that uses ECS has used a different tool, often written in-house and of dubious quality. Sometimes, just a bash script around the AWS CLI, which inevitably fails to handle failure scenarios sufficiently.

Amazon offers AWS Copilot CLI which claims to be a deployment tool but is really more for setting up infrastructure and CodePipeline pipelines.

None of the community tools on Github that I came across met the level of quality that I would want for running in a pipeline or had some strange quirks that made them non-starters.

Infrastructure as Code tools has never been really great for deploying applications. CloudFormation and Terraform have hardcoded timeouts on deploying ECS Services. Managing task definitions in Terraform is quite awkward too.

I am not claiming this is not a death sentence for ECS. The reality is that most organisations are not pushing for Continuous Deployment pipelines and will benefit immensely from automating even a small part of their SDLC.

45-minute deploy times are long in the context of multiple deploys a day, but not so much in the context of improving once-a-month deployment schedules.

Advanced deployment techniques

For a lot of organisations that are at the stage of looking at container platforms, they are not likely at the stage where you need advanced deployment techniques. Chances are, a rock-solid rolling update via CI/CD pipeline is going to be bounds and leaps better than whatever was there previously.

You may find that once your rolling update deployment is in, it is “good enough”.

While it would be nice to get canary deployments out of the box, the reality is that they are not widely used, they are complex and unless you have the maturity to support them, they will not be a net benefit for mitigating risk on deployments.

However, having the option for later on is good!

For EKS, there are a number of controllers available, for example, flagger. While I have not used it, it claims to do everything you would want it to.

For ECS, Amazon claims that CodeDeploy can do canary deployments. However, it doesn’t take any metrics into account, and it does not even automatically roll back, so it is not really a canary deployment at all.

Cost

Infrastructure cost is not a huge consideration here, as I feel it is probably dwarfed by harder-to-quantify costs such as the opportunity cost of making it (or not making it) to market with your application, the cost of hiring talent with Kubernetes skills, attracting and retaining talent with better Developer Experience (DX), time spent waiting for deployments to complete, etc.

But there are two main considerations for cost when comparing ECS and EKS:

ECS Fargate spot vs. EC2 spot: fargate is slightly more expensive (on paper, anyway).
EKS cluster standby cost: $200/month (remember to multiply this by the number of environments and regions).

Other thoughts

What is the main motivation for moving to a container platform? I have found that often the initial motivations for a container platform are not clear or do not hold up under scrutiny.

Kubernetes will not “make it just work all the time”
You do not need Kubernetes to do CI/CD
You do not need Kubernetes to optimise cloud costs
You do not need Kubernetes to use Docker as a packaging mechanism
You do not need Kubernetes to embrace DevOps culture
Kubernetes will not make you “cloud-agnostic”

There’s a lot you can do with just EC2 ASGs. There are often process changes – particularly with CI/CD – that can make significant, immediate improvements to SDLC without needing to migrate your applications to containers.

I’ll also add that if you do decide to go with Kubernetes, it does not mean that you need to put everything from that point onward into Kubernetes. On the contrary, avoid it if you can. Suitable for Serverless? Throw it in Lambda. Databases? Do not worry about StatefulSets; just keep using RDS. Logging cluster? Please spare yourself the headache of trying to run Elasticsearch on Kubernetes.

Summary

Can you afford to dedicate a team to maintaining Kubernetes clusters?
Do you have the capacity to dedicate yourself to constantly upgrading EKS clusters?
Do you want to maintain your own Prometheus, Ingress controllers, log aggregators and cluster autoscaling?
Where is your organisation at in its SDLC journey? Too large of a jump will bring problems. ECS might be the right amount of complexity in this regard.

For help tackling these questions and much more, CMD Solutions is here to help address your ever-evolving business requirements through an AWS application modernisation journey.

We have a proven track record of providing modernisation assessments, advisory services, training and advanced development to our customers.

Ready to get started? Visit us here and discover how we can partner with you to build, run and scale Amazon Web Services.

—

We experimented with healthcheck/draining timings, but nothing we did would significantly improve it. We ended up turning off healthcheck validation on deploys until the hardcoded (?!) limit was increased to 15 minutes. ↩︎