Streamlining the CI/CD process and Cloud Workload Design to deploy a highly-available and secure technology infrastructure in AWS.
INDUSTRY
Insurance Services
Technology
SERVICES
Security uplift and cloud workload design for AWS
Uplifting of DevOps processes
Company Overview
The client is an embedded finance company that builds and manages infrastructure for the global insurance industry. It is based in Sydney, Australia and Auckland, New Zealand.
The company primarily offers embedded insurance products that are managed with application programming interfaces (APIs). These products are embedded and offered to the customers of telecommunication, energy, travel, banking and retail companies
The Problem
The cloud workload of the client involved a monolithic deployment of infrastructure and applications mostly via manual jobs, with minimal pipeline deployment activities. This has led to a single point of failure and the challenge of quickly turning around and testing new features across all their environments. Mantel Group was engaged to:
- Design a CI/CD approach in conjunction with a cloud workload design that adheres to security and compliance requirements
- Enable multiple customers and developers to build and test new capabilities in a secure environment by using Zero Trust methodologies in a DevOps cycle over Agile
The Solution
Mantel Group worked with the client to identify key areas of improvement and the corresponding solution for each area and to deliver relevant items in the technical solution:
- Architecture Diagram (mandatory)
- Network Patterns
- DNS and URI Structures
- Secure Data Transfer
- Security and Monitoring
- Metrics Reporting
- Availability
Create a repeatable developer experience to production to allow for DevOps style of working
Repeatable patterns were enabled via Gitlab CI/CD for CDK Python to deploy new and repeatable environments across Dev/Staging and Production. This uses the same code base which is replicated to a branch per environment, so that each developer creating product features or updates to have a 1:1 replication of production to validate changes.
Repeatable patterns were developed for the CDK and Python code for reusability, and to be structured correctly when reused for new environments and white-labelled instances of the workload.
Removing the single point of failure for Application and Web Layer
Monolithic EC2 instances were managed and broken down to multiple AWS Services to create redundancy and DR with multiple paths of failure using ECS, Cloudfront, S3, WAFv2, RDS Aurora, ACM and more.
ECS and Docker were implemented for the application stack to remove the single point of failure in hosting API and Administrator portal in a single EC2 instance, which was present in the current design of the workload.
Build pipelines were rewritten with docker build and docker push to the local repository available within Gitlab cloud. This allows for all workloads to be available via Gitlab repos, and for artefacts to be pulled into the CDK Infra deployment within ECS and S3 deploy modules. The website was moved to CloudFront to restrict customer access to edge locations and have it hosted across multiple locations to achieve better resiliency against issues. DMS was also used to ensure there is an available instance in another location.
Security of the environments
Enhancing the security required to create redundancy within the application, have the data secured and stored within Australian shores and enable encryption from the client to the service.
To achieve this, the AWS Account Model was shifted to using multiple accounts, one for each workload: Dev, Staging, Prod, Logging and Security. This is to decentralise data and hosting as previously all workloads live within a single account. These new accounts are now managed via AWS Control Tower.
Next, the environments were developed to be locked down using Route 53 internally, where it could only be accessed through tunnels and forwarders following the Zero Trust model. CORS validation against the API was also enabled to ensure the header request was from the initial domain request which is the internal Route 53 Resolver. Customers were added via IdP to CloudFlare to allow them to authenticate against an application following the Zero Trust model. This application resolved to a private white-labelled instance for a specific customer.
Key products or services we used
AWS Services
- Amazon Virtual Private Cloud (VPC) segregated into network tiers for hosting component services within subnets – DMZ, Application, and Data.
- AWS Transit Gateway controlling all inter-VPC traffic routing and egress traffic.
- Amazon API Gateway exposes external APIs and facilitates throttling and security controls.
- Amazon Elastic Container Service (ECS) for containerised compute nodes supporting most APIs developed by a 3rd party payments provider.
- Amazon Elastic Compute Cloud (EC2) for non-containerised compute supporting a subset of APIs developed in Java.
- AWS Lambda primarily for event-driven, asynchronous, serverless compute.
- Amazon Relational Database Service (RDS) for managed service data persistence.
- Amazon ElastiCache for Redis for in-memory data store providing sub-millisecond latency to the application tier.
- Amazon CloudFront and S3 for hosting and caching static web content.
- AWS Transfer Family and Amazon S3 for simplified SFTP hosting.
- AWS WAFv2, AWS Config for providing security in depth.
- AWS Key Management Service (KMS) for key management controlled by a 3rd party payments provider through customer managed keys.
- AWS Certificate Manager for provisioning and managing SSL/TLS certificates used by AWS services.
- AWS CloudTrail, Amazon CloudWatch, and AWS X-Ray for monitoring, logging, and tracing native AWS components and deployed applications.
- AWS Control Tower for providing account management capabilities and guardrails to provide governance assistance.
- AWS Data Migration Service for providing resiliency for the application and supporting the BI Processing on data
Third party applications or solutions used
- Okta for IDP and user access controls
- Cloudflare Zero Trust Access controls to environments
- Datadog for Monitoring, Logging and Synthetic testing
Outcomes
All these changes were successfully implemented and are working as intended. The flow logs and routing checks have shown that network-related changes function as expected. Manual and automated tests have verified all network-related changes, and 100% encryption is enabled using KMS Keys for data in transit and at rest.
Lessons Learned
There were a number of lessons learned. Worth mentioning:
- EV or Extended Validation Certificates cannot be issued via Certificate Manager, but can be imported. Using pipelines and Boto3 we can monitor expiry and update Application Load Balancers
- ECS Handling Task version updates, recycling tasks using Lambda functions and CI/CD Boto3 can be used to help assist deployment, quickly respond to ramifications and speed up the pipeline.