Migrating 40 Microservices to EKS in 8 Weeks Without Downtime
The Situation
A fintech company had 40 microservices running on EC2 instances managed by a patchwork of Ansible playbooks and shell scripts. Their CTO had committed to investors that the platform would be fully containerised on Kubernetes before the next funding round closed — 10 weeks away. Two weeks of internal prep had been done. They needed the remaining 8 weeks executed.
40 services migrated · 8 weeks total · 0 min downtime · 6 teams involved
The Migration Strategy
The critical architectural decision was traffic-based cutover using weighted DNS routing rather than hard cutovers. For each service: deploy to Kubernetes, use Route 53 weighted routing to shift 5% → 25% → 50% → 100% over 48 hours, monitor error rates and p99 latency at each step, decommission EC2 only after 72 hours of stable production traffic at 100%.
Week-by-Week Execution
Weeks 1–2: EKS cluster, VPC CNI networking, security groups, observability stack (Prometheus, Grafana, Loki), and a reference Helm chart all teams would use as a starting point. The shared chart accelerated adoption dramatically.
Weeks 3–5: Three migration waves of 12–15 services each, ordered by risk — stateless services first. Each wave ran for two weeks with one week of stabilisation before the next wave began.
Weeks 6–7: Stateful services — Redis clusters and two Postgres-backed services using Kubernetes operators. Highest-risk work, most conservative cutover timeline.
Week 8: EC2 decommission, cost validation (32% reduction from eliminated EC2 overhead), runbook handover, and team training.
What Actually Went Wrong
Resource requests and limits were missing on every service — the EC2 world never required them. Three services caused node pressure events in week 3 before we enforced LimitRange defaults cluster-wide. This is now the first thing we do on every engagement.
Key Learnings
- Observability infrastructure must come before migration, not after — you cannot debug a migration blind
- A shared Helm chart template accelerates team adoption; services that deviated from it caused the most friction
- Resource requests and limits are the most common gap when migrating from VM-based infrastructure — enforce them cluster-wide from day one
- Weighted DNS cutover is far safer than a hard switch; real production traffic at 5% surfaces issues that staging never reveals
Summarize this post with:
Ready to put this into production?
Our engineers have deployed these architectures across 100+ client engagements — from AWS migrations to Kubernetes clusters to AI infrastructure. We turn complex cloud challenges into measurable outcomes.