Migrating 40 Microservices to EKS in 8 Weeks Without Downtime

17 Mar 2026 • 1 min read

The Situation

A fintech company had 40 microservices running on EC2 instances managed by a patchwork of Ansible playbooks and shell scripts. Their CTO had committed to investors that the platform would be fully containerised on Kubernetes before the next funding round closed — 10 weeks away. Two weeks of internal prep had been done. They needed the remaining 8 weeks executed.

40 services migrated · 8 weeks total · 0 min downtime · 6 teams involved

The Migration Strategy

The critical architectural decision was traffic-based cutover using weighted DNS routing rather than hard cutovers. For each service: deploy to Kubernetes, use Route 53 weighted routing to shift 5% → 25% → 50% → 100% over 48 hours, monitor error rates and p99 latency at each step, decommission EC2 only after 72 hours of stable production traffic at 100%.

Week-by-Week Execution

Weeks 1–2: EKS cluster, VPC CNI networking, security groups, observability stack (Prometheus, Grafana, Loki), and a reference Helm chart all teams would use as a starting point. The shared chart accelerated adoption dramatically.

Weeks 3–5: Three migration waves of 12–15 services each, ordered by risk — stateless services first. Each wave ran for two weeks with one week of stabilisation before the next wave began.

Weeks 6–7: Stateful services — Redis clusters and two Postgres-backed services using Kubernetes operators. Highest-risk work, most conservative cutover timeline.

Week 8: EC2 decommission, cost validation (32% reduction from eliminated EC2 overhead), runbook handover, and team training.

What Actually Went Wrong

Resource requests and limits were missing on every service — the EC2 world never required them. Three services caused node pressure events in week 3 before we enforced LimitRange defaults cluster-wide. This is now the first thing we do on every engagement.

Key Learnings

Observability infrastructure must come before migration, not after — you cannot debug a migration blind
A shared Helm chart template accelerates team adoption; services that deviated from it caused the most friction
Resource requests and limits are the most common gap when migrating from VM-based infrastructure — enforce them cluster-wide from day one
Weighted DNS cutover is far safer than a hard switch; real production traffic at 5% surfaces issues that staging never reveals

Summarize this post with:

ChatGPT Perplexity Claude Grok

Expert Cloud Consulting

Ready to put this into production?

Our engineers have deployed these architectures across 100+ client engagements — from AWS migrations to Kubernetes clusters to AI infrastructure. We turn complex cloud challenges into measurable outcomes.

100+ Deployments

99.99% Uptime SLA

15 min Response time

Talk to Our Engineers See Case Studies →