The AWS VPC design we use for every new client
Of all the decisions you make when setting up a new AWS account, the VPC design is the one that punishes you hardest if you get it wrong. Security groups can be tightened later. IAM policies can be rewritten. But your CIDR block allocation, your subnet layout, your routing tables -- those are the bones of the network, and restructuring bones means downtime.
This is the pattern we start with on every engagement -- from single-cluster startups to multi-account enterprises -- and the reasoning behind each decision.
The three-tier subnet layout
Every VPC we build has three tiers of subnets, each with a distinct purpose and a distinct level of internet access:
- Public subnets -- NAT gateways and load balancers only. Nothing else gets a public IP. No EC2 instances, no EKS nodes, no databases.
- Private subnets -- Where compute lives. EKS nodes, ECS tasks, application servers. These route outbound through NAT (for image pulls, external APIs) but have no public IPs. Inbound traffic arrives only through a load balancer in the public tier.
- Intra subnets -- The most locked-down tier. RDS, ElastiCache, internal services with no route to a NAT gateway. Zero outbound internet. If a database is compromised, it cannot phone home. This is the tier most teams skip, and it matters the most.
This separation is enforced at the route table level. Each tier gets its own route table. The intra route table has one entry: the local VPC CIDR. There is no path out.
CIDR planning: give yourself room
We allocate a /16 for the VPC -- 65,536 addresses. That sounds generous until you're running 200 EKS pods across three AZs and each pod consumes a subnet IP (as they do with the AWS VPC CNI plugin).
Within the /16, we carve /20 subnets. Each /20 gives you 4,091 usable IPs. Here's the layout for a VPC with CIDR 10.0.0.0/16:
| Subnet | AZ | CIDR | Usable IPs |
|---|---|---|---|
| Public | AZ-1 | 10.0.0.0/20 |
4,091 |
| Public | AZ-2 | 10.0.16.0/20 |
4,091 |
| Public | AZ-3 | 10.0.32.0/20 |
4,091 |
| Private | AZ-1 | 10.0.48.0/20 |
4,091 |
| Private | AZ-2 | 10.0.64.0/20 |
4,091 |
| Private | AZ-3 | 10.0.80.0/20 |
4,091 |
| Intra | AZ-1 | 10.0.96.0/20 |
4,091 |
| Intra | AZ-2 | 10.0.112.0/20 |
4,091 |
| Intra | AZ-3 | 10.0.128.0/20 |
4,091 |
Nine /20 blocks used out of a possible sixteen. The remaining seven (10.0.144.0/20 through 10.0.240.0/20) are free for future subnet tiers, VPN client ranges, or workload clusters. Plan for what you don't need yet -- adding secondary CIDRs later works but introduces routing complexity you'd rather avoid.
Three availability zones, always
We deploy across three AZs even when two feels sufficient. The cost difference is near zero -- you pay for resources, not subnets -- and the resilience difference is significant. Lose one of two AZs and you've lost 50% capacity. Lose one of three and you've lost 33%.
More importantly, AWS services like RDS Multi-AZ, ElastiCache, and the EKS control plane distribute across three AZs internally. If your subnets only span two, you're leaving resilience on the table and may hit placement constraints during scaling events. There is no good reason to use two AZs in any region that offers three.
NAT gateway strategy
Each NAT gateway costs roughly $32/month plus $0.045/GB processed. In production, we deploy one NAT gateway per AZ. If the NAT in AZ-1 fails, private subnets in AZ-1 lose outbound access rather than failing over cross-AZ -- which would introduce latency and data transfer charges.
In dev and staging, we run a single NAT gateway. If it goes down, all private subnets lose outbound access. Acceptable for non-production, and it saves roughly $64/month.
# Production: one NAT per AZ
nat_gateway_count = 3
single_nat_gateway = false
# Dev/Staging: single NAT
nat_gateway_count = 1
single_nat_gateway = true
VPC endpoints: cut your NAT bill in half
Without VPC endpoints, every AWS API call from a private subnet -- pulling ECR images, writing CloudWatch logs, assuming IAM roles via STS -- routes through NAT. You pay $0.045/GB on all of it. We deploy four endpoints as a baseline:
- S3 (Gateway endpoint) -- Free. Terraform state, application assets, log archival all stay on the AWS backbone.
- ECR (
ecr.api+ecr.dkr) (Interface endpoints) -- Every container image pull bypasses NAT. On a busy cluster, this is the biggest single savings. - CloudWatch Logs (Interface endpoint) -- The second-biggest source of NAT traffic if you're shipping logs from every pod.
- STS (Interface endpoint) -- Every IRSA token exchange hits STS. Low volume, high frequency -- removes NAT dependency for IAM operations.
Interface endpoints cost about $7.20/month each per AZ -- roughly $86/month for four across three AZs. On a busy cluster, the NAT savings cover that in the first week. We've seen NAT bills drop from $200/month to under $40 after adding endpoints.
Where it gets interesting
Everything above is the foundation -- a single VPC in a single account. The real complexity starts when you add VPC peering for shared services, Transit Gateway for multi-account connectivity, PrivateLink for cross-account service exposure, or hub-and-spoke topologies with centralized egress.
But those decisions are meaningless if the underlying VPC is poorly designed. You can't peer a VPC whose CIDR overlaps with another account. You can't attach Transit Gateway to subnets with no room for ENIs. You can't retrofit an intra tier onto a two-layer VPC without rebuilding.
Get the foundation right and everything on top is easier. Get it wrong and you'll spend a weekend migrating databases between subnets -- or worse, rebuilding the VPC from scratch.
If you're planning a new AWS environment or inheriting one that feels brittle, we'd be happy to take a look. The VPC review is usually the first thing we do.
Vishwaraja Pathi
Cloud & DevOps specialist with 13+ years of experience. Founder of Adiyogi Technologies. Previously at Roku, Rocket Lawyer, and BetterPlace.