The EKS production checklist we use for every client
Every EKS cluster we set up — whether it's a cybersecurity startup running three environments or a streaming platform handling millions of daily requests — starts with the same checklist. Not because every cluster is identical, but because the foundational decisions are always the same, and getting them wrong early is expensive to fix later.
This is the checklist we've refined over years of running production Kubernetes on AWS. It's opinionated. It skips the basics that every tutorial covers and focuses on the things we've seen teams get wrong — or skip entirely — and regret six months later.
1. Network architecture: private by default
The single most impactful decision you'll make is whether your EKS API endpoint is public or private. Our default: private endpoint only.
resource "aws_eks_cluster" "main" {
name = "${var.project}-${var.environment}"
version = "1.29"
role_arn = aws_iam_role.cluster.arn
vpc_config {
subnet_ids = var.private_subnet_ids
endpoint_private_access = true
endpoint_public_access = false
security_group_ids = [aws_security_group.cluster.id]
}
}
Public endpoints mean your Kubernetes API is accessible from the internet. Even with RBAC and authentication, that's unnecessary attack surface. If your developers need access, set up a VPN or use AWS SSM Session Manager to proxy through a bastion. We typically use a WireGuard VPN on a t3.micro — costs under $12/month and locks down the entire cluster.
Subnet placement matters too. Worker nodes go in private subnets. Always. The only things in public subnets should be NAT gateways and load balancers. We use a three-AZ layout with dedicated subnet tiers:
- Public subnets — NAT gateways, ALB/NLB
- Private subnets — EKS nodes, application pods
- Intra subnets — RDS, ElastiCache, internal services (no NAT route)
2. Node groups: managed, with Graviton
Use managed node groups. Self-managed node groups give you more control, but the operational overhead isn't worth it for 90% of workloads. You lose automatic AMI updates, automatic draining during upgrades, and integration with the EKS console.
Unless you have a specific reason not to, run Graviton (ARM) instances. We default to m7g or c7g instances depending on the workload profile. They're 20-30% cheaper than equivalent x86 instances and most container images support multi-arch now.
resource "aws_eks_node_group" "main" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "${var.project}-${var.environment}-main"
node_role_arn = aws_iam_role.node.arn
subnet_ids = var.private_subnet_ids
instance_types = ["m7g.large"]
ami_type = "AL2023_ARM_64_STANDARD"
scaling_config {
desired_size = 2
min_size = 2
max_size = 10
}
update_config {
max_unavailable = 1
}
}
Set max_unavailable = 1 in the update config. This ensures rolling updates during node group upgrades don't take down too much capacity at once. The default can be aggressive for small clusters.
3. IAM: IRSA everywhere, cluster roles locked down
This is where most teams take shortcuts they later regret. Every pod that talks to an AWS service should use IAM Roles for Service Accounts (IRSA), not node-level instance profiles.
Node-level IAM roles mean every pod on that node has the same AWS permissions. If your logging pod needs CloudWatch access and it's on the same node as your app pod, the app pod can also write to CloudWatch — and whatever else the node role allows. IRSA gives you per-pod identity.
resource "aws_iam_role" "app_pod" {
name = "${var.project}-${var.environment}-app-pod"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Federated = aws_iam_openid_connect_provider.eks.arn
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"${replace(aws_eks_cluster.main.identity[0].oidc[0].issuer, "https://", "")}:sub" =
"system:serviceaccount:${var.namespace}:${var.service_account_name}"
}
}
}]
})
}
For the cluster IAM role itself, keep it minimal. The cluster role only needs eks:* permissions and the AWS-managed AmazonEKSClusterPolicy. Don't attach extra policies "just in case."
4. Access management: ditch aws-auth, use access entries
If you're still managing cluster access through the aws-auth ConfigMap, stop. AWS introduced EKS Access Entries and they're significantly better. The aws-auth ConfigMap is fragile — one bad edit and you can lock yourself out of the cluster entirely.
resource "aws_eks_access_entry" "admin" {
cluster_name = aws_eks_cluster.main.name
principal_arn = "arn:aws:iam::${var.account_id}:role/AdminRole"
type = "STANDARD"
}
resource "aws_eks_access_policy_association" "admin" {
cluster_name = aws_eks_cluster.main.name
principal_arn = "arn:aws:iam::${var.account_id}:role/AdminRole"
policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy"
access_scope {
type = "cluster"
}
}
Access entries are managed through the AWS API, which means they're Terraform-friendly, they can't be accidentally deleted by a kubectl apply, and you get proper IAM audit trails in CloudTrail.
5. Secrets management: external, not Kubernetes native
Kubernetes Secrets are base64-encoded, not encrypted at rest (unless you set up envelope encryption with a KMS key). Even with KMS encryption enabled, the developer experience of managing secrets via kubectl is poor.
We use the AWS Secrets Manager CSI driver to mount secrets directly from AWS Secrets Manager or Parameter Store into pods. Secrets are fetched at pod startup and mounted as files — no environment variables, no base64 encoding, and you get automatic rotation through AWS.
At minimum, enable envelope encryption on the cluster:
encryption_config {
provider {
key_arn = aws_kms_key.eks.arn
}
resources = ["secrets"]
}
6. Logging and observability from day one
Don't wait until something breaks to set up logging. On every cluster, we deploy:
- Fluent Bit — ships container logs to CloudWatch Log Groups. Lightweight, runs as a DaemonSet, low resource footprint.
- Control plane logging — enable api, audit, authenticator, controllerManager, and scheduler logs. Yes, all of them. It costs a few dollars a month and saves you hours during incident response.
- Container Insights — gives you cluster-level metrics (CPU, memory, network by pod/node/namespace) without deploying Prometheus on day one.
enabled_cluster_log_types = [
"api",
"audit",
"authenticator",
"controllerManager",
"scheduler"
]
The audit log alone has saved us multiple times — when a misconfigured deployment was deleting resources, or when we needed to trace exactly which IAM role made a specific API call.
7. Ingress: Kong over ALB Ingress Controller
We've written separately about this, but in short: the AWS ALB Ingress Controller works for simple HTTP routing. The moment you need rate limiting, request transformation, mTLS termination, or API key authentication, you'll hit its limits.
We run Kong Ingress Controller on every production cluster. It's deployed as a Helm chart, backed by a Network Load Balancer, and gives us a full-featured API gateway that we can configure through Kubernetes CRDs. One gateway, consistent config across all services.
8. Resource quotas and limit ranges
Every namespace gets a ResourceQuota and a LimitRange. No exceptions. Without these, a single misbehaving deployment can consume all node resources and starve everything else.
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
The defaultRequest is what the scheduler uses for bin-packing decisions. If you don't set it, pods with no resource requests get scheduled anywhere and you lose all predictability in your cluster utilization.
9. Network policies
By default, every pod in a Kubernetes cluster can talk to every other pod. That's a flat network with zero segmentation — fine for development, unacceptable in production.
We deploy a default-deny ingress policy in every namespace and then explicitly allow the traffic patterns that should exist. This is defense in depth: even if an attacker compromises one pod, lateral movement is restricted.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
Then add allow rules for specific service-to-service communication. Yes, it's more work upfront. But when you're running multiple tenants or handling sensitive data, it's non-negotiable.
10. Cluster upgrades: plan for them from day one
EKS versions have a roughly 14-month support window. If you're not upgrading, you're accumulating security debt and eventually you'll hit forced deprecation.
Our approach:
- Upgrade dev first, let it soak for a week
- Upgrade staging, run integration tests
- Upgrade production with a maintenance window
- Always upgrade the control plane first, then node groups
- Check the Kubernetes deprecation guide for removed APIs before each upgrade
We've seen teams skip two minor versions and hit breaking API changes that required rewriting Helm charts. Stay current. One version behind at most.
11. Backup and disaster recovery
Your application data is in RDS or S3 — but your cluster state is also valuable. Namespace configs, RBAC bindings, CRDs, ConfigMaps. If you had to recreate the cluster from scratch, could you?
We use Velero for cluster state backups. It snapshots Kubernetes resources and (optionally) persistent volumes to S3. Scheduled daily, retained for 30 days. We've needed it exactly twice in four years, and both times it saved a full day of manual reconstruction.
12. Cost guardrails
The last item on the checklist, but one of the most important. EKS clusters accumulate cost in places you don't expect:
- NAT gateway data processing — $0.045/GB. If your pods pull large images or make heavy outbound API calls, this adds up fast. Use VPC endpoints for ECR, S3, and CloudWatch.
- Idle node capacity — set up cluster autoscaler or Karpenter to scale down when load drops.
- CloudWatch log storage — set retention policies. Default is "never delete." We set 30 days for dev, 90 for staging, 365 for production.
- Load balancer proliferation — one ALB per Ingress resource adds up. Use a shared ingress controller (Kong or nginx) behind a single NLB.
We set up AWS Budgets and Cost Anomaly Detection on every account. A $5/day spike caught on day one is a quick fix. The same spike noticed at the end of the month is a $150 surprise on the invoice.
The full checklist
Here's the condensed version you can copy into your own process:
Private API endpoint — disable public access, use VPN for kubectl
Three-AZ subnet layout — public, private, and intra tiers
Managed node groups — Graviton instances, rolling update config
IRSA for every pod — no node-level IAM shortcuts
Access entries over aws-auth — Terraform-managed, CloudTrail audited
KMS envelope encryption for Kubernetes secrets
All five control plane log types enabled
Fluent Bit DaemonSet for application log shipping
Kong or nginx ingress behind a single NLB
ResourceQuotas + LimitRanges on every namespace
Default-deny network policies with explicit allow rules
VPC endpoints for ECR, S3, CloudWatch, STS
None of this is cutting-edge. That's the point. Production infrastructure should be boring. Every item on this list exists because we've either seen it missing and watched teams scramble, or we've skipped it ourselves and paid the price.
If you're setting up a new EKS cluster and want a second opinion on your architecture, reach out. We're happy to do a quick review.
Vishwaraja Pathi
Cloud & DevOps specialist with 13+ years of experience. Founder of Adiyogi Technologies. Previously at Roku, Rocket Lawyer, and BetterPlace.