Blog / EKS Production Checklist
Kubernetes

The EKS production checklist we use for every client

Vishwaraja Pathi · March 28, 2026 · 8 min read

Every EKS cluster we set up — whether it's a cybersecurity startup running three environments or a streaming platform handling millions of daily requests — starts with the same checklist. Not because every cluster is identical, but because the foundational decisions are always the same, and getting them wrong early is expensive to fix later.

This is the checklist we've refined over years of running production Kubernetes on AWS. It's opinionated. It skips the basics that every tutorial covers and focuses on the things we've seen teams get wrong — or skip entirely — and regret six months later.

1. Network architecture: private by default

The single most impactful decision you'll make is whether your EKS API endpoint is public or private. Our default: private endpoint only.

resource "aws_eks_cluster" "main" {
  name     = "${var.project}-${var.environment}"
  version  = "1.29"
  role_arn = aws_iam_role.cluster.arn

  vpc_config {
    subnet_ids              = var.private_subnet_ids
    endpoint_private_access = true
    endpoint_public_access  = false
    security_group_ids      = [aws_security_group.cluster.id]
  }
}

Public endpoints mean your Kubernetes API is accessible from the internet. Even with RBAC and authentication, that's unnecessary attack surface. If your developers need access, set up a VPN or use AWS SSM Session Manager to proxy through a bastion. We typically use a WireGuard VPN on a t3.micro — costs under $12/month and locks down the entire cluster.

Subnet placement matters too. Worker nodes go in private subnets. Always. The only things in public subnets should be NAT gateways and load balancers. We use a three-AZ layout with dedicated subnet tiers:

  • Public subnets — NAT gateways, ALB/NLB
  • Private subnets — EKS nodes, application pods
  • Intra subnets — RDS, ElastiCache, internal services (no NAT route)

2. Node groups: managed, with Graviton

Use managed node groups. Self-managed node groups give you more control, but the operational overhead isn't worth it for 90% of workloads. You lose automatic AMI updates, automatic draining during upgrades, and integration with the EKS console.

Unless you have a specific reason not to, run Graviton (ARM) instances. We default to m7g or c7g instances depending on the workload profile. They're 20-30% cheaper than equivalent x86 instances and most container images support multi-arch now.

resource "aws_eks_node_group" "main" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "${var.project}-${var.environment}-main"
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = var.private_subnet_ids

  instance_types = ["m7g.large"]
  ami_type       = "AL2023_ARM_64_STANDARD"

  scaling_config {
    desired_size = 2
    min_size     = 2
    max_size     = 10
  }

  update_config {
    max_unavailable = 1
  }
}

Set max_unavailable = 1 in the update config. This ensures rolling updates during node group upgrades don't take down too much capacity at once. The default can be aggressive for small clusters.

3. IAM: IRSA everywhere, cluster roles locked down

This is where most teams take shortcuts they later regret. Every pod that talks to an AWS service should use IAM Roles for Service Accounts (IRSA), not node-level instance profiles.

Node-level IAM roles mean every pod on that node has the same AWS permissions. If your logging pod needs CloudWatch access and it's on the same node as your app pod, the app pod can also write to CloudWatch — and whatever else the node role allows. IRSA gives you per-pod identity.

resource "aws_iam_role" "app_pod" {
  name = "${var.project}-${var.environment}-app-pod"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = aws_iam_openid_connect_provider.eks.arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${replace(aws_eks_cluster.main.identity[0].oidc[0].issuer, "https://", "")}:sub" =
            "system:serviceaccount:${var.namespace}:${var.service_account_name}"
        }
      }
    }]
  })
}

For the cluster IAM role itself, keep it minimal. The cluster role only needs eks:* permissions and the AWS-managed AmazonEKSClusterPolicy. Don't attach extra policies "just in case."

4. Access management: ditch aws-auth, use access entries

If you're still managing cluster access through the aws-auth ConfigMap, stop. AWS introduced EKS Access Entries and they're significantly better. The aws-auth ConfigMap is fragile — one bad edit and you can lock yourself out of the cluster entirely.

resource "aws_eks_access_entry" "admin" {
  cluster_name  = aws_eks_cluster.main.name
  principal_arn = "arn:aws:iam::${var.account_id}:role/AdminRole"
  type          = "STANDARD"
}

resource "aws_eks_access_policy_association" "admin" {
  cluster_name  = aws_eks_cluster.main.name
  principal_arn = "arn:aws:iam::${var.account_id}:role/AdminRole"
  policy_arn    = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy"

  access_scope {
    type = "cluster"
  }
}

Access entries are managed through the AWS API, which means they're Terraform-friendly, they can't be accidentally deleted by a kubectl apply, and you get proper IAM audit trails in CloudTrail.

5. Secrets management: external, not Kubernetes native

Kubernetes Secrets are base64-encoded, not encrypted at rest (unless you set up envelope encryption with a KMS key). Even with KMS encryption enabled, the developer experience of managing secrets via kubectl is poor.

We use the AWS Secrets Manager CSI driver to mount secrets directly from AWS Secrets Manager or Parameter Store into pods. Secrets are fetched at pod startup and mounted as files — no environment variables, no base64 encoding, and you get automatic rotation through AWS.

At minimum, enable envelope encryption on the cluster:

encryption_config {
  provider {
    key_arn = aws_kms_key.eks.arn
  }
  resources = ["secrets"]
}

6. Logging and observability from day one

Don't wait until something breaks to set up logging. On every cluster, we deploy:

  • Fluent Bit — ships container logs to CloudWatch Log Groups. Lightweight, runs as a DaemonSet, low resource footprint.
  • Control plane logging — enable api, audit, authenticator, controllerManager, and scheduler logs. Yes, all of them. It costs a few dollars a month and saves you hours during incident response.
  • Container Insights — gives you cluster-level metrics (CPU, memory, network by pod/node/namespace) without deploying Prometheus on day one.
enabled_cluster_log_types = [
  "api",
  "audit",
  "authenticator",
  "controllerManager",
  "scheduler"
]

The audit log alone has saved us multiple times — when a misconfigured deployment was deleting resources, or when we needed to trace exactly which IAM role made a specific API call.

7. Ingress: Kong over ALB Ingress Controller

We've written separately about this, but in short: the AWS ALB Ingress Controller works for simple HTTP routing. The moment you need rate limiting, request transformation, mTLS termination, or API key authentication, you'll hit its limits.

We run Kong Ingress Controller on every production cluster. It's deployed as a Helm chart, backed by a Network Load Balancer, and gives us a full-featured API gateway that we can configure through Kubernetes CRDs. One gateway, consistent config across all services.

8. Resource quotas and limit ranges

Every namespace gets a ResourceQuota and a LimitRange. No exceptions. Without these, a single misbehaving deployment can consume all node resources and starve everything else.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      type: Container

The defaultRequest is what the scheduler uses for bin-packing decisions. If you don't set it, pods with no resource requests get scheduled anywhere and you lose all predictability in your cluster utilization.

9. Network policies

By default, every pod in a Kubernetes cluster can talk to every other pod. That's a flat network with zero segmentation — fine for development, unacceptable in production.

We deploy a default-deny ingress policy in every namespace and then explicitly allow the traffic patterns that should exist. This is defense in depth: even if an attacker compromises one pod, lateral movement is restricted.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress

Then add allow rules for specific service-to-service communication. Yes, it's more work upfront. But when you're running multiple tenants or handling sensitive data, it's non-negotiable.

10. Cluster upgrades: plan for them from day one

EKS versions have a roughly 14-month support window. If you're not upgrading, you're accumulating security debt and eventually you'll hit forced deprecation.

Our approach:

  • Upgrade dev first, let it soak for a week
  • Upgrade staging, run integration tests
  • Upgrade production with a maintenance window
  • Always upgrade the control plane first, then node groups
  • Check the Kubernetes deprecation guide for removed APIs before each upgrade

We've seen teams skip two minor versions and hit breaking API changes that required rewriting Helm charts. Stay current. One version behind at most.

11. Backup and disaster recovery

Your application data is in RDS or S3 — but your cluster state is also valuable. Namespace configs, RBAC bindings, CRDs, ConfigMaps. If you had to recreate the cluster from scratch, could you?

We use Velero for cluster state backups. It snapshots Kubernetes resources and (optionally) persistent volumes to S3. Scheduled daily, retained for 30 days. We've needed it exactly twice in four years, and both times it saved a full day of manual reconstruction.

12. Cost guardrails

The last item on the checklist, but one of the most important. EKS clusters accumulate cost in places you don't expect:

  • NAT gateway data processing — $0.045/GB. If your pods pull large images or make heavy outbound API calls, this adds up fast. Use VPC endpoints for ECR, S3, and CloudWatch.
  • Idle node capacity — set up cluster autoscaler or Karpenter to scale down when load drops.
  • CloudWatch log storage — set retention policies. Default is "never delete." We set 30 days for dev, 90 for staging, 365 for production.
  • Load balancer proliferation — one ALB per Ingress resource adds up. Use a shared ingress controller (Kong or nginx) behind a single NLB.

We set up AWS Budgets and Cost Anomaly Detection on every account. A $5/day spike caught on day one is a quick fix. The same spike noticed at the end of the month is a $150 surprise on the invoice.

The full checklist

Here's the condensed version you can copy into your own process:

Private API endpoint — disable public access, use VPN for kubectl

Three-AZ subnet layout — public, private, and intra tiers

Managed node groups — Graviton instances, rolling update config

IRSA for every pod — no node-level IAM shortcuts

Access entries over aws-auth — Terraform-managed, CloudTrail audited

KMS envelope encryption for Kubernetes secrets

All five control plane log types enabled

Fluent Bit DaemonSet for application log shipping

Kong or nginx ingress behind a single NLB

ResourceQuotas + LimitRanges on every namespace

Default-deny network policies with explicit allow rules

VPC endpoints for ECR, S3, CloudWatch, STS

None of this is cutting-edge. That's the point. Production infrastructure should be boring. Every item on this list exists because we've either seen it missing and watched teams scramble, or we've skipped it ourselves and paid the price.

If you're setting up a new EKS cluster and want a second opinion on your architecture, reach out. We're happy to do a quick review.

V

Vishwaraja Pathi

Cloud & DevOps specialist with 13+ years of experience. Founder of Adiyogi Technologies. Previously at Roku, Rocket Lawyer, and BetterPlace.

More from the blog