IaC-First: Why I am Never Touch the AWS Console in Production

“Never touch the AWS console in production” sounds like an extreme rule. It is not. It is the most important operational discipline in a cloud-native team, and the cost of violating it accumulates silently until it causes a major incident.

This post explains why, and how to enforce IaC-first development in a real team.

The State Drift Problem

Terraform (and OpenTofu) maintains a state file that represents what infrastructure exists. When you apply, Terraform compares the state file against your configuration and makes the minimum set of changes to bring reality in line with the configuration.

When you click around in the AWS console and create or modify resources, you are changing reality without changing the state file.

Now: The next plan will show drift: Terraform wants to destroy or recreate resources that “should not exist” (because they are not in the configuration).

Or worse: Terraform’s state says the resource exists with config A, but the console changed it to config B. plan shows “no changes” even though the resource is misconfigured.

Or: someone adds a resource in the console, it works, gets depended upon by other resources, and then an apply runs and deletes it because it is not in the state.

The console is not the source of truth. Your IaC is. The console is a lie.

The Import Workflow: The Correct Response to “It Already Exists”

When a resource exists in AWS but not in your Terraform configuration (created by someone in the console, created manually via CLI, or migrated from another tool), the correct action is to import it:

Import an existing resource into Terraform state

./deploy.sh --infra --import aws_s3_bucket.service existing-bucket-name
./deploy.sh --infra --import aws_dynamodb_table.service_files serviceFiles
./deploy.sh --infra --import aws_lambda_function.service_api myapp-service-api

After importing, run plan:
./deploy.sh --infra

The plan shows the diff between what exists in AWS and what your configuration says. Fix the configuration until the plan shows no unexpected changes. Then apply to bring the state file in sync.

Never delete a resource from the console and re-create it via Terraform. This destroys data and breaks dependencies. Import → reconcile → apply.

The One Legitimate Console Use

The console is appropriate for:

  • Read-only exploration. CloudWatch logs, CloudWatch metrics, Lambda invocation history, DynamoDB item inspection.
  • Emergency operations with immediate import. If production is down and the fix requires a console change, make the change; but document it and import it into Terraform within the same working session.

That is it. Everything that creates, modifies, or deletes resources belongs in IaC.

The Deploy Script as the Single Entry Point

Raw terraform apply or tofu apply commands run directly on the host are dangerous because they bypass your team’s standard deploy workflow (validation, lint, plan review). Create a deploy script that wraps all Terraform operations:

#!/bin/bash
# deploy.sh - the only way to interact with infrastructure

case "$1" in
    --infra)
        case "$2" in
            --apply)   run_apply ;;
            --plan)    run_plan ;;
            --validate) run_validate ;;
            --tflint)  run_tflint ;;
            --import)  run_import "$3" "$4" ;;
            *)         run_plan ;;  # default: plan only (safe)
        esac
        ;;
esac

Key properties of a good deploy script:

  • Default to plan. Running ./deploy.sh –infra with no second argument should show the plan, never apply. Applying requires explicit intent (–apply).
  • Always validate before apply. Run terraform validate and tflint before any apply. Reject the deploy if either fails.
  • Always show the plan before applying. Even for –apply, show the plan output and require confirmation in interactive mode.
  • Consistent var-file. Always include the same var-file (vars/production.tfvars). Never apply without it.

Plan Review as a Gate

Before any infrastructure PR merges, the plan output should be visible in the PR:
## Terraform Plan

Terraform will perform the following actions:
module.service_api.aws_lambda_function.service_api will be updated in-place
~ resource "aws_lambda_function" "service_api" {
~ timeout = 30 -> 60
}

Plan: 0 to add, 1 to change, 0 to destroy.

module.service_api.aws_lambda_function.service_api will be updated in-place

~ resource "aws_lambda_function" "service_api" {
~ timeout = 30 -> 60
}

Plan: 0 to add, 1 to change, 0 to destroy.

This accomplishes two things:

  1. The reviewer can see exactly what will change in production before approving.
  2. The plan output is a historical record of what was intended at merge time.

A plan that shows unexpected destroys (N to destroy) should block the PR until the author explains why.

The No-Exclude Rule

A common mistake when an apply fails: exclude the failing resource and apply anyway.

# WRONG tofu apply -exclude='aws_bedrockagent_knowledge_base.kb'

This creates a partial apply: some resources were updated, some were not. The state file now reflects a partially-applied configuration. The excluded resource is out of sync with everything that depends on it.

The correct response to a failing resource:

  1. Let the apply finish. Do not interrupt it.
  2. Understand why the resource failed.
  3. Fix the configuration or import the existing resource.
  4. Run apply again until zero errors.

If the error is “resource already exists”: import it.
If the error is a config mismatch: fix the configuration to match the existing resource.
If the error is a dependency ordering problem: fix depends_on.

Never: exclude, comment out, or otherwise skip resources. The IaC configuration must always reflect reality.

The Forced

Terraform uses a lock table (DynamoDB) to prevent concurrent applies. If a lock is stuck (Lambda timed out mid-apply, process was killed), you will see:

Error: Error locking state: Error acquiring the state lock

The tempting fix: terraform force-unlock. This is dangerous if another apply is actually running – force-unlocking a live apply causes corruption.

The correct response:

  1. Check if any apply is actually running (check CloudWatch for the deploy Lambda, check with teammates).
  2. If no apply is running and you are certain the lock is stale, then force-unlock.
  3. Document that you forced a lock – it is a signal to investigate why the previous apply did not clean up properly.

Never force-unlock as a first response. Always verify there is no live apply first.

Drift Detection

Even with strict IaC discipline, drift accumulates. AWS may auto-modify resources (e.g., Cognito updating a Lambda trigger association), or a team member may make an emergency console change and forget to import it.

Schedule a regular plan run (weekly or after significant activity periods) with no apply:

# In CI/CD: weekly scheduled plan ./deploy.sh --infra --plan 2>&1 | tee plan-output.txt # Alert if plan shows unexpected changes

If the plan shows changes that should not exist, investigate before the next apply destroys them.

Key Takeaways

  • The AWS console is for reading, not writing. Every infrastructure change belongs in IaC.
  • When a resource exists in AWS but not in Terraform: import it, reconcile the config, then apply.
  • Never use -exclude to work around a failing resource. Fix the root cause.
  • Wrap all Terraform operations in a deploy script. Default to plan; require explicit intent to apply.
  • Put the plan output in every infrastructure PR. Unexpected destroys should block the PR.
  • Never force-unlock without verifying no live apply is running.
  • Run scheduled plan checks to detect drift early.
Posted in: ,