All Posts

Infrastructure as Code Pattern: GitOps, Reusable Modules, and Policy Guardrails

Manage cloud infrastructure declaratively with reviewable diffs, drift control, and policy checks.

Abstract AlgorithmsAbstract Algorithms
··13 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Infrastructure as code is useful because it makes infrastructure changes reviewable, repeatable, and testable. It becomes production-grade only when module boundaries, state locking, GitOps flow, and policy checks are treated as operational controls rather than tooling preferences.

TLDR: The SRE question is not “do we use Terraform?” It is “can we explain how infra changes are planned, approved, applied, rolled back, and audited without relying on someone remembering a terminal command from last quarter?”

Operator note: Incident reviews usually show that the outage was not caused by “IaC” itself. It was caused by a manual hotfix that drifted from code, a module that owned too much blast radius, or a plan that passed review without policy gates for network, IAM, or deletion risk.

In 2017, a GitLab engineer accidentally deleted a production database with a single terminal command run on the wrong host. Recovery took 18 hours partly because the environment had been configured manually — there was no code to re-apply, no reviewed diff, and no reproducible restore path. If that environment had been managed with IaC, any engineer on the team could have reapplied the desired state from version control and validated the rebuild against policy checks.

If your team operates cloud infrastructure, IaC and GitOps are the difference between “we think it looks like this” and “we can prove it, reproduce it, and audit every change that got us here.”

Worked example — an OPA policy that blocks public S3 exposure before terraform apply runs:

deny[msg] {
  r := input.resource_changes[_]
  r.type == "aws_s3_bucket_acl"
  r.change.after.acl == "public-read"
  msg := sprintf("BLOCKED: '%v' must not use public-read ACL", [r.address])
}

This rule runs against the Terraform plan in CI. A misconfiguration is caught before any human reviews the diff — apply is never reached.

📖 When IaC and GitOps Actually Help

IaC is strongest when infrastructure changes are frequent enough to need guardrails and consistent enough to encode declaratively.

Use this pattern when:

  • multiple environments must stay aligned,
  • teams need audited change history,
  • infra drift creates repeated surprises,
  • security and reliability rules should block unsafe changes before apply.
Operational problemWhy IaC or GitOps helps
Manual network changes drift across environmentsDesired state in code becomes the source of truth
Teams cannot review infrastructure risk before change windowsPlans and policy checks expose blast radius early
Emergency changes leave no paper trailPull requests and controller history make actions auditable
Platform standards vary across teamsModules and policy-as-code standardize the baseline

🔍 When Not to Over-Apply IaC

IaC needs judgement: encoding one-off decisions into broad shared modules can make change harder, not safer.

Avoid IaC when:

  • module boundaries are so broad that one change affects unrelated systems,
  • teams have not yet agreed on ownership of shared state or environments,
  • GitOps controllers would hide the operational intent more than clarify it,
  • emergency break-glass procedures do not exist for controller or state-store failure.
ConstraintBetter first move
Shared module is a monolith no one wants to touchSplit module boundaries first
Teams still hand-edit cloud resources dailyStart with drift visibility and one managed environment
Policy standards are not yet definedWrite the policy contract before automating enforcement

⚙️ How IaC Works in Production

A practical flow is simple and disciplined:

  1. Engineers propose infra changes in version control.
  2. CI validates syntax, module inputs, and generated plan.
  3. Policy checks block unsafe changes such as public exposure, missing tags, or unrestricted IAM.
  4. Reviewers inspect the plan, not just the HCL/YAML diff.
  5. Apply happens through one controlled path.
  6. Drift detection and controller reconciliation reveal out-of-band changes.
Control pointWhat operators care aboutWhy it matters
Module boundaryBlast radius of one changePrevents “change one thing, surprise ten systems”
Remote state and lockingSafe concurrency during plan/applyPrevents corrupt or racing state
Policy gatesBlocking dangerous changes pre-applyMoves review from intuition to enforcement
Apply pathOne approved execution pathReduces shadow automation and manual drift
Drift detectionVisibility into manual changeKeeps code and reality converged

🧠 Deep Dive: What Incident Reviews Usually Reveal First

Failure modeEarly symptomRoot causeFirst mitigation
Plan looked small, impact was hugeOne module change touched many unrelated resourcesModule boundary was too broadRefactor modules around ownership and lifecycle
Safe rollback was impossibleReverting code did not restore runtime safelyChange affected stateful or destructive resource lifecycleDefine forward-fix and immutable replacement rules
Drift keeps reappearingControllers “fix” manual changes repeatedlyBreak-glass changes bypassed normal pathTrack manual changes and formalize exception workflow
Policy gates are ignored in emergenciesTeams disable checks to move fasterPolicies were too noisy or not risk-rankedSeparate hard-blockers from advisory checks
State lock contention delays incidentsEmergency apply waits on stale lock or long planState layout is too coarseSplit state by ownership and environment

Field note: one of the highest-signal metrics in mature IaC programs is not “apply succeeded.” It is “how often did someone bypass the path entirely?” Manual change count is often the best predictor of the next surprise.

Internals: How Shopify, Datadog, and HashiCorp Build IaC Programs

Shopify manages Terraform at scale by coupling module ownership to team boundaries. Each platform team owns a versioned module set published via a private Terraform registry. Any change to a shared "golden module" triggers CI on downstream consumers automatically — not just the module's own tests. This surfaces blast-radius early, before engineers open a plan.

Datadog's GitOps pipeline uses a controller pattern similar to Argo CD: infrastructure desired state lives in Git, and a reconciliation controller continuously compares live cloud state against that spec. Drift alerts fire within minutes of a manual change, not days later during the next deployment cycle. Teams see exactly which resource drifted and who owns it.

HashiCorp's own policy-as-code practice treats OPA (Open Policy Agent) rules as first-class engineering artifacts: policies are version-controlled, CI-tested against known-bad plan fixtures, and reviewed in PRs just like application code. A policy that advisory-warns on staging but hard-blocks on production is a common and effective pattern — it teaches teams the rule before enforcing it.

Failure scenario: at a fintech running Terraform with a broad shared networking module, an engineer added a security group rule and the generated plan silently touched 12 RDS instances. A downstream firewall change then blocked database traffic for one environment — the issue was invisible in the source diff. An OPA rule checking security group modifications would have caught it before apply. The fix: split the module and add a hard-blocking policy.

Performance and Operational Metrics That Signal IaC Health

IaC programs don't have p99 latency in the traditional sense, but they do have measurable signals that predict the next surprise:

MetricWhy it mattersThreshold to investigate
Manual change count per weekPredicts hidden divergence; bypasses are the leading indicator of the next outage> 3 manual changes outside the normal path
Drift events per environmentShows how often live state diverges from declared stateAny drift outside a maintenance window
Failed policy checks (hard-block)Shows which guardrails are catching real riskSame rule blocking repeatedly → fix the pattern, not the check
Apply duration (p95)Oversized modules make applies slow and risky> 10 minutes suggests a module split is overdue
State lock wait time during incidentsReveals team coupling and emergency friction> 2 minutes → split state by ownership

📊 IaC and GitOps Control Flow

flowchart TD
    A[Pull request with infra change] --> B[Validate syntax and module inputs]
    B --> C[Generate plan or desired-state diff]
    C --> D[Run policy checks]
    D --> E{Checks pass?}
    E -->|No| F[Reject change before apply]
    E -->|Yes| G[Human review of blast radius]
    G --> H[Controlled apply or GitOps merge]
    H --> I[Runtime reconciliation]
    I --> J[Drift detection and alerts]

🧪 Concrete Config Example: CI Gate for Terraform and Policy Checks

name: infra-plan
on:
  pull_request:
    paths:
      - infra/**

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: terraform -chdir=infra/prod init -input=false
      - run: terraform -chdir=infra/prod validate
      - run: terraform -chdir=infra/prod plan -out=tfplan -input=false
      - run: terraform -chdir=infra/prod show -json tfplan > tfplan.json
      - run: conftest test tfplan.json --policy policy/

An example OPA (Rego) policy that blocks public S3 exposure and unrestricted security group rules:

# policy/deny_public_exposure.rego
package main

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_s3_bucket_acl"
  resource.change.after.acl == "public-read"
  msg := sprintf("BLOCKED: S3 bucket '%v' must not use public-read ACL", [resource.address])
}

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_security_group_rule"
  resource.change.after.cidr_blocks[_] == "0.0.0.0/0"
  resource.change.after.from_port == 0
  msg := sprintf("BLOCKED: '%v' opens all ports to the internet", [resource.address])
}

Conftest evaluates this against the Terraform JSON plan before any human review happens. A denial fails the CI job and blocks the PR before apply is ever reached. The policy lives in policy/ alongside the Terraform code — same PR review, same audit trail, same version history.

Why this matters:

  • the plan is produced in CI, not on one engineer’s laptop,
  • policy checks inspect the generated plan, not just source text,
  • reviewers can reason about blast radius before apply is allowed.

🌍 Real-World Applications: What to Instrument and What to Audit

SignalWhy it mattersTypical alert or review
Drift count by environmentShows when reality no longer matches codeDrift detected outside maintenance window
Manual change countPredicts hidden divergenceManual changes exceed threshold
Failed policy checksShows which rules are catching risky behaviorHigh-risk policy violations appear repeatedly
Apply duration and failure rateIndicates unhealthy state layout or controller issuesApply time spikes or repeated partial failures
State lock contentionReveals team coupling and emergency frictionLock wait time too high during incidents

⚖️ Trade-offs & Failure Modes: Pros, Cons, and Alternatives

CategoryPractical impactMitigation
ProsReviewable and repeatable infrastructure changesMake plan review mandatory
ProsEasier policy enforcement and auditabilityEncode hard-blocking reliability rules
ConsMore process around small changesUse modules and templates to reduce toil
ConsController or state-store failure can slow responseKeep break-glass process explicit
RiskTeams trust green CI more than real blast radiusReview generated plan and ownership impact
RiskGitOps hides drift until reconciliation fights productionAlert on drift, not just sync failure

🧭 Decision Guide for SRE Teams

SituationRecommendation
Multi-environment platform with recurring driftAdopt IaC plus drift detection
Need auditable cluster or network changesAdd GitOps or controlled apply path
Shared module creates oversized blast radiusRefactor modules before scaling automation
Stateful destructive changes dominateTreat rollback as forward-fix design, not revert fantasy

If the plan review cannot clearly answer “what might be deleted or replaced?”, the change is not ready.

🛠️ Terraform, Pulumi, and AWS CDK: IaC Tooling for Production Infrastructure

Terraform (HashiCorp) is the most widely adopted declarative IaC tool, using HCL to define infrastructure resources with a plan/apply workflow, remote state, and a vast provider ecosystem. Pulumi offers the same capabilities using general-purpose languages (TypeScript, Python, Java, Go) instead of HCL. AWS CDK is Amazon's IaC framework using TypeScript or Python that compiles to CloudFormation.

These tools solve the IaC problem by making infrastructure changes reviewable, diffable, and policy-checkable before any resource is modified. The plan (terraform plan, pulumi preview, cdk diff) produces a human-readable change set; OPA or Conftest policies evaluate that plan before apply is allowed.

The OPA policy and CI gate from the 🧪 Concrete Config Example section above show policy enforcement in action. Here is the foundational Terraform module structure and variable pattern that keeps blast radius bounded:

# terraform/modules/api-service/variables.tf
variable "service_name" {
  type        = string
  description = "Unique name for this service instance — scopes all resources."
}

variable "environment" {
  type        = string
  description = "Deployment environment: dev | staging | prod"
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "environment must be one of: dev, staging, prod"
  }
}

variable "min_instances" { type = number; default = 2 }
variable "max_instances" { type = number; default = 10 }

# terraform/modules/api-service/main.tf
resource "aws_autoscaling_group" "api" {
  name             = "${var.service_name}-${var.environment}"
  min_size         = var.min_instances
  max_size         = var.max_instances
  # ...
  tag {
    key                 = "service"
    value               = var.service_name
    propagate_at_launch = true
  }
  tag {
    key                 = "environment"
    value               = var.environment
    propagate_at_launch = true
  }
}

# Callers instantiate the module per environment — one change touches one env
module "payments_api_prod" {
  source        = "./modules/api-service"
  service_name  = "payments-api"
  environment   = "prod"
  min_instances = 4
  max_instances = 20
}

Pulumi achieves the same module-per-service boundary using TypeScript ComponentResource classes, giving teams full IDE support, type checking, and unit tests for infrastructure. AWS CDK's Stack and Construct primitives map to the same concept in the CloudFormation world.

For a full deep-dive on Terraform modules, Pulumi, and AWS CDK production patterns, a dedicated follow-up post is planned.

📚 Interactive Review: Infra Change Drill

Before approving an infra PR, ask:

  1. Which owner is accountable if this module change affects a second team unexpectedly?
  2. What resources can be deleted, replaced, or interrupted by this plan?
  3. If the apply fails halfway through, is the path forward a rollback or a forward-fix?
  4. Which policy rule would have caught the last real outage of this type?
  5. How would operators detect and reconcile a manual break-glass change afterward?

Scenario question: a networking PR changes one shared module and the generated plan touches 180 resources across three environments. Do you keep the change, split the module, or stage the rollout? Why?

📌 TLDR: Summary & Key Takeaways

  • IaC improves reliability only when ownership, state, and policy boundaries are explicit.
  • Review the generated plan or desired-state diff, not just the source change.
  • Module blast radius and rollback reality matter more than tool brand.
  • Drift detection and manual-change auditing are core operational signals.
  • GitOps is strongest when it makes desired state clearer, not when it hides how change happens.

📝 Practice Quiz

  1. What is the clearest operational advantage of IaC?

A) It removes the need for human review
B) It makes infrastructure changes repeatable, reviewable, and auditable
C) It guarantees safe rollback for every change

Correct Answer: B

  1. Which issue most often creates oversized infra blast radius?

A) Small, well-owned modules
B) Broad shared modules that couple unrelated resources
C) Running policy checks in CI

Correct Answer: B

  1. What should reviewers examine most closely before approving a change?

A) Only the code diff formatting
B) The generated plan or desired-state diff and its delete/replace impact
C) The number of comments on the PR

Correct Answer: B

  1. Open-ended challenge: your team keeps using manual hotfixes during incidents because the state lock is too slow. How would you redesign state layout and break-glass workflow without giving up auditability?
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms