Home/Blog/Architecture/Infrastructure as Code Pattern: GitOps, Reusable Modules, and Policy Guardrails

ArchitectureAdvanced•15 min read•Mar 13, 2026

Infrastructure as Code Pattern: GitOps, Reusable Modules, and Policy Guardrails

Manage cloud infrastructure declaratively with reviewable diffs, drift control, and policy checks.

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: Infrastructure as code is useful because it makes infrastructure changes reviewable, repeatable, and testable. It becomes production-grade only when module boundaries, state locking, GitOps flow, and policy checks are treated as operational controls rather than tooling preferences.

TLDR: The SRE question is not “do we use Terraform?” It is “can we explain how infra changes are planned, approved, applied, rolled back, and audited without relying on someone remembering a terminal command from last quarter?”

Operator note: Incident reviews usually show that the outage was not caused by “IaC” itself. It was caused by a manual hotfix that drifted from code, a module that owned too much blast radius, or a plan that passed review without policy gates for network, IAM, or deletion risk.

In 2017, a GitLab engineer accidentally deleted a production database with a single terminal command run on the wrong host. Recovery took 18 hours partly because the environment had been configured manually — there was no code to re-apply, no reviewed diff, and no reproducible restore path. If that environment had been managed with IaC, any engineer on the team could have reapplied the desired state from version control and validated the rebuild against policy checks.

If your team operates cloud infrastructure, IaC and GitOps are the difference between “we think it looks like this” and “we can prove it, reproduce it, and audit every change that got us here.”

Worked example — an OPA policy that blocks public S3 exposure before terraform apply runs:

deny[msg] {
  r := input.resource_changes[_]
  r.type == "aws_s3_bucket_acl"
  r.change.after.acl == "public-read"
  msg := sprintf("BLOCKED: '%v' must not use public-read ACL", [r.address])
}

This rule runs against the Terraform plan in CI. A misconfiguration is caught before any human reviews the diff — apply is never reached.

📖 When IaC and GitOps Actually Help

IaC is strongest when infrastructure changes are frequent enough to need guardrails and consistent enough to encode declaratively.

Use this pattern when:

multiple environments must stay aligned,
teams need audited change history,
infra drift creates repeated surprises,
security and reliability rules should block unsafe changes before apply.

Operational problem	Why IaC or GitOps helps
Manual network changes drift across environments	Desired state in code becomes the source of truth
Teams cannot review infrastructure risk before change windows	Plans and policy checks expose blast radius early
Emergency changes leave no paper trail	Pull requests and controller history make actions auditable
Platform standards vary across teams	Modules and policy-as-code standardize the baseline

🔍 When Not to Over-Apply IaC

IaC needs judgement: encoding one-off decisions into broad shared modules can make change harder, not safer.

Avoid IaC when:

module boundaries are so broad that one change affects unrelated systems,
teams have not yet agreed on ownership of shared state or environments,
GitOps controllers would hide the operational intent more than clarify it,
emergency break-glass procedures do not exist for controller or state-store failure.

Constraint	Better first move
Shared module is a monolith no one wants to touch	Split module boundaries first
Teams still hand-edit cloud resources daily	Start with drift visibility and one managed environment
Policy standards are not yet defined	Write the policy contract before automating enforcement

⚙️ How IaC Works in Production

A practical flow is simple and disciplined:

Engineers propose infra changes in version control.
CI validates syntax, module inputs, and generated plan.
Policy checks block unsafe changes such as public exposure, missing tags, or unrestricted IAM.
Reviewers inspect the plan, not just the HCL/YAML diff.
Apply happens through one controlled path.
Drift detection and controller reconciliation reveal out-of-band changes.

Control point	What operators care about	Why it matters
Module boundary	Blast radius of one change	Prevents “change one thing, surprise ten systems”
Remote state and locking	Safe concurrency during plan/apply	Prevents corrupt or racing state
Policy gates	Blocking dangerous changes pre-apply	Moves review from intuition to enforcement
Apply path	One approved execution path	Reduces shadow automation and manual drift
Drift detection	Visibility into manual change	Keeps code and reality converged

🧠 Deep Dive: What Incident Reviews Usually Reveal First

Failure mode	Early symptom	Root cause	First mitigation
Plan looked small, impact was huge	One module change touched many unrelated resources	Module boundary was too broad	Refactor modules around ownership and lifecycle
Safe rollback was impossible	Reverting code did not restore runtime safely	Change affected stateful or destructive resource lifecycle	Define forward-fix and immutable replacement rules
Drift keeps reappearing	Controllers “fix” manual changes repeatedly	Break-glass changes bypassed normal path	Track manual changes and formalize exception workflow
Policy gates are ignored in emergencies	Teams disable checks to move faster	Policies were too noisy or not risk-ranked	Separate hard-blockers from advisory checks
State lock contention delays incidents	Emergency apply waits on stale lock or long plan	State layout is too coarse	Split state by ownership and environment

Field note: one of the highest-signal metrics in mature IaC programs is not “apply succeeded.” It is “how often did someone bypass the path entirely?” Manual change count is often the best predictor of the next surprise.

Internals: How Shopify, Datadog, and HashiCorp Build IaC Programs

Shopify manages Terraform at scale by coupling module ownership to team boundaries. Each platform team owns a versioned module set published via a private Terraform registry. Any change to a shared "golden module" triggers CI on downstream consumers automatically — not just the module's own tests. This surfaces blast-radius early, before engineers open a plan.

Datadog's GitOps pipeline uses a controller pattern similar to Argo CD: infrastructure desired state lives in Git, and a reconciliation controller continuously compares live cloud state against that spec. Drift alerts fire within minutes of a manual change, not days later during the next deployment cycle. Teams see exactly which resource drifted and who owns it.

📊 GitOps Reconciliation Loop

sequenceDiagram
    participant Dev as Developer
    participant Git as Git Repo
    participant ACD as ArgoCD
    participant K8s as K8s Cluster
    Dev->>Git: Push config change
    ACD->>Git: Detect diff (polling)
    ACD->>ACD: Compare desired vs actual
    ACD->>K8s: Apply manifests
    K8s-->>ACD: Sync complete
    ACD-->>Dev: Deployment status OK

HashiCorp's own policy-as-code practice treats OPA (Open Policy Agent) rules as first-class engineering artifacts: policies are version-controlled, CI-tested against known-bad plan fixtures, and reviewed in PRs just like application code. A policy that advisory-warns on staging but hard-blocks on production is a common and effective pattern — it teaches teams the rule before enforcing it.

Failure scenario: at a fintech running Terraform with a broad shared networking module, an engineer added a security group rule and the generated plan silently touched 12 RDS instances. A downstream firewall change then blocked database traffic for one environment — the issue was invisible in the source diff. An OPA rule checking security group modifications would have caught it before apply. The fix: split the module and add a hard-blocking policy.

Performance and Operational Metrics That Signal IaC Health

IaC programs don't have p99 latency in the traditional sense, but they do have measurable signals that predict the next surprise:

Metric	Why it matters	Threshold to investigate
Manual change count per week	Predicts hidden divergence; bypasses are the leading indicator of the next outage	> 3 manual changes outside the normal path
Drift events per environment	Shows how often live state diverges from declared state	Any drift outside a maintenance window
Failed policy checks (hard-block)	Shows which guardrails are catching real risk	Same rule blocking repeatedly → fix the pattern, not the check
Apply duration (p95)	Oversized modules make applies slow and risky	> 10 minutes suggests a module split is overdue
State lock wait time during incidents	Reveals team coupling and emergency friction	> 2 minutes → split state by ownership

📊 IaC and GitOps Control Flow

flowchart TD
    A[Pull request with infra change] --> B[Validate syntax and module inputs]
    B --> C[Generate plan or desired-state diff]
    C --> D[Run policy checks]
    D --> E{Checks pass?}
    E -->|No| F[Reject change before apply]
    E -->|Yes| G[Human review of blast radius]
    G --> H[Controlled apply or GitOps merge]
    H --> I[Runtime reconciliation]
    I --> J[Drift detection and alerts]

This flowchart maps the full GitOps control flow from a pull request carrying an infrastructure change through syntax validation, plan generation, and policy checks to a human-reviewed apply and ongoing drift monitoring. The critical gate is the policy check node: hard-block failures reject the change before it reaches apply, ensuring guardrails run before any blast radius is assessed. The takeaway is that drift detection closes the loop — any deviation from declared state feeds back into the same plan-review-apply cycle.

🧪 Concrete Config Example: CI Gate for Terraform and Policy Checks

This example demonstrates a GitHub Actions CI pipeline that gates infrastructure changes with Terraform plan generation and automated policy enforcement checks before any human review. It was chosen because CI-based IaC validation is the most practical first control to implement — it catches risky changes before apply without requiring a dedicated policy server or manual checklist. Read the workflow steps in sequence: checkout, authenticate, generate plan, then enforce policy — each step is a distinct failure boundary with its own blast-radius scope.

name: infra-plan
on:
  pull_request:
    paths:
      - infra/**

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - use
s: actions/checkout@v4
      - ru
n: terraform -chdir=infra/prod init -input=false
      - ru
n: terraform -chdir=infra/prod validate
      - ru
n: terraform -chdir=infra/prod plan -out=tfplan -input=false
      - ru
n: terraform -chdir=infra/prod show -json tfplan > tfplan.json
      - ru
n: conftest test tfplan.json --policy policy/

An example OPA (Rego) policy that blocks public S3 exposure and unrestricted security group rules:

# policy/deny_public_exposure.rego
package main

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_s3_bucket_acl"
  resource.change.after.acl == "public-read"
  msg := sprintf("BLOCKED: S3 bucket '%v' must not use public-read ACL", [resource.address])
}

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_security_group_rule"
  resource.change.after.cidr_blocks[_] == "0.0.0.0/0"
  resource.change.after.from_port == 0
  msg := sprintf("BLOCKED: '%v' opens all ports to the internet", [resource.address])
}

Conftest evaluates this against the Terraform JSON plan before any human review happens. A denial fails the CI job and blocks the PR before apply is ever reached. The policy lives in policy/ alongside the Terraform code — same PR review, same audit trail, same version history.

Why this matters:

the plan is produced in CI, not on one engineer’s laptop,
policy checks inspect the generated plan, not just source text,
reviewers can reason about blast radius before apply is allowed.

🌍 Real-World Applications: What to Instrument and What to Audit

Signal	Why it matters	Typical alert or review
Drift count by environment	Shows when reality no longer matches code	Drift detected outside maintenance window
Manual change count	Predicts hidden divergence	Manual changes exceed threshold
Failed policy checks	Shows which rules are catching risky behavior	High-risk policy violations appear repeatedly
Apply duration and failure rate	Indicates unhealthy state layout or controller issues	Apply time spikes or repeated partial failures
State lock contention	Reveals team coupling and emergency friction	Lock wait time too high during incidents

⚖️ Trade-offs & Failure Modes: Pros, Cons, and Alternatives

Category	Practical impact	Mitigation
Pros	Reviewable and repeatable infrastructure changes	Make plan review mandatory
Pros	Easier policy enforcement and auditability	Encode hard-blocking reliability rules
Cons	More process around small changes	Use modules and templates to reduce toil
Cons	Controller or state-store failure can slow response	Keep break-glass process explicit
Risk	Teams trust green CI more than real blast radius	Review generated plan and ownership impact
Risk	GitOps hides drift until reconciliation fights production	Alert on drift, not just sync failure

🧭 Decision Guide for SRE Teams

Situation	Recommendation
Multi-environment platform with recurring drift	Adopt IaC plus drift detection
Need auditable cluster or network changes	Add GitOps or controlled apply path
Shared module creates oversized blast radius	Refactor modules before scaling automation
Stateful destructive changes dominate	Treat rollback as forward-fix design, not revert fantasy

If the plan review cannot clearly answer “what might be deleted or replaced?”, the change is not ready.

🛠️ Terraform, Pulumi, and AWS CDK: IaC Tooling for Production Infrastructure

Terraform (HashiCorp) is the most widely adopted declarative IaC tool, using HCL to define infrastructure resources with a plan/apply workflow, remote state, and a vast provider ecosystem. Pulumi offers the same capabilities using general-purpose languages (TypeScript, Python, Java, Go) instead of HCL. AWS CDK is Amazon's IaC framework using TypeScript or Python that compiles to CloudFormation.

These tools solve the IaC problem by making infrastructure changes reviewable, diffable, and policy-checkable before any resource is modified. The plan (terraform plan, pulumi preview, cdk diff) produces a human-readable change set; OPA or Conftest policies evaluate that plan before apply is allowed.

The OPA policy and CI gate from the 🧪 Concrete Config Example section above show policy enforcement in action. Here is the foundational Terraform module structure and variable pattern that keeps blast radius bounded:

# terraform/modules/api-service/variables.tf
variable "service_name" {
  type        = string
  description = "Unique name for this service instance — scopes all resources."
}

variable "environment" {
  type        = string
  description = "Deployment environment: dev | staging | prod"
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "environment must be one of: dev, staging, prod"
  }
}

variable "min_instances" { type = number; default = 2 }
variable "max_instances" { type = number; default = 10 }

# terraform/modules/api-service/main.tf
resource "aws_autoscaling_group" "api" {
  name             = "${var.service_name}-${var.environment}"
  min_size         = var.min_instances
  max_size         = var.max_instances
  # ...
  tag {
    key                 = "service"
    value               = var.service_name
    propagate_at_launch = true
  }
  tag {
    key                 = "environment"
    value               = var.environment
    propagate_at_launch = true
  }
}

# Callers instantiate the module per environment — one change touches one env
module "payments_api_prod" {
  source        = "./modules/api-service"
  service_name  = "payments-api"
  environment   = "prod"
  min_instances = 4
  max_instances = 20
}

Pulumi achieves the same module-per-service boundary using TypeScript ComponentResource classes, giving teams full IDE support, type checking, and unit tests for infrastructure. AWS CDK's Stack and Construct primitives map to the same concept in the CloudFormation world.

For a full deep-dive on Terraform modules, Pulumi, and AWS CDK production patterns, a dedicated follow-up post is planned.

📊 Terraform Plan → Apply Flow

flowchart TD
    A[Write .tf config] --> B[terraform init]
    B --> C[terraform plan]
    C --> D{Review diff}
    D -->|Approved| E[terraform apply]
    D -->|Rejected| A
    E --> F[State file updated]
    F --> G[Infra matches config]
    G --> H[Monitor drift]
    H -->|Drift detected| C

This flowchart shows the Terraform plan-apply cycle: writing configuration leads to init and plan, then a review decision either approves the apply or loops back for revision. A successful apply updates the state file to match declared configuration, and drift monitoring continuously checks for divergence that would restart the cycle. The key takeaway is that the review gate after plan is the primary blast-radius control — every resource that can be deleted, replaced, or interrupted must be visible here before apply proceeds.

📚 Interactive Review: Infra Change Drill

Before approving an infra PR, ask:

Which owner is accountable if this module change affects a second team unexpectedly?
What resources can be deleted, replaced, or interrupted by this plan?
If the apply fails halfway through, is the path forward a rollback or a forward-fix?
Which policy rule would have caught the last real outage of this type?
How would operators detect and reconcile a manual break-glass change afterward?

Scenario question: a networking PR changes one shared module and the generated plan touches 180 resources across three environments. Do you keep the change, split the module, or stage the rollout? Why?

📌 TLDR: Summary & Key Takeaways

IaC improves reliability only when ownership, state, and policy boundaries are explicit.
Review the generated plan or desired-state diff, not just the source change.
Module blast radius and rollback reality matter more than tool brand.
Drift detection and manual-change auditing are core operational signals.
GitOps is strongest when it makes desired state clearer, not when it hides how change happens.

Article tools

Explain simpler Compare approaches What next?

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata