Types of LLM Quantization: By Timing, Scope, and Mapping

PTQ, QAT, INT8, INT4, and NF4 explained through timing, scope, and mapping choices.

LLM Engineering

Abstract Algorithms

·Mar 14, 2026·16 min read

Cover Image for Types of LLM Quantization: By Timing, Scope, and Mapping

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 16 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantization, then add activation quantization when needed.

📖 Quantization Is a Design Space, Not One Switch

Deploying LLaMA-3-70B at full fp16 precision requires approximately 140GB of VRAM — two A100 80GB GPUs at roughly $8–12K/month in cloud GPU rental. At Q4_K_M quantization (4-bit weights), the same model fits on a single consumer GPU with 24GB VRAM. The measured quality difference on most standard benchmarks? Under 2%.

That is the practical upside, but "just quantize it to 4-bit" is not a strategy — it is a gamble. Teams that apply the wrong quantization type to the wrong model component regularly discover quality regressions in production that never appeared in offline evals.

Here is a concrete picture of the memory trade-off:

Model	Precision	VRAM Required	Approximate Hardware
LLaMA-3-70B	fp16	~140 GB	2× A100 80GB
LLaMA-3-70B	INT8	~70 GB	1× A100 80GB
LLaMA-3-70B	Q4_K_M	~35 GB	1× H100 40GB or RTX 3090
LLaMA-3-8B	Q4_K_M	~4.5 GB	Consumer laptop GPU

The right question is not "should I quantize?" — it is: "Which quantization type fits my latency, memory, and quality budget?"

This post uses a taxonomy approach. Instead of memorizing tool names, classify every method by:

By Timing: when low precision enters the pipeline.
By Scope: which tensors or components are quantized.
By Mapping: how float values are mapped to low-bit representations.

If your main pain is...	Your first axis to optimize
Model does not fit GPU/edge memory	Scope (start with weights)
Cost per token is too high	Scope + Mapping
Accuracy regression after PTQ	Timing (move toward QAT)
Latency remains high despite smaller model	Mapping + kernel compatibility

🔍 Three Classification Axes You Can Apply to Any Quantized LLM

Before selecting a library or hardware backend, use this compact classifier:

Axis	Core question	Common options	Typical impact
Timing	When do we apply quantization?	PTQ, QAT	Accuracy retention vs implementation effort
Scope	What model parts are quantized?	Weights-only, weights+activations, KV cache	Memory and throughput
Mapping	How are floats represented in low bits?	Symmetric, asymmetric, non-uniform (NF4)	Error profile and hardware efficiency

Use timing for lifecycle decisions, scope for memory/bandwidth impact, and mapping for error behavior.

📊 PTQ vs QAT Taxonomy

flowchart LR
    subgraph PTQ[Post-Training Quantization (PTQ)]
        P1[Trained FP16 Model]
        P2[Calibration Dataset (representative prompts)]
        P3[Quantize Weights (layer by layer)]
        P4[Deploy 4-bit / INT8 Model]
        P1 --> P2 --> P3 --> P4
    end

    subgraph QAT[Quantization-Aware Training (QAT)]
        Q1[FP16 Model or Adapter]
        Q2[Simulate Low-Precision During Fine-Tuning]
        Q3[Weights Adapt to Quantization Noise]
        Q4[Deploy with Better Quality Retention]
        Q1 --> Q2 --> Q3 --> Q4
    end

    PTQ -->|"Quality drops?"| QAT

This flowchart compares the two primary quantization timing strategies side by side. PTQ starts from an already-trained FP16 model, runs a small calibration dataset to determine quantization parameters layer by layer, and produces a 4-bit or INT8 model ready for deployment — requiring no gradient updates. QAT instead embeds simulated low-precision arithmetic into the fine-tuning loop so that weights adapt to quantization noise before deployment, yielding better quality retention at the same bit width. The arrow from PTQ to QAT signals the recommended workflow: try PTQ first for speed and simplicity, then escalate to QAT only if the quality drop is unacceptable.

📊 Scope: Per-Tensor vs Per-Channel vs Per-Token

flowchart TD
    Quant[Quantization Scope Decision]
    WeightOnly[Weights-Only (most linear layers)]
    WeightAct[Weights + Activations (higher throughput)]
    KVCache[+ KV Cache (long-context memory)]
    Mixed[Selective / Mixed (sensitive layers = BF16)]

    Quant --> WeightOnly
    WeightOnly -->|"Need more memory saving"| WeightAct
    WeightAct -->|"Long-context prompts"| KVCache
    WeightOnly -->|"Quality sensitive layers"| Mixed

    subgraph Granularity[Quantization Granularity]
        PerTensor[Per-Tensor (one scale for whole tensor)]
        PerChannel[Per-Channel (scale per output channel)]
        PerToken[Per-Token (scale per token, activations)]
    end

    WeightOnly --> Granularity

This flowchart maps the quantization scope decision tree, starting from the default weights-only path and escalating to wider coverage as memory or quality constraints demand. Weights-only quantization covers most linear layers; adding activations increases throughput at higher compression; appending KV-cache quantization targets long-context memory pressure. The Granularity subgraph shows how scale granularity — per-tensor, per-channel, or per-token — trades implementation overhead for accuracy preservation at each scope level. The key takeaway is that scope is a dial: start narrow and widen only when benchmarks show you must.

⚙️ By Timing: PTQ vs QAT

Timing answers when quantization appears during the model lifecycle.

Post-Training Quantization (PTQ)

PTQ quantizes an already trained model. You do not retrain from scratch.

Fastest path to deployment.
Good first step for most LLM serving workloads.

PTQ can be static (calibrated once) or dynamic (activation scale computed at runtime in some setups).

Quantization-Aware Training (QAT)

QAT simulates low-precision behavior during fine-tuning/training so weights adapt to quantization noise.

Better quality retention when PTQ degrades important tasks.
Requires a cleaner data and eval pipeline.

Timing type	Best when	Main risk	Typical owner
PTQ	You need speed and lower infra cost now	Quality drops on sensitive tasks	Inference/platform team
QAT	PTQ quality is below product threshold	Extra tuning cycles and GPU cost	Model + platform collaboration

For most teams: PTQ first, QAT only when validation says PTQ is not enough.

⚙️ By Scope: Which Parts of the LLM Get Quantized

Scope determines where quantization is applied in the model and runtime path.

Scope option	What is quantized	Memory gain	Accuracy risk	Notes
Weights-only	Model parameters	High	Low to medium	Most common first step
Weights + activations	Parameters + runtime activations	Higher	Medium	Better throughput potential
Weights + activations + KV cache	Adds cache compression	Very high	Medium to high	Long-context quality needs careful testing
Selective/mixed scope	Some layers kept high precision	Medium to high	Lower	Practical compromise

Common pattern: quantize most linear layer weights first, keep sensitive heads in higher precision, then add activations only after quality baselines pass.

⚙️ By Mapping: How Float Values Become Low-Bit Values

Mapping defines the numeric transformation from float tensors to low-bit formats.

Symmetric mapping

Values are centered around zero, typically with a single scale.

Simpler and often faster.
Works well when tensor distributions are roughly zero-centered.

Asymmetric mapping

Uses scale plus zero-point, allowing shifted ranges.

Better fit for non-zero-centered distributions.
Slightly more metadata/handling complexity.

Non-uniform mapping (example: NF4)

Not all quantization levels are equally spaced.

Better alignment with weight distributions in some LLMs.
Common in 4-bit weight quantization pipelines.

Mapping type	Formula style	Hardware friendliness	Typical use
Symmetric	`q = round(x / s)`	High	INT8 weight or activation paths
Asymmetric	`q = round(x / s) + z`	High	INT8 with offset-friendly runtimes
Non-uniform	Codebook/learned bins	Medium	4-bit LLM weight quantization

If two methods use the same bit width but different mapping, quality can differ significantly.

⚙️ Popular Approaches: Weight Quantization and Activation Quantization

These are the two most widely discussed practical approaches.

Weight quantization

Weight quantization compresses model parameters (usually linear layers).

Why it is popular:

Big memory savings with manageable quality impact.
Often enough to move from "cannot deploy" to "production feasible."

Typical setup:

8-bit (safer) or 4-bit (more aggressive) weights.
Per-channel or group-wise scales.
Optional selective high precision for sensitive layers.

Activation quantization

Activation quantization compresses intermediate runtime tensors produced during inference.

Why teams use it:

Further reduces bandwidth and memory traffic.

Why teams delay it:

More sensitive to input distribution shifts.
Requires strong eval coverage (long context, tool calling, domain prompts).

Approach	Biggest benefit	Biggest challenge	Good default order
Weight quantization	Large memory reduction	Layer sensitivity at low bits	Start here
Activation quantization	Extra speed and memory gains	Runtime distribution sensitivity	Add second

In short: weight quantization is the baseline optimization; activation quantization is the scaling optimization.

🧠 Deep Dive: Inside the Runtime: Why Timing, Scope, and Mapping Interact

The internals

At inference time, quantized LLMs run through three hidden mechanisms:

Tensor representation changes: weights/activations stored in low-bit format with scales.
Kernel path changes: runtime chooses quantized GEMM kernels if available.
Rescaling/dequantization points: outputs are rescaled at specific boundaries.

Small scope or mapping changes can push execution to a different kernel path, so "smaller model" does not always mean lower latency.

Mathematical model (lightweight)

A common affine quantization mapping is:

$$ q = \text{clip}\left(\text{round}\left(\frac{x}{s}\right) + z, q_{\min}, q_{\max}\right) $$

$$ \hat{x} = s \cdot (q - z) $$

The reconstruction error is:

$$ e = x - \hat{x} $$

Your deployment goal is to keep e small enough that task-level metrics stay within budget.

Performance analysis

For decoder-only LLMs, per-layer compute class stays similar to the unquantized path, but constants change:

Time complexity trend: operation class stays similar, but lower precision often improves throughput through lower memory transfer.
Space complexity trend: parameter memory roughly scales with bit width (FP16 to INT8 is about 2x smaller; FP16 to 4-bit is about 4x smaller before metadata overhead).
Bottlenecks: memory bandwidth, unsupported kernels, and dequantization overhead can limit gains.

Change	Usually improves	Can regress when
Lower weight precision	Memory footprint, model fit	Kernel path is not optimized
Activation quantization	Throughput, memory traffic	Calibration misses production distribution
More aggressive mapping	Compression ratio	Quantization error hurts key tasks

🔬 Internals

Post-training quantization (PTQ) maps FP32/BF16 weights to lower-bit representations using a calibration dataset to determine optimal scale and zero-point per layer. Quantization-aware training (QAT) simulates quantization noise during forward passes using straight-through estimators, allowing gradients to flow through the discretization step. Weight-only quantization (e.g., GPTQ) quantizes weights but keeps activations in FP16; activation quantization (e.g., SmoothQuant) quantizes both, requiring per-channel rescaling to handle outlier activations.

⚡ Performance Analysis

INT8 weight-only quantization (LLM.int8) cuts memory in half with <0.5% accuracy loss on most benchmarks. INT4 PTQ (GPTQ/AWQ) achieves 4× compression with ~1–2% accuracy degradation. QAT INT4 matches FP16 quality but requires 10–20% additional training compute — justified only for edge deployment where inference cost dominates. Activation quantization with SmoothQuant enables INT8 inference on 175B models at 1.56× throughput improvement over FP16.

📊 Visualizing a Quantization Strategy Flow

flowchart TD
    A[Start with FP16 or BF16 LLM] --> B[Choose Timing Axis]
    B --> C{PTQ or QAT?}
    C -->|PTQ| D[Select Scope: Weights Only]
    C -->|QAT| E[Train with Quantization Simulation]
    D --> F[Select Mapping: Symmetric Asymmetric NF4]
    E --> F
    F --> G[Benchmark Memory Latency Quality]
    G --> H{Targets met?}
    H -- No --> I[Expand Scope or Adjust Mapping]
    I --> G
    H -- Yes --> J[Canary Deploy + Fallback]
    J --> K[Production Rollout]

This flowchart shows the complete quantization strategy selection process from a pre-trained model to production deployment. After choosing between PTQ — which flows immediately into scope and mapping selection — and QAT — which requires a fine-tuning pass with simulated quantization noise — both paths converge at a benchmarking gate covering memory, latency, and quality. If targets are not met, the loop expands scope or adjusts the numerical mapping; once satisfied, a canary deployment validates stability before full rollout. The key takeaway is that quantization is iterative: commit to production only after the benchmark gate passes.

🌍 Real-World Applications: Input, Process, Output

Case study 1: Customer support assistant on shared GPUs

Stage	Details
Input	Multilingual support prompts, medium context, strict p95 latency
Process	PTQ + INT8 weights first, then selective activation quantization on stable layers
Output	Lower memory usage, better concurrency, acceptable quality drift

Case study 2: On-prem legal drafting assistant with long context

Stage	Details
Input	Long-context legal prompts with domain terms
Process	Weight-only 4-bit in most layers, output head and selected attention blocks kept BF16
Output	Model fits target hardware, but long-context eval required extra iteration

Both cases succeed by sequencing decisions across timing, scope, and mapping instead of maximizing compression immediately.

⚖️ Trade-offs & Failure Modes: Trade-offs and Failure Modes You Should Expect

Trade-off or failure mode	What it looks like in production	Mitigation
Memory saved, quality drops	Answers remain fluent but become less accurate	Add task-specific eval thresholds before rollout
Low-bit model, no latency gain	Smaller model but unchanged p95	Validate backend kernel support early
Activation drift	Good offline metrics, bad real traffic performance	Use representative calibration and shadow traffic
Over-quantized sensitive layers	Hallucinations or format breakage in structured tasks	Keep selective layers in higher precision
Aggressive scope change	Improvements on average, poor long-tail reliability	Canary release with automated rollback

Intermediate-level rule: do not ship quantization based on memory metrics alone. Always include task-quality and tail-latency checks.

🧭 Decision Guide: Choosing by Constraint

Situation	Recommendation
Use when	Start with PTQ + weight quantization (INT8 or safe 4-bit) when memory and cost are immediate problems.
Avoid when	Avoid activation quantization as the first move if you do not have production-like calibration/eval data.
Alternative	Use mixed precision: quantize most layers, keep sensitive modules higher precision.
Edge cases	For long context, tool use, or strict JSON output, run dedicated eval suites before full rollout.

If deployment is blocked by memory, optimize scope first. If quality fails after PTQ, revisit timing (QAT). If two same-bit methods differ, inspect mapping and kernel support.

🧪 Practical Examples: Weight-First and Activation-Aware Paths

These examples demonstrate the two most common production quantization strategies: a weight-first path using 4-bit NF4 loading and a activation-aware path that extends quantization to the KV cache for long-context workloads. They were chosen because they represent the lowest-risk starting point and the next logical escalation when memory pressure remains after weight-only quantization. As you read them, focus on where quantization is applied in the pipeline — which layers are quantized, what data type is used for compute, and which components remain in higher precision.

Example 1: Weight quantization with 4-bit loading

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

quant_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_cfg,
    device_map="auto",
)

prompt = "List three trade-offs of LLM quantization."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=80)

print(tokenizer.decode(output[0], skip_special_tokens=True))

What this demonstrates: a weight-first quantization strategy that is widely used for fast prototyping and production pilots.

Activation quantization in full LLM stacks is backend-dependent. Apply it after a weight-only baseline passes, then validate with production-like prompts, long-context tests, and structured-output checks.

🛠️ AutoGPTQ, AutoAWQ, and bitsandbytes: Quantization Libraries in Practice

bitsandbytes (the bnb library by Tim Dettmers) integrates directly with the HuggingFace transformers from_pretrained() API to load models in INT8 or NF4 (4-bit) precision without a separate quantization step — it is the fastest path from a HuggingFace model card to a quantized inference session.

AutoGPTQ implements the GPTQ algorithm (layer-wise weight quantization using second-order gradient information) for aggressive 4-bit quantization with better quality retention than naive round-to-nearest. AutoAWQ implements the AWQ (Activation-Aware Weight Quantization) algorithm, which identifies and preserves the 1% of weight channels most important to output quality.

# ── bitsandbytes: NF4 (4-bit NormalFloat) via HuggingFace BitsAndBytesConfig ──
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # non-uniform 4-bit: better weight distribution fit
    bnb_4bit_use_double_quant=True,      # quantize the scale factors too (~0.4-bit savings)
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model_bnb = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    quantization_config=bnb_cfg,
    device_map="auto",
)

# ── AutoGPTQ: GPTQ 4-bit from a pre-quantized model checkpoint ────────────────
from auto_gptq import AutoGPTQForCausalLM

model_gptq = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
    model_basename="model",
    use_safetensors=True,
    device="cuda:0",
    use_triton=False,    # set True for faster inference on supported GPUs
)

# ── AutoAWQ: AWQ 4-bit from a pre-quantized model checkpoint ──────────────────
from awq import AutoAWQForCausalLM

model_awq = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
    fuse_layers=True,    # fuse attention layers for ~20% throughput improvement
    trust_remote_code=False,
    safetensors=True,
)

# ── Comparison: same prompt, three quantization backends ─────────────────────
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
prompt = "Explain the difference between PTQ and QAT in two sentences."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

for name, model in [("bnb-nf4", model_bnb), ("gptq", model_gptq), ("awq", model_awq)]:
    with torch.inference_mode():
        out = model.generate(**inputs, max_new_tokens=60)
    print(f"\n[{name}]", tokenizer.decode(out[0], skip_special_tokens=True))

Library	Algorithm	Timing	Best for	Quality vs. NF4
bitsandbytes	NF4 / INT8	PTQ (load-time)	Fast prototyping, LoRA fine-tuning	Baseline
AutoGPTQ	GPTQ	PTQ (offline, calibration)	Inference-only deployments needing small model size	+0.5–1% on coding/math
AutoAWQ	AWQ	PTQ (offline, activation-aware)	Low-bit deployment with quality-critical tasks	+0.5–1.5% on reasoning

fuse_layers=True in AWQ merges adjacent transformer blocks into fused CUDA kernels — this is the activation-aware advantage materialising as runtime throughput rather than just model size reduction.

For a full deep-dive on AutoGPTQ calibration pipelines, AWQ activation channel analysis, and bitsandbytes QLoRA fine-tuning workflows, a dedicated follow-up post is planned.

📚 Lessons Learned from Quantization Projects

Classify decisions by timing, scope, and mapping before choosing tools.
Weight quantization is usually the highest-ROI first step.
Activation quantization can unlock additional speed, but calibration quality becomes critical.
Same bit width does not mean same quality; mapping and granularity matter.
Kernel compatibility can dominate real latency outcomes.
Selective precision is often better than aggressive all-layer quantization.

📌 TLDR: Summary & Key Takeaways

LLM quantization is best understood as a 3-axis design space: timing, scope, and mapping.
By Timing: PTQ is fast and practical; QAT helps recover quality when PTQ is insufficient.
By Scope: start with weights, then add activations if needed.
By Mapping: symmetric, asymmetric, and non-uniform mappings create different error behavior.
Weight quantization is the most common production entry point.
Activation quantization is powerful but requires stronger evaluation discipline.
Production success depends on joint optimization of memory, latency, and task-level quality.

One-liner: The best quantization strategy is the one that meets your product SLA with the smallest quality compromise, not the one with the lowest bit count.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

Softmax Function Explained: From Raw Scores to Probabilities

TLDR: Softmax converts a vector of raw scores (logits) into a valid probability distribution by exponentiating each value and dividing by the total. Subtracting the max before exponentiating prevents floating-point overflow. Temperature scaling contr...

May 3, 2026•21 min read

Dot Product in Machine Learning: The Engine Behind Similarity, Attention, and Neural Networks

TLDR: The dot product multiplies corresponding elements of two vectors and sums the results. In machine learning it does three critical jobs: it scores semantic similarity between embeddings, computes every activation in a fully connected layer, and ...