All Posts

Types of LLM Quantization: By Timing, Scope, and Mapping

PTQ, QAT, INT8, INT4, and NF4 explained through timing, scope, and mapping choices.

Abstract AlgorithmsAbstract Algorithms
ยทยท14 min read
Cover Image for Types of LLM Quantization: By Timing, Scope, and Mapping
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantization, then add activation quantization when needed.


๐Ÿ“– Quantization Is a Design Space, Not One Switch

Deploying LLaMA-3-70B at full fp16 precision requires approximately 140GB of VRAM โ€” two A100 80GB GPUs at roughly $8โ€“12K/month in cloud GPU rental. At Q4_K_M quantization (4-bit weights), the same model fits on a single consumer GPU with 24GB VRAM. The measured quality difference on most standard benchmarks? Under 2%.

That is the practical upside, but "just quantize it to 4-bit" is not a strategy โ€” it is a gamble. Teams that apply the wrong quantization type to the wrong model component regularly discover quality regressions in production that never appeared in offline evals.

Here is a concrete picture of the memory trade-off:

ModelPrecisionVRAM RequiredApproximate Hardware
LLaMA-3-70Bfp16~140 GB2ร— A100 80GB
LLaMA-3-70BINT8~70 GB1ร— A100 80GB
LLaMA-3-70BQ4_K_M~35 GB1ร— H100 40GB or RTX 3090
LLaMA-3-8BQ4_K_M~4.5 GBConsumer laptop GPU

The right question is not "should I quantize?" โ€” it is: "Which quantization type fits my latency, memory, and quality budget?"

This post uses a taxonomy approach. Instead of memorizing tool names, classify every method by:

  • By Timing: when low precision enters the pipeline.
  • By Scope: which tensors or components are quantized.
  • By Mapping: how float values are mapped to low-bit representations.
If your main pain is...Your first axis to optimize
Model does not fit GPU/edge memoryScope (start with weights)
Cost per token is too highScope + Mapping
Accuracy regression after PTQTiming (move toward QAT)
Latency remains high despite smaller modelMapping + kernel compatibility

๐Ÿ” Three Classification Axes You Can Apply to Any Quantized LLM

Before selecting a library or hardware backend, use this compact classifier:

AxisCore questionCommon optionsTypical impact
TimingWhen do we apply quantization?PTQ, QATAccuracy retention vs implementation effort
ScopeWhat model parts are quantized?Weights-only, weights+activations, KV cacheMemory and throughput
MappingHow are floats represented in low bits?Symmetric, asymmetric, non-uniform (NF4)Error profile and hardware efficiency

Use timing for lifecycle decisions, scope for memory/bandwidth impact, and mapping for error behavior.


โš™๏ธ By Timing: PTQ vs QAT

Timing answers when quantization appears during the model lifecycle.

Post-Training Quantization (PTQ)

PTQ quantizes an already trained model. You do not retrain from scratch.

  • Fastest path to deployment.
  • Good first step for most LLM serving workloads.

PTQ can be static (calibrated once) or dynamic (activation scale computed at runtime in some setups).

Quantization-Aware Training (QAT)

QAT simulates low-precision behavior during fine-tuning/training so weights adapt to quantization noise.

  • Better quality retention when PTQ degrades important tasks.
  • Requires a cleaner data and eval pipeline.
Timing typeBest whenMain riskTypical owner
PTQYou need speed and lower infra cost nowQuality drops on sensitive tasksInference/platform team
QATPTQ quality is below product thresholdExtra tuning cycles and GPU costModel + platform collaboration

For most teams: PTQ first, QAT only when validation says PTQ is not enough.


โš™๏ธ By Scope: Which Parts of the LLM Get Quantized

Scope determines where quantization is applied in the model and runtime path.

Scope optionWhat is quantizedMemory gainAccuracy riskNotes
Weights-onlyModel parametersHighLow to mediumMost common first step
Weights + activationsParameters + runtime activationsHigherMediumBetter throughput potential
Weights + activations + KV cacheAdds cache compressionVery highMedium to highLong-context quality needs careful testing
Selective/mixed scopeSome layers kept high precisionMedium to highLowerPractical compromise

Common pattern: quantize most linear layer weights first, keep sensitive heads in higher precision, then add activations only after quality baselines pass.


โš™๏ธ By Mapping: How Float Values Become Low-Bit Values

Mapping defines the numeric transformation from float tensors to low-bit formats.

Symmetric mapping

Values are centered around zero, typically with a single scale.

  • Simpler and often faster.
  • Works well when tensor distributions are roughly zero-centered.

Asymmetric mapping

Uses scale plus zero-point, allowing shifted ranges.

  • Better fit for non-zero-centered distributions.
  • Slightly more metadata/handling complexity.

Non-uniform mapping (example: NF4)

Not all quantization levels are equally spaced.

  • Better alignment with weight distributions in some LLMs.
  • Common in 4-bit weight quantization pipelines.
Mapping typeFormula styleHardware friendlinessTypical use
Symmetricq = round(x / s)HighINT8 weight or activation paths
Asymmetricq = round(x / s) + zHighINT8 with offset-friendly runtimes
Non-uniformCodebook/learned binsMedium4-bit LLM weight quantization

If two methods use the same bit width but different mapping, quality can differ significantly.


These are the two most widely discussed practical approaches.

Weight quantization

Weight quantization compresses model parameters (usually linear layers).

Why it is popular:

  • Big memory savings with manageable quality impact.
  • Often enough to move from "cannot deploy" to "production feasible."

Typical setup:

  • 8-bit (safer) or 4-bit (more aggressive) weights.
  • Per-channel or group-wise scales.
  • Optional selective high precision for sensitive layers.

Activation quantization

Activation quantization compresses intermediate runtime tensors produced during inference.

Why teams use it:

  • Further reduces bandwidth and memory traffic.

Why teams delay it:

  • More sensitive to input distribution shifts.
  • Requires strong eval coverage (long context, tool calling, domain prompts).
ApproachBiggest benefitBiggest challengeGood default order
Weight quantizationLarge memory reductionLayer sensitivity at low bitsStart here
Activation quantizationExtra speed and memory gainsRuntime distribution sensitivityAdd second

In short: weight quantization is the baseline optimization; activation quantization is the scaling optimization.


๐Ÿง  Deep Dive: Inside the Runtime: Why Timing, Scope, and Mapping Interact

The internals

At inference time, quantized LLMs run through three hidden mechanisms:

  1. Tensor representation changes: weights/activations stored in low-bit format with scales.
  2. Kernel path changes: runtime chooses quantized GEMM kernels if available.
  3. Rescaling/dequantization points: outputs are rescaled at specific boundaries.

Small scope or mapping changes can push execution to a different kernel path, so "smaller model" does not always mean lower latency.

Mathematical model (lightweight)

A common affine quantization mapping is:

$$ q = \text{clip}\left(\text{round}\left(\frac{x}{s}\right) + z, q_{\min}, q_{\max}\right) $$

$$ \hat{x} = s \cdot (q - z) $$

The reconstruction error is:

$$ e = x - \hat{x} $$

Your deployment goal is to keep e small enough that task-level metrics stay within budget.

Performance analysis

For decoder-only LLMs, per-layer compute class stays similar to the unquantized path, but constants change:

  • Time complexity trend: operation class stays similar, but lower precision often improves throughput through lower memory transfer.
  • Space complexity trend: parameter memory roughly scales with bit width (FP16 to INT8 is about 2x smaller; FP16 to 4-bit is about 4x smaller before metadata overhead).
  • Bottlenecks: memory bandwidth, unsupported kernels, and dequantization overhead can limit gains.
ChangeUsually improvesCan regress when
Lower weight precisionMemory footprint, model fitKernel path is not optimized
Activation quantizationThroughput, memory trafficCalibration misses production distribution
More aggressive mappingCompression ratioQuantization error hurts key tasks

๐Ÿ“Š Visualizing a Quantization Strategy Flow

flowchart TD
    A[Start with FP16 or BF16 LLM] --> B[Choose Timing Axis]
    B --> C{PTQ or QAT?}
    C -->|PTQ| D[Select Scope: Weights Only]
    C -->|QAT| E[Train with Quantization Simulation]
    D --> F[Select Mapping: Symmetric Asymmetric NF4]
    E --> F
    F --> G[Benchmark Memory Latency Quality]
    G --> H{Targets met?}
    H -- No --> I[Expand Scope or Adjust Mapping]
    I --> G
    H -- Yes --> J[Canary Deploy + Fallback]
    J --> K[Production Rollout]

๐ŸŒ Real-World Applications: Input, Process, Output

Case study 1: Customer support assistant on shared GPUs

StageDetails
InputMultilingual support prompts, medium context, strict p95 latency
ProcessPTQ + INT8 weights first, then selective activation quantization on stable layers
OutputLower memory usage, better concurrency, acceptable quality drift
StageDetails
InputLong-context legal prompts with domain terms
ProcessWeight-only 4-bit in most layers, output head and selected attention blocks kept BF16
OutputModel fits target hardware, but long-context eval required extra iteration

Both cases succeed by sequencing decisions across timing, scope, and mapping instead of maximizing compression immediately.


โš–๏ธ Trade-offs & Failure Modes: Trade-offs and Failure Modes You Should Expect

Trade-off or failure modeWhat it looks like in productionMitigation
Memory saved, quality dropsAnswers remain fluent but become less accurateAdd task-specific eval thresholds before rollout
Low-bit model, no latency gainSmaller model but unchanged p95Validate backend kernel support early
Activation driftGood offline metrics, bad real traffic performanceUse representative calibration and shadow traffic
Over-quantized sensitive layersHallucinations or format breakage in structured tasksKeep selective layers in higher precision
Aggressive scope changeImprovements on average, poor long-tail reliabilityCanary release with automated rollback

Intermediate-level rule: do not ship quantization based on memory metrics alone. Always include task-quality and tail-latency checks.


๐Ÿงญ Decision Guide: Choosing by Constraint

SituationRecommendation
Use whenStart with PTQ + weight quantization (INT8 or safe 4-bit) when memory and cost are immediate problems.
Avoid whenAvoid activation quantization as the first move if you do not have production-like calibration/eval data.
AlternativeUse mixed precision: quantize most layers, keep sensitive modules higher precision.
Edge casesFor long context, tool use, or strict JSON output, run dedicated eval suites before full rollout.

If deployment is blocked by memory, optimize scope first. If quality fails after PTQ, revisit timing (QAT). If two same-bit methods differ, inspect mapping and kernel support.


๐Ÿงช Practical Examples: Weight-First and Activation-Aware Paths

Example 1: Weight quantization with 4-bit loading

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

quant_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_cfg,
    device_map="auto",
)

prompt = "List three trade-offs of LLM quantization."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=80)

print(tokenizer.decode(output[0], skip_special_tokens=True))

What this demonstrates: a weight-first quantization strategy that is widely used for fast prototyping and production pilots.

Activation quantization in full LLM stacks is backend-dependent. Apply it after a weight-only baseline passes, then validate with production-like prompts, long-context tests, and structured-output checks.


๐Ÿ› ๏ธ AutoGPTQ, AutoAWQ, and bitsandbytes: Quantization Libraries in Practice

bitsandbytes (the bnb library by Tim Dettmers) integrates directly with the HuggingFace transformers from_pretrained() API to load models in INT8 or NF4 (4-bit) precision without a separate quantization step โ€” it is the fastest path from a HuggingFace model card to a quantized inference session.

AutoGPTQ implements the GPTQ algorithm (layer-wise weight quantization using second-order gradient information) for aggressive 4-bit quantization with better quality retention than naive round-to-nearest. AutoAWQ implements the AWQ (Activation-Aware Weight Quantization) algorithm, which identifies and preserves the 1% of weight channels most important to output quality.

# โ”€โ”€ bitsandbytes: NF4 (4-bit NormalFloat) via HuggingFace BitsAndBytesConfig โ”€โ”€
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # non-uniform 4-bit: better weight distribution fit
    bnb_4bit_use_double_quant=True,      # quantize the scale factors too (~0.4-bit savings)
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model_bnb = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    quantization_config=bnb_cfg,
    device_map="auto",
)

# โ”€โ”€ AutoGPTQ: GPTQ 4-bit from a pre-quantized model checkpoint โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from auto_gptq import AutoGPTQForCausalLM

model_gptq = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
    model_basename="model",
    use_safetensors=True,
    device="cuda:0",
    use_triton=False,    # set True for faster inference on supported GPUs
)

# โ”€โ”€ AutoAWQ: AWQ 4-bit from a pre-quantized model checkpoint โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from awq import AutoAWQForCausalLM

model_awq = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
    fuse_layers=True,    # fuse attention layers for ~20% throughput improvement
    trust_remote_code=False,
    safetensors=True,
)

# โ”€โ”€ Comparison: same prompt, three quantization backends โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
prompt = "Explain the difference between PTQ and QAT in two sentences."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

for name, model in [("bnb-nf4", model_bnb), ("gptq", model_gptq), ("awq", model_awq)]:
    with torch.inference_mode():
        out = model.generate(**inputs, max_new_tokens=60)
    print(f"\n[{name}]", tokenizer.decode(out[0], skip_special_tokens=True))
LibraryAlgorithmTimingBest forQuality vs. NF4
bitsandbytesNF4 / INT8PTQ (load-time)Fast prototyping, LoRA fine-tuningBaseline
AutoGPTQGPTQPTQ (offline, calibration)Inference-only deployments needing small model size+0.5โ€“1% on coding/math
AutoAWQAWQPTQ (offline, activation-aware)Low-bit deployment with quality-critical tasks+0.5โ€“1.5% on reasoning

fuse_layers=True in AWQ merges adjacent transformer blocks into fused CUDA kernels โ€” this is the activation-aware advantage materialising as runtime throughput rather than just model size reduction.

For a full deep-dive on AutoGPTQ calibration pipelines, AWQ activation channel analysis, and bitsandbytes QLoRA fine-tuning workflows, a dedicated follow-up post is planned.

๐Ÿ“š Lessons Learned from Quantization Projects

  • Classify decisions by timing, scope, and mapping before choosing tools.
  • Weight quantization is usually the highest-ROI first step.
  • Activation quantization can unlock additional speed, but calibration quality becomes critical.
  • Same bit width does not mean same quality; mapping and granularity matter.
  • Kernel compatibility can dominate real latency outcomes.
  • Selective precision is often better than aggressive all-layer quantization.

๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • LLM quantization is best understood as a 3-axis design space: timing, scope, and mapping.
  • By Timing: PTQ is fast and practical; QAT helps recover quality when PTQ is insufficient.
  • By Scope: start with weights, then add activations if needed.
  • By Mapping: symmetric, asymmetric, and non-uniform mappings create different error behavior.
  • Weight quantization is the most common production entry point.
  • Activation quantization is powerful but requires stronger evaluation discipline.
  • Production success depends on joint optimization of memory, latency, and task-level quality.

One-liner: The best quantization strategy is the one that meets your product SLA with the smallest quality compromise, not the one with the lowest bit count.


๐Ÿ“ Practice Quiz

  1. Which axis answers "when do we introduce low precision into the model lifecycle"?

    • A) Scope
    • B) Timing
    • C) Mapping

    Correct Answer: B

  2. You reduced a model to 4-bit weights but p95 latency did not improve. What is the most likely next check?

    Correct Answer: Validate kernel/backend support and dequantization overhead on the target hardware.

  3. In most teams, which approach is the safer first step before activation quantization?

    Correct Answer: PTQ with weight quantization (often INT8 or conservative 4-bit).

  4. Open-ended: Design an evaluation plan for adding activation quantization to an LLM that must generate valid JSON and handle long-context prompts.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms