Types of LLM Quantization: By Timing, Scope, and Mapping
PTQ, QAT, INT8, INT4, and NF4 explained through timing, scope, and mapping choices.
Abstract Algorithms
TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantization, then add activation quantization when needed.
๐ Quantization Is a Design Space, Not One Switch
Deploying LLaMA-3-70B at full fp16 precision requires approximately 140GB of VRAM โ two A100 80GB GPUs at roughly $8โ12K/month in cloud GPU rental. At Q4_K_M quantization (4-bit weights), the same model fits on a single consumer GPU with 24GB VRAM. The measured quality difference on most standard benchmarks? Under 2%.
That is the practical upside, but "just quantize it to 4-bit" is not a strategy โ it is a gamble. Teams that apply the wrong quantization type to the wrong model component regularly discover quality regressions in production that never appeared in offline evals.
Here is a concrete picture of the memory trade-off:
| Model | Precision | VRAM Required | Approximate Hardware |
| LLaMA-3-70B | fp16 | ~140 GB | 2ร A100 80GB |
| LLaMA-3-70B | INT8 | ~70 GB | 1ร A100 80GB |
| LLaMA-3-70B | Q4_K_M | ~35 GB | 1ร H100 40GB or RTX 3090 |
| LLaMA-3-8B | Q4_K_M | ~4.5 GB | Consumer laptop GPU |
The right question is not "should I quantize?" โ it is: "Which quantization type fits my latency, memory, and quality budget?"
This post uses a taxonomy approach. Instead of memorizing tool names, classify every method by:
- By Timing: when low precision enters the pipeline.
- By Scope: which tensors or components are quantized.
- By Mapping: how float values are mapped to low-bit representations.
| If your main pain is... | Your first axis to optimize |
| Model does not fit GPU/edge memory | Scope (start with weights) |
| Cost per token is too high | Scope + Mapping |
| Accuracy regression after PTQ | Timing (move toward QAT) |
| Latency remains high despite smaller model | Mapping + kernel compatibility |
๐ Three Classification Axes You Can Apply to Any Quantized LLM
Before selecting a library or hardware backend, use this compact classifier:
| Axis | Core question | Common options | Typical impact |
| Timing | When do we apply quantization? | PTQ, QAT | Accuracy retention vs implementation effort |
| Scope | What model parts are quantized? | Weights-only, weights+activations, KV cache | Memory and throughput |
| Mapping | How are floats represented in low bits? | Symmetric, asymmetric, non-uniform (NF4) | Error profile and hardware efficiency |
Use timing for lifecycle decisions, scope for memory/bandwidth impact, and mapping for error behavior.
โ๏ธ By Timing: PTQ vs QAT
Timing answers when quantization appears during the model lifecycle.
Post-Training Quantization (PTQ)
PTQ quantizes an already trained model. You do not retrain from scratch.
- Fastest path to deployment.
- Good first step for most LLM serving workloads.
PTQ can be static (calibrated once) or dynamic (activation scale computed at runtime in some setups).
Quantization-Aware Training (QAT)
QAT simulates low-precision behavior during fine-tuning/training so weights adapt to quantization noise.
- Better quality retention when PTQ degrades important tasks.
- Requires a cleaner data and eval pipeline.
| Timing type | Best when | Main risk | Typical owner |
| PTQ | You need speed and lower infra cost now | Quality drops on sensitive tasks | Inference/platform team |
| QAT | PTQ quality is below product threshold | Extra tuning cycles and GPU cost | Model + platform collaboration |
For most teams: PTQ first, QAT only when validation says PTQ is not enough.
โ๏ธ By Scope: Which Parts of the LLM Get Quantized
Scope determines where quantization is applied in the model and runtime path.
| Scope option | What is quantized | Memory gain | Accuracy risk | Notes |
| Weights-only | Model parameters | High | Low to medium | Most common first step |
| Weights + activations | Parameters + runtime activations | Higher | Medium | Better throughput potential |
| Weights + activations + KV cache | Adds cache compression | Very high | Medium to high | Long-context quality needs careful testing |
| Selective/mixed scope | Some layers kept high precision | Medium to high | Lower | Practical compromise |
Common pattern: quantize most linear layer weights first, keep sensitive heads in higher precision, then add activations only after quality baselines pass.
โ๏ธ By Mapping: How Float Values Become Low-Bit Values
Mapping defines the numeric transformation from float tensors to low-bit formats.
Symmetric mapping
Values are centered around zero, typically with a single scale.
- Simpler and often faster.
- Works well when tensor distributions are roughly zero-centered.
Asymmetric mapping
Uses scale plus zero-point, allowing shifted ranges.
- Better fit for non-zero-centered distributions.
- Slightly more metadata/handling complexity.
Non-uniform mapping (example: NF4)
Not all quantization levels are equally spaced.
- Better alignment with weight distributions in some LLMs.
- Common in 4-bit weight quantization pipelines.
| Mapping type | Formula style | Hardware friendliness | Typical use |
| Symmetric | q = round(x / s) | High | INT8 weight or activation paths |
| Asymmetric | q = round(x / s) + z | High | INT8 with offset-friendly runtimes |
| Non-uniform | Codebook/learned bins | Medium | 4-bit LLM weight quantization |
If two methods use the same bit width but different mapping, quality can differ significantly.
โ๏ธ Popular Approaches: Weight Quantization and Activation Quantization
These are the two most widely discussed practical approaches.
Weight quantization
Weight quantization compresses model parameters (usually linear layers).
Why it is popular:
- Big memory savings with manageable quality impact.
- Often enough to move from "cannot deploy" to "production feasible."
Typical setup:
- 8-bit (safer) or 4-bit (more aggressive) weights.
- Per-channel or group-wise scales.
- Optional selective high precision for sensitive layers.
Activation quantization
Activation quantization compresses intermediate runtime tensors produced during inference.
Why teams use it:
- Further reduces bandwidth and memory traffic.
Why teams delay it:
- More sensitive to input distribution shifts.
- Requires strong eval coverage (long context, tool calling, domain prompts).
| Approach | Biggest benefit | Biggest challenge | Good default order |
| Weight quantization | Large memory reduction | Layer sensitivity at low bits | Start here |
| Activation quantization | Extra speed and memory gains | Runtime distribution sensitivity | Add second |
In short: weight quantization is the baseline optimization; activation quantization is the scaling optimization.
๐ง Deep Dive: Inside the Runtime: Why Timing, Scope, and Mapping Interact
The internals
At inference time, quantized LLMs run through three hidden mechanisms:
- Tensor representation changes: weights/activations stored in low-bit format with scales.
- Kernel path changes: runtime chooses quantized GEMM kernels if available.
- Rescaling/dequantization points: outputs are rescaled at specific boundaries.
Small scope or mapping changes can push execution to a different kernel path, so "smaller model" does not always mean lower latency.
Mathematical model (lightweight)
A common affine quantization mapping is:
$$ q = \text{clip}\left(\text{round}\left(\frac{x}{s}\right) + z, q_{\min}, q_{\max}\right) $$
$$ \hat{x} = s \cdot (q - z) $$
The reconstruction error is:
$$ e = x - \hat{x} $$
Your deployment goal is to keep e small enough that task-level metrics stay within budget.
Performance analysis
For decoder-only LLMs, per-layer compute class stays similar to the unquantized path, but constants change:
- Time complexity trend: operation class stays similar, but lower precision often improves throughput through lower memory transfer.
- Space complexity trend: parameter memory roughly scales with bit width (FP16 to INT8 is about 2x smaller; FP16 to 4-bit is about 4x smaller before metadata overhead).
- Bottlenecks: memory bandwidth, unsupported kernels, and dequantization overhead can limit gains.
| Change | Usually improves | Can regress when |
| Lower weight precision | Memory footprint, model fit | Kernel path is not optimized |
| Activation quantization | Throughput, memory traffic | Calibration misses production distribution |
| More aggressive mapping | Compression ratio | Quantization error hurts key tasks |
๐ Visualizing a Quantization Strategy Flow
flowchart TD
A[Start with FP16 or BF16 LLM] --> B[Choose Timing Axis]
B --> C{PTQ or QAT?}
C -->|PTQ| D[Select Scope: Weights Only]
C -->|QAT| E[Train with Quantization Simulation]
D --> F[Select Mapping: Symmetric Asymmetric NF4]
E --> F
F --> G[Benchmark Memory Latency Quality]
G --> H{Targets met?}
H -- No --> I[Expand Scope or Adjust Mapping]
I --> G
H -- Yes --> J[Canary Deploy + Fallback]
J --> K[Production Rollout]
๐ Real-World Applications: Input, Process, Output
Case study 1: Customer support assistant on shared GPUs
| Stage | Details |
| Input | Multilingual support prompts, medium context, strict p95 latency |
| Process | PTQ + INT8 weights first, then selective activation quantization on stable layers |
| Output | Lower memory usage, better concurrency, acceptable quality drift |
Case study 2: On-prem legal drafting assistant with long context
| Stage | Details |
| Input | Long-context legal prompts with domain terms |
| Process | Weight-only 4-bit in most layers, output head and selected attention blocks kept BF16 |
| Output | Model fits target hardware, but long-context eval required extra iteration |
Both cases succeed by sequencing decisions across timing, scope, and mapping instead of maximizing compression immediately.
โ๏ธ Trade-offs & Failure Modes: Trade-offs and Failure Modes You Should Expect
| Trade-off or failure mode | What it looks like in production | Mitigation |
| Memory saved, quality drops | Answers remain fluent but become less accurate | Add task-specific eval thresholds before rollout |
| Low-bit model, no latency gain | Smaller model but unchanged p95 | Validate backend kernel support early |
| Activation drift | Good offline metrics, bad real traffic performance | Use representative calibration and shadow traffic |
| Over-quantized sensitive layers | Hallucinations or format breakage in structured tasks | Keep selective layers in higher precision |
| Aggressive scope change | Improvements on average, poor long-tail reliability | Canary release with automated rollback |
Intermediate-level rule: do not ship quantization based on memory metrics alone. Always include task-quality and tail-latency checks.
๐งญ Decision Guide: Choosing by Constraint
| Situation | Recommendation |
| Use when | Start with PTQ + weight quantization (INT8 or safe 4-bit) when memory and cost are immediate problems. |
| Avoid when | Avoid activation quantization as the first move if you do not have production-like calibration/eval data. |
| Alternative | Use mixed precision: quantize most layers, keep sensitive modules higher precision. |
| Edge cases | For long context, tool use, or strict JSON output, run dedicated eval suites before full rollout. |
If deployment is blocked by memory, optimize scope first. If quality fails after PTQ, revisit timing (QAT). If two same-bit methods differ, inspect mapping and kernel support.
๐งช Practical Examples: Weight-First and Activation-Aware Paths
Example 1: Weight quantization with 4-bit loading
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
quant_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quant_cfg,
device_map="auto",
)
prompt = "List three trade-offs of LLM quantization."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=80)
print(tokenizer.decode(output[0], skip_special_tokens=True))
What this demonstrates: a weight-first quantization strategy that is widely used for fast prototyping and production pilots.
Activation quantization in full LLM stacks is backend-dependent. Apply it after a weight-only baseline passes, then validate with production-like prompts, long-context tests, and structured-output checks.
๐ ๏ธ AutoGPTQ, AutoAWQ, and bitsandbytes: Quantization Libraries in Practice
bitsandbytes (the bnb library by Tim Dettmers) integrates directly with the HuggingFace transformers from_pretrained() API to load models in INT8 or NF4 (4-bit) precision without a separate quantization step โ it is the fastest path from a HuggingFace model card to a quantized inference session.
AutoGPTQ implements the GPTQ algorithm (layer-wise weight quantization using second-order gradient information) for aggressive 4-bit quantization with better quality retention than naive round-to-nearest. AutoAWQ implements the AWQ (Activation-Aware Weight Quantization) algorithm, which identifies and preserves the 1% of weight channels most important to output quality.
# โโ bitsandbytes: NF4 (4-bit NormalFloat) via HuggingFace BitsAndBytesConfig โโ
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # non-uniform 4-bit: better weight distribution fit
bnb_4bit_use_double_quant=True, # quantize the scale factors too (~0.4-bit savings)
bnb_4bit_compute_dtype=torch.bfloat16,
)
model_bnb = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.2",
quantization_config=bnb_cfg,
device_map="auto",
)
# โโ AutoGPTQ: GPTQ 4-bit from a pre-quantized model checkpoint โโโโโโโโโโโโโโโโ
from auto_gptq import AutoGPTQForCausalLM
model_gptq = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
model_basename="model",
use_safetensors=True,
device="cuda:0",
use_triton=False, # set True for faster inference on supported GPUs
)
# โโ AutoAWQ: AWQ 4-bit from a pre-quantized model checkpoint โโโโโโโโโโโโโโโโโโ
from awq import AutoAWQForCausalLM
model_awq = AutoAWQForCausalLM.from_quantized(
"TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
fuse_layers=True, # fuse attention layers for ~20% throughput improvement
trust_remote_code=False,
safetensors=True,
)
# โโ Comparison: same prompt, three quantization backends โโโโโโโโโโโโโโโโโโโโโ
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
prompt = "Explain the difference between PTQ and QAT in two sentences."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
for name, model in [("bnb-nf4", model_bnb), ("gptq", model_gptq), ("awq", model_awq)]:
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=60)
print(f"\n[{name}]", tokenizer.decode(out[0], skip_special_tokens=True))
| Library | Algorithm | Timing | Best for | Quality vs. NF4 |
| bitsandbytes | NF4 / INT8 | PTQ (load-time) | Fast prototyping, LoRA fine-tuning | Baseline |
| AutoGPTQ | GPTQ | PTQ (offline, calibration) | Inference-only deployments needing small model size | +0.5โ1% on coding/math |
| AutoAWQ | AWQ | PTQ (offline, activation-aware) | Low-bit deployment with quality-critical tasks | +0.5โ1.5% on reasoning |
fuse_layers=True in AWQ merges adjacent transformer blocks into fused CUDA kernels โ this is the activation-aware advantage materialising as runtime throughput rather than just model size reduction.
For a full deep-dive on AutoGPTQ calibration pipelines, AWQ activation channel analysis, and bitsandbytes QLoRA fine-tuning workflows, a dedicated follow-up post is planned.
๐ Lessons Learned from Quantization Projects
- Classify decisions by timing, scope, and mapping before choosing tools.
- Weight quantization is usually the highest-ROI first step.
- Activation quantization can unlock additional speed, but calibration quality becomes critical.
- Same bit width does not mean same quality; mapping and granularity matter.
- Kernel compatibility can dominate real latency outcomes.
- Selective precision is often better than aggressive all-layer quantization.
๐ TLDR: Summary & Key Takeaways
- LLM quantization is best understood as a 3-axis design space: timing, scope, and mapping.
- By Timing: PTQ is fast and practical; QAT helps recover quality when PTQ is insufficient.
- By Scope: start with weights, then add activations if needed.
- By Mapping: symmetric, asymmetric, and non-uniform mappings create different error behavior.
- Weight quantization is the most common production entry point.
- Activation quantization is powerful but requires stronger evaluation discipline.
- Production success depends on joint optimization of memory, latency, and task-level quality.
One-liner: The best quantization strategy is the one that meets your product SLA with the smallest quality compromise, not the one with the lowest bit count.
๐ Practice Quiz
Which axis answers "when do we introduce low precision into the model lifecycle"?
- A) Scope
- B) Timing
- C) Mapping
Correct Answer: B
You reduced a model to 4-bit weights but p95 latency did not improve. What is the most likely next check?
Correct Answer: Validate kernel/backend support and dequantization overhead on the target hardware.
In most teams, which approach is the safer first step before activation quantization?
Correct Answer: PTQ with weight quantization (often INT8 or conservative 4-bit).
Open-ended: Design an evaluation plan for adding activation quantization to an LLM that must generate valid JSON and handle long-context prompts.
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Stream Processing Pipeline Pattern: Stateful Real-Time Data Products
TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together โ and Kafka Streams lets you build all three directly inside your Spring Boot service. Stripe's real-time fraud detection processes...
Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic
TLDR: A service mesh intercepts all service-to-service traffic via injected Envoy sidecar proxies, letting a platform team enforce mTLS, retries, timeouts, and circuit breaking centrally โ without changing application code. Reach for it when cross-te...
Serverless Architecture Pattern: Event-Driven Scale with Operational Guardrails
TLDR: Serverless is strongest for spiky asynchronous workloads when cold-start, observability, and state boundaries are intentionally designed. TLDR: Serverless works best for spiky, event-driven workloads when you design for idempotency, observabili...
