Series

LLM Engineering

A structured path through AI engineering, retrieval, evaluation, and production guardrails.

Articles

14h 7m

Estimated reading

Intermediate to Advanced

Knowledge level

2,964

Readers

Start Series

Client

Gateway

User Service

Order Service

Payment Service

Cache

Database

About this series

A structured path through AI engineering, retrieval, evaluation, and production guardrails.

Learn with real world examples

Connect articles into a structured path

Best practices and trade-offs

Interview focused insights

Continuously updated content

Series Progress

0% Complete

0 of 49 articles viewed

Continue Learning

ANN Index Types Explained: When to Choose Flat, HNSW, IVF, or IVF-PQ

Article 1 of 49

Who is this for?

Software engineers and developers learning this topic.

Knowledge Level

Intermediate to Advanced

Last Updated

May 30, 2026

Created by

Abstract Algorithms

All Articles

Article 1

ANN Index Types Explained: When to Choose Flat, HNSW, IVF, or IVF-PQ

TLDR: If your dataset is small and correctness is critical, use Flat. If you need high recall with low latency and enough RAM, use HNSW. If your corpus is huge and memory is your bottleneck, use IVF-P

14 min read

Article 2

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

📌 TL;DR Summary Use RAG when facts change frequently and answers must be source-grounded. Use fine-tuning when you need stable behavior: tone, format, and domain-specific reasoning. Use RAG + fine-t

31 min read

Article 3

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enab

31 min read

Article 4

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M

31 min read

Article 5

Fine-Tuning LLMs: The Complete Engineer's Guide to SFT, LoRA, and RLHF

TLDR: A pretrained LLM is a generalist. Fine-tuning makes it a specialist. Supervised Fine-Tuning (SFT) teaches it your domain's language through labeled examples. LoRA does the same with 99% fewer tr

30 min read

Article 6

Chain of Thought Prompting: Teaching LLMs to Think Step by Step

TLDR: Chain of Thought (CoT) prompting tells a language model to reason out loud before answering. By generating intermediate steps, the model steers itself toward correct conclusions — turning guessw

27 min read

Article 7

LLM Hallucinations: Causes, Detection, and Mitigation Strategies

TLDR: LLMs hallucinate because they are trained to predict the next plausible token — not the next true token. Understanding the three hallucination types (factual, faithfulness, open-domain) plus the

30 min read

Article 8

Dense LLM Architecture: How Every Parameter Works on Every Token

TLDR: In a dense LLM every single parameter is active for every token in every forward pass — no routing, no selection. A transformer block runs multi-head self-attention (Q, K, V) followed by a feed-

24 min read

Article 9

Managed API LLMs vs Self-Hosted Models: When to Switch and When Not To

TLDR: Most teams should start with managed LLM APIs because they buy speed, reliability, model quality, and low operational burden. Move to self-hosted or open-weight models only when you have stable

17 min read

Article 10

LLM Software Development Pitfalls: What to Avoid and When to Simplify

TLDR: Most bad LLM products do not fail because the model is weak. They fail because teams wrap a maybe-useful model in too much architecture: prompt spaghetti, no eval harness, weak tool schemas, hug

20 min read

Article 11

LLM Model Selection Guide: GPT-4o vs Claude vs Llama vs Mistral — When to Use Which

TLDR: 🧠 Choosing the right LLM can save you 80% on costs while maintaining quality. This guide provides a decision framework, cost comparison, and practical examples to help engineering teams select

23 min read

Article 12

LLM Observability: Tracing, Logging, and Debugging Production AI Systems

TLDR: 🔍 LLM observability is radically different from traditional APM—non-deterministic outputs, variable token costs, and multi-step reasoning chains require specialized tracing. LangSmith provides

19 min read

Article 13

LLM Evaluation Frameworks: How to Measure Model Quality (RAGAS, DeepEval, TruLens)

TLDR: 📏 Traditional ML metrics (accuracy, F1) fail for LLMs because there's no single "correct" answer. RAGAS measures RAG pipeline quality with faithfulness, answer relevance, and context precision.

16 min read

Article 14

Context Window Management: Strategies for Long Documents and Extended Conversations

TLDR: 🧠 Context windows are LLM memory limits. When conversations grow past 4K-128K tokens, you need strategies: sliding windows (cheap, lossy), summarization (balanced), RAG (selective), map-reduce

20 min read

Article 15

Step-by-Step: How to Expose a Skill as an MCP Server

TLDR: Turn any Python function into a multi-client MCP server in 11 steps — from annotation to Docker. 📖 The Copy-Paste Problem: Why Skills Die at IDE Boundaries A developer pastes their summarize_pr_diff function into a Slack message because thei...

26 min read

Article 16

Headless Agents: Deploy Skills as MCP Servers — Full Guide from Concept to Three Clients

TLDR: Build an MCP server once and call it from Cursor, Claude Desktop, and VS Code without rewrites — this guide takes you from a single Python function to a containerized, authenticated, three-clien

33 min read

Article 17

Types of LLM Quantization: By Timing, Scope, and Mapping

TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In

17 min read

Article 18

AI Architecture Patterns: Routers, Planner-Worker Loops, Memory Layers, and Evaluation Guardrails

TLDR: A single agent loop is enough for a demo, but production AI systems need explicit layers for routing, execution, memory, and evaluation. Those layers determine safety, latency, cost, and traceab

14 min read

Article 19

Practical LLM Quantization in Colab: A Hugging Face Walkthrough

TLDR: This is a practical, notebook-style quantization guide for Google Colab and Hugging Face. You will quantize real models, run inference, compare memory/latency, and learn when to use 4-bit NF4 vs

15 min read

Article 20

LLM Skills vs Tools: The Missing Layer in Agent Design

TLDR: A tool is a single callable capability (search, SQL, calculator). A skill is a reusable mini-workflow that coordinates multiple tool calls with policy, guardrails, retries, and output structure.

16 min read

Article 21

LLM Skill Registries, Routing Policies, and Evaluation for Production Agents

TLDR: If tools are primitives and skills are reusable routines, then the skill registry + router + evaluator is your production control plane. This layer decides which skill runs, under what constrain

14 min read

Article 22

GPTQ vs AWQ vs NF4: Choosing the Right LLM Quantization Pipeline

TLDR: GPTQ, AWQ, and NF4 all shrink LLMs, but they optimize different constraints. GPTQ focuses on post-training reconstruction error, AWQ protects salient weights for better quality at low bits, and

15 min read

Article 23

SFT for LLMs: A Practical Guide to Supervised Fine-Tuning

TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretrai

12 min read

Article 24

RLHF in Practice: From Human Preferences to Better LLM Policies

TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a r

12 min read

Article 25

PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning

TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen

14 min read

Article 26

LLM Model Naming Conventions: How to Read Names and Why They Matter

TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mi

12 min read

Article 27

Why Embeddings Matter: Solving Key Issues in Data Representation

TLDR: Embeddings convert words (and images, users, products) into dense numerical vectors in a geometric space where semantic similarity = geometric proximity. "King - Man + Woman ≈ Queen" is not magi

14 min read

Article 28

What are Logits in Machine Learning and Why They Matter

TLDR: Logits are the raw, unnormalized scores produced by the final layer of a neural network — before any probability transformation. Softmax converts them to probabilities. Temperature scales them b

11 min read

Article 29

Text Decoding Strategies: Greedy, Beam Search, and Sampling

TLDR: An LLM doesn't "write" text — it generates a probability distribution over all possible next tokens and then uses a decoding strategy to pick one. Greedy, Beam Search, and Sampling are different

16 min read

Article 30

RLHF Explained: How We Teach AI to Be Nice

TLDR: A raw LLM is a super-smart parrot that read the entire internet — including its worst parts. RLHF (Reinforcement Learning from Human Feedback) is the training pipeline that transforms it from a

14 min read

Article 31

Mastering Prompt Templates: System, User, and Assistant Roles with LangChain

TLDR: A production prompt is not a string — it is a structured message list with system, user, and optional assistant roles. LangChain's ChatPromptTemplate turns this structure into a reusable, testab

14 min read

Article 32

Prompt Engineering Guide: From Zero-Shot to Chain-of-Thought

TLDR: Prompt Engineering is the art of writing instructions that guide an LLM toward the answer you want. Zero-Shot, Few-Shot, and Chain-of-Thought are systematic techniques — not guesswork — that can

13 min read

Article 33

Multistep AI Agents: The Power of Planning

TLDR: A simple ReAct agent reacts one tool call at a time. A multistep agent plans a complete task decomposition upfront, then executes each step sequentially — handling complex goals that require 5-1

15 min read

Article 34

LoRA Explained: How to Fine-Tune LLMs on a Budget

TLDR: Fine-tuning a 7B-parameter LLM updates billions of weights and requires expensive GPUs. LoRA (Low-Rank Adaptation) freezes the original weights and trains only tiny adapter matrices that are add

13 min read

Article 35

Diffusion Models: How AI Creates Art from Noise

TLDR: Diffusion models work by first learning to add noise to an image, then learning to undo that noise. At inference time you start from pure static and iteratively denoise into a meaningful image.

12 min read

Article 36

'The Developer''s Guide: When to Use Code, ML, LLMs, or Agents'

TLDR: AI is a tool, not a religion. Use Code for deterministic logic (banking, math). Use Traditional ML for structured predictions (fraud, recommendations). Use LLMs for unstructured text (summarizat

15 min read

Article 37

AI Agents Explained: When LLMs Start Using Tools

TLDR: A standard LLM is a brain in a jar — it can reason but cannot act. An AI Agent connects that brain to tools (web search, code execution, APIs). Instead of just answering a question, an agent exe

13 min read

Article 38

A Guide to Pre-training Large Language Models

TLDR: Pre-training is the phase where an LLM learns "Language" and "World Knowledge" by reading petabytes of text. It uses Self-Supervised Learning to predict the next word in a sentence. This creates

15 min read

Article 39

A Beginner's Guide to Vector Database Principles

TLDR: A vector database stores meaning as numbers so you can search by intent, not exact keywords. That is why "reset my password" can find "account recovery steps" even if the words are different.

14 min read

Article 40

LLM Model Quantization: Why, When, and How to Deploy Smaller, Faster Models

TLDR: Quantization converts high-precision model weights and activations (FP16/FP32) into lower-precision formats (INT8 or INT4) so LLMs run with less memory, lower latency, and lower cost. The key is

13 min read

Article 41

LLM Hyperparameters Guide: Temperature, Top-P, and Top-K Explained

TLDR: Temperature, Top-p, and Top-k are three sampling controls that determine how "creative" or "deterministic" an LLM's output is. Temperature rescales the probability distribution; Top-k limits the

16 min read

Article 42

Mastering Prompt Templates: System, User, and Assistant Roles with LangChain

TLDR: Prompt templates are the contract between your application and the LLM. Role-based messages (System / User / Assistant) provide structure. LangChain's ChatPromptTemplate and MessagesPlaceholder

13 min read

Article 43

Tokenization Explained: How LLMs Understand Text

TLDR: LLMs don't read words — they read tokens. A token is roughly 4 characters. Byte Pair Encoding (BPE) builds an efficient subword vocabulary by iteratively merging frequent character pairs. Tokeni

12 min read

Article 44

RAG Explained: How to Give Your LLM a Brain Upgrade

TLDR: LLMs have a training cut-off and no access to private data. RAG (Retrieval-Augmented Generation) solves both problems by retrieving relevant documents from an external store and injecting them i

11 min read

Article 45

Variational Autoencoders (VAE): The Art of Compression and Creation

TLDR: A VAE learns to compress data into a smooth probabilistic latent space, then generate new samples by decoding random points from that space. The reparameterization trick is what makes it trainab

13 min read

Article 46

LLM Terms You Should Know: A Helpful Glossary

TLDR: The world of LLMs has its own dense vocabulary. This post is your decoder ring — covering foundation terms (tokens, context window), generation settings (temperature, top-p), safety concepts (ha

14 min read

Article 47

Advanced AI: Agents, RAG, and the Future of Intelligence

TLDR: Large Language Models are brilliant "brains in a jar." Retrieval-Augmented Generation (RAG) hands them a constantly refreshed memory, while AI Agents give them tools to act in the world. Combine

15 min read

Article 48

Large Language Models (LLMs): The Generative AI Revolution

TLDR: Large Language Models predict the next token, one at a time, using a Transformer architecture trained on billions of words. At scale, this simple objective produces emergent reasoning, coding, a

14 min read

Article 49

Natural Language Processing (NLP): Teaching Computers to Read

TLDR: 🌟 NLP turns raw text into numbers so machines can read, understand, and generate language. The field evolved from counting words (Bag-of-Words) to contextual Transformers — each leap brings ric

14 min read

LLM Engineering: Learning Roadmap

You read a post about LoRA fine-tuning. Then you see one about RAG that mentions embeddings you don't understand. So you search for embeddings, find a vector database guide that assumes you know about quantization, dive into a quantization post that references RLHF concepts you've never seen. Three hours later, you're reading about diffusion models with no clear path back to your original goal.

This is the classic LLM learning trap. The field moves at breakneck speed, posts proliferate across dozens of blogs, and everything connects to everything else. Without a structured path, you accumulate fragments instead of building a coherent mental model. You know isolated techniques but struggle to see how they fit together into production systems that actually work.

TLDR: This roadmap organizes 37 LLM Engineering posts into decision-tree learning paths based on your goal: ship an app fast (App Developer), customize models (ML Engineer), build autonomous agents (Agent Builder), or understand the theory (Research Track). Start with fundamentals, then choose your path.

What You'll Learn

Understand LLM Engineering through real published examples

Follow a sequence of 49 articles from fundamentals to deeper topics

Connect related concepts: ANN, vector database, RAG

Practice explaining trade-offs and implementation decisions

Prerequisites

Basic API knowledge

Familiarity with data pipelines

Curiosity about model behavior

FAQs

How should I read this series?

Start from the first article if you are new, or use the article list to jump into the most relevant topic.

Is progress automatic?

Progress is based on articles opened from this browser using the local learning history.