A Journey from AI to LLMs and MCP - 2 - How LLMs Work — Embeddings, Vectors, and Context Windows

A Journey from AI to LLMs and MCP - 2 - How LLMs Work — Embeddings, Vectors, and Context Windows

Free Resources

In our last post, we explored the evolution of AI—from rule-based systems to deep learning—and how Large Language Models (LLMs) like GPT-4 and Claude represent a transformative leap in capability.

But how do these models actually work?

In this post, we’ll peel back the curtain on the inner workings of LLMs. We’ll explore the fundamental concepts that make these models tick: embeddings, vector spaces, and context windows. You’ll walk away with a clearer understanding of how LLMs “understand” language—and what their limits are.

How LLMs Think: It’s All Math Underneath

Despite their fluent text output, LLMs don’t truly “understand” language in the human sense. Instead, they operate on numerical representations of text, using vast networks of mathematical weights to predict the next word in a sequence.

The key mechanism behind this: transformers.

Transformers revolutionized NLP by allowing models to weigh the relevance of each word in a sentence—attention mechanisms—instead of processing words one-by-one like RNNs.

Here’s the simplified flow:

  1. Text is tokenized (split into chunks)
  2. Tokens are converted into embeddings (vectors)
  3. Those vectors pass through layers of attention to capture meaning
  4. The model generates the next token based on probability

But what are these embeddings and why do they matter?

Embeddings: From Words to Numbers

Before an LLM can do anything with language, it must convert words into numbers it can operate on.

That’s where embeddings come in.

What is an embedding?

An embedding is a high-dimensional vector (think: a long list of numbers) that represents the meaning of a word or phrase.

Words with similar meanings have similar embeddings.

For example:

Embedding("dog") ≈ Embedding("puppy") Embedding("Paris") ≈ Embedding("London")

These vectors live in an abstract vector space, where distance encodes similarity.

LLMs use embeddings not just for input, but throughout every layer of their neural network to understand relationships, context, and meaning.

Vector Search and Semantic Understanding

Because embeddings encode meaning, they’re also incredibly useful for semantic search.

Instead of matching exact words (like keyword search), vector search compares embeddings to find text that’s conceptually similar.

For example:

  • Query: “How do I fix a leaking pipe?”
  • Match: “Plumbing repair for minor water leaks”

Even though the words don’t overlap, the meaning does—and that’s what embeddings capture.

This is the foundation for many powerful AI techniques like:

  • Document similarity
  • Retrieval-Augmented Generation (RAG) (more on this in Blog 3)
  • Context injection from external data sources

Context Windows: The Model’s Working Memory

Another crucial concept in LLMs is the context window—the maximum number of tokens the model can “see” at once.

Every input to an LLM gets broken into tokens, and the model has a limited capacity for how many tokens it can process per request.

ModelMax Context Window
GPT-3.54,096 tokens (~3,000 words)
GPT-4 TurboUp to 128,000 tokens
Claude 3 OpusUp to 200,000 tokens

If you go over the limit, you’ll need to:

  • Truncate input (losing information)
  • Summarize
  • Use techniques like RAG or memory management

TL;DR: The larger the context window, the more the model can “remember” during a conversation or task.

Limitations of Embeddings and Context Windows

Even though LLMs are powerful, they come with trade-offs:

Embedding limitations:

  • Don’t always reflect nuanced context (e.g., sarcasm, tone)
  • Fixed dimensionality: can’t represent everything
  • Require separate handling for different modalities (text vs images)

Context window limitations:

  • Long documents may get truncated or ignored
  • Memory is not persistent—everything resets after a session unless you manually re-include previous context
  • More tokens = higher latency and cost

These limits are precisely why so much effort goes into enhancing LLMs through fine-tuning, retrieval systems, and smarter prompt engineering.

We’ll dive into that next.

Recap: Key Concepts from This Post

ConceptWhat It IsWhy It Matters
EmbeddingsVector representations of tokens/textEnable semantic understanding & search
Vector SpaceMathematical space where embeddings liveAllows similarity comparison & clustering
Context WindowMax token size per LLM inputDefines how much the model can “see”
AttentionWeighs token relationships dynamicallyEnables context awareness in LLMs

🔮 Up Next: Making LLMs Smarter with Fine-Tuning, Prompt Engineering, and RAG

In our next post, we’ll show how to enhance LLM performance using proven techniques:

  • Fine-tuning
  • Prompt engineering
  • Retrieval-Augmented Generation (RAG)

These strategies help you move beyond limitations—and get the most out of your models.