Published on
· 22 min read

Updated

Embeddings, Vector Storage, and RAG: Complete Guide to AI Knowledge Systems (2026)

TL;DR: Embeddings convert text into numerical vectors that capture semantic meaning. Vector databases like pgvector store and search those vectors efficiently. RAG (Retrieval-Augmented Generation) ties it all together — retrieving relevant documents from a vector store and feeding them to an LLM so it can answer questions using your custom knowledge base. This guide walks through the full stack with working TypeScript code: from generating embeddings with OpenAI, to storing them in Supabase with pgvector, to building a complete document Q&A system with the Vercel AI SDK.

Table of Contents

What Are Embeddings?

An embedding is a numerical vector — an array of floating-point numbers — that represents the semantic meaning of a piece of text. When you embed the sentence "The cat sat on the mat" through a model like OpenAI's text-embedding-3-small, you get back an array of 1,536 numbers. That array is the embedding.

The key insight is that similar meanings produce similar vectors. The embeddings for "How do I reset my password?" and "I forgot my login credentials" will be close together in vector space, even though they share almost no words. Meanwhile, "The weather is nice today" will be far away from both.

This works because embedding models are trained on massive text corpora to learn relationships between concepts. During training, the model learns to map semantically related text to nearby points in a high-dimensional space. The number of dimensions varies by model — OpenAI's small model uses 1,536 dimensions, while their large model uses 3,072.

import OpenAI from 'openai'

const openai = new OpenAI()

async function getEmbedding(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
  })
  return response.data[0].embedding // number[] with 1536 dimensions
}

const embedding = await getEmbedding('How do embeddings work?')
console.log(embedding.length) // 1536
console.log(embedding.slice(0, 5)) // [-0.0023, 0.0451, -0.0112, ...]

You can think of embeddings as coordinates in a meaning space. Just as GPS coordinates tell you where something is on Earth, embedding vectors tell you where something is in semantic space. And just as you can calculate the distance between two GPS points, you can calculate the distance between two embeddings to measure how similar their meanings are.

If you're new to the transformer architecture that powers these models, start there — understanding attention mechanisms will clarify why embeddings capture meaning so effectively.

How Vector Search Works

Once you have embeddings, you need a way to find the most similar ones to a given query. This is vector search — also called nearest-neighbor search or similarity search.

Distance Metrics

There are three common ways to measure the "distance" between two vectors:

Cosine similarity measures the angle between two vectors, ignoring their magnitude. A score of 1.0 means identical direction (same meaning), 0 means orthogonal (unrelated), and -1 means opposite. This is the most common metric for text embeddings because it normalizes for document length.

Dot product multiplies corresponding elements and sums the results. It's faster to compute than cosine similarity but sensitive to vector magnitude. OpenAI's embedding models are normalized, so dot product and cosine similarity produce the same rankings.

Euclidean distance measures the straight-line distance between two points. Lower values mean more similar. It's less common for text embeddings but useful when magnitude matters.

function cosineSimilarity(a: number[], b: number[]): number {
  let dotProduct = 0
  let normA = 0
  let normB = 0
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i]
    normA += a[i] * a[i]
    normB += b[i] * b[i]
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB))
}

With a small dataset (under 100k vectors), you can do an exact nearest-neighbor search — compare the query vector against every stored vector and return the closest ones. This guarantees the best results but gets slow as your dataset grows.

For larger datasets, approximate nearest-neighbor (ANN) algorithms trade a tiny amount of accuracy for massive speed improvements. The most popular is HNSW (Hierarchical Navigable Small World), which builds a multi-layer graph structure that enables logarithmic-time search. pgvector supports HNSW indexes natively, delivering sub-millisecond queries on millions of vectors.

With pgvectorscale, PostgreSQL achieves 471 QPS at 99% recall on 50M vectors — competitive with dedicated vector databases and 11.4x faster than Qdrant in benchmarks.

The RAG Architecture

RAG (Retrieval-Augmented Generation) is the architecture that makes embeddings and vector search useful for building AI applications. Instead of relying solely on what an LLM learned during training, RAG retrieves relevant information from your data and includes it in the prompt. Organizations using RAG report 60-80% reduction in hallucinations compared to vanilla LLM responses.

The flow has two phases:

Ingestion (Offline)

  1. Load your documents (PDFs, markdown files, database rows, web pages)
  2. Chunk documents into smaller segments (typically 200-1,000 tokens)
  3. Embed each chunk using an embedding model
  4. Store the vectors alongside the original text in a vector database

Query (Real-time)

  1. Embed the user's question using the same embedding model
  2. Search the vector database for the most similar document chunks
  3. Augment the LLM prompt with the retrieved chunks as context
  4. Generate a response grounded in the retrieved information
User Question
      │
      ▼
┌─────────────┐
│  Embed Query │ ──→ [0.02, -0.15, 0.33, ...]
└─────────────┘
      │
      ▼
┌─────────────────┐
│ Vector Database  │ ──→ Top-K similar chunks
│  (pgvector)      │
└─────────────────┘
      │
      ▼
┌──────────────────────────────────┐
│  LLM Prompt                       │
│  System: Answer using context     │
│  Context: [retrieved chunks]      │
│  Question: {user question}        │
└──────────────────────────────────┘
      │
      ▼
┌─────────────┐
│  LLM Response │ ──→ Grounded answer
└─────────────┘

The beauty of RAG is that you don't need to fine-tune the model. You can update your knowledge base by re-embedding new documents — no retraining required. This makes RAG the dominant architecture for building AI systems with custom knowledge in 2026.

Embedding Models Compared

Choosing the right embedding model affects search quality, latency, and cost. Here's how the top models compare as of early 2026:

ModelProviderDimensionsPrice per 1M TokensMTEB ScoreBest For
text-embedding-3-smallOpenAI1,536$0.0262.3Best cost-performance ratio
text-embedding-3-largeOpenAI3,072$0.1364.6High-accuracy applications
embed-v4Cohere1,024$0.1065.2Multilingual, hybrid search
Gemini Embedding 001Google768$0.00 (free preview)68.3Multimodal embeddings
BGE-M3BAAI1,024Free (self-hosted)63.0Open-source, multilingual
Nomic Embed v1.5Nomic AI768Free (self-hosted)62.2Local via Ollama
voyage-3Voyage AI1,024$0.0667.1Code and technical content

For most developers building RAG systems, OpenAI's text-embedding-3-small is the right starting point. At $0.02 per million tokens, you can embed a million document chunks for under $1. The quality gap between it and more expensive models rarely justifies the 6.5x price increase — validate on your own data before upgrading.

If you're running models locally on a Mac Mini or similar hardware, Nomic Embed and BGE-M3 run well through Ollama and produce solid results without any API costs.

Vector Database Comparison

Vector databases store embeddings and provide efficient similarity search. The landscape in 2026 ranges from PostgreSQL extensions to fully managed cloud services:

DatabaseTypeMax VectorsHybrid SearchPricingBest For
pgvectorPostgres extension~100MVia pg_trgmFree (self-hosted)Teams already using Postgres
PineconeManaged cloudBillionsYesFree tier, then usage-basedZero-ops managed service
WeaviateOpen-source / CloudBillionsNative BM25 + vectorFree (self-hosted) / Cloud plansHybrid search, multitenancy
QdrantOpen-source / CloudBillionsPayload filteringFree (self-hosted) / Cloud plansFiltering + vector search
ChromaOpen-source~1MMetadata filteringFreePrototyping, local dev

pgvector is the default recommendation for teams that already have Postgres in their stack. It avoids the operational complexity of a second data store and gives you transactional consistency that dedicated vector databases can't match. Supabase includes pgvector built-in, so if you're already on Supabase, you get vector search for free.

Pinecone is the right choice if you need a fully managed service with no infrastructure to maintain. You get an API key, create an index, and start querying — no instances to size, no indexes to tune.

Weaviate stands out for its native hybrid search that combines BM25 keyword matching with vector similarity in a single query. If your use case involves a mix of exact-match and semantic queries, Weaviate handles both natively.

For most RAG applications under 10M vectors, pgvector on Supabase gives you the best balance of simplicity, cost, and performance.

Building a RAG System with pgvector and Supabase

Let's build the vector storage layer. First, enable pgvector and create the documents table in Supabase:

-- Enable the pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create a table for document chunks with embeddings
CREATE TABLE documents (
  id BIGSERIAL PRIMARY KEY,
  content TEXT NOT NULL,
  metadata JSONB DEFAULT '{}',
  embedding VECTOR(1536) -- Match your model's dimensions
);

-- Create an HNSW index for fast approximate search
CREATE INDEX ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

The HNSW index parameters control the accuracy-speed tradeoff: m is the number of connections per node (higher = more accurate, more memory), and ef_construction controls build-time accuracy (higher = better index, slower to build). The defaults of m=16 and ef_construction=64 work well for most datasets.

Next, create a SQL function for similarity search. Supabase's PostgREST doesn't support pgvector operators directly, so you need to wrap the query in a function and call it via .rpc():

CREATE OR REPLACE FUNCTION match_documents(
  query_embedding VECTOR(1536),
  match_threshold FLOAT DEFAULT 0.78,
  match_count INT DEFAULT 5
)
RETURNS TABLE (
  id BIGINT,
  content TEXT,
  metadata JSONB,
  similarity FLOAT
)
LANGUAGE sql STABLE
AS $$
  SELECT
    documents.id,
    documents.content,
    documents.metadata,
    1 - (documents.embedding <=> query_embedding) AS similarity
  FROM documents
  WHERE 1 - (documents.embedding <=> query_embedding) > match_threshold
  ORDER BY documents.embedding <=> query_embedding
  LIMIT match_count;
$$;

The <=> operator computes cosine distance (lower = more similar). We subtract from 1 to convert it to cosine similarity (higher = more similar) and filter by a threshold to exclude irrelevant results.

Now let's write the TypeScript code to embed and store documents:

import { createClient } from '@supabase/supabase-js'
import OpenAI from 'openai'

const supabase = createClient(process.env.SUPABASE_URL!, process.env.SUPABASE_SERVICE_KEY!)
const openai = new OpenAI()

async function embedDocument(content: string, metadata: Record<string, unknown> = {}) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: content,
  })

  const { error } = await supabase.from('documents').insert({
    content,
    metadata,
    embedding: response.data[0].embedding,
  })

  if (error) throw error
}

async function searchDocuments(query: string, matchCount = 5) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: query,
  })

  const { data, error } = await supabase.rpc('match_documents', {
    query_embedding: response.data[0].embedding,
    match_threshold: 0.78,
    match_count: matchCount,
  })

  if (error) throw error
  return data
}

Framework Comparison: LangChain vs LlamaIndex vs Vercel AI SDK

Three frameworks dominate the RAG ecosystem in 2026. Each takes a different approach:

LangChain / LangGraph

LangChain is the most comprehensive framework with pre-built integrations for every vector store, embedding model, and LLM. In 2026, production workloads have largely shifted to LangGraph, which adds stateful agent workflows on top of LangChain's primitives.

Best for: Complex agent pipelines, multi-step reasoning, tool-augmented RAG Trade-off: 101.2 kB gzipped bundle, heavier abstraction layer, steeper learning curve

import { ChatOpenAI } from '@langchain/openai'
import { SupabaseVectorStore } from '@langchain/community/vectorstores/supabase'
import { OpenAIEmbeddings } from '@langchain/openai'
import { createRetrievalChain } from 'langchain/chains/retrieval'

const vectorStore = await SupabaseVectorStore.fromExistingIndex(
  new OpenAIEmbeddings({ model: 'text-embedding-3-small' }),
  { client: supabase, tableName: 'documents', queryName: 'match_documents' }
)

const retriever = vectorStore.asRetriever({ k: 5 })

LlamaIndex

LlamaIndex is purpose-built for search and retrieval. It has the most optimized ingestion pipeline and retrieves documents 40% faster than LangChain in benchmarks.

Best for: Document-heavy RAG, structured data querying, knowledge base construction Trade-off: Narrower scope than LangChain — less flexibility for non-RAG tasks

import { VectorStoreIndex, SimpleDirectoryReader } from 'llamaindex'

const documents = await new SimpleDirectoryReader().loadData('./docs')
const index = await VectorStoreIndex.fromDocuments(documents)
const queryEngine = index.asQueryEngine()

const response = await queryEngine.query('How does authentication work?')

Vercel AI SDK

The Vercel AI SDK takes a minimal, streaming-first approach. Rather than abstracting away the RAG pipeline, it gives you composable primitives (embed, streamText, generateObject) that you wire together yourself.

Best for: Next.js applications, streaming UI, edge runtime, minimal bundle size Trade-off: You build the RAG pipeline yourself — less out-of-the-box compared to LangChain

import { openai } from '@ai-sdk/openai'
import { embed, streamText } from 'ai'

const { embedding } = await embed({
  model: openai.embedding('text-embedding-3-small'),
  value: 'How does authentication work?',
})
// Use embedding to query your vector store, then pass results to streamText

For Next.js developers, the Vercel AI SDK is the clear winner. It reduces streaming UI implementation from 100+ lines to ~20, supports the edge runtime, and its 67.5 kB gzipped bundle is half the size of LangChain. You can always add LangChain later for complex agent workflows.

Working Example: Document Q&A with Next.js

Here's a complete implementation of a document Q&A system using Next.js, the Vercel AI SDK, OpenAI embeddings, and pgvector on Supabase. This example includes both the ingestion script and the query API route.

Ingestion Script

// scripts/ingest.ts
import { createClient } from '@supabase/supabase-js'
import OpenAI from 'openai'
import { readFileSync, readdirSync } from 'fs'
import { join } from 'path'

const supabase = createClient(process.env.SUPABASE_URL!, process.env.SUPABASE_SERVICE_KEY!)
const openai = new OpenAI()

function chunkText(text: string, maxTokens = 500, overlap = 50): string[] {
  const words = text.split(/\s+/)
  const chunks: string[] = []
  const chunkSize = maxTokens * 0.75 // rough word-to-token ratio

  for (let i = 0; i < words.length; i += chunkSize - overlap) {
    const chunk = words.slice(i, i + chunkSize).join(' ')
    if (chunk.trim()) chunks.push(chunk.trim())
  }

  return chunks
}

async function ingestDirectory(dirPath: string) {
  const files = readdirSync(dirPath).filter((f) => f.endsWith('.md') || f.endsWith('.txt'))

  for (const file of files) {
    const content = readFileSync(join(dirPath, file), 'utf-8')
    const chunks = chunkText(content)

    console.log(`Processing ${file}: ${chunks.length} chunks`)

    // Batch embed chunks (OpenAI supports batch input)
    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: chunks,
    })

    const rows = chunks.map((chunk, i) => ({
      content: chunk,
      metadata: { source: file, chunkIndex: i },
      embedding: response.data[i].embedding,
    }))

    const { error } = await supabase.from('documents').insert(rows)
    if (error) {
      console.error(`Failed to insert chunks for ${file}:`, error.message)
    }
  }

  console.log('Ingestion complete.')
}

ingestDirectory('./docs')

Query API Route

// app/api/chat/route.ts
import { openai } from '@ai-sdk/openai'
import { embed, streamText } from 'ai'
import { createClient } from '@supabase/supabase-js'

const supabase = createClient(process.env.SUPABASE_URL!, process.env.SUPABASE_SERVICE_KEY!)

export async function POST(req: Request) {
  const { messages } = await req.json()
  const lastMessage = messages[messages.length - 1].content

  // 1. Embed the user's question
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: lastMessage,
  })

  // 2. Search for relevant documents
  const { data: documents } = await supabase.rpc('match_documents', {
    query_embedding: embedding,
    match_threshold: 0.78,
    match_count: 5,
  })

  // 3. Build context from retrieved documents
  const context = documents?.map((doc: { content: string }) => doc.content).join('\n\n---\n\n')

  // 4. Generate a response with context
  const result = streamText({
    model: openai('gpt-4o'),
    system: `You are a helpful assistant. Answer the user's question using ONLY the provided context. If the context doesn't contain the answer, say so honestly.

Context:
${context}`,
    messages,
  })

  return result.toDataStreamResponse()
}

Client Component

// app/page.tsx
'use client'

import { useChat } from '@ai-sdk/react'

export default function Chat() {
  const { messages, input, handleInputChange, handleSubmit } = useChat()

  return (
    <div className="mx-auto max-w-2xl p-4">
      <div className="space-y-4">
        {messages.map((m) => (
          <div key={m.id} className={m.role === 'user' ? 'text-right' : 'text-left'}>
            <span className="inline-block rounded-lg bg-gray-100 px-4 py-2 dark:bg-gray-800">
              {m.content}
            </span>
          </div>
        ))}
      </div>
      <form onSubmit={handleSubmit} className="mt-4 flex gap-2">
        <input
          value={input}
          onChange={handleInputChange}
          placeholder="Ask a question about your documents..."
          className="flex-1 rounded-lg border px-4 py-2"
        />
        <button type="submit" className="rounded-lg bg-blue-500 px-4 py-2 text-white">
          Send
        </button>
      </form>
    </div>
  )
}

Install the required dependencies:

pnpm add ai @ai-sdk/openai @ai-sdk/react @supabase/supabase-js openai

Chunking Strategies

How you split documents into chunks has a huge impact on retrieval quality. A chunk that's too large dilutes the relevant information with noise. A chunk that's too small loses critical context. Recent benchmarks show chunking strategy can swing accuracy by 70%.

Fixed-Size Chunking

Split text into chunks of N tokens with M tokens of overlap. Simple, predictable, and surprisingly effective.

function fixedSizeChunk(text: string, size = 512, overlap = 64): string[] {
  const words = text.split(/\s+/)
  const chunks: string[] = []
  for (let i = 0; i < words.length; i += size - overlap) {
    chunks.push(words.slice(i, i + size).join(' '))
  }
  return chunks
}

When to use: Default starting point. A NAACL 2025 study found that fixed 200-word chunks matched or beat semantic chunking across retrieval and answer generation tasks.

Recursive Chunking

Splits hierarchically — first by sections (##), then paragraphs, then sentences — until chunks fit size limits. This preserves document structure.

When to use: Markdown docs, technical documentation, any content with clear structural hierarchy. Benchmarks place recursive 512-token splitting at 69% accuracy vs 54% for semantic chunking across academic papers.

Semantic Chunking

Uses embeddings to detect topic shifts and splits where the semantic similarity between consecutive sentences drops below a threshold.

When to use: Long, unstructured documents (transcripts, emails, chat logs) where topic boundaries aren't marked by formatting. The computational cost is higher — you're making embedding API calls during chunking — so only use this when fixed-size chunks produce poor results.

Best Practices

  • Start with 400-512 tokens, 10-20% overlap. This is the recommended baseline from Weaviate's benchmarks
  • Include metadata. Always store the source file, section heading, page number, or URL with each chunk. This enables source attribution in your RAG responses
  • Overlap prevents context loss. If a key fact spans a chunk boundary, overlap ensures both adjacent chunks contain it
  • Benchmark on your data. Chunking is highly domain-dependent. Test 3-4 strategies on a set of real queries and measure retrieval precision

Vector search excels at semantic similarity but struggles with exact matches. Searching for "error code E-1042" will fail with pure vector search because the embedding won't capture the specific code. This is where hybrid search combines vector similarity with keyword matching.

BM25 is the classic keyword-matching algorithm (used by Elasticsearch, Solr, and most search engines). Hybrid search runs both BM25 and vector search in parallel, then merges the results using Reciprocal Rank Fusion (RRF):

-- Hybrid search in pgvector with ts_vector for keyword matching
CREATE INDEX documents_search_idx ON documents USING GIN (to_tsvector('english', content));

CREATE OR REPLACE FUNCTION hybrid_search(
  query_text TEXT,
  query_embedding VECTOR(1536),
  match_count INT DEFAULT 5,
  keyword_weight FLOAT DEFAULT 0.3,
  semantic_weight FLOAT DEFAULT 0.7
)
RETURNS TABLE (id BIGINT, content TEXT, metadata JSONB, score FLOAT)
LANGUAGE sql STABLE
AS $$
  WITH semantic AS (
    SELECT id, content, metadata,
      1 - (embedding <=> query_embedding) AS similarity,
      ROW_NUMBER() OVER (ORDER BY embedding <=> query_embedding) AS rank
    FROM documents
    ORDER BY embedding <=> query_embedding
    LIMIT match_count * 2
  ),
  keyword AS (
    SELECT id, content, metadata,
      ts_rank(to_tsvector('english', content), plainto_tsquery('english', query_text)) AS rank_score,
      ROW_NUMBER() OVER (ORDER BY ts_rank(to_tsvector('english', content), plainto_tsquery('english', query_text)) DESC) AS rank
    FROM documents
    WHERE to_tsvector('english', content) @@ plainto_tsquery('english', query_text)
    LIMIT match_count * 2
  )
  SELECT
    COALESCE(s.id, k.id) AS id,
    COALESCE(s.content, k.content) AS content,
    COALESCE(s.metadata, k.metadata) AS metadata,
    (COALESCE(semantic_weight / s.rank, 0) + COALESCE(keyword_weight / k.rank, 0)) AS score
  FROM semantic s
  FULL OUTER JOIN keyword k ON s.id = k.id
  ORDER BY score DESC
  LIMIT match_count;
$$;
  • Technical documentation — users search for error codes, function names, config keys
  • E-commerce — product SKUs, brand names mixed with natural language queries
  • Legal/compliance — exact statute numbers alongside conceptual questions
  • Mixed query patterns — some users type keywords, others ask natural language questions

If your queries are purely conversational ("How do I deploy to production?"), pure vector search works fine. Add keyword matching when you see exact-match queries failing.

Evaluation and Testing RAG Quality

Building a RAG system is straightforward. Building one that consistently returns good answers is hard. You need to measure quality across three dimensions:

Retrieval Metrics

  • Precision@K — Of the K documents retrieved, how many were actually relevant? If you retrieve 5 documents and 3 are relevant, precision@5 = 0.6
  • Recall@K — Of all relevant documents in the database, how many did you retrieve? If there are 10 relevant documents and you retrieved 3, recall@5 = 0.3
  • Mean Reciprocal Rank (MRR) — How high does the first relevant result appear? If the first relevant result is position 2, MRR = 0.5

Generation Metrics

  • Faithfulness — Does the answer only use information from the retrieved context? Hallucinated information that goes beyond the context scores poorly
  • Answer relevance — Does the answer actually address the user's question?
  • Correctness — Is the answer factually correct compared to a ground truth?

Practical Evaluation Approach

Build a test set of 50-100 question-answer pairs with known correct answers and the expected source documents. Run your RAG pipeline on these questions and measure:

interface EvalResult {
  question: string
  expectedAnswer: string
  actualAnswer: string
  retrievedDocs: string[]
  expectedDocs: string[]
  metrics: {
    precisionAtK: number
    faithfulness: number // LLM-as-judge score 0-1
    answerRelevance: number // LLM-as-judge score 0-1
  }
}

Tools like Ragas and LangSmith automate this evaluation loop. The key is to measure before tweaking — change one variable at a time (chunk size, overlap, top-K, model) and track how each change affects your metrics.

Production Considerations

Cache Embeddings

Embedding the same query twice wastes money and adds latency. Cache embeddings by hashing the input text:

import { createHash } from 'crypto'

const embeddingCache = new Map<string, number[]>()

async function getCachedEmbedding(text: string): Promise<number[]> {
  const hash = createHash('sha256').update(text).digest('hex')
  if (embeddingCache.has(hash)) return embeddingCache.get(hash)!

  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text,
  })

  const embedding = response.data[0].embedding
  embeddingCache.set(hash, embedding)
  return embedding
}

For production, replace the in-memory Map with Redis or a similar cache.

Rate Limits and Batching

OpenAI's embedding API allows batch input — pass an array of strings instead of making individual calls. This is faster and less likely to hit rate limits:

// Batch embed up to 2048 inputs at once
const response = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: chunks, // string[] — up to 2048 items
})

Cost Optimization

At text-embedding-3-small pricing ($0.02/1M tokens):

  • 10,000 document chunks (~500 tokens each) = ~$0.10 to embed
  • 100,000 queries/month (~20 tokens each) = ~$0.04/month
  • Total embedding costs stay well under $10/month for most applications

The LLM generation step (GPT-4o, Claude) is where the real cost is. Optimize by:

  • Reducing top-K — fetch 3-5 documents instead of 10
  • Trimming context — only pass the most relevant portions of each chunk
  • Caching responses — cache answers for common/repeated questions
  • Using smaller models — GPT-4o-mini or Claude Haiku for simple Q&A

Monitoring

Track these metrics in production:

  • Retrieval latency — p50/p95 time for vector search queries
  • Similarity scores — average similarity of top-K results (low scores = poor retrieval)
  • Empty results rate — percentage of queries that return no results above threshold
  • LLM "I don't know" rate — how often the model can't answer from the context (indicates coverage gaps)

Log every query, the retrieved documents, and the generated response. This data is invaluable for identifying failure patterns and improving your RAG pipeline over time. If you're using PostHog for analytics, track these as custom events to build dashboards around RAG quality.


RAG has evolved from an experimental technique to the standard architecture for building AI systems with custom knowledge. The stack is mature: OpenAI embeddings for encoding, pgvector on Supabase for storage and search, and the Vercel AI SDK for building the application layer. Start with the working example above, benchmark on your own data, and iterate on chunk size and retrieval strategy until your quality metrics hit target.


Sources

Research Papers

arXivSurvey on Knowledge-Oriented RAG (2025)arXivStatic Word Embeddings for Sentence Semantics (2025)arXivMTEB Massive Text Embedding Benchmark (2025 Update)arXivBlending Learning to Rank and Dense Representations (2025)arXivText Embedding Evaluation Benchmark (2025)arXivGenerative Query Rewriting for Passage Retrieval (2025)

Documentation & Tools

OpenAI Embeddings API DocsSupabase pgvector GuideVercel AI SDK DocsPinecone: What is a Vector Database?
Share: