OpenAI Context
- GPT-4o: 128K tokens
- GPT-4 Turbo: 128K tokens
- GPT-3.5 Turbo: 16K tokens
Model Selection Tips
- Best for: General purpose, function calling
- Pricing: Token-based (input + output)
Anthropic (Claude)
- Best for: Long documents, analysis, coding
- Pricing: Token-based
- Context: 200K tokens (all Claude 3 & 4 models)
- Latest: Claude Sonnet 4.5, Claude Opus 4
Open Source (Llama, Mistral, Phi)
- Best for: Cost control, data privacy, customization
- Pricing: Infrastructure costs only
- Deployment: Self-hosted or cloud
OpenAI Example
import openai
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
Anthropic Claude Example
import anthropic
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
messages=[
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
)
print(message.content[0].text)
Chat Completion
- Supports system, user, and assistant roles.
- Maintains conversation history for context.
- Can be used for customer support, virtual assistants, and interactive applications.
Text Completion
- Generates text from a prompt without conversation history.
- Useful for code generation, document drafting, and auto-complete features.
- Often used in IDEs and productivity tools.
Embeddings
- Converts text into high-dimensional vectors.
- Enables semantic search, clustering, and similarity matching.
- Used in Retrieval-Augmented Generation (RAG) and recommendation systems.
Function Calling
- Allows models to call external tools or APIs with structured arguments.
- Enables automation, data retrieval, and integration with other services.
- Supports workflows like tool-augmented agents and dynamic task execution.
Zero-Shot Prompting
Direct instruction without examples.
Classify the sentiment: "This product exceeded my expectations!"
- Simple and fast
- Works well for common tasks
- May struggle with complex or domain-specific tasks
Few-Shot Prompting
Provide examples before the actual task.
Classify sentiment as positive, negative, or neutral:
Review: "Amazing quality!" → Positive
Review: "Terrible experience." → Negative
Review: "It's okay, nothing special." → Neutral
Review: "Best purchase ever!" → ?
- Dramatically improves accuracy
- 3-5 examples usually sufficient
- Examples should match your use case
Ask model to show reasoning steps. Critical for math and logic.
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Let's think step by step:
1. Roger starts with 5 balls
2. He buys 2 cans with 3 balls each = 2 × 3 = 6 balls
3. Total = 5 + 6 = 11 balls
System Prompts
Set behavior and context for the entire conversation.
System: You are a Python expert who provides concise,
production-ready code with error handling. Always include
type hints and docstrings.
Prompt Templates
Summarize this document in {num_sentences} sentences,
focusing on {key_aspects}.
Document: {text}
Text is split into subword units before processing. One token ≠ one word.
Input: "Hello, world!"
Tokens: ["Hello", ",", " world", "!"]
Token count: 4
Input: "artificial intelligence"
Tokens: ["art", "ificial", " intelligence"]
Token count: 3
Input: "ChatGPT is amazing!"
Tokens: ["Chat", "GPT", " is", " amazing", "!"]
Token count: 5
- Common words: Usually 1 token
- Rare words: Split into multiple tokens
- Numbers/Special Chars: Often vary in tokenization
- Impacts:
- Cost: APIs charge per token
- Speed: More tokens = slower generation
- Context: Models have fixed token limits
Writing Efficiency
- Use common words over rare synonyms
- Avoid excessive formatting (markdown, JSON) when not needed
- Remove redundant whitespace
- Be concise in system prompts
Counting Tokens
# OpenAI (tiktoken)
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode("Your text here")
print(f"Token count: {len(tokens)}")
# Anthropic (approximate)
# ~4 chars per token for English
char_count = len(text)
approx_tokens = char_count / 4
Common Token Traps
- JSON formatting: Adds 20-30% overhead with braces, quotes, commas
- Code blocks: Indentation and syntax can double token count
- Repeated context: Don't resend full conversation history every time
- System prompts: Keep under 200 tokens; they're sent with every request
When to Optimize
- High volume: >1,000 requests/day → optimize aggressively
- Long context: Near model limits (e.g., 120K/128K) → must optimize
- Low volume: <100 requests/day → don't over-optimize
Token Savings Example
❌ Inefficient (85 tokens):
"Please provide a comprehensive and detailed analysis..."
✅ Efficient (12 tokens):
"Analyze this data:"
Savings: 73 tokens = ~$0.002 per request × 10K requests = $20/month
| Database | Type | Scale | Pricing | Best For |
|---|---|---|---|---|
| Pinecone | Managed | Billions | $70+/mo | Production, zero ops |
| Weaviate | Open/Managed | Billions | Free (self-host) | Hybrid search, GraphQL |
| ChromaDB | Embedded | Millions | Free | Prototyping, local dev |
| Qdrant | Open/Managed | Billions | Free (self-host) | Performance, filtering |
| FAISS | Library | Billions | Free | Research, custom needs |
| Milvus | Open/Managed | Trillions | Free (self-host) | Massive scale, GPUs |
| pgvector | PostgreSQL ext | Millions | Free | Existing Postgres apps |
Key Features Comparison
- Hybrid Search: Pinecone, Weaviate, Qdrant (vector + keyword)
- Built-in Embeddings: Weaviate (auto-vectorize text)
- Metadata Filtering: All except FAISS
- Multi-tenancy: Pinecone, Weaviate, Qdrant
- GPU Acceleration: Milvus, FAISS
Decision Tree
- Just starting/prototyping? → ChromaDB (easiest setup)
- Already using Postgres? → pgvector (no new infrastructure)
- Production, don't want to manage infra? → Pinecone (fully managed)
- Need open-source + production-ready? → Weaviate or Qdrant
- Massive scale (>10B vectors)? → Milvus
- Research/custom algorithms? → FAISS
- Tight budget? → ChromaDB or self-hosted Qdrant
Performance Characteristics
| Database | Query Speed (1M vectors) | Setup Time |
|---|---|---|
| FAISS | <10ms | 5 min |
| ChromaDB | <50ms | 2 min |
| Qdrant | <20ms | 15 min |
| Pinecone | <50ms | 10 min (signup) |
| Weaviate | <30ms | 20 min |
Common Gotchas
- ChromaDB: Not for production at scale, no built-in auth
- FAISS: No persistence layer, you must manage data storage
- Pinecone: Can get expensive at scale ($0.096/1M queries)
- Milvus: Complex setup, needs DevOps expertise
- pgvector: Slower than specialized vector DBs at >1M vectors
Quick Start: ChromaDB
pip install chromadb
import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")
# Add vectors
collection.add(
documents=["This is doc 1", "This is doc 2"],
ids=["id1", "id2"]
)
# Query
results = collection.query(
query_texts=["find similar docs"],
n_results=2
)
print(results)
Quick Start: Pinecone
pip install pinecone-client
from pinecone import Pinecone
pc = Pinecone(api_key="your-key")
# Create index
index = pc.create_index(
name="my-index",
dimension=1536, # OpenAI embedding size
metric="cosine"
)
# Upsert vectors
index.upsert([
("id1", [0.1, 0.2, ...], {"text": "doc 1"}),
("id2", [0.3, 0.4, ...], {"text": "doc 2"})
])
# Query
results = index.query(
vector=[0.1, 0.2, ...],
top_k=5,
include_metadata=True
)
Index Types Comparison
| Method | Speed | Accuracy (Recall) | Memory | Best Scale |
|---|---|---|---|---|
| Flat | Slow | 100% | Low | <100K vectors |
| HNSW | Very Fast | 95-99% | High | 1M-100M |
| IVF | Fast | 90-95% | Medium | 100M-1B |
| PQ | Fast | 85-90% | Very Low | 1B+ |
| LSH | Very Fast | 80-85% | Low | 100M+ |
Detailed Breakdown
Flat Index (Brute Force)
- How it works: Compares query to every vector. No optimization.
- Query time: O(n) - linear with dataset size
- Use when: <10K vectors, 100% accuracy required, baseline testing
- Popular in: FAISS, ChromaDB (default)
# FAISS Flat Index
import faiss
index = faiss.IndexFlatL2(dimension)
index.add(vectors) # Add all vectors
distances, ids = index.search(query, k=5)
HNSW (Hierarchical Navigable Small World)
- How it works: Multi-layer graph. Navigate from top (sparse) to bottom (dense).
- Query time: O(log n) - logarithmic!
- Build time: Slow (hours for 10M vectors)
- Memory: 30-50% overhead per vector
- Recall tuning: Adjust `ef` (higher = better recall, slower queries)
- Use when: Need fast queries, can afford memory, dataset <100M
- Popular in: Pinecone, Weaviate, Qdrant (default), FAISS
# FAISS HNSW
import faiss
index = faiss.IndexHNSWFlat(dimension, 32) # 32 = M (connections per node)
index.hnsw.efConstruction = 40 # Build quality
index.add(vectors)
index.hnsw.efSearch = 16 # Query quality (higher = better recall)
distances, ids = index.search(query, k=5)
IVF (Inverted File Index)
- How it works: Cluster vectors into groups (centroids). Search only nearest clusters.
- Query time: O(√n) - sublinear
- nprobe: Number of clusters to search (1-20% of total)
- Recall tradeoff: More nprobe = better recall but slower
- Use when: 100M-1B vectors, balanced speed/accuracy
- Popular in: FAISS, Milvus
# FAISS IVF
import faiss
nlist = 100 # Number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
# Train on sample
index.train(training_vectors)
index.add(vectors)
# Query
index.nprobe = 10 # Search 10 clusters (10% of 100)
distances, ids = index.search(query, k=5)
PQ (Product Quantization)
- How it works: Compress vectors into smaller codes. Lossy compression.
- Memory savings: 8-32x reduction (1536D → 48-96 bytes)
- Accuracy loss: 10-15% lower recall than Flat
- Use when: Limited RAM, billion-scale, can tolerate lower recall
- Popular in: FAISS, often combined with IVF (IVFPQ)
# FAISS IVFPQ (IVF + Product Quantization)
import faiss
nlist = 100
m = 8 # Number of subquantizers
nbits = 8 # Bits per subquantizer
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)
index.train(training_vectors)
index.add(vectors)
index.nprobe = 10
distances, ids = index.search(query, k=5)
LSH (Locality-Sensitive Hashing)
- How it works: Hash similar vectors to same buckets. Probabilistic.
- Best for: Very high dimensions (>2000D), approximate search
- Tradeoff: Fast but lower recall (80-85%)
- Use when: Real-time requirements, can tolerate missed results
Choosing the Right Index
| Your Situation | Recommended Index |
|---|---|
| Testing/Development (<10K vectors) | Flat |
| Small-Medium (10K-1M vectors) | HNSW |
| Large (1M-100M vectors) | HNSW or IVF |
| Massive (100M-1B+ vectors) | IVF + PQ |
| Limited RAM | IVF + PQ or LSH |
| Need 99%+ recall | HNSW (high ef) or Flat |
| Ultra-fast queries (<10ms) | HNSW (GPU) or LSH |
Performance Benchmarks (1M vectors, 1536D)
| Index | Build Time | Query Time | Recall@10 | Memory |
|---|---|---|---|---|
| Flat | 1s | 100ms | 100% | 6GB |
| HNSW (M=32) | 10min | 2ms | 98% | 9GB |
| IVF (nlist=1000) | 5min | 8ms | 92% | 6GB |
| IVFPQ (m=8) | 5min | 5ms | 87% | 750MB |
Hybrid Approaches (Best of Both Worlds)
- IVF-HNSW: HNSW for coarse search, IVF for fine search
- IVFPQ: IVF clustering + PQ compression (most common for large scale)
- Two-stage: Fast approximate index → Rerank with exact distances
Tuning Tips
- HNSW M: 16 (fast), 32 (balanced), 64 (high recall). Higher = more memory.
- HNSW ef: 100-200 for build, 10-50 for search. Higher = better recall.
- IVF nlist: sqrt(N) is a good starting point. More clusters = finer granularity.
- IVF nprobe: 1-5% of nlist. Start with 5-10, tune based on recall needs.
- Always benchmark: Test recall@k on your specific data!
Common Mistakes
- ❌ Using Flat index for >100K vectors (too slow)
- ❌ Not training IVF indexes properly (needs representative sample)
- ❌ Setting HNSW ef too low (poor recall)
- ❌ Not measuring recall (blindly trusting approximate results)
- ❌ Using PQ without understanding accuracy loss
- Query Processing: Embed user question, extract keywords.
- Retrieval: Vector search (semantic) + Keyword search (hybrid).
- Reranking: Reorder results by true relevance.
- Context Assembly: Combine top-k docs with prompt.
- Generation: LLM produces answer with citations.
- Fixed-Size: Split by token count (e.g., 512). Simple but may break context.
- Semantic: Split at natural boundaries (paragraphs). Preserves meaning.
- Recursive: Split by headers → paragraphs → sentences. Best for structured docs.
- Optimal Size: 400-800 tokens for most use cases.
Hybrid Search
Combine dense (vector) and sparse (keyword) retrieval. Typically 70% semantic + 30% keyword.
Reranking
Use a Cross-Encoder or Rerank API to re-score top results. Improves precision by 20-40%.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, doc.text) for doc in results])
- Model: Select from available LLMs (e.g., GPT-4, Claude, Llama).
- Context Window: Max tokens per request (e.g., 128K for GPT-4o).
- Temperature: Controls randomness/creativity (0.0-1.0).
- System Prompt: Sets assistant behavior and tone.
- Max Tokens: Output length limit.
Temperature Scale (0.0 - 2.0)
- 0.0 (Deterministic): Factual Q&A, Code, Data Extraction.
- 0.3 (Focused): Summaries, Technical writing.
- 0.7 (Balanced): General chat, Explanations. (Default)
- 1.0+ (Creative): Brainstorming, Marketing copy.
Other Parameters
- Top-p: Nucleus sampling (alternative to temp).
- Frequency Penalty: Reduces repetition of words.
- Presence Penalty: Encourages new topics.
Popular Models
- OpenAI text-embedding-3-small: Cost-effective, standard.
- OpenAI text-embedding-3-large: High precision.
- Cohere embed-english-v3.0: Specialized for search/clustering.
- Sentence Transformers: Open source, run locally.
Distance Metrics
- Cosine Similarity: Most common for text. Measures angle.
- Dot Product: Faster, sensitive to magnitude.
- Euclidean: Good for spatial data.
Vision + Text
- GPT-4o / GPT-4V: Image understanding + generation.
- Claude 3.5 Sonnet: Chart reading and UI analysis.
- Gemini 1.5 Pro: Video understanding and long context.
Streaming
Start showing results immediately to improve perceived latency.
✅ Do it for: Domain-specific language (medical/legal), consistent output formats, or high volume tasks.
❌ Don't do it for: General knowledge, reasoning tasks (use RAG instead), or small datasets.
- Full Finetuning: Updates all parameters. Expensive.
- LoRA (Low-Rank Adaptation): Updates small adapters. Efficient.
- QLoRA: Quantized LoRA. Finetune large models on consumer GPUs.
- Model Cascade: Try cheap models first, fallback to expensive.
- Caching: Cache exact queries and embeddings.
- Batch Processing: Process non-urgent requests in batches (50% cheaper).
- Token Management: Truncate input, set max_tokens, use stop sequences.
Customer Support
RAG over help docs + Function calling for tickets. Critical: Accuracy and smooth handoff.
Code Generation
IDE auto-complete, bug fixing, refactoring. Best practice: Human review loop.
Document Q&A
Legal/Medical analysis. Critical: Citations and semantic chunking.
Data Extraction
Invoice processing, form filling. Use JSON mode for structured output.
Hallucinations
- Use RAG to ground responses
- Lower temperature (0.0-0.3)
- Request citations: "Cite your sources"
High Latency
- Use smaller models (Haiku, GPT-3.5)
- Enable streaming responses
- Reduce max_tokens output
Cost Overruns
- Cache common queries
- Use batch processing for non-urgent tasks
- Monitor tokens per request