Generative AI Cheatsheet

Choosing Your Model

▼

OpenAI Context

GPT-4o: 128K tokens
GPT-4 Turbo: 128K tokens
GPT-3.5 Turbo: 16K tokens

Model Selection Tips

Best for: General purpose, function calling
Pricing: Token-based (input + output)

Anthropic (Claude)

Best for: Long documents, analysis, coding
Pricing: Token-based
Context: 200K tokens (all Claude 3 & 4 models)
Latest: Claude Sonnet 4.5, Claude Opus 4

Open Source (Llama, Mistral, Phi)

Best for: Cost control, data privacy, customization
Pricing: Infrastructure costs only
Deployment: Self-hosted or cloud

Your First API Call

▼

OpenAI Example

import openai

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Anthropic Claude Example

import anthropic

client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ]
)

print(message.content[0].text)

Common Patterns

▼

Chat Completion

Supports system, user, and assistant roles.
Maintains conversation history for context.
Can be used for customer support, virtual assistants, and interactive applications.

Text Completion

Generates text from a prompt without conversation history.
Useful for code generation, document drafting, and auto-complete features.
Often used in IDEs and productivity tools.

Embeddings

Converts text into high-dimensional vectors.
Enables semantic search, clustering, and similarity matching.
Used in Retrieval-Augmented Generation (RAG) and recommendation systems.

Function Calling

Allows models to call external tools or APIs with structured arguments.
Enables automation, data retrieval, and integration with other services.
Supports workflows like tool-augmented agents and dynamic task execution.

Zero-Shot vs Few-Shot

▼

Zero-Shot Prompting

Direct instruction without examples.

Classify the sentiment: "This product exceeded my expectations!"

Simple and fast
Works well for common tasks
May struggle with complex or domain-specific tasks

Few-Shot Prompting

Provide examples before the actual task.

Classify sentiment as positive, negative, or neutral:

Review: "Amazing quality!" → Positive
Review: "Terrible experience." → Negative
Review: "It's okay, nothing special." → Neutral
Review: "Best purchase ever!" → ?

Dramatically improves accuracy
3-5 examples usually sufficient
Examples should match your use case

Chain-of-Thought (CoT)

▼

Ask model to show reasoning steps. Critical for math and logic.

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. 
Each can has 3 tennis balls. How many tennis balls does he have now?

A: Let's think step by step:
1. Roger starts with 5 balls
2. He buys 2 cans with 3 balls each = 2 × 3 = 6 balls
3. Total = 5 + 6 = 11 balls

System Prompts & Templates

▼

System Prompts

Set behavior and context for the entire conversation.

System: You are a Python expert who provides concise, 
production-ready code with error handling. Always include 
type hints and docstrings.

Prompt Templates

Summarize this document in {num_sentences} sentences, 
focusing on {key_aspects}.

Document: {text}

How Tokenization Works

▼

Text is split into subword units before processing. One token ≠ one word.

Input: "Hello, world!"
Tokens: ["Hello", ",", " world", "!"]
Token count: 4

Input: "artificial intelligence"
Tokens: ["art", "ificial", " intelligence"]
Token count: 3

Input: "ChatGPT is amazing!"
Tokens: ["Chat", "GPT", " is", " amazing", "!"]
Token count: 5

Common words: Usually 1 token
Rare words: Split into multiple tokens
Numbers/Special Chars: Often vary in tokenization
Impacts:
- Cost: APIs charge per token
- Speed: More tokens = slower generation
- Context: Models have fixed token limits

Token Efficiency Tips

▼

Writing Efficiency

Use common words over rare synonyms
Avoid excessive formatting (markdown, JSON) when not needed
Remove redundant whitespace
Be concise in system prompts

Counting Tokens

# OpenAI (tiktoken)
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode("Your text here")
print(f"Token count: {len(tokens)}")

# Anthropic (approximate)
# ~4 chars per token for English
char_count = len(text)
approx_tokens = char_count / 4

Common Token Traps

JSON formatting: Adds 20-30% overhead with braces, quotes, commas
Code blocks: Indentation and syntax can double token count
Repeated context: Don't resend full conversation history every time
System prompts: Keep under 200 tokens; they're sent with every request

When to Optimize

High volume: >1,000 requests/day → optimize aggressively
Long context: Near model limits (e.g., 120K/128K) → must optimize
Low volume: <100 requests/day → don't over-optimize

Token Savings Example

❌ Inefficient (85 tokens):
"Please provide a comprehensive and detailed analysis..."

✅ Efficient (12 tokens):
"Analyze this data:"

Savings: 73 tokens = ~$0.002 per request × 10K requests = $20/month

Popular Vector DBs

▼

Database	Type	Scale	Pricing	Best For
Pinecone	Managed	Billions	$70+/mo	Production, zero ops
Weaviate	Open/Managed	Billions	Free (self-host)	Hybrid search, GraphQL
ChromaDB	Embedded	Millions	Free	Prototyping, local dev
Qdrant	Open/Managed	Billions	Free (self-host)	Performance, filtering
FAISS	Library	Billions	Free	Research, custom needs
Milvus	Open/Managed	Trillions	Free (self-host)	Massive scale, GPUs
pgvector	PostgreSQL ext	Millions	Free	Existing Postgres apps

Key Features Comparison

Hybrid Search: Pinecone, Weaviate, Qdrant (vector + keyword)
Built-in Embeddings: Weaviate (auto-vectorize text)
Metadata Filtering: All except FAISS
Multi-tenancy: Pinecone, Weaviate, Qdrant
GPU Acceleration: Milvus, FAISS

Decision Tree

Just starting/prototyping? → ChromaDB (easiest setup)
Already using Postgres? → pgvector (no new infrastructure)
Production, don't want to manage infra? → Pinecone (fully managed)
Need open-source + production-ready? → Weaviate or Qdrant
Massive scale (>10B vectors)? → Milvus
Research/custom algorithms? → FAISS
Tight budget? → ChromaDB or self-hosted Qdrant

Performance Characteristics

Database	Query Speed (1M vectors)	Setup Time
FAISS	<10ms	5 min
ChromaDB	<50ms	2 min
Qdrant	<20ms	15 min
Pinecone	<50ms	10 min (signup)
Weaviate	<30ms	20 min

Common Gotchas

ChromaDB: Not for production at scale, no built-in auth
FAISS: No persistence layer, you must manage data storage
Pinecone: Can get expensive at scale ($0.096/1M queries)
Milvus: Complex setup, needs DevOps expertise
pgvector: Slower than specialized vector DBs at >1M vectors

Quick Start: ChromaDB

pip install chromadb

import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")

# Add vectors
collection.add(
    documents=["This is doc 1", "This is doc 2"],
    ids=["id1", "id2"]
)

# Query
results = collection.query(
    query_texts=["find similar docs"],
    n_results=2
)
print(results)

Quick Start: Pinecone

pip install pinecone-client

from pinecone import Pinecone
pc = Pinecone(api_key="your-key")

# Create index
index = pc.create_index(
    name="my-index",
    dimension=1536,  # OpenAI embedding size
    metric="cosine"
)

# Upsert vectors
index.upsert([
    ("id1", [0.1, 0.2, ...], {"text": "doc 1"}),
    ("id2", [0.3, 0.4, ...], {"text": "doc 2"})
])

# Query
results = index.query(
    vector=[0.1, 0.2, ...],
    top_k=5,
    include_metadata=True
)

Indexing Methods

▼

Index Types Comparison

Method	Speed	Accuracy (Recall)	Memory	Best Scale
Flat	Slow	100%	Low	<100K vectors
HNSW	Very Fast	95-99%	High	1M-100M
IVF	Fast	90-95%	Medium	100M-1B
PQ	Fast	85-90%	Very Low	1B+
LSH	Very Fast	80-85%	Low	100M+

Detailed Breakdown

Flat Index (Brute Force)

How it works: Compares query to every vector. No optimization.
Query time: O(n) - linear with dataset size
Use when: <10K vectors, 100% accuracy required, baseline testing
Popular in: FAISS, ChromaDB (default)

# FAISS Flat Index
import faiss
index = faiss.IndexFlatL2(dimension)
index.add(vectors)  # Add all vectors
distances, ids = index.search(query, k=5)

HNSW (Hierarchical Navigable Small World)

How it works: Multi-layer graph. Navigate from top (sparse) to bottom (dense).
Query time: O(log n) - logarithmic!
Build time: Slow (hours for 10M vectors)
Memory: 30-50% overhead per vector
Recall tuning: Adjust `ef` (higher = better recall, slower queries)
Use when: Need fast queries, can afford memory, dataset <100M
Popular in: Pinecone, Weaviate, Qdrant (default), FAISS

# FAISS HNSW
import faiss
index = faiss.IndexHNSWFlat(dimension, 32)  # 32 = M (connections per node)
index.hnsw.efConstruction = 40  # Build quality
index.add(vectors)

index.hnsw.efSearch = 16  # Query quality (higher = better recall)
distances, ids = index.search(query, k=5)

IVF (Inverted File Index)

How it works: Cluster vectors into groups (centroids). Search only nearest clusters.
Query time: O(√n) - sublinear
nprobe: Number of clusters to search (1-20% of total)
Recall tradeoff: More nprobe = better recall but slower
Use when: 100M-1B vectors, balanced speed/accuracy
Popular in: FAISS, Milvus

# FAISS IVF
import faiss
nlist = 100  # Number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)

# Train on sample
index.train(training_vectors)
index.add(vectors)

# Query
index.nprobe = 10  # Search 10 clusters (10% of 100)
distances, ids = index.search(query, k=5)

PQ (Product Quantization)

How it works: Compress vectors into smaller codes. Lossy compression.
Memory savings: 8-32x reduction (1536D → 48-96 bytes)
Accuracy loss: 10-15% lower recall than Flat
Use when: Limited RAM, billion-scale, can tolerate lower recall
Popular in: FAISS, often combined with IVF (IVFPQ)

# FAISS IVFPQ (IVF + Product Quantization)
import faiss
nlist = 100
m = 8  # Number of subquantizers
nbits = 8  # Bits per subquantizer

quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)

index.train(training_vectors)
index.add(vectors)
index.nprobe = 10
distances, ids = index.search(query, k=5)

LSH (Locality-Sensitive Hashing)

How it works: Hash similar vectors to same buckets. Probabilistic.
Best for: Very high dimensions (>2000D), approximate search
Tradeoff: Fast but lower recall (80-85%)
Use when: Real-time requirements, can tolerate missed results

Choosing the Right Index

Your Situation	Recommended Index
Testing/Development (<10K vectors)	Flat
Small-Medium (10K-1M vectors)	HNSW
Large (1M-100M vectors)	HNSW or IVF
Massive (100M-1B+ vectors)	IVF + PQ
Limited RAM	IVF + PQ or LSH
Need 99%+ recall	HNSW (high ef) or Flat
Ultra-fast queries (<10ms)	HNSW (GPU) or LSH

Performance Benchmarks (1M vectors, 1536D)

Index	Build Time	Query Time	Recall@10	Memory
Flat	1s	100ms	100%	6GB
HNSW (M=32)	10min	2ms	98%	9GB
IVF (nlist=1000)	5min	8ms	92%	6GB
IVFPQ (m=8)	5min	5ms	87%	750MB

Hybrid Approaches (Best of Both Worlds)

IVF-HNSW: HNSW for coarse search, IVF for fine search
IVFPQ: IVF clustering + PQ compression (most common for large scale)
Two-stage: Fast approximate index → Rerank with exact distances

Tuning Tips

HNSW M: 16 (fast), 32 (balanced), 64 (high recall). Higher = more memory.
HNSW ef: 100-200 for build, 10-50 for search. Higher = better recall.
IVF nlist: sqrt(N) is a good starting point. More clusters = finer granularity.
IVF nprobe: 1-5% of nlist. Start with 5-10, tune based on recall needs.
Always benchmark: Test recall@k on your specific data!

Common Mistakes

❌ Using Flat index for >100K vectors (too slow)
❌ Not training IVF indexes properly (needs representative sample)
❌ Setting HNSW ef too low (poor recall)
❌ Not measuring recall (blindly trusting approximate results)
❌ Using PQ without understanding accuracy loss

RAG Architecture

▼

Query Processing: Embed user question, extract keywords.
Retrieval: Vector search (semantic) + Keyword search (hybrid).
Reranking: Reorder results by true relevance.
Context Assembly: Combine top-k docs with prompt.
Generation: LLM produces answer with citations.

Chunking Strategies

▼

Fixed-Size: Split by token count (e.g., 512). Simple but may break context.
Semantic: Split at natural boundaries (paragraphs). Preserves meaning.
Recursive: Split by headers → paragraphs → sentences. Best for structured docs.
Optimal Size: 400-800 tokens for most use cases.

Retrieval & Reranking

▼

Hybrid Search

Combine dense (vector) and sparse (keyword) retrieval. Typically 70% semantic + 30% keyword.

Reranking

Use a Cross-Encoder or Rerank API to re-score top results. Improves precision by 20-40%.

from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, doc.text) for doc in results])

Model Configuration

▼

Model: Select from available LLMs (e.g., GPT-4, Claude, Llama).
Context Window: Max tokens per request (e.g., 128K for GPT-4o).
Temperature: Controls randomness/creativity (0.0-1.0).
System Prompt: Sets assistant behavior and tone.
Max Tokens: Output length limit.

Temperature Scale

▼

Temperature Scale (0.0 - 2.0)

0.0 (Deterministic): Factual Q&A, Code, Data Extraction.
0.3 (Focused): Summaries, Technical writing.
0.7 (Balanced): General chat, Explanations. (Default)
1.0+ (Creative): Brainstorming, Marketing copy.

Other Parameters

Top-p: Nucleus sampling (alternative to temp).
Frequency Penalty: Reduces repetition of words.
Presence Penalty: Encourages new topics.

Models & Metrics

▼

Popular Models

OpenAI text-embedding-3-small: Cost-effective, standard.
OpenAI text-embedding-3-large: High precision.
Cohere embed-english-v3.0: Specialized for search/clustering.
Sentence Transformers: Open source, run locally.

Distance Metrics

Cosine Similarity: Most common for text. Measures angle.
Dot Product: Faster, sensitive to magnitude.
Euclidean: Good for spatial data.

Multimodal & Streaming

▼

Vision + Text

GPT-4o / GPT-4V: Image understanding + generation.
Claude 3.5 Sonnet: Chart reading and UI analysis.
Gemini 1.5 Pro: Video understanding and long context.

Streaming

Start showing results immediately to improve perceived latency.

When to Finetune

▼

✅ Do it for: Domain-specific language (medical/legal), consistent output formats, or high volume tasks.

❌ Don't do it for: General knowledge, reasoning tasks (use RAG instead), or small datasets.

Methods (LoRA vs Full)

▼

Full Finetuning: Updates all parameters. Expensive.
LoRA (Low-Rank Adaptation): Updates small adapters. Efficient.
QLoRA: Quantized LoRA. Finetune large models on consumer GPUs.

Cost Optimization

▼

Model Cascade: Try cheap models first, fallback to expensive.
Caching: Cache exact queries and embeddings.
Batch Processing: Process non-urgent requests in batches (50% cheaper).
Token Management: Truncate input, set max_tokens, use stop sequences.

Applications

▼

Customer Support

RAG over help docs + Function calling for tickets. Critical: Accuracy and smooth handoff.

Code Generation

IDE auto-complete, bug fixing, refactoring. Best practice: Human review loop.

Document Q&A

Legal/Medical analysis. Critical: Citations and semantic chunking.

Data Extraction

Invoice processing, form filling. Use JSON mode for structured output.

Common Issues & Solutions

▼

Hallucinations

Use RAG to ground responses
Lower temperature (0.0-0.3)
Request citations: "Cite your sources"

High Latency

Use smaller models (Haiku, GPT-3.5)
Enable streaming responses
Reduce max_tokens output

Cost Overruns

Cache common queries
Use batch processing for non-urgent tasks
Monitor tokens per request