🚀 Quick Start Guide
Getting started with LLMs, making your first API calls, and understanding common patterns.
Choosing Your Model

OpenAI Context

  • GPT-4o: 128K tokens
  • GPT-4 Turbo: 128K tokens
  • GPT-3.5 Turbo: 16K tokens

Model Selection Tips

  • Best for: General purpose, function calling
  • Pricing: Token-based (input + output)

Anthropic (Claude)

  • Best for: Long documents, analysis, coding
  • Pricing: Token-based
  • Context: 200K tokens (all Claude 3 & 4 models)
  • Latest: Claude Sonnet 4.5, Claude Opus 4

Open Source (Llama, Mistral, Phi)

  • Best for: Cost control, data privacy, customization
  • Pricing: Infrastructure costs only
  • Deployment: Self-hosted or cloud
Your First API Call

OpenAI Example

import openai

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Anthropic Claude Example

import anthropic

client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ]
)

print(message.content[0].text)
Common Patterns

Chat Completion

  • Supports system, user, and assistant roles.
  • Maintains conversation history for context.
  • Can be used for customer support, virtual assistants, and interactive applications.

Text Completion

  • Generates text from a prompt without conversation history.
  • Useful for code generation, document drafting, and auto-complete features.
  • Often used in IDEs and productivity tools.

Embeddings

  • Converts text into high-dimensional vectors.
  • Enables semantic search, clustering, and similarity matching.
  • Used in Retrieval-Augmented Generation (RAG) and recommendation systems.

Function Calling

  • Allows models to call external tools or APIs with structured arguments.
  • Enables automation, data retrieval, and integration with other services.
  • Supports workflows like tool-augmented agents and dynamic task execution.
🛠️ Prompt Engineering
Techniques for crafting effective prompts including zero-shot, few-shot, chain-of-thought, and prompt templates.
Zero-Shot vs Few-Shot

Zero-Shot Prompting

Direct instruction without examples.

Classify the sentiment: "This product exceeded my expectations!"
  • Simple and fast
  • Works well for common tasks
  • May struggle with complex or domain-specific tasks

Few-Shot Prompting

Provide examples before the actual task.

Classify sentiment as positive, negative, or neutral:

Review: "Amazing quality!" → Positive
Review: "Terrible experience." → Negative
Review: "It's okay, nothing special." → Neutral
Review: "Best purchase ever!" → ?
  • Dramatically improves accuracy
  • 3-5 examples usually sufficient
  • Examples should match your use case
Chain-of-Thought (CoT)

Ask model to show reasoning steps. Critical for math and logic.

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. 
Each can has 3 tennis balls. How many tennis balls does he have now?

A: Let's think step by step:
1. Roger starts with 5 balls
2. He buys 2 cans with 3 balls each = 2 × 3 = 6 balls
3. Total = 5 + 6 = 11 balls
System Prompts & Templates

System Prompts

Set behavior and context for the entire conversation.

System: You are a Python expert who provides concise, 
production-ready code with error handling. Always include 
type hints and docstrings.

Prompt Templates

Summarize this document in {num_sentences} sentences, 
focusing on {key_aspects}.

Document: {text}
🔢 Tokenization
Understanding how text is split into tokens, token counts, and optimization strategies.
How Tokenization Works

Text is split into subword units before processing. One token ≠ one word.

Input: "Hello, world!"
Tokens: ["Hello", ",", " world", "!"]
Token count: 4

Input: "artificial intelligence"
Tokens: ["art", "ificial", " intelligence"]
Token count: 3

Input: "ChatGPT is amazing!"
Tokens: ["Chat", "GPT", " is", " amazing", "!"]
Token count: 5
  • Common words: Usually 1 token
  • Rare words: Split into multiple tokens
  • Numbers/Special Chars: Often vary in tokenization
  • Impacts:
    • Cost: APIs charge per token
    • Speed: More tokens = slower generation
    • Context: Models have fixed token limits
Token Efficiency Tips

Writing Efficiency

  • Use common words over rare synonyms
  • Avoid excessive formatting (markdown, JSON) when not needed
  • Remove redundant whitespace
  • Be concise in system prompts

Counting Tokens

# OpenAI (tiktoken)
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode("Your text here")
print(f"Token count: {len(tokens)}")

# Anthropic (approximate)
# ~4 chars per token for English
char_count = len(text)
approx_tokens = char_count / 4

Common Token Traps

  • JSON formatting: Adds 20-30% overhead with braces, quotes, commas
  • Code blocks: Indentation and syntax can double token count
  • Repeated context: Don't resend full conversation history every time
  • System prompts: Keep under 200 tokens; they're sent with every request

When to Optimize

  • High volume: >1,000 requests/day → optimize aggressively
  • Long context: Near model limits (e.g., 120K/128K) → must optimize
  • Low volume: <100 requests/day → don't over-optimize

Token Savings Example

❌ Inefficient (85 tokens):
"Please provide a comprehensive and detailed analysis..."

✅ Efficient (12 tokens):
"Analyze this data:"

Savings: 73 tokens = ~$0.002 per request × 10K requests = $20/month
🗄️ Vector Databases
Popular vector databases, indexing methods, and setup examples for semantic search.
Popular Vector DBs
Database Type Scale Pricing Best For
Pinecone Managed Billions $70+/mo Production, zero ops
Weaviate Open/Managed Billions Free (self-host) Hybrid search, GraphQL
ChromaDB Embedded Millions Free Prototyping, local dev
Qdrant Open/Managed Billions Free (self-host) Performance, filtering
FAISS Library Billions Free Research, custom needs
Milvus Open/Managed Trillions Free (self-host) Massive scale, GPUs
pgvector PostgreSQL ext Millions Free Existing Postgres apps

Key Features Comparison

  • Hybrid Search: Pinecone, Weaviate, Qdrant (vector + keyword)
  • Built-in Embeddings: Weaviate (auto-vectorize text)
  • Metadata Filtering: All except FAISS
  • Multi-tenancy: Pinecone, Weaviate, Qdrant
  • GPU Acceleration: Milvus, FAISS

Decision Tree

  • Just starting/prototyping? → ChromaDB (easiest setup)
  • Already using Postgres? → pgvector (no new infrastructure)
  • Production, don't want to manage infra? → Pinecone (fully managed)
  • Need open-source + production-ready? → Weaviate or Qdrant
  • Massive scale (>10B vectors)? → Milvus
  • Research/custom algorithms? → FAISS
  • Tight budget? → ChromaDB or self-hosted Qdrant

Performance Characteristics

Database Query Speed (1M vectors) Setup Time
FAISS <10ms 5 min
ChromaDB <50ms 2 min
Qdrant <20ms 15 min
Pinecone <50ms 10 min (signup)
Weaviate <30ms 20 min

Common Gotchas

  • ChromaDB: Not for production at scale, no built-in auth
  • FAISS: No persistence layer, you must manage data storage
  • Pinecone: Can get expensive at scale ($0.096/1M queries)
  • Milvus: Complex setup, needs DevOps expertise
  • pgvector: Slower than specialized vector DBs at >1M vectors

Quick Start: ChromaDB

pip install chromadb

import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")

# Add vectors
collection.add(
    documents=["This is doc 1", "This is doc 2"],
    ids=["id1", "id2"]
)

# Query
results = collection.query(
    query_texts=["find similar docs"],
    n_results=2
)
print(results)

Quick Start: Pinecone

pip install pinecone-client

from pinecone import Pinecone
pc = Pinecone(api_key="your-key")

# Create index
index = pc.create_index(
    name="my-index",
    dimension=1536,  # OpenAI embedding size
    metric="cosine"
)

# Upsert vectors
index.upsert([
    ("id1", [0.1, 0.2, ...], {"text": "doc 1"}),
    ("id2", [0.3, 0.4, ...], {"text": "doc 2"})
])

# Query
results = index.query(
    vector=[0.1, 0.2, ...],
    top_k=5,
    include_metadata=True
)
Indexing Methods

Index Types Comparison

Method Speed Accuracy (Recall) Memory Best Scale
Flat Slow 100% Low <100K vectors
HNSW Very Fast 95-99% High 1M-100M
IVF Fast 90-95% Medium 100M-1B
PQ Fast 85-90% Very Low 1B+
LSH Very Fast 80-85% Low 100M+

Detailed Breakdown

Flat Index (Brute Force)
  • How it works: Compares query to every vector. No optimization.
  • Query time: O(n) - linear with dataset size
  • Use when: <10K vectors, 100% accuracy required, baseline testing
  • Popular in: FAISS, ChromaDB (default)
# FAISS Flat Index
import faiss
index = faiss.IndexFlatL2(dimension)
index.add(vectors)  # Add all vectors
distances, ids = index.search(query, k=5)
HNSW (Hierarchical Navigable Small World)
  • How it works: Multi-layer graph. Navigate from top (sparse) to bottom (dense).
  • Query time: O(log n) - logarithmic!
  • Build time: Slow (hours for 10M vectors)
  • Memory: 30-50% overhead per vector
  • Recall tuning: Adjust `ef` (higher = better recall, slower queries)
  • Use when: Need fast queries, can afford memory, dataset <100M
  • Popular in: Pinecone, Weaviate, Qdrant (default), FAISS
# FAISS HNSW
import faiss
index = faiss.IndexHNSWFlat(dimension, 32)  # 32 = M (connections per node)
index.hnsw.efConstruction = 40  # Build quality
index.add(vectors)

index.hnsw.efSearch = 16  # Query quality (higher = better recall)
distances, ids = index.search(query, k=5)
IVF (Inverted File Index)
  • How it works: Cluster vectors into groups (centroids). Search only nearest clusters.
  • Query time: O(√n) - sublinear
  • nprobe: Number of clusters to search (1-20% of total)
  • Recall tradeoff: More nprobe = better recall but slower
  • Use when: 100M-1B vectors, balanced speed/accuracy
  • Popular in: FAISS, Milvus
# FAISS IVF
import faiss
nlist = 100  # Number of clusters
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)

# Train on sample
index.train(training_vectors)
index.add(vectors)

# Query
index.nprobe = 10  # Search 10 clusters (10% of 100)
distances, ids = index.search(query, k=5)
PQ (Product Quantization)
  • How it works: Compress vectors into smaller codes. Lossy compression.
  • Memory savings: 8-32x reduction (1536D → 48-96 bytes)
  • Accuracy loss: 10-15% lower recall than Flat
  • Use when: Limited RAM, billion-scale, can tolerate lower recall
  • Popular in: FAISS, often combined with IVF (IVFPQ)
# FAISS IVFPQ (IVF + Product Quantization)
import faiss
nlist = 100
m = 8  # Number of subquantizers
nbits = 8  # Bits per subquantizer

quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)

index.train(training_vectors)
index.add(vectors)
index.nprobe = 10
distances, ids = index.search(query, k=5)
LSH (Locality-Sensitive Hashing)
  • How it works: Hash similar vectors to same buckets. Probabilistic.
  • Best for: Very high dimensions (>2000D), approximate search
  • Tradeoff: Fast but lower recall (80-85%)
  • Use when: Real-time requirements, can tolerate missed results

Choosing the Right Index

Your Situation Recommended Index
Testing/Development (<10K vectors) Flat
Small-Medium (10K-1M vectors) HNSW
Large (1M-100M vectors) HNSW or IVF
Massive (100M-1B+ vectors) IVF + PQ
Limited RAM IVF + PQ or LSH
Need 99%+ recall HNSW (high ef) or Flat
Ultra-fast queries (<10ms) HNSW (GPU) or LSH

Performance Benchmarks (1M vectors, 1536D)

Index Build Time Query Time Recall@10 Memory
Flat 1s 100ms 100% 6GB
HNSW (M=32) 10min 2ms 98% 9GB
IVF (nlist=1000) 5min 8ms 92% 6GB
IVFPQ (m=8) 5min 5ms 87% 750MB

Hybrid Approaches (Best of Both Worlds)

  • IVF-HNSW: HNSW for coarse search, IVF for fine search
  • IVFPQ: IVF clustering + PQ compression (most common for large scale)
  • Two-stage: Fast approximate index → Rerank with exact distances

Tuning Tips

  • HNSW M: 16 (fast), 32 (balanced), 64 (high recall). Higher = more memory.
  • HNSW ef: 100-200 for build, 10-50 for search. Higher = better recall.
  • IVF nlist: sqrt(N) is a good starting point. More clusters = finer granularity.
  • IVF nprobe: 1-5% of nlist. Start with 5-10, tune based on recall needs.
  • Always benchmark: Test recall@k on your specific data!

Common Mistakes

  • ❌ Using Flat index for >100K vectors (too slow)
  • ❌ Not training IVF indexes properly (needs representative sample)
  • ❌ Setting HNSW ef too low (poor recall)
  • ❌ Not measuring recall (blindly trusting approximate results)
  • ❌ Using PQ without understanding accuracy loss
🔍 RAG (Retrieval-Augmented Generation)
Retrieval Augmented Generation is a powerful technique that combines information retrieval with text generation to create more accurate, up-to-date and contextually relevant AI response.
RAG Architecture
  1. Query Processing: Embed user question, extract keywords.
  2. Retrieval: Vector search (semantic) + Keyword search (hybrid).
  3. Reranking: Reorder results by true relevance.
  4. Context Assembly: Combine top-k docs with prompt.
  5. Generation: LLM produces answer with citations.
Chunking Strategies
  • Fixed-Size: Split by token count (e.g., 512). Simple but may break context.
  • Semantic: Split at natural boundaries (paragraphs). Preserves meaning.
  • Recursive: Split by headers → paragraphs → sentences. Best for structured docs.
  • Optimal Size: 400-800 tokens for most use cases.
Retrieval & Reranking

Hybrid Search

Combine dense (vector) and sparse (keyword) retrieval. Typically 70% semantic + 30% keyword.

Reranking

Use a Cross-Encoder or Rerank API to re-score top results. Improves precision by 20-40%.

from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = reranker.predict([(query, doc.text) for doc in results])
⚙️ Model Configuration
Key settings, temperature, model selection, and multimodal features.
Model Configuration
  • Model: Select from available LLMs (e.g., GPT-4, Claude, Llama).
  • Context Window: Max tokens per request (e.g., 128K for GPT-4o).
  • Temperature: Controls randomness/creativity (0.0-1.0).
  • System Prompt: Sets assistant behavior and tone.
  • Max Tokens: Output length limit.
Temperature Scale

Temperature Scale (0.0 - 2.0)

  • 0.0 (Deterministic): Factual Q&A, Code, Data Extraction.
  • 0.3 (Focused): Summaries, Technical writing.
  • 0.7 (Balanced): General chat, Explanations. (Default)
  • 1.0+ (Creative): Brainstorming, Marketing copy.

Other Parameters

  • Top-p: Nucleus sampling (alternative to temp).
  • Frequency Penalty: Reduces repetition of words.
  • Presence Penalty: Encourages new topics.
Models & Metrics

Popular Models

  • OpenAI text-embedding-3-small: Cost-effective, standard.
  • OpenAI text-embedding-3-large: High precision.
  • Cohere embed-english-v3.0: Specialized for search/clustering.
  • Sentence Transformers: Open source, run locally.

Distance Metrics

  • Cosine Similarity: Most common for text. Measures angle.
  • Dot Product: Faster, sensitive to magnitude.
  • Euclidean: Good for spatial data.
Multimodal & Streaming

Vision + Text

  • GPT-4o / GPT-4V: Image understanding + generation.
  • Claude 3.5 Sonnet: Chart reading and UI analysis.
  • Gemini 1.5 Pro: Video understanding and long context.

Streaming

Start showing results immediately to improve perceived latency.

🚀 Model Deployment & Optimization
Finetuning, methods, and cost optimization for LLMs in production.
When to Finetune

✅ Do it for: Domain-specific language (medical/legal), consistent output formats, or high volume tasks.

❌ Don't do it for: General knowledge, reasoning tasks (use RAG instead), or small datasets.

Methods (LoRA vs Full)
  • Full Finetuning: Updates all parameters. Expensive.
  • LoRA (Low-Rank Adaptation): Updates small adapters. Efficient.
  • QLoRA: Quantized LoRA. Finetune large models on consumer GPUs.
Cost Optimization
  • Model Cascade: Try cheap models first, fallback to expensive.
  • Caching: Cache exact queries and embeddings.
  • Batch Processing: Process non-urgent requests in batches (50% cheaper).
  • Token Management: Truncate input, set max_tokens, use stop sequences.
🛠️ Use Cases & Troubleshooting
Applications and solutions to common LLM issues.
Applications

Customer Support

RAG over help docs + Function calling for tickets. Critical: Accuracy and smooth handoff.

Code Generation

IDE auto-complete, bug fixing, refactoring. Best practice: Human review loop.

Document Q&A

Legal/Medical analysis. Critical: Citations and semantic chunking.

Data Extraction

Invoice processing, form filling. Use JSON mode for structured output.

Common Issues & Solutions

Hallucinations

  • Use RAG to ground responses
  • Lower temperature (0.0-0.3)
  • Request citations: "Cite your sources"

High Latency

  • Use smaller models (Haiku, GPT-3.5)
  • Enable streaming responses
  • Reduce max_tokens output

Cost Overruns

  • Cache common queries
  • Use batch processing for non-urgent tasks
  • Monitor tokens per request