Central Tendency
- Mean: Average of all values โ sum of values / count
- Median: Middle value when sorted (robust to outliers)
- Mode: Most frequently occurring value
Spread & Variability
- Variance: Average squared deviation from mean
- Standard Deviation: Square root of variance (same units as data)
- Range: Max - Min
- Interquartile Range (IQR): Q3 - Q1 (middle 50% of data)
Probability Concepts
- Probability: P(event) = favorable outcomes / total outcomes (0 to 1)
- Conditional Probability: P(A|B) = probability of A given B occurred
- Independence: Events don't affect each other โ P(A and B) = P(A) ร P(B)
- Bayes' Theorem: Update beliefs with new evidence
P(A|B) = P(B|A) ร P(A) / P(B)
Distributions
- Normal (Gaussian): Bell curve, symmetric around mean
- Uniform: All outcomes equally likely
- Binomial: Success/failure over n trials
Vectors
Definition: Ordered list of numbers [xโ, xโ, ..., xโ]
Operations:
- Addition: Add corresponding elements
- Scalar multiplication: Multiply each element by a number
- Dot product: Sum of element-wise products โ measures similarity
Matrices
Definition: 2D array of numbers (rows ร columns)
Operations:
- Addition/Subtraction: Element-wise (same dimensions)
- Multiplication: Row ร Column (inner dimensions must match)
- Transpose: Flip rows โ columns (Aแต)
Key Concepts
- Identity Matrix (I): Diagonal of 1s, rest 0s
- Dimensions: Shape of data (e.g., 100 samples ร 5 features)
- Matrix multiplication: Transforms data (used in neural networks)
- Inverse: Aโปยน such that A ร Aโปยน = I (used in solving equations)
Variables & Data Types
# Variable Assignment x = 42 # Integer y = 3.14 # Float name = "ML" # String is_valid = True # Boolean # Dynamic Typing x = "Now I'm a string" type(x) # <class 'str'>
Lists, Tuples & Dictionaries
# List (Mutable, Ordered)
nums = [1, 2, 3]
nums.append(4) # Add to end
nums.pop() # Remove last
nums[0] = 10 # Modify
# Tuple (Immutable, Ordered)
coords = (10, 20)
x, y = coords # Unpacking
# Dictionary (Key-Value)
data = {"id": 1, "val": 0.5}
data["id"] # Access: 1
data.keys() # Get keys
data.values() # Get values
Loops & Control Flow
# If-else
if x > 10:
print("Large")
elif x > 5:
print("Medium")
else:
print("Small")
# For Loop
for i in range(5):
print(i)
# Loop over List
fruits = ["apple", "banana"]
for fruit in fruits:
print(fruit)
# While Loop
count = 0
while count < 5:
count += 1
# List Comprehension
squares = [x**2 for x in range(5)]
Functions
def greet(name, greeting="Hello"):
"""Function with default parameter"""
return f"{greeting}, {name}!"
result = greet("Alice") # "Hello, Alice!"
result = greet("Bob", "Hi") # "Hi, Bob!"
File I/O
# Read file
with open('data.txt', 'r') as f:
content = f.read()
lines = f.readlines()
# Write file
with open('output.txt', 'w') as f:
f.write("Hello, World!\n")
Core Purpose: Fast numeric arrays; foundation of most numeric work in Python
Array Creation
import numpy as np # From list arr = np.array([1, 2, 3, 4]) # Special arrays zeros = np.zeros((3, 4)) # 3ร4 array of zeros ones = np.ones((2, 3)) # 2ร3 array of ones range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8] linspace = np.linspace(0, 1, 5) # 5 evenly spaced values random = np.random.rand(3, 3) # 3ร3 random [0,1)
Vectorized Math (Fast Element-wise Operations)
# Element-wise operations (no loops needed!) a = np.array([1, 2, 3]) b = np.array([4, 5, 6]) a + b # [5, 7, 9] a * 2 # [2, 4, 6] a ** 2 # [1, 4, 9] a * b # [4, 10, 18] (element-wise) np.sqrt(a) # [1.0, 1.414, 1.732] np.exp(a) # Exponential np.log(a) # Natural log # Aggregations arr.sum() # Sum all elements arr.mean() # Average arr.std() # Standard deviation arr.min() # Minimum arr.max() # Maximum
Linear Algebra Operations
# Dot product (vector multiplication) a = np.array([1, 2, 3]) b = np.array([4, 5, 6]) np.dot(a, b) # 1*4 + 2*5 + 3*6 = 32 # Matrix multiplication A = np.array([[1, 2], [3, 4]]) B = np.array([[5, 6], [7, 8]]) np.matmul(A, B) # or A @ B # Transpose A.T # Flip rows and columns # Matrix operations np.linalg.inv(A) # Inverse np.linalg.det(A) # Determinant np.linalg.eig(A) # Eigenvalues/eigenvectors
Indexing & Slicing
arr = np.array([10, 20, 30, 40, 50]) arr[0] # 10 (first element) arr[-1] # 50 (last element) arr[1:4] # [20, 30, 40] (slice) arr[arr > 25] # [30, 40, 50] (boolean indexing) # 2D arrays matrix = np.array([[1, 2, 3], [4, 5, 6]]) matrix[0, 1] # 2 (row 0, col 1) matrix[:, 1] # [2, 5] (all rows, col 1)
Reshaping
arr = np.arange(12) arr.reshape(3, 4) # 3ร4 matrix arr.reshape(-1, 1) # Column vector (auto-calculate rows) arr.flatten() # 1D array
Core Purpose: Tabular data manipulation (DataFrame, Series)
Loading Data
import pandas as pd
# From CSV (most common)
df = pd.read_csv('data.csv')
# With options
df = pd.read_csv('data.csv',
sep=',', # Delimiter
header=0, # Row for column names
index_col=0, # Column to use as index
na_values=['?']) # Custom missing values
# Other formats
df = pd.read_excel('data.xlsx')
df = pd.read_json('data.json')
df = pd.read_sql(query, connection)
Inspection Methods
# Quick overview df.head() # First 5 rows df.tail(3) # Last 3 rows df.info() # Column types, non-null counts, memory usage df.describe() # Statistics for numeric columns df.shape # (rows, columns) df.columns # Column names df.dtypes # Data types per column df.isnull().sum() # Count missing values per column
Data Cleaning
# Handle missing values
df.dropna() # Remove rows with any NaN
df.dropna(subset=['age']) # Drop rows where 'age' is NaN
df.fillna(0) # Fill NaN with 0
df.fillna(df.mean()) # Fill with column means
df['age'].fillna(df['age'].median(), inplace=True)
# Boolean filtering (masks)
df[df['age'] > 25] # Rows where age > 25
df[df['city'] == 'NYC'] # Rows where city is NYC
df[(df['age'] > 25) & (df['salary'] > 50000)] # Multiple conditions
df[df['name'].str.contains('Alice')] # String matching
# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['email'])
Selection & Manipulation
# Select columns
df['name'] # Single column (Series)
df[['name', 'age']] # Multiple columns (DataFrame)
# Select rows
df.iloc[0] # First row by position
df.loc[0] # First row by label
df.iloc[0:3] # First 3 rows
# Add/modify columns
df['age_squared'] = df['age'] ** 2
df['full_name'] = df['first'] + ' ' + df['last']
# Drop columns
df.drop('city', axis=1, inplace=True)
df.drop(['col1', 'col2'], axis=1)
# Sort
df.sort_values('age', ascending=False)
df.sort_values(['city', 'age'])
# Group by and aggregate
df.groupby('city')['age'].mean()
df.groupby('city').agg({'age': 'mean', 'salary': 'sum'})
Core Purpose: Low-level plotting library for customizable visualizations
Basic Plot Types
import matplotlib.pyplot as plt
# Line plot
plt.plot([1, 2, 3, 4], [2, 4, 6, 8], marker='o', linestyle='--', color='blue')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot')
plt.legend(['Data'])
plt.grid(True)
plt.show()
# Scatter plot
plt.scatter(x, y, c='red', s=100, alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Scatter Plot')
plt.show()
# Histogram
plt.hist(data, bins=30, color='green', alpha=0.7, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Distribution')
plt.show()
# Bar chart
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]
plt.bar(categories, values, color='purple')
plt.xlabel('Category')
plt.ylabel('Count')
plt.title('Bar Chart')
plt.show()
Customization
# Labels and titles
plt.xlabel('X Label', fontsize=12)
plt.ylabel('Y Label', fontsize=12)
plt.title('My Plot', fontsize=14, fontweight='bold')
# Legend
plt.plot(x, y1, label='Line 1')
plt.plot(x, y2, label='Line 2')
plt.legend(loc='upper right')
# Grid and style
plt.grid(True, linestyle='--', alpha=0.5)
plt.style.use('seaborn-v0_8') # or 'ggplot', 'fivethirtyeight'
# Figure size
plt.figure(figsize=(10, 6))
# Save figure
plt.savefig('plot.png', dpi=300, bbox_inches='tight')
Subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes[0, 0].plot(x, y)
axes[0, 0].set_title('Plot 1')
axes[0, 1].scatter(x, y)
axes[0, 1].set_title('Plot 2')
axes[1, 0].hist(data, bins=20)
axes[1, 0].set_title('Plot 3')
axes[1, 1].bar(categories, values)
axes[1, 1].set_title('Plot 4')
plt.tight_layout()
plt.show()
Core Purpose: High-level statistical visualization built on Matplotlib; great for EDA patterns and relationships
Setup
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid') # 'darkgrid', 'white', 'dark', 'ticks'
sns.set_palette('husl') # Color palette
Distribution Plots
# Histogram with KDE (kernel density estimate) sns.histplot(data=df, x='age', kde=True, bins=20) plt.show() # Box plot (quartiles, outliers) sns.boxplot(data=df, x='city', y='age') plt.show() # Violin plot (distribution shape) sns.violinplot(data=df, x='city', y='age') plt.show() # Distribution plot sns.displot(data=df, x='age', kind='kde') plt.show()
Relationship Plots
# Scatter plot with regression line sns.scatterplot(data=df, x='age', y='salary', hue='city', size='experience') plt.show() # Regression plot sns.regplot(data=df, x='age', y='salary') plt.show() # Pair plot (all pairwise relationships) sns.pairplot(df, hue='city') plt.show() # Joint plot (scatter + distributions) sns.jointplot(data=df, x='age', y='salary', kind='scatter') plt.show()
Correlation & Heatmaps
# Correlation heatmap
corr = df.corr()
sns.heatmap(corr,
annot=True, # Show values
cmap='coolwarm', # Color scheme
center=0, # Center colormap at 0
square=True, # Square cells
linewidths=1) # Cell borders
plt.title('Correlation Matrix')
plt.show()
Categorical Plots
# Bar plot (with aggregation) sns.barplot(data=df, x='city', y='age', estimator=np.mean) plt.show() # Count plot (frequency) sns.countplot(data=df, x='city') plt.show() # Point plot (with error bars) sns.pointplot(data=df, x='city', y='age') plt.show() # Strip plot (all points) sns.stripplot(data=df, x='city', y='age', jitter=True) plt.show()
Advanced: FacetGrid for Multi-panel Plots
# Create grid based on categorical variable g = sns.FacetGrid(df, col='city', row='gender', height=4) g.map(sns.histplot, 'age', bins=20) plt.show()
Problem-Solving Framework
- Understand: What is the question asking? What data do I have?
- Break Down: Divide complex problems into smaller steps
- Pattern Recognition: Look for similarities to known problems
- Test Assumptions: Verify your understanding with simple examples
- Iterate: Start simple, then add complexity
ML-Specific Thinking
- Data First: Always explore data before modeling (distributions, missing values, outliers)
- Baseline: Start with simple models (e.g., mean prediction) before complex ones
- Validation: Split data (train/test) to evaluate performance honestly
- Feature Engineering: Transform raw data into meaningful inputs
- Debugging: Print shapes, check for NaNs, visualize intermediate results
Common Pitfalls to Avoid
- โ Assuming data is clean (always check!)
- โ Overfitting (model memorizes training data)
- โ Data leakage (test data influences training)
- โ Ignoring class imbalance
- โ Not scaling features (important for many algorithms)
๐ก Pro Tips for Success
- NumPy: Use vectorized operations instead of loops (100x faster!)
- Pandas: Chain operations with method chaining:
df.dropna().groupby('city')['age'].mean() - Matplotlib: Use
plt.style.use()for consistent aesthetics - Seaborn: Perfect for quick EDA; automatically handles DataFrames
- Practice with real datasets (Kaggle, UCI ML Repository)
- Use Jupyter notebooks for interactive exploration
- Google errors and read documentation (it's part of learning!)
# 1. Load data
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# 2. Inspect
df.head()
df.info()
df.describe()
df.isnull().sum()
# 3. Visualize (EDA)
import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(df)
sns.heatmap(df.corr(), annot=True)
sns.histplot(df['target'], kde=True)
# 4. Clean
df = df.dropna()
df = df[df['age'] > 0] # Filter outliers
# 5. Prepare
X = df.drop('target', axis=1) # Features
y = df['target'] # Target
# 6. Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 7. Train model (example)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
# 8. Evaluate
score = model.score(X_test, y_test)
print(f"Rยฒ Score: {score}")
# 9. Predict
predictions = model.predict(X_test)
File-Based Data Sources
CSV (Comma-Separated Values)
import pandas as pd
# Basic load
df = pd.read_csv('data.csv')
# Advanced options
df = pd.read_csv('data.csv',
sep=',', # Delimiter (can be '\t', '|', etc.)
header=0, # Row number for column names
index_col=0, # Column to use as row index
na_values=['NA', '?', '-'], # Custom missing values
parse_dates=['date'], # Convert to datetime
encoding='utf-8', # Handle special characters
low_memory=False) # For large files
# Read specific columns only
df = pd.read_csv('data.csv', usecols=['name', 'age', 'salary'])
# Read in chunks (for huge files)
chunk_iter = pd.read_csv('large_file.csv', chunksize=10000)
for chunk in chunk_iter:
process(chunk)
Excel Files
# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# Multiple sheets
excel_file = pd.ExcelFile('data.xlsx')
print(excel_file.sheet_names)
# All sheets as dictionary
all_sheets = pd.read_excel('data.xlsx', sheet_name=None)
JSON
# Simple JSON
df = pd.read_json('data.json')
# Nested JSON
df = pd.read_json('data.json', orient='records')
# From API response
import requests
response = requests.get('https://api.example.com/data')
df = pd.DataFrame(response.json())
# Normalize nested JSON
from pandas import json_normalize
df = json_normalize(data, record_path=['items'])
Parquet (Fast & Efficient)
# Read Parquet
df = pd.read_parquet('data.parquet')
# Write Parquet (compressed)
df.to_parquet('output.parquet', compression='gzip')
Database Sources
SQL Databases (Relational)
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:password@localhost:5432/dbname')
# Read table or query
df = pd.read_sql_table('customers', engine)
# Execute SQL query
query = """
SELECT c.name, c.age, o.total
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.date >= '2024-01-01'
"""
df = pd.read_sql_query(query, engine)
# Write to database
df.to_sql('new_table', engine, if_exists='replace', index=False)
NoSQL Databases
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['mydatabase']
collection = db['customers']
# Query and convert to DataFrame
cursor = collection.find({'age': {'$gt': 25}})
df = pd.DataFrame(list(cursor))
# Redis
import redis
import json
r = redis.Redis(host='localhost', port=6379, db=0)
data = json.loads(r.get('user:1'))
1. Initial Data Inspection
# Load data
df = pd.read_csv('data.csv')
# Quick overview
print(df.shape) # (rows, columns)
print(df.head()) # First 5 rows
print(df.info()) # Data types, non-null counts
print(df.describe()) # Statistics for numeric columns
print(df.isnull().sum()) # Missing values per column
print(df.duplicated().sum()) # Duplicate rows
2. Handling Missing Values
# Detection
missing_pct = (df.isnull().sum() / len(df)) * 100
print(missing_pct[missing_pct > 0])
# Visualization
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()
# Strategies
df.dropna() # Remove rows with any NaN
df.dropna(subset=['age']) # Remove rows with NaN in 'age'
df.fillna(0) # Fill with constant
df['age'].fillna(df['age'].median(), inplace=True) # Fill with median
df['price'].fillna(method='ffill', inplace=True) # Forward fill
df['salary'] = df.groupby('dept')['salary'].transform(lambda x: x.fillna(x.mean()))
3. Handling Duplicates
# Check
print(f"Duplicate rows: {df.duplicated().sum()}")
# Remove (keep first occurrence)
df_clean = df.drop_duplicates()
# Remove based on specific columns
df_clean = df.drop_duplicates(subset=['email', 'phone'])
4. Handling Outliers
# IQR Method Q1 = df['age'].quantile(0.25) Q3 = df['age'].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR # Remove df_clean = df[(df['age'] >= lower) & (df['age'] <= upper)] # Cap (Winsorization) df['age'] = df['age'].clip(lower=lower, upper=upper) # Z-score method from scipy import stats z_scores = np.abs(stats.zscore(df['age'])) outliers = df[z_scores > 3]
5. Data Type Corrections
# Convert to numeric
df['age'] = pd.to_numeric(df['age'], errors='coerce')
# Convert to datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')
# Convert to categorical
df['category'] = df['category'].astype('category')
6. Handling Inconsistent Data
# Standardize text
df['city'] = df['city'].str.strip().str.title()
# Fix typos
df['city'] = df['city'].replace({'NY': 'New York', 'NYC': 'New York'})
# Remove special characters
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
Univariate Analysis (Single Variable)
import seaborn as sns import matplotlib.pyplot as plt # Distribution of numeric variable plt.figure(figsize=(10, 4)) plt.subplot(1, 2, 1) sns.histplot(df['age'], kde=True) plt.subplot(1, 2, 2) sns.boxplot(x=df['age']) plt.show() # Categorical variable sns.countplot(data=df, x='category') plt.xticks(rotation=45) plt.show()
Bivariate Analysis (Two Variables)
# Numeric vs Numeric sns.scatterplot(data=df, x='age', y='salary', hue='department') plt.show() # Categorical vs Numeric sns.boxplot(data=df, x='department', y='salary') plt.show() # Categorical vs Categorical pd.crosstab(df['department'], df['gender']).plot(kind='bar') plt.show()
Multivariate Analysis
# Correlation heatmap corr = df.select_dtypes(include=[np.number]).corr() sns.heatmap(corr, annot=True, cmap='coolwarm') plt.show() # Pair Plot sns.pairplot(df, hue='category') plt.show()
1. Feature Engineering
# Date features df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['is_weekend'] = df['date'].dt.dayofweek.isin([5, 6]).astype(int) # Mathematical transformations df['age_squared'] = df['age'] ** 2 df['bmi'] = df['weight'] / (df['height'] ** 2) # Binning df['age_group'] = pd.cut(df['age'], bins=[0, 18, 60, 100], labels=['Child', 'Adult', 'Senior']) # Interaction Features df['income_per_age'] = df['income'] / df['age'] df['age_income'] = df['age'] * df['income']
2. Feature Selection
# Correlation corr = df.corr()['target'].abs().sort_values(ascending=False) top_features = corr[1:6].index.tolist() # Variance Threshold from sklearn.feature_selection import VarianceThreshold selector = VarianceThreshold(threshold=0.1) X_high_var = selector.fit_transform(X) # Recursive Feature Elimination (RFE) from sklearn.feature_selection import RFE rfe = RFE(model, n_features_to_select=5) X_rfe = rfe.fit_transform(X, y)
3. Dimensionality Reduction
# PCA from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler X_scaled = StandardScaler().fit_transform(X) pca = PCA(n_components=0.95) X_pca = pca.fit_transform(X_scaled) # t-SNE (Visualization) from sklearn.manifold import TSNE tsne = TSNE(n_components=2) X_tsne = tsne.fit_transform(X_scaled)
4. Scaling and Normalization
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler # StandardScaler (Mean=0, Std=1) scaler = StandardScaler() X_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # MinMaxScaler (Range [0, 1]) scaler = MinMaxScaler() # RobustScaler (Robust to outliers) scaler = RobustScaler()
5. Encoding Categorical Variables
# Label Encoding (Ordinal)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['grade'] = le.fit_transform(df['grade'])
# One-Hot Encoding (Nominal)
df = pd.get_dummies(df, columns=['city'], drop_first=True)
# Target Encoding
target_means = df.groupby('city')['salary'].mean()
df['city_encoded'] = df['city'].map(target_means)
Complete Data Cleaning Pipeline
def clean_data(df):
# 1. Initial inspection
print("Shape:", df.shape)
# 2. Remove duplicates
df = df.drop_duplicates()
# 3. Handle missing values
num_cols = df.select_dtypes(include=[np.number]).columns
for col in num_cols:
df[col].fillna(df[col].median(), inplace=True)
cat_cols = df.select_dtypes(include=['object']).columns
for col in cat_cols:
df[col].fillna(df[col].mode()[0], inplace=True)
# 4. Remove outliers (IQR)
for col in num_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR)))]
# 5. Data type corrections
date_cols = [c for c in df.columns if 'date' in c.lower()]
for col in date_cols:
df[col] = pd.to_datetime(df[col], errors='coerce')
# 6. Encode
df = pd.get_dummies(df, columns=cat_cols, drop_first=True)
return df
Data Cleaning Checklist
- โ Load data: Check source and format
- โ Initial inspection: Shape, info, missing values
- โ Duplicates: Identify and remove
- โ Missing Data: Drop, fill, or impute
- โ Outliers: Detect and handle
- โ Data Types: Fix dates, numbers, categories
- โ Standardize text: Strip, lowercase, fix typos
- โ Feature engineering: Create new features
- โ Encode: Label or One-Hot encoding
- โ Scale: Standard or MinMax (fit on train only!)
- โ Split: Train/Test split BEFORE scaling
Best Practices
- Always keep a copy of raw data: Never modify original.
- Document cleaning steps: Use notebooks/comments.
- Visualize before and after: Verify cleaning.
- Handle missing data thoughtfully: Don't just drop.
- Scale after splitting: Prevent data leakage.
- Validate cleaning: Check if results make sense.
Common Pitfalls to Avoid
- โ Data Leakage: Using test stats for training scaling.
- โ Dropping too much data: Losing info.
- โ Ignoring Types: Treating numbers as strings.
- โ Over-cleaning: Removing valid outliers.
- โ Not handling categorical: Models need numbers.
- โ Scaling before splitting: Leakage risk.
Quick Reference: Data Format Comparison
| Format | Speed | Use Case |
|---|---|---|
| CSV | Slow | Simple, Sharing |
| Excel | Slow | Business Reports |
| JSON | Medium | APIs, Nested |
| Parquet | Fast | Big Data |
| SQL | Fast | Structured Queries |
Supervised Learning
Definition: Learning from labeled data (input-output pairs). Goal is to learn a mapping function f(X) โ y.
- Regression: Predict continuous values (price, temperature).
- Classification: Predict discrete categories (spam/not spam).
Unsupervised Learning
Definition: Learning from unlabeled data to find hidden patterns.
- Clustering: Group similar data points.
- Dimensionality Reduction: Reduce feature space.
- Anomaly Detection: Identify unusual patterns.
Semi-Supervised Learning
Mix of labeled and unlabeled data. Useful when labeling is expensive but data is abundant.
Reinforcement Learning (RL)
Learning through interaction with an environment (trial and error) to maximize cumulative reward.
# Simple Q-Learning example (tabular RL)
import numpy as np
# Initialize Q-table (states ร actions)
Q = np.zeros((n_states, n_actions))
# Training loop
for episode in range(1000):
state = env.reset()
done = False
while not done:
# Epsilon-greedy action selection
if np.random.random() < epsilon:
action = env.action_space.sample() # Explore
else:
action = np.argmax(Q[state]) # Exploit
# Take action
next_state, reward, done, _ = env.step(action)
# Q-learning update
Q[state, action] += alpha * (
reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
state = next_state
Regression vs Classification
| Aspect | Regression | Classification |
|---|---|---|
| Output | Continuous number | Discrete category |
| Examples | Price, temperature | Spam/ham, cat/dog |
| Evaluation | MSE, RMSE, Rยฒ | Accuracy, F1, ROC-AUC |
When to use Regression
- Predicting quantities (sales, revenue)
- Forecasting (time series)
- Estimating continuous relationships
When to use Classification
- Yes/No decisions (approve loan)
- Multi-class problems (image recognition)
- Ranking/prioritization
1. Linear Regression
Fits a straight line to predict continuous values. Good for linear relationships, interpretable.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
print(f"Rยฒ Score: {model.score(X_test, y_test):.2f}")
2. Logistic Regression
Predicts probability of binary outcomes (0 or 1). Good baseline for classification.
from sklearn.linear_model import LogisticRegression model = LogisticRegression(max_iter=1000) model.fit(X_train, y_train) y_pred_proba = model.predict_proba(X_test)
3. Decision Trees
Tree of if-then-else rules. Handles non-linear data, interpretable, but prone to overfitting.
from sklearn.tree import DecisionTreeClassifier, plot_tree model = DecisionTreeClassifier(max_depth=5) model.fit(X_train, y_train) # Visualize plot_tree(model, filled=True)
4. Random Forest
Ensemble of decision trees. High accuracy, robust, handles non-linear data.
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train) # Feature importance importances = model.feature_importances_
5. K-Nearest Neighbors (KNN)
Classifies based on K closest examples. Simple, no training, but slow on large data.
from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler # Scale features first! scaler = StandardScaler() X_scaled = scaler.fit_transform(X) model = KNeighborsClassifier(n_neighbors=5) model.fit(X_scaled, y)
6. Support Vector Machines (SVM)
Finds optimal hyperplane. Great for high-dimensional data (text/images).
from sklearn.svm import SVC model = SVC(kernel='rbf', C=1.0) model.fit(X_train_scaled, y_train)
7. Naive Bayes
Probabilistic classifier assuming feature independence. Fast, great for text.
from sklearn.naive_bayes import MultinomialNB model = MultinomialNB() model.fit(X_train, y_train)
8. Gradient Boosting (XGBoost)
Sequential ensemble of weak learners. State-of-the-art accuracy for tabular data.
from xgboost import XGBClassifier model = XGBClassifier(n_estimators=100, learning_rate=0.1) model.fit(X_train, y_train)
1. K-Means Clustering
Groups data into K clusters based on similarity.
from sklearn.cluster import KMeans model = KMeans(n_clusters=3) clusters = model.fit_predict(X) # Elbow method for K inertias = [KMeans(k).fit(X).inertia_ for k in range(1, 10)]
2. Principal Component Analysis (PCA)
Reduces dimensionality while preserving variance.
from sklearn.decomposition import PCA pca = PCA(n_components=0.95) # Keep 95% variance X_pca = pca.fit_transform(X_scaled)
3. Autoencoders
Neural network that learns compressed representation. Good for anomaly detection.
# Simple Autoencoder (Keras) input_img = keras.Input(shape=(input_dim,)) encoded = layers.Dense(32, activation='relu')(input_img) decoded = layers.Dense(input_dim, activation='sigmoid')(encoded) autoencoder = keras.Model(input_img, decoded)
Comparison Table
| Algorithm | Type | Pros | Cons |
|---|---|---|---|
| Linear Reg | Reg | Fast, Interpretable | Linear only |
| Random Forest | Both | Accurate, Robust | Slow, Opaque |
| XGBoost | Both | High Accuracy | Overfitting risk |
| SVM | Both | High Dimensions | Slow on large data |
Selection Guide
By Problem Type:
- Regression: Linear Regression (baseline) โ Random Forest/XGBoost
- Binary Class: Logistic Regression (baseline) โ Random Forest/XGBoost
- Multi-class: Random Forest or Neural Networks
- Clustering: K-Means (if K known) or DBSCAN
By Data Size:
- Small (<1k): Logistic Reg, Naive Bayes, KNN
- Medium: Random Forest, SVM, XGBoost
- Large (>100k): Neural Networks, XGBoost
Regression Metrics
from sklearn.metrics import mean_squared_error, r2_score mse = mean_squared_error(y_test, y_pred) # Lower is better r2 = r2_score(y_test, y_pred) # 1.0 is perfect
Classification Metrics
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score acc = accuracy_score(y_test, y_pred) f1 = f1_score(y_test, y_pred) # Balance precision/recall auc = roc_auc_score(y_test, y_pred_proba) # Threshold independent
Cross-Validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(f"CV Acc: {scores.mean():.2%}")
Best Practices
- Start Simple: Baseline with Linear/Logistic Regression.
- Understand Data: Visualize and clean before modeling.
- Feature Engineering: Often more impactful than model choice.
- Always Split: Train/Test split to prevent leakage.
- Scale Features: Essential for KNN, SVM, Neural Nets.
Common Mistakes
- โ Data Leakage: Scaling before splitting.
- โ Imbalanced Data: Using accuracy as the only metric.
- โ Overfitting: High train accuracy, low test accuracy.
- โ No Cross-Validation: Unreliable performance estimates.
Learning Path
- Beginner: Linear/Logistic Reg, Decision Trees, K-Means.
- Intermediate: Random Forest, KNN, Naive Bayes, PCA.
- Advanced: SVM, Gradient Boosting, Neural Networks, RL.
Quick Decision Tree
Predict Number? โโ Yes โ REGRESSION โ โโ Linear? โ Linear Regression โ โโ Complex? โ Random Forest / XGBoost โโ No โ CLASSIFICATION โโ Binary? โ Logistic Regression โโ Text? โ Naive Bayes โโ Complex? โ Random Forest / XGBoost
Linear Regression
Predicts continuous values: y = mx + b
- When: Linear relationship between features and target
- Assumptions: Linearity, independence, homoscedasticity, normality
- Pros: Fast, interpretable, works with small data
- Cons: Assumes linearity, sensitive to outliers
Logistic Regression
Binary classification using sigmoid function
- When: Binary outcomes (yes/no, 0/1)
- Output: Probability between 0 and 1
- Pros: Probabilistic output, fast, interpretable
- Cons: Linear decision boundary
from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X_train, y_train) predictions = model.predict(X_test)
Decision Trees
Tree structure of if-else decisions
- Pros: Easy to interpret, handles non-linear relationships, no scaling needed
- Cons: Prone to overfitting, unstable (small changes โ different tree)
Random Forests
Ensemble of many decision trees (bagging)
- How: Build multiple trees on random subsets, average predictions
- Pros: Reduces overfitting, handles missing values, feature importance
- Cons: Less interpretable, slower than single tree
- Best for: Tabular data, when you need robust performance
from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=100, max_depth=10) rf.fit(X_train, y_train) # Feature importance importances = rf.feature_importances_Great baseline model for tabular data
Core Concept
Find the hyperplane that maximizes margin between classes
The Kernel Trick
Transform data to higher dimensions without computing coordinates
- Linear: For linearly separable data
- RBF (Radial Basis Function): Most common, handles non-linear
- Polynomial: For polynomial relationships
When to Use
- High-dimensional spaces (text, images)
- Clear margin of separation
- Small to medium datasets
Pros & Cons
Pros: Effective in high dimensions, memory efficient
Cons: Slow on large datasets, requires feature scaling
from sklearn.svm import SVC svm = SVC(kernel='rbf', C=1.0, gamma='scale') svm.fit(X_train, y_train)Best for: Text classification, image recognition
Architecture Components
- Input Layer: Receives features
- Hidden Layers: Learn representations (deep = many layers)
- Output Layer: Produces predictions
- Activation Functions: ReLU (hidden), Sigmoid/Softmax (output)
Key Concepts
- Backpropagation: Update weights using gradient descent
- Learning Rate: How big each update step is (0.001-0.01 typical)
- Epochs: Full passes through training data
- Batch Size: Samples processed before updating weights
When to Use
Complex patterns, images, text, audio, large datasets
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
Dense(64, activation='relu', input_shape=(10,)),
Dense(32, activation='relu'),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(X_train, y_train, epochs=50, batch_size=32)
Deep learning powerhouse
k-Nearest Neighbors (k-NN)
Classify based on k closest training examples
- Pros: Simple, no training phase, works for multi-class
- Cons: Slow prediction, sensitive to scale and irrelevant features
- Tip: Always scale features, try k=3,5,7
k-Means Clustering
Partition data into k clusters (unsupervised)
- How: Assign points to nearest centroid, update centroids, repeat
- Use for: Customer segmentation, data compression
- Choosing k: Elbow method (plot within-cluster sum of squares)
Naive Bayes
Probabilistic classifier using Bayes' theorem
- Assumption: Features are independent (rarely true but works anyway)
- Best for: Text classification (spam detection, sentiment)
- Pros: Fast, works with small data, handles high dimensions
# k-NN from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=5) # k-Means from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=3) # Naive Bayes from sklearn.naive_bayes import GaussianNB nb = GaussianNB()
TensorFlow/Keras
Google's production-ready deep learning framework
- Best for: Production deployment, mobile (TensorFlow Lite), research
- Pros: Industry standard, excellent documentation, TensorBoard visualization
- Keras: High-level API for TensorFlow (easy to use)
- Use when: Need production deployment, mobile apps, or serving at scale
import tensorflow as tf
from tensorflow import keras
# Sequential API (simple)
model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=(10,)),
keras.layers.Dropout(0.2),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
history = model.fit(
X_train, y_train,
epochs=50,
batch_size=32,
validation_split=0.2,
callbacks=[keras.callbacks.EarlyStopping(patience=5)]
)
# Functional API (complex architectures)
inputs = keras.Input(shape=(10,))
x = keras.layers.Dense(64, activation='relu')(inputs)
x = keras.layers.Dense(32, activation='relu')(x)
outputs = keras.layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs=inputs, outputs=outputs)
PyTorch
Facebook's research-focused deep learning framework
- Best for: Research, experimentation, dynamic models
- Pros: Pythonic, dynamic computation graphs, easier debugging
- Popular in: Academic research, NLP (Hugging Face), computer vision
- Use when: Need flexibility, research, or custom architectures
import torch
import torch.nn as nn
import torch.optim as optim
# Define model
class NeuralNet(nn.Module):
def __init__(self):
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(10, 64)
self.fc2 = nn.Linear(64, 32)
self.fc3 = nn.Linear(32, 1)
self.dropout = nn.Dropout(0.2)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = self.dropout(x)
x = torch.relu(self.fc2(x))
x = torch.sigmoid(self.fc3(x))
return x
model = NeuralNet()
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
for epoch in range(50):
model.train()
optimizer.zero_grad()
outputs = model(X_train)
loss = criterion(outputs, y_train)
loss.backward()
optimizer.step()
TensorFlow vs PyTorch
| Aspect | TensorFlow | PyTorch |
|---|---|---|
| Ease of Use | Keras makes it easy | More Pythonic, intuitive |
| Learning Curve | Moderate | Easier for Python devs |
| Deployment | Excellent (TF Serving, Lite) | Good (TorchServe) |
| Research | Good | Dominant in academia |
| Debugging | Harder (static graphs) | Easier (dynamic graphs) |
| Community | Large, industry-focused | Large, research-focused |
Common Use Cases
- Computer Vision: Both (PyTorch slightly preferred)
- NLP: PyTorch (Hugging Face Transformers)
- Production/Mobile: TensorFlow
- Research Papers: PyTorch
- Time Series: Both
Key Libraries
- TensorFlow: Keras, TensorBoard, TF Data, TF Lite
- PyTorch: torchvision, torchtext, Lightning (wrapper)
- Both: ONNX (model interchange format)
Accuracy
Correct predictions / Total predictions
- When: Balanced classes
- Misleading when: Imbalanced data (e.g., 95% class A, 5% class B)
Precision
True Positives / (True Positives + False Positives)
- Question: Of predicted positives, how many are correct?
- Use when: False positives are costly (spam filter)
Recall (Sensitivity)
True Positives / (True Positives + False Negatives)
- Question: Of actual positives, how many did we catch?
- Use when: False negatives are costly (disease detection)
F1-Score
Harmonic mean of precision and recall: 2 ร (Precision ร Recall) / (Precision + Recall)
- Use when: Balance between precision and recall matters
ROC-AUC
Area Under the Receiver Operating Characteristic curve
- Plots True Positive Rate vs False Positive Rate
- AUC = 1.0: Perfect classifier
- AUC = 0.5: Random guessing
- Use when: Comparing models across thresholds
from sklearn.metrics import classification_report, roc_auc_score print(classification_report(y_test, y_pred)) auc = roc_auc_score(y_test, y_pred_proba)Choose metric based on business impact
Mean Squared Error (MSE)
Average of squared differences: ฮฃ(actual - predicted)ยฒ / n
- Penalizes large errors heavily
- Same units as target variable squared
Root Mean Squared Error (RMSE)
Square root of MSE: โMSE
- Same units as target variable
- Most common regression metric
- More interpretable than MSE
Mean Absolute Error (MAE)
Average of absolute differences: ฮฃ|actual - predicted| / n
- Less sensitive to outliers than MSE/RMSE
- Same units as target variable
- More robust metric
Rยฒ (Coefficient of Determination)
Proportion of variance explained: 1 - (SS_res / SS_tot)
- Rยฒ = 1.0: Perfect predictions
- Rยฒ = 0.0: As good as predicting mean
- Can be negative for bad models
- Scale-independent (compare across datasets)
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) mae = mean_absolute_error(y_test, y_pred) r2 = r2_score(y_test, y_pred)RMSE for magnitude, Rยฒ for model quality
Why Cross-Validation?
Get more reliable performance estimates using all data for both training and validation
k-Fold Cross-Validation
- Split data into k folds (typically k=5 or 10)
- Train on k-1 folds, validate on remaining fold
- Repeat k times, average results
- Pros: Every sample used for both training and validation
Stratified k-Fold
- Maintains class distribution in each fold
- Use for: Imbalanced classification problems
Leave-One-Out (LOO)
- k = n (number of samples)
- Use for: Very small datasets
- Con: Computationally expensive
Time Series Split
- Respects temporal ordering
- Critical for: Sequential data (stocks, sales)
from sklearn.model_selection import cross_val_score, StratifiedKFold
# Simple k-fold
scores = cross_val_score(model, X, y, cv=5)
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
# Stratified k-fold
skf = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=skf)
Always use CV for model selection
The Matrix
Predicted
Pos Neg
Actual Pos TP FN
Neg FP TN
Understanding Each Cell
- True Positive (TP): Correctly predicted positive
- True Negative (TN): Correctly predicted negative
- False Positive (FP): Incorrectly predicted positive (Type I error)
- False Negative (FN): Incorrectly predicted negative (Type II error)
What to Look For
- High FP? Model too aggressive (reduce threshold)
- High FN? Model too conservative (increase threshold)
- Imbalanced diagonal? Class imbalance or poor model
Quick Code
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay import matplotlib.pyplot as plt cm = confusion_matrix(y_test, y_pred) disp = ConfusionMatrixDisplay(confusion_matrix=cm) disp.plot() plt.show()Always visualize your confusion matrix
Grid Search
Try every combination of specified parameters
- Pros: Exhaustive, guaranteed to find best in grid
- Cons: Exponentially slow with more parameters
- Use when: Few parameters, small ranges
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(
RandomForestClassifier(),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
Random Search
Sample random combinations
- Pros: Faster, explores more space
- Cons: May miss optimal
- Use when: Many parameters, large ranges
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
'n_estimators': [100, 200, 300, 400],
'max_depth': [10, 20, 30, 40, None],
'min_samples_split': [2, 5, 10, 15]
}
random_search = RandomizedSearchCV(
RandomForestClassifier(),
param_dist,
n_iter=20, # Number of random combinations
cv=5,
n_jobs=-1
)
random_search.fit(X_train, y_train)
Key Hyperparameters by Algorithm
Random Forest: n_estimators, max_depth, min_samples_split
SVM: C (regularization), kernel, gamma
Neural Networks: learning_rate, batch_size, hidden_layers, neurons
XGBoost: learning_rate, max_depth, n_estimators, subsample
Start with defaults, then tune most important paramsBy Problem Type
Binary Classification:
- Logistic Regression (baseline)
- Random Forest (robust)
- XGBoost (high performance)
- Neural Networks (complex patterns)
Multi-class Classification:
- Random Forest
- XGBoost
- Naive Bayes (text)
Regression:
- Linear Regression (baseline)
- Random Forest
- XGBoost
- Neural Networks
Clustering:
- K-Means (spherical clusters)
- DBSCAN (arbitrary shapes, outliers)
- Hierarchical (dendrograms)
By Data Characteristics
Small Data (<10k samples):
- Logistic Regression, Naive Bayes
- Simple models to avoid overfitting
Large Data (>100k samples):
- Neural Networks, XGBoost
- Can learn complex patterns
High Dimensional (many features):
- Regularized models (Lasso, Ridge)
- Random Forest (handles many features)
- Feature selection first
Imbalanced Classes:
- Random Forest with class_weight='balanced'
- XGBoost with scale_pos_weight
- SMOTE for oversampling
Quick Decision Tree
Need interpretability? โ Logistic Regression or Decision Tree
Need high accuracy? โ XGBoost or Random Forest
Have images/text? โ Neural Networks (CNN/RNN)
Limited time? โ Start with Random Forest
Always try multiple algorithmsData Leakage
Information from test set leaks into training
- Example: Scaling before train/test split
- Fix: Always split first, then preprocess
- Example: Using future information in time series
- Fix: Use time-based split
Class Imbalance
One class dominates dataset (e.g., 95% vs 5%)
- Symptom: High accuracy but poor recall on minority class
- Solutions:
- Use stratified sampling
- Oversample minority class (SMOTE)
- Undersample majority class
- Use class weights
- Change evaluation metric (F1, AUC instead of accuracy)
Poor Performance Checklist
- โ Check for data leakage
- โ Verify train/test split is correct
- โ Look for missing values
- โ Check feature scaling
- โ Examine class distribution
- โ Plot learning curves (more data needed?)
- โ Try different algorithms
- โ Engineer better features
Model Not Learning
- Neural Networks: Learning rate too high/low, bad initialization
- All models: Features not informative, need more data
Overfitting Signs
- Training accuracy >> test accuracy (gap >10%)
- Performance degrades on new data
- Model too complex for data size
# Check for data leakage
from sklearn.model_selection import cross_val_score
# If cross-validation score much worse than train score โ leakage
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"CV: {cv_scores.mean():.3f}, Train: {train_score:.3f}")
โ ๏ธ Always validate on unseen data
Complete ML Pipeline
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
# 1. Load data
df = pd.read_csv('data.csv')
# 2. Basic exploration
print(df.info())
print(df.describe())
print(df.isnull().sum())
# 3. Prepare features and target
X = df.drop('target', axis=1)
y = df['target']
# 4. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 5. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 6. Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# 7. Evaluate
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# 8. Cross-validation
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
print(f"CV Score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
Pandas Essentials
# Load data
df = pd.read_csv('file.csv')
# Exploration
df.head()
df.shape
df.dtypes
df.describe()
df.isnull().sum()
# Selection
df['column']
df[['col1', 'col2']]
df[df['age'] > 30]
# Missing values
df.dropna()
df.fillna(df.mean())
# Encoding
pd.get_dummies(df, columns=['category'])
# Group by
df.groupby('category')['value'].mean()
Bookmark this for quick reference!