Machine Learning

Statistics & Probability Basics

▼

Central Tendency

Mean: Average of all values → sum of values / count
Median: Middle value when sorted (robust to outliers)
Mode: Most frequently occurring value

Spread & Variability

Variance: Average squared deviation from mean
Standard Deviation: Square root of variance (same units as data)
Range: Max - Min
Interquartile Range (IQR): Q3 - Q1 (middle 50% of data)

Probability Concepts

Probability: P(event) = favorable outcomes / total outcomes (0 to 1)
Conditional Probability: P(A|B) = probability of A given B occurred
Independence: Events don't affect each other → P(A and B) = P(A) × P(B)
Bayes' Theorem: Update beliefs with new evidence
P(A|B) = P(B|A) × P(A) / P(B)

Distributions

Normal (Gaussian): Bell curve, symmetric around mean
Uniform: All outcomes equally likely
Binomial: Success/failure over n trials

Linear Algebra Essentials

▼

Vectors

Definition: Ordered list of numbers [x₁, x₂, ..., xₙ]

Operations:

Addition: Add corresponding elements
Scalar multiplication: Multiply each element by a number
Dot product: Sum of element-wise products → measures similarity

Matrices

Definition: 2D array of numbers (rows × columns)

Operations:

Addition/Subtraction: Element-wise (same dimensions)
Multiplication: Row × Column (inner dimensions must match)
Transpose: Flip rows ↔ columns (Aᵀ)

Key Concepts

Identity Matrix (I): Diagonal of 1s, rest 0s
Dimensions: Shape of data (e.g., 100 samples × 5 features)
Matrix multiplication: Transforms data (used in neural networks)
Inverse: A⁻¹ such that A × A⁻¹ = I (used in solving equations)

Python Basics

▼

Variables & Data Types

# Variable Assignment
x = 42          # Integer
y = 3.14        # Float
name = "ML"     # String
is_valid = True # Boolean

# Dynamic Typing
x = "Now I'm a string"
type(x)         # <class 'str'>

Lists, Tuples & Dictionaries

# List (Mutable, Ordered)
nums = [1, 2, 3]
nums.append(4)      # Add to end
nums.pop()          # Remove last
nums[0] = 10        # Modify

# Tuple (Immutable, Ordered)
coords = (10, 20)
x, y = coords       # Unpacking

# Dictionary (Key-Value)
data = {"id": 1, "val": 0.5}
data["id"]          # Access: 1
data.keys()         # Get keys
data.values()       # Get values

Loops & Control Flow

# If-else
if x > 10:
    print("Large")
elif x > 5:
    print("Medium")
else:
    print("Small")

# For Loop
for i in range(5):
    print(i)

# Loop over List
fruits = ["apple", "banana"]
for fruit in fruits:
    print(fruit)

# While Loop
count = 0
while count < 5:
    count += 1

# List Comprehension
squares = [x**2 for x in range(5)]

Functions

def greet(name, greeting="Hello"):
    """Function with default parameter"""
    return f"{greeting}, {name}!"

result = greet("Alice")  # "Hello, Alice!"
result = greet("Bob", "Hi")  # "Hi, Bob!"

File I/O

# Read file
with open('data.txt', 'r') as f:
    content = f.read()
    lines = f.readlines()

# Write file
with open('output.txt', 'w') as f:
    f.write("Hello, World!\n")

NumPy Essentials

▼

Core Purpose: Fast numeric arrays; foundation of most numeric work in Python

Array Creation

import numpy as np

# From list
arr = np.array([1, 2, 3, 4])

# Special arrays
zeros = np.zeros((3, 4))      # 3×4 array of zeros
ones = np.ones((2, 3))        # 2×3 array of ones
range_arr = np.arange(0, 10, 2)  # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)  # 5 evenly spaced values
random = np.random.rand(3, 3)    # 3×3 random [0,1)

Vectorized Math (Fast Element-wise Operations)

# Element-wise operations (no loops needed!)
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

a + b      # [5, 7, 9]
a * 2      # [2, 4, 6]
a ** 2     # [1, 4, 9]
a * b      # [4, 10, 18] (element-wise)
np.sqrt(a) # [1.0, 1.414, 1.732]
np.exp(a)  # Exponential
np.log(a)  # Natural log

# Aggregations
arr.sum()      # Sum all elements
arr.mean()     # Average
arr.std()      # Standard deviation
arr.min()      # Minimum
arr.max()      # Maximum

Linear Algebra Operations

# Dot product (vector multiplication)
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.dot(a, b)  # 1*4 + 2*5 + 3*6 = 32

# Matrix multiplication
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
np.matmul(A, B)  # or A @ B

# Transpose
A.T  # Flip rows and columns

# Matrix operations
np.linalg.inv(A)    # Inverse
np.linalg.det(A)    # Determinant
np.linalg.eig(A)    # Eigenvalues/eigenvectors

Indexing & Slicing

arr = np.array([10, 20, 30, 40, 50])

arr[0]         # 10 (first element)
arr[-1]        # 50 (last element)
arr[1:4]       # [20, 30, 40] (slice)
arr[arr > 25]  # [30, 40, 50] (boolean indexing)

# 2D arrays
matrix = np.array([[1, 2, 3], [4, 5, 6]])
matrix[0, 1]   # 2 (row 0, col 1)
matrix[:, 1]   # [2, 5] (all rows, col 1)

Reshaping

arr = np.arange(12)
arr.reshape(3, 4)    # 3×4 matrix
arr.reshape(-1, 1)   # Column vector (auto-calculate rows)
arr.flatten()        # 1D array

Pandas Essentials

▼

Core Purpose: Tabular data manipulation (DataFrame, Series)

Loading Data

import pandas as pd

# From CSV (most common)
df = pd.read_csv('data.csv')

# With options
df = pd.read_csv('data.csv', 
                 sep=',',           # Delimiter
                 header=0,          # Row for column names
                 index_col=0,       # Column to use as index
                 na_values=['?'])   # Custom missing values

# Other formats
df = pd.read_excel('data.xlsx')
df = pd.read_json('data.json')
df = pd.read_sql(query, connection)

Inspection Methods

# Quick overview
df.head()           # First 5 rows
df.tail(3)          # Last 3 rows
df.info()           # Column types, non-null counts, memory usage
df.describe()       # Statistics for numeric columns
df.shape            # (rows, columns)
df.columns          # Column names
df.dtypes           # Data types per column
df.isnull().sum()   # Count missing values per column

Data Cleaning

# Handle missing values
df.dropna()                    # Remove rows with any NaN
df.dropna(subset=['age'])      # Drop rows where 'age' is NaN
df.fillna(0)                   # Fill NaN with 0
df.fillna(df.mean())           # Fill with column means
df['age'].fillna(df['age'].median(), inplace=True)

# Boolean filtering (masks)
df[df['age'] > 25]                          # Rows where age > 25
df[df['city'] == 'NYC']                     # Rows where city is NYC
df[(df['age'] > 25) & (df['salary'] > 50000)]  # Multiple conditions
df[df['name'].str.contains('Alice')]        # String matching

# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['email'])

Selection & Manipulation

# Select columns
df['name']              # Single column (Series)
df[['name', 'age']]     # Multiple columns (DataFrame)

# Select rows
df.iloc[0]              # First row by position
df.loc[0]               # First row by label
df.iloc[0:3]            # First 3 rows

# Add/modify columns
df['age_squared'] = df['age'] ** 2
df['full_name'] = df['first'] + ' ' + df['last']

# Drop columns
df.drop('city', axis=1, inplace=True)
df.drop(['col1', 'col2'], axis=1)

# Sort
df.sort_values('age', ascending=False)
df.sort_values(['city', 'age'])

# Group by and aggregate
df.groupby('city')['age'].mean()
df.groupby('city').agg({'age': 'mean', 'salary': 'sum'})

Matplotlib Essentials

▼

Core Purpose: Low-level plotting library for customizable visualizations

Basic Plot Types

import matplotlib.pyplot as plt

# Line plot
plt.plot([1, 2, 3, 4], [2, 4, 6, 8], marker='o', linestyle='--', color='blue')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot')
plt.legend(['Data'])
plt.grid(True)
plt.show()

# Scatter plot
plt.scatter(x, y, c='red', s=100, alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Scatter Plot')
plt.show()

# Histogram
plt.hist(data, bins=30, color='green', alpha=0.7, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Distribution')
plt.show()

# Bar chart
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]
plt.bar(categories, values, color='purple')
plt.xlabel('Category')
plt.ylabel('Count')
plt.title('Bar Chart')
plt.show()

Customization

# Labels and titles
plt.xlabel('X Label', fontsize=12)
plt.ylabel('Y Label', fontsize=12)
plt.title('My Plot', fontsize=14, fontweight='bold')

# Legend
plt.plot(x, y1, label='Line 1')
plt.plot(x, y2, label='Line 2')
plt.legend(loc='upper right')

# Grid and style
plt.grid(True, linestyle='--', alpha=0.5)
plt.style.use('seaborn-v0_8')  # or 'ggplot', 'fivethirtyeight'

# Figure size
plt.figure(figsize=(10, 6))

# Save figure
plt.savefig('plot.png', dpi=300, bbox_inches='tight')

Subplots

fig, axes = plt.subplots(2, 2, figsize=(10, 8))

axes[0, 0].plot(x, y)
axes[0, 0].set_title('Plot 1')

axes[0, 1].scatter(x, y)
axes[0, 1].set_title('Plot 2')

axes[1, 0].hist(data, bins=20)
axes[1, 0].set_title('Plot 3')

axes[1, 1].bar(categories, values)
axes[1, 1].set_title('Plot 4')

plt.tight_layout()
plt.show()

Seaborn Essentials

▼

Core Purpose: High-level statistical visualization built on Matplotlib; great for EDA patterns and relationships

Setup

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('whitegrid')  # 'darkgrid', 'white', 'dark', 'ticks'
sns.set_palette('husl')     # Color palette

Distribution Plots

# Histogram with KDE (kernel density estimate)
sns.histplot(data=df, x='age', kde=True, bins=20)
plt.show()

# Box plot (quartiles, outliers)
sns.boxplot(data=df, x='city', y='age')
plt.show()

# Violin plot (distribution shape)
sns.violinplot(data=df, x='city', y='age')
plt.show()

# Distribution plot
sns.displot(data=df, x='age', kind='kde')
plt.show()

Relationship Plots

# Scatter plot with regression line
sns.scatterplot(data=df, x='age', y='salary', hue='city', size='experience')
plt.show()

# Regression plot
sns.regplot(data=df, x='age', y='salary')
plt.show()

# Pair plot (all pairwise relationships)
sns.pairplot(df, hue='city')
plt.show()

# Joint plot (scatter + distributions)
sns.jointplot(data=df, x='age', y='salary', kind='scatter')
plt.show()

Correlation & Heatmaps

# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, 
            annot=True,        # Show values
            cmap='coolwarm',   # Color scheme
            center=0,          # Center colormap at 0
            square=True,       # Square cells
            linewidths=1)      # Cell borders
plt.title('Correlation Matrix')
plt.show()

Categorical Plots

# Bar plot (with aggregation)
sns.barplot(data=df, x='city', y='age', estimator=np.mean)
plt.show()

# Count plot (frequency)
sns.countplot(data=df, x='city')
plt.show()

# Point plot (with error bars)
sns.pointplot(data=df, x='city', y='age')
plt.show()

# Strip plot (all points)
sns.stripplot(data=df, x='city', y='age', jitter=True)
plt.show()

Advanced: FacetGrid for Multi-panel Plots

# Create grid based on categorical variable
g = sns.FacetGrid(df, col='city', row='gender', height=4)
g.map(sns.histplot, 'age', bins=20)
plt.show()

Logical & Analytical Thinking Tips

▼

Problem-Solving Framework

Understand: What is the question asking? What data do I have?
Break Down: Divide complex problems into smaller steps
Pattern Recognition: Look for similarities to known problems
Test Assumptions: Verify your understanding with simple examples
Iterate: Start simple, then add complexity

ML-Specific Thinking

Data First: Always explore data before modeling (distributions, missing values, outliers)
Baseline: Start with simple models (e.g., mean prediction) before complex ones
Validation: Split data (train/test) to evaluate performance honestly
Feature Engineering: Transform raw data into meaningful inputs
Debugging: Print shapes, check for NaNs, visualize intermediate results

Common Pitfalls to Avoid

❌ Assuming data is clean (always check!)
❌ Overfitting (model memorizes training data)
❌ Data leakage (test data influences training)
❌ Ignoring class imbalance
❌ Not scaling features (important for many algorithms)

💡 Pro Tips for Success

NumPy: Use vectorized operations instead of loops (100x faster!)
Pandas: Chain operations with method chaining: df.dropna().groupby('city')['age'].mean()
Matplotlib: Use plt.style.use() for consistent aesthetics
Seaborn: Perfect for quick EDA; automatically handles DataFrames
Practice with real datasets (Kaggle, UCI ML Repository)
Use Jupyter notebooks for interactive exploration
Google errors and read documentation (it's part of learning!)

Typical ML Workflow

▼

# 1. Load data
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')

# 2. Inspect
df.head()
df.info()
df.describe()
df.isnull().sum()

# 3. Visualize (EDA)
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df)
sns.heatmap(df.corr(), annot=True)
sns.histplot(df['target'], kde=True)

# 4. Clean
df = df.dropna()
df = df[df['age'] > 0]  # Filter outliers

# 5. Prepare
X = df.drop('target', axis=1)  # Features
y = df['target']                # Target

# 6. Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 7. Train model (example)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

# 8. Evaluate
score = model.score(X_test, y_test)
print(f"R² Score: {score}")

# 9. Predict
predictions = model.predict(X_test)

Data Collection & Sources

▼

File-Based Data Sources

CSV (Comma-Separated Values)

import pandas as pd

# Basic load
df = pd.read_csv('data.csv')

# Advanced options
df = pd.read_csv('data.csv',
                 sep=',',              # Delimiter (can be '\t', '|', etc.)
                 header=0,             # Row number for column names
                 index_col=0,          # Column to use as row index
                 na_values=['NA', '?', '-'],  # Custom missing values
                 parse_dates=['date'], # Convert to datetime
                 encoding='utf-8',     # Handle special characters
                 low_memory=False)     # For large files

# Read specific columns only
df = pd.read_csv('data.csv', usecols=['name', 'age', 'salary'])

# Read in chunks (for huge files)
chunk_iter = pd.read_csv('large_file.csv', chunksize=10000)
for chunk in chunk_iter:
    process(chunk)

Excel Files

# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Multiple sheets
excel_file = pd.ExcelFile('data.xlsx')
print(excel_file.sheet_names)

# All sheets as dictionary
all_sheets = pd.read_excel('data.xlsx', sheet_name=None)

JSON

# Simple JSON
df = pd.read_json('data.json')

# Nested JSON
df = pd.read_json('data.json', orient='records')

# From API response
import requests
response = requests.get('https://api.example.com/data')
df = pd.DataFrame(response.json())

# Normalize nested JSON
from pandas import json_normalize
df = json_normalize(data, record_path=['items'])

Parquet (Fast & Efficient)

# Read Parquet
df = pd.read_parquet('data.parquet')

# Write Parquet (compressed)
df.to_parquet('output.parquet', compression='gzip')

Database Sources

SQL Databases (Relational)

from sqlalchemy import create_engine
engine = create_engine('postgresql://user:password@localhost:5432/dbname')

# Read table or query
df = pd.read_sql_table('customers', engine)

# Execute SQL query
query = """
SELECT c.name, c.age, o.total
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.date >= '2024-01-01'
"""
df = pd.read_sql_query(query, engine)

# Write to database
df.to_sql('new_table', engine, if_exists='replace', index=False)

NoSQL Databases

from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['mydatabase']
collection = db['customers']

# Query and convert to DataFrame
cursor = collection.find({'age': {'$gt': 25}})
df = pd.DataFrame(list(cursor))

# Redis
import redis
import json
r = redis.Redis(host='localhost', port=6379, db=0)
data = json.loads(r.get('user:1'))

Data Cleaning Essentials

▼

1. Initial Data Inspection

# Load data
df = pd.read_csv('data.csv')

# Quick overview
print(df.shape)           # (rows, columns)
print(df.head())          # First 5 rows
print(df.info())          # Data types, non-null counts
print(df.describe())      # Statistics for numeric columns
print(df.isnull().sum())  # Missing values per column
print(df.duplicated().sum())  # Duplicate rows

2. Handling Missing Values

# Detection
missing_pct = (df.isnull().sum() / len(df)) * 100
print(missing_pct[missing_pct > 0])

# Visualization
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()

# Strategies
df.dropna()                    # Remove rows with any NaN
df.dropna(subset=['age'])      # Remove rows with NaN in 'age'
df.fillna(0)                   # Fill with constant
df['age'].fillna(df['age'].median(), inplace=True) # Fill with median
df['price'].fillna(method='ffill', inplace=True)   # Forward fill
df['salary'] = df.groupby('dept')['salary'].transform(lambda x: x.fillna(x.mean()))

3. Handling Duplicates

# Check
print(f"Duplicate rows: {df.duplicated().sum()}")

# Remove (keep first occurrence)
df_clean = df.drop_duplicates()

# Remove based on specific columns
df_clean = df.drop_duplicates(subset=['email', 'phone'])

4. Handling Outliers

# IQR Method
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Remove
df_clean = df[(df['age'] >= lower) & (df['age'] <= upper)]

# Cap (Winsorization)
df['age'] = df['age'].clip(lower=lower, upper=upper)

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['age']))
outliers = df[z_scores > 3]

5. Data Type Corrections

# Convert to numeric
df['age'] = pd.to_numeric(df['age'], errors='coerce')

# Convert to datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')

# Convert to categorical
df['category'] = df['category'].astype('category')

6. Handling Inconsistent Data

# Standardize text
df['city'] = df['city'].str.strip().str.title()

# Fix typos
df['city'] = df['city'].replace({'NY': 'New York', 'NYC': 'New York'})

# Remove special characters
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)

Visualization & Patterns

▼

Univariate Analysis (Single Variable)

import seaborn as sns
import matplotlib.pyplot as plt

# Distribution of numeric variable
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
sns.histplot(df['age'], kde=True)
plt.subplot(1, 2, 2)
sns.boxplot(x=df['age'])
plt.show()

# Categorical variable
sns.countplot(data=df, x='category')
plt.xticks(rotation=45)
plt.show()

Bivariate Analysis (Two Variables)

# Numeric vs Numeric
sns.scatterplot(data=df, x='age', y='salary', hue='department')
plt.show()

# Categorical vs Numeric
sns.boxplot(data=df, x='department', y='salary')
plt.show()

# Categorical vs Categorical
pd.crosstab(df['department'], df['gender']).plot(kind='bar')
plt.show()

Multivariate Analysis

# Correlation heatmap
corr = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()

# Pair Plot
sns.pairplot(df, hue='category')
plt.show()

Preprocessing Techniques

▼

1. Feature Engineering

# Date features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['is_weekend'] = df['date'].dt.dayofweek.isin([5, 6]).astype(int)

# Mathematical transformations
df['age_squared'] = df['age'] ** 2
df['bmi'] = df['weight'] / (df['height'] ** 2)

# Binning
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 60, 100], labels=['Child', 'Adult', 'Senior'])

# Interaction Features
df['income_per_age'] = df['income'] / df['age']
df['age_income'] = df['age'] * df['income']

2. Feature Selection

# Correlation
corr = df.corr()['target'].abs().sort_values(ascending=False)
top_features = corr[1:6].index.tolist()

# Variance Threshold
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.1)
X_high_var = selector.fit_transform(X)

# Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE
rfe = RFE(model, n_features_to_select=5)
X_rfe = rfe.fit_transform(X, y)

3. Dimensionality Reduction

# PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

# t-SNE (Visualization)
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X_scaled)

4. Scaling and Normalization

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# StandardScaler (Mean=0, Std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# MinMaxScaler (Range [0, 1])
scaler = MinMaxScaler()

# RobustScaler (Robust to outliers)
scaler = RobustScaler()

5. Encoding Categorical Variables

# Label Encoding (Ordinal)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['grade'] = le.fit_transform(df['grade'])

# One-Hot Encoding (Nominal)
df = pd.get_dummies(df, columns=['city'], drop_first=True)

# Target Encoding
target_means = df.groupby('city')['salary'].mean()
df['city_encoded'] = df['city'].map(target_means)

Pipeline & Checklist

▼

Complete Data Cleaning Pipeline

def clean_data(df):
    # 1. Initial inspection
    print("Shape:", df.shape)
    
    # 2. Remove duplicates
    df = df.drop_duplicates()
    
    # 3. Handle missing values
    num_cols = df.select_dtypes(include=[np.number]).columns
    for col in num_cols:
        df[col].fillna(df[col].median(), inplace=True)
    
    cat_cols = df.select_dtypes(include=['object']).columns
    for col in cat_cols:
        df[col].fillna(df[col].mode()[0], inplace=True)
    
    # 4. Remove outliers (IQR)
    for col in num_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        df = df[~((df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR)))]
        
    # 5. Data type corrections
    date_cols = [c for c in df.columns if 'date' in c.lower()]
    for col in date_cols:
        df[col] = pd.to_datetime(df[col], errors='coerce')
    
    # 6. Encode
    df = pd.get_dummies(df, columns=cat_cols, drop_first=True)
    
    return df

Data Cleaning Checklist

✅ Load data: Check source and format
✅ Initial inspection: Shape, info, missing values
✅ Duplicates: Identify and remove
✅ Missing Data: Drop, fill, or impute
✅ Outliers: Detect and handle
✅ Data Types: Fix dates, numbers, categories
✅ Standardize text: Strip, lowercase, fix typos
✅ Feature engineering: Create new features
✅ Encode: Label or One-Hot encoding
✅ Scale: Standard or MinMax (fit on train only!)
✅ Split: Train/Test split BEFORE scaling

Best Practices & Pitfalls

▼

Best Practices

Always keep a copy of raw data: Never modify original.
Document cleaning steps: Use notebooks/comments.
Visualize before and after: Verify cleaning.
Handle missing data thoughtfully: Don't just drop.
Scale after splitting: Prevent data leakage.
Validate cleaning: Check if results make sense.

Common Pitfalls to Avoid

❌ Data Leakage: Using test stats for training scaling.
❌ Dropping too much data: Losing info.
❌ Ignoring Types: Treating numbers as strings.
❌ Over-cleaning: Removing valid outliers.
❌ Not handling categorical: Models need numbers.
❌ Scaling before splitting: Leakage risk.

Quick Reference: Data Format Comparison

Format	Speed	Use Case
CSV	Slow	Simple, Sharing
Excel	Slow	Business Reports
JSON	Medium	APIs, Nested
Parquet	Fast	Big Data
SQL	Fast	Structured Queries

Machine Learning Paradigms

▼

Supervised Learning

Definition: Learning from labeled data (input-output pairs). Goal is to learn a mapping function f(X) → y.

Regression: Predict continuous values (price, temperature).
Classification: Predict discrete categories (spam/not spam).

Unsupervised Learning

Definition: Learning from unlabeled data to find hidden patterns.

Clustering: Group similar data points.
Dimensionality Reduction: Reduce feature space.
Anomaly Detection: Identify unusual patterns.

Semi-Supervised Learning

Mix of labeled and unlabeled data. Useful when labeling is expensive but data is abundant.

Reinforcement Learning (RL)

Learning through interaction with an environment (trial and error) to maximize cumulative reward.

# Simple Q-Learning example (tabular RL)
import numpy as np

# Initialize Q-table (states × actions)
Q = np.zeros((n_states, n_actions))

# Training loop
for episode in range(1000):
    state = env.reset()
    done = False
    
    while not done:
        # Epsilon-greedy action selection
        if np.random.random() < epsilon:
            action = env.action_space.sample()  # Explore
        else:
            action = np.argmax(Q[state])  # Exploit
        
        # Take action
        next_state, reward, done, _ = env.step(action)
        
        # Q-learning update
        Q[state, action] += alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )
        state = next_state

Problem Types Deep Dive

▼

Regression vs Classification

Aspect	Regression	Classification
Output	Continuous number	Discrete category
Examples	Price, temperature	Spam/ham, cat/dog
Evaluation	MSE, RMSE, R²	Accuracy, F1, ROC-AUC

When to use Regression

Predicting quantities (sales, revenue)
Forecasting (time series)
Estimating continuous relationships

When to use Classification

Yes/No decisions (approve loan)
Multi-class problems (image recognition)
Ranking/prioritization

Supervised Learning Algorithms

▼

1. Linear Regression

Fits a straight line to predict continuous values. Good for linear relationships, interpretable.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
print(f"R² Score: {model.score(X_test, y_test):.2f}")

2. Logistic Regression

Predicts probability of binary outcomes (0 or 1). Good baseline for classification.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_test)

3. Decision Trees

Tree of if-then-else rules. Handles non-linear data, interpretable, but prone to overfitting.

from sklearn.tree import DecisionTreeClassifier, plot_tree
model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train)

# Visualize
plot_tree(model, filled=True)

4. Random Forest

Ensemble of decision trees. High accuracy, robust, handles non-linear data.

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Feature importance
importances = model.feature_importances_

5. K-Nearest Neighbors (KNN)

Classifies based on K closest examples. Simple, no training, but slow on large data.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# Scale features first!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_scaled, y)

6. Support Vector Machines (SVM)

Finds optimal hyperplane. Great for high-dimensional data (text/images).

from sklearn.svm import SVC
model = SVC(kernel='rbf', C=1.0)
model.fit(X_train_scaled, y_train)

7. Naive Bayes

Probabilistic classifier assuming feature independence. Fast, great for text.

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train, y_train)

8. Gradient Boosting (XGBoost)

Sequential ensemble of weak learners. State-of-the-art accuracy for tabular data.

from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)

Unsupervised Learning Algorithms

▼

1. K-Means Clustering

Groups data into K clusters based on similarity.

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
clusters = model.fit_predict(X)

# Elbow method for K
inertias = [KMeans(k).fit(X).inertia_ for k in range(1, 10)]

2. Principal Component Analysis (PCA)

Reduces dimensionality while preserving variance.

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95) # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)

3. Autoencoders

Neural network that learns compressed representation. Good for anomaly detection.

# Simple Autoencoder (Keras)
input_img = keras.Input(shape=(input_dim,))
encoded = layers.Dense(32, activation='relu')(input_img)
decoded = layers.Dense(input_dim, activation='sigmoid')(encoded)
autoencoder = keras.Model(input_img, decoded)

Algorithm Comparison & Selection

▼

Comparison Table

Algorithm	Type	Pros	Cons
Linear Reg	Reg	Fast, Interpretable	Linear only
Random Forest	Both	Accurate, Robust	Slow, Opaque
XGBoost	Both	High Accuracy	Overfitting risk
SVM	Both	High Dimensions	Slow on large data

Selection Guide

By Problem Type:

Regression: Linear Regression (baseline) → Random Forest/XGBoost
Binary Class: Logistic Regression (baseline) → Random Forest/XGBoost
Multi-class: Random Forest or Neural Networks
Clustering: K-Means (if K known) or DBSCAN

By Data Size:

Small (<1k): Logistic Reg, Naive Bayes, KNN
Medium: Random Forest, SVM, XGBoost
Large (>100k): Neural Networks, XGBoost

Model Evaluation

▼

Regression Metrics

from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, y_pred) # Lower is better
r2 = r2_score(y_test, y_pred) # 1.0 is perfect

Classification Metrics

from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred) # Balance precision/recall
auc = roc_auc_score(y_test, y_pred_proba) # Threshold independent

Cross-Validation

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(f"CV Acc: {scores.mean():.2%}")

Best Practices & Pitfalls

▼

Best Practices

Start Simple: Baseline with Linear/Logistic Regression.
Understand Data: Visualize and clean before modeling.
Feature Engineering: Often more impactful than model choice.
Always Split: Train/Test split to prevent leakage.
Scale Features: Essential for KNN, SVM, Neural Nets.

Common Mistakes

❌ Data Leakage: Scaling before splitting.
❌ Imbalanced Data: Using accuracy as the only metric.
❌ Overfitting: High train accuracy, low test accuracy.
❌ No Cross-Validation: Unreliable performance estimates.

Learning Path & Decision Tree

▼

Learning Path

Beginner: Linear/Logistic Reg, Decision Trees, K-Means.
Intermediate: Random Forest, KNN, Naive Bayes, PCA.
Advanced: SVM, Gradient Boosting, Neural Networks, RL.

Quick Decision Tree

Predict Number?
├─ Yes → REGRESSION
│  ├─ Linear? → Linear Regression
│  └─ Complex? → Random Forest / XGBoost
└─ No → CLASSIFICATION
   ├─ Binary? → Logistic Regression
   ├─ Text? → Naive Bayes
   └─ Complex? → Random Forest / XGBoost

Linear/Logistic Regression

▼

Linear Regression

Predicts continuous values: y = mx + b

When: Linear relationship between features and target
Assumptions: Linearity, independence, homoscedasticity, normality
Pros: Fast, interpretable, works with small data
Cons: Assumes linearity, sensitive to outliers

Logistic Regression

Binary classification using sigmoid function

When: Binary outcomes (yes/no, 0/1)
Output: Probability between 0 and 1
Pros: Probabilistic output, fast, interpretable
Cons: Linear decision boundary

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Decision Trees/Random Forests

▼

Decision Trees

Tree structure of if-else decisions

Pros: Easy to interpret, handles non-linear relationships, no scaling needed
Cons: Prone to overfitting, unstable (small changes → different tree)

Random Forests

Ensemble of many decision trees (bagging)

How: Build multiple trees on random subsets, average predictions
Pros: Reduces overfitting, handles missing values, feature importance
Cons: Less interpretable, slower than single tree
Best for: Tabular data, when you need robust performance

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, max_depth=10)
rf.fit(X_train, y_train)

# Feature importance
importances = rf.feature_importances_

Great baseline model for tabular data

Support Vector Machines

▼

Core Concept

Find the hyperplane that maximizes margin between classes

The Kernel Trick

Transform data to higher dimensions without computing coordinates

Linear: For linearly separable data
RBF (Radial Basis Function): Most common, handles non-linear
Polynomial: For polynomial relationships

When to Use

High-dimensional spaces (text, images)
Clear margin of separation
Small to medium datasets

Pros & Cons

Pros: Effective in high dimensions, memory efficient

Cons: Slow on large datasets, requires feature scaling

from sklearn.svm import SVC

svm = SVC(kernel='rbf', C=1.0, gamma='scale')
svm.fit(X_train, y_train)

Best for: Text classification, image recognition

Neural Networks

▼

Architecture Components

Input Layer: Receives features
Hidden Layers: Learn representations (deep = many layers)
Output Layer: Produces predictions
Activation Functions: ReLU (hidden), Sigmoid/Softmax (output)

Key Concepts

Backpropagation: Update weights using gradient descent
Learning Rate: How big each update step is (0.001-0.01 typical)
Epochs: Full passes through training data
Batch Size: Samples processed before updating weights

When to Use

Complex patterns, images, text, audio, large datasets

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(64, activation='relu', input_shape=(10,)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(X_train, y_train, epochs=50, batch_size=32)

Deep learning powerhouse

k-NN, k-Means, Naive Bayes

▼

k-Nearest Neighbors (k-NN)

Classify based on k closest training examples

Pros: Simple, no training phase, works for multi-class
Cons: Slow prediction, sensitive to scale and irrelevant features
Tip: Always scale features, try k=3,5,7

k-Means Clustering

Partition data into k clusters (unsupervised)

How: Assign points to nearest centroid, update centroids, repeat
Use for: Customer segmentation, data compression
Choosing k: Elbow method (plot within-cluster sum of squares)

Naive Bayes

Probabilistic classifier using Bayes' theorem

Assumption: Features are independent (rarely true but works anyway)
Best for: Text classification (spam detection, sentiment)
Pros: Fast, works with small data, handles high dimensions

# k-NN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

# k-Means
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()

Deep Learning Frameworks

▼

TensorFlow/Keras

Google's production-ready deep learning framework

Best for: Production deployment, mobile (TensorFlow Lite), research
Pros: Industry standard, excellent documentation, TensorBoard visualization
Keras: High-level API for TensorFlow (easy to use)
Use when: Need production deployment, mobile apps, or serving at scale

import tensorflow as tf
from tensorflow import keras

# Sequential API (simple)
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(10,)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    callbacks=[keras.callbacks.EarlyStopping(patience=5)]
)

# Functional API (complex architectures)
inputs = keras.Input(shape=(10,))
x = keras.layers.Dense(64, activation='relu')(inputs)
x = keras.layers.Dense(32, activation='relu')(x)
outputs = keras.layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs=inputs, outputs=outputs)

PyTorch

Facebook's research-focused deep learning framework

Best for: Research, experimentation, dynamic models
Pros: Pythonic, dynamic computation graphs, easier debugging
Popular in: Academic research, NLP (Hugging Face), computer vision
Use when: Need flexibility, research, or custom architectures

import torch
import torch.nn as nn
import torch.optim as optim

# Define model
class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = torch.relu(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))
        return x

model = NeuralNet()
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(50):
    model.train()
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()

TensorFlow vs PyTorch

Aspect	TensorFlow	PyTorch
Ease of Use	Keras makes it easy	More Pythonic, intuitive
Learning Curve	Moderate	Easier for Python devs
Deployment	Excellent (TF Serving, Lite)	Good (TorchServe)
Research	Good	Dominant in academia
Debugging	Harder (static graphs)	Easier (dynamic graphs)
Community	Large, industry-focused	Large, research-focused

Common Use Cases

Computer Vision: Both (PyTorch slightly preferred)
NLP: PyTorch (Hugging Face Transformers)
Production/Mobile: TensorFlow
Research Papers: PyTorch
Time Series: Both

Key Libraries

TensorFlow: Keras, TensorBoard, TF Data, TF Lite
PyTorch: torchvision, torchtext, Lightning (wrapper)
Both: ONNX (model interchange format)

Start with Keras for simplicity, PyTorch for research

Classification Metrics

▼

Accuracy

Correct predictions / Total predictions

When: Balanced classes
Misleading when: Imbalanced data (e.g., 95% class A, 5% class B)

Precision

True Positives / (True Positives + False Positives)

Question: Of predicted positives, how many are correct?
Use when: False positives are costly (spam filter)

Recall (Sensitivity)

True Positives / (True Positives + False Negatives)

Question: Of actual positives, how many did we catch?
Use when: False negatives are costly (disease detection)

F1-Score

Harmonic mean of precision and recall: 2 × (Precision × Recall) / (Precision + Recall)

Use when: Balance between precision and recall matters

ROC-AUC

Area Under the Receiver Operating Characteristic curve

Plots True Positive Rate vs False Positive Rate
AUC = 1.0: Perfect classifier
AUC = 0.5: Random guessing
Use when: Comparing models across thresholds

from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_test, y_pred))
auc = roc_auc_score(y_test, y_pred_proba)

Choose metric based on business impact

Regression Metrics

▼

Mean Squared Error (MSE)

Average of squared differences: Σ(actual - predicted)² / n

Penalizes large errors heavily
Same units as target variable squared

Root Mean Squared Error (RMSE)

Square root of MSE: √MSE

Same units as target variable
Most common regression metric
More interpretable than MSE

Mean Absolute Error (MAE)

Average of absolute differences: Σ|actual - predicted| / n

Less sensitive to outliers than MSE/RMSE
Same units as target variable
More robust metric

R² (Coefficient of Determination)

Proportion of variance explained: 1 - (SS_res / SS_tot)

R² = 1.0: Perfect predictions
R² = 0.0: As good as predicting mean
Can be negative for bad models
Scale-independent (compare across datasets)

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

RMSE for magnitude, R² for model quality

Cross-Validation

▼

Why Cross-Validation?

Get more reliable performance estimates using all data for both training and validation

k-Fold Cross-Validation

Split data into k folds (typically k=5 or 10)
Train on k-1 folds, validate on remaining fold
Repeat k times, average results
Pros: Every sample used for both training and validation

Stratified k-Fold

Maintains class distribution in each fold
Use for: Imbalanced classification problems

Leave-One-Out (LOO)

k = n (number of samples)
Use for: Very small datasets
Con: Computationally expensive

Time Series Split

Respects temporal ordering
Critical for: Sequential data (stocks, sales)

from sklearn.model_selection import cross_val_score, StratifiedKFold

# Simple k-fold
scores = cross_val_score(model, X, y, cv=5)
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

# Stratified k-fold
skf = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=skf)

Always use CV for model selection

Confusion Matrix

▼

The Matrix

                Predicted
                 Pos    Neg
Actual  Pos     TP     FN
        Neg     FP     TN

Understanding Each Cell

True Positive (TP): Correctly predicted positive
True Negative (TN): Correctly predicted negative
False Positive (FP): Incorrectly predicted positive (Type I error)
False Negative (FN): Incorrectly predicted negative (Type II error)

What to Look For

High FP? Model too aggressive (reduce threshold)
High FN? Model too conservative (increase threshold)
Imbalanced diagonal? Class imbalance or poor model

Quick Code

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

Always visualize your confusion matrix

Hyperparameter Tuning

▼

Grid Search

Try every combination of specified parameters

Pros: Exhaustive, guaranteed to find best in grid
Cons: Exponentially slow with more parameters
Use when: Few parameters, small ranges

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_

Random Search

Sample random combinations

Pros: Faster, explores more space
Cons: May miss optimal
Use when: Many parameters, large ranges

from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [10, 20, 30, 40, None],
    'min_samples_split': [2, 5, 10, 15]
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(),
    param_dist,
    n_iter=20,  # Number of random combinations
    cv=5,
    n_jobs=-1
)

random_search.fit(X_train, y_train)

Key Hyperparameters by Algorithm

Random Forest: n_estimators, max_depth, min_samples_split

SVM: C (regularization), kernel, gamma

Neural Networks: learning_rate, batch_size, hidden_layers, neurons

XGBoost: learning_rate, max_depth, n_estimators, subsample

Start with defaults, then tune most important params

Algorithm Selection Guide

▼

By Problem Type

Binary Classification:

Logistic Regression (baseline)
Random Forest (robust)
XGBoost (high performance)
Neural Networks (complex patterns)

Multi-class Classification:

Random Forest
XGBoost
Naive Bayes (text)

Regression:

Linear Regression (baseline)
Random Forest
XGBoost
Neural Networks

Clustering:

K-Means (spherical clusters)
DBSCAN (arbitrary shapes, outliers)
Hierarchical (dendrograms)

By Data Characteristics

Small Data (<10k samples):

Logistic Regression, Naive Bayes
Simple models to avoid overfitting

Large Data (>100k samples):

Neural Networks, XGBoost
Can learn complex patterns

High Dimensional (many features):

Regularized models (Lasso, Ridge)
Random Forest (handles many features)
Feature selection first

Imbalanced Classes:

Random Forest with class_weight='balanced'
XGBoost with scale_pos_weight
SMOTE for oversampling

Quick Decision Tree

Need interpretability? → Logistic Regression or Decision Tree

Need high accuracy? → XGBoost or Random Forest

Have images/text? → Neural Networks (CNN/RNN)

Limited time? → Start with Random Forest

Always try multiple algorithms

Common Pitfalls & Debugging

▼

Data Leakage

Information from test set leaks into training

Example: Scaling before train/test split
Fix: Always split first, then preprocess
Example: Using future information in time series
Fix: Use time-based split

Class Imbalance

One class dominates dataset (e.g., 95% vs 5%)

Symptom: High accuracy but poor recall on minority class
Solutions:
- Use stratified sampling
- Oversample minority class (SMOTE)
- Undersample majority class
- Use class weights
- Change evaluation metric (F1, AUC instead of accuracy)

Poor Performance Checklist

✓ Check for data leakage
✓ Verify train/test split is correct
✓ Look for missing values
✓ Check feature scaling
✓ Examine class distribution
✓ Plot learning curves (more data needed?)
✓ Try different algorithms
✓ Engineer better features

Model Not Learning

Neural Networks: Learning rate too high/low, bad initialization
All models: Features not informative, need more data

Overfitting Signs

Training accuracy >> test accuracy (gap >10%)
Performance degrades on new data
Model too complex for data size

# Check for data leakage
from sklearn.model_selection import cross_val_score

# If cross-validation score much worse than train score → leakage
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"CV: {cv_scores.mean():.3f}, Train: {train_score:.3f}")

⚠️ Always validate on unseen data

Quick Reference Code

▼

Complete ML Pipeline

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# 1. Load data
df = pd.read_csv('data.csv')

# 2. Basic exploration
print(df.info())
print(df.describe())
print(df.isnull().sum())

# 3. Prepare features and target
X = df.drop('target', axis=1)
y = df['target']

# 4. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 5. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 6. Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# 7. Evaluate
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# 8. Cross-validation
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
print(f"CV Score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

Pandas Essentials

# Load data
df = pd.read_csv('file.csv')

# Exploration
df.head()
df.shape
df.dtypes
df.describe()
df.isnull().sum()

# Selection
df['column']
df[['col1', 'col2']]
df[df['age'] > 30]

# Missing values
df.dropna()
df.fillna(df.mean())

# Encoding
pd.get_dummies(df, columns=['category'])

# Group by
df.groupby('category')['value'].mean()

Bookmark this for quick reference!

Machine Learning Cheatsheet

Central Tendency

Spread & Variability

Probability Concepts

Distributions

Vectors

Matrices

Key Concepts

Variables & Data Types

Lists, Tuples & Dictionaries

Loops & Control Flow

Functions

File I/O

Array Creation

Vectorized Math (Fast Element-wise Operations)

Linear Algebra Operations

Indexing & Slicing

Reshaping

Loading Data

Inspection Methods

Data Cleaning

Selection & Manipulation

Basic Plot Types

Customization

Subplots

Setup

Distribution Plots

Relationship Plots

Correlation & Heatmaps

Categorical Plots

Advanced: FacetGrid for Multi-panel Plots

Problem-Solving Framework

ML-Specific Thinking

Common Pitfalls to Avoid

💡 Pro Tips for Success

File-Based Data Sources

Database Sources

1. Initial Data Inspection

2. Handling Missing Values

3. Handling Duplicates

4. Handling Outliers

5. Data Type Corrections

6. Handling Inconsistent Data

Univariate Analysis (Single Variable)

Bivariate Analysis (Two Variables)

Multivariate Analysis

1. Feature Engineering

2. Feature Selection

3. Dimensionality Reduction

4. Scaling and Normalization

5. Encoding Categorical Variables

Complete Data Cleaning Pipeline

Data Cleaning Checklist

Best Practices

Common Pitfalls to Avoid

Quick Reference: Data Format Comparison

Supervised Learning

Unsupervised Learning

Semi-Supervised Learning

Reinforcement Learning (RL)

Regression vs Classification

When to use Regression

When to use Classification

1. Linear Regression

2. Logistic Regression

3. Decision Trees

4. Random Forest

5. K-Nearest Neighbors (KNN)

6. Support Vector Machines (SVM)

7. Naive Bayes

8. Gradient Boosting (XGBoost)

1. K-Means Clustering

2. Principal Component Analysis (PCA)

3. Autoencoders

Comparison Table

Selection Guide

Regression Metrics

Classification Metrics

Cross-Validation

Best Practices