๐Ÿ“š Fundamentals
Core concepts including statistics, linear algebra, Python libraries (NumPy, Pandas), and visualization tools essential for Machine Learning.
Statistics & Probability Basics
โ–ผ

Central Tendency

  • Mean: Average of all values โ†’ sum of values / count
  • Median: Middle value when sorted (robust to outliers)
  • Mode: Most frequently occurring value

Spread & Variability

  • Variance: Average squared deviation from mean
  • Standard Deviation: Square root of variance (same units as data)
  • Range: Max - Min
  • Interquartile Range (IQR): Q3 - Q1 (middle 50% of data)

Probability Concepts

  • Probability: P(event) = favorable outcomes / total outcomes (0 to 1)
  • Conditional Probability: P(A|B) = probability of A given B occurred
  • Independence: Events don't affect each other โ†’ P(A and B) = P(A) ร— P(B)
  • Bayes' Theorem: Update beliefs with new evidence
    P(A|B) = P(B|A) ร— P(A) / P(B)

Distributions

  • Normal (Gaussian): Bell curve, symmetric around mean
  • Uniform: All outcomes equally likely
  • Binomial: Success/failure over n trials
Linear Algebra Essentials
โ–ผ

Vectors

Definition: Ordered list of numbers [xโ‚, xโ‚‚, ..., xโ‚™]

Operations:

  • Addition: Add corresponding elements
  • Scalar multiplication: Multiply each element by a number
  • Dot product: Sum of element-wise products โ†’ measures similarity

Matrices

Definition: 2D array of numbers (rows ร— columns)

Operations:

  • Addition/Subtraction: Element-wise (same dimensions)
  • Multiplication: Row ร— Column (inner dimensions must match)
  • Transpose: Flip rows โ†” columns (Aแต€)

Key Concepts

  • Identity Matrix (I): Diagonal of 1s, rest 0s
  • Dimensions: Shape of data (e.g., 100 samples ร— 5 features)
  • Matrix multiplication: Transforms data (used in neural networks)
  • Inverse: Aโปยน such that A ร— Aโปยน = I (used in solving equations)
Python Basics
โ–ผ

Variables & Data Types

# Variable Assignment
x = 42          # Integer
y = 3.14        # Float
name = "ML"     # String
is_valid = True # Boolean

# Dynamic Typing
x = "Now I'm a string"
type(x)         # <class 'str'>

Lists, Tuples & Dictionaries

# List (Mutable, Ordered)
nums = [1, 2, 3]
nums.append(4)      # Add to end
nums.pop()          # Remove last
nums[0] = 10        # Modify

# Tuple (Immutable, Ordered)
coords = (10, 20)
x, y = coords       # Unpacking

# Dictionary (Key-Value)
data = {"id": 1, "val": 0.5}
data["id"]          # Access: 1
data.keys()         # Get keys
data.values()       # Get values

Loops & Control Flow

# If-else
if x > 10:
    print("Large")
elif x > 5:
    print("Medium")
else:
    print("Small")

# For Loop
for i in range(5):
    print(i)

# Loop over List
fruits = ["apple", "banana"]
for fruit in fruits:
    print(fruit)

# While Loop
count = 0
while count < 5:
    count += 1

# List Comprehension
squares = [x**2 for x in range(5)]

Functions

def greet(name, greeting="Hello"):
    """Function with default parameter"""
    return f"{greeting}, {name}!"

result = greet("Alice")  # "Hello, Alice!"
result = greet("Bob", "Hi")  # "Hi, Bob!"

File I/O

# Read file
with open('data.txt', 'r') as f:
    content = f.read()
    lines = f.readlines()

# Write file
with open('output.txt', 'w') as f:
    f.write("Hello, World!\n")
NumPy Essentials
โ–ผ

Core Purpose: Fast numeric arrays; foundation of most numeric work in Python

Array Creation

import numpy as np

# From list
arr = np.array([1, 2, 3, 4])

# Special arrays
zeros = np.zeros((3, 4))      # 3ร—4 array of zeros
ones = np.ones((2, 3))        # 2ร—3 array of ones
range_arr = np.arange(0, 10, 2)  # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)  # 5 evenly spaced values
random = np.random.rand(3, 3)    # 3ร—3 random [0,1)

Vectorized Math (Fast Element-wise Operations)

# Element-wise operations (no loops needed!)
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

a + b      # [5, 7, 9]
a * 2      # [2, 4, 6]
a ** 2     # [1, 4, 9]
a * b      # [4, 10, 18] (element-wise)
np.sqrt(a) # [1.0, 1.414, 1.732]
np.exp(a)  # Exponential
np.log(a)  # Natural log

# Aggregations
arr.sum()      # Sum all elements
arr.mean()     # Average
arr.std()      # Standard deviation
arr.min()      # Minimum
arr.max()      # Maximum

Linear Algebra Operations

# Dot product (vector multiplication)
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.dot(a, b)  # 1*4 + 2*5 + 3*6 = 32

# Matrix multiplication
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
np.matmul(A, B)  # or A @ B

# Transpose
A.T  # Flip rows and columns

# Matrix operations
np.linalg.inv(A)    # Inverse
np.linalg.det(A)    # Determinant
np.linalg.eig(A)    # Eigenvalues/eigenvectors

Indexing & Slicing

arr = np.array([10, 20, 30, 40, 50])

arr[0]         # 10 (first element)
arr[-1]        # 50 (last element)
arr[1:4]       # [20, 30, 40] (slice)
arr[arr > 25]  # [30, 40, 50] (boolean indexing)

# 2D arrays
matrix = np.array([[1, 2, 3], [4, 5, 6]])
matrix[0, 1]   # 2 (row 0, col 1)
matrix[:, 1]   # [2, 5] (all rows, col 1)

Reshaping

arr = np.arange(12)
arr.reshape(3, 4)    # 3ร—4 matrix
arr.reshape(-1, 1)   # Column vector (auto-calculate rows)
arr.flatten()        # 1D array
Pandas Essentials
โ–ผ

Core Purpose: Tabular data manipulation (DataFrame, Series)

Loading Data

import pandas as pd

# From CSV (most common)
df = pd.read_csv('data.csv')

# With options
df = pd.read_csv('data.csv', 
                 sep=',',           # Delimiter
                 header=0,          # Row for column names
                 index_col=0,       # Column to use as index
                 na_values=['?'])   # Custom missing values

# Other formats
df = pd.read_excel('data.xlsx')
df = pd.read_json('data.json')
df = pd.read_sql(query, connection)

Inspection Methods

# Quick overview
df.head()           # First 5 rows
df.tail(3)          # Last 3 rows
df.info()           # Column types, non-null counts, memory usage
df.describe()       # Statistics for numeric columns
df.shape            # (rows, columns)
df.columns          # Column names
df.dtypes           # Data types per column
df.isnull().sum()   # Count missing values per column

Data Cleaning

# Handle missing values
df.dropna()                    # Remove rows with any NaN
df.dropna(subset=['age'])      # Drop rows where 'age' is NaN
df.fillna(0)                   # Fill NaN with 0
df.fillna(df.mean())           # Fill with column means
df['age'].fillna(df['age'].median(), inplace=True)

# Boolean filtering (masks)
df[df['age'] > 25]                          # Rows where age > 25
df[df['city'] == 'NYC']                     # Rows where city is NYC
df[(df['age'] > 25) & (df['salary'] > 50000)]  # Multiple conditions
df[df['name'].str.contains('Alice')]        # String matching

# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['email'])

Selection & Manipulation

# Select columns
df['name']              # Single column (Series)
df[['name', 'age']]     # Multiple columns (DataFrame)

# Select rows
df.iloc[0]              # First row by position
df.loc[0]               # First row by label
df.iloc[0:3]            # First 3 rows

# Add/modify columns
df['age_squared'] = df['age'] ** 2
df['full_name'] = df['first'] + ' ' + df['last']

# Drop columns
df.drop('city', axis=1, inplace=True)
df.drop(['col1', 'col2'], axis=1)

# Sort
df.sort_values('age', ascending=False)
df.sort_values(['city', 'age'])

# Group by and aggregate
df.groupby('city')['age'].mean()
df.groupby('city').agg({'age': 'mean', 'salary': 'sum'})
Matplotlib Essentials
โ–ผ

Core Purpose: Low-level plotting library for customizable visualizations

Basic Plot Types

import matplotlib.pyplot as plt

# Line plot
plt.plot([1, 2, 3, 4], [2, 4, 6, 8], marker='o', linestyle='--', color='blue')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot')
plt.legend(['Data'])
plt.grid(True)
plt.show()

# Scatter plot
plt.scatter(x, y, c='red', s=100, alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Scatter Plot')
plt.show()

# Histogram
plt.hist(data, bins=30, color='green', alpha=0.7, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Distribution')
plt.show()

# Bar chart
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]
plt.bar(categories, values, color='purple')
plt.xlabel('Category')
plt.ylabel('Count')
plt.title('Bar Chart')
plt.show()

Customization

# Labels and titles
plt.xlabel('X Label', fontsize=12)
plt.ylabel('Y Label', fontsize=12)
plt.title('My Plot', fontsize=14, fontweight='bold')

# Legend
plt.plot(x, y1, label='Line 1')
plt.plot(x, y2, label='Line 2')
plt.legend(loc='upper right')

# Grid and style
plt.grid(True, linestyle='--', alpha=0.5)
plt.style.use('seaborn-v0_8')  # or 'ggplot', 'fivethirtyeight'

# Figure size
plt.figure(figsize=(10, 6))

# Save figure
plt.savefig('plot.png', dpi=300, bbox_inches='tight')

Subplots

fig, axes = plt.subplots(2, 2, figsize=(10, 8))

axes[0, 0].plot(x, y)
axes[0, 0].set_title('Plot 1')

axes[0, 1].scatter(x, y)
axes[0, 1].set_title('Plot 2')

axes[1, 0].hist(data, bins=20)
axes[1, 0].set_title('Plot 3')

axes[1, 1].bar(categories, values)
axes[1, 1].set_title('Plot 4')

plt.tight_layout()
plt.show()
Seaborn Essentials
โ–ผ

Core Purpose: High-level statistical visualization built on Matplotlib; great for EDA patterns and relationships

Setup

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('whitegrid')  # 'darkgrid', 'white', 'dark', 'ticks'
sns.set_palette('husl')     # Color palette

Distribution Plots

# Histogram with KDE (kernel density estimate)
sns.histplot(data=df, x='age', kde=True, bins=20)
plt.show()

# Box plot (quartiles, outliers)
sns.boxplot(data=df, x='city', y='age')
plt.show()

# Violin plot (distribution shape)
sns.violinplot(data=df, x='city', y='age')
plt.show()

# Distribution plot
sns.displot(data=df, x='age', kind='kde')
plt.show()

Relationship Plots

# Scatter plot with regression line
sns.scatterplot(data=df, x='age', y='salary', hue='city', size='experience')
plt.show()

# Regression plot
sns.regplot(data=df, x='age', y='salary')
plt.show()

# Pair plot (all pairwise relationships)
sns.pairplot(df, hue='city')
plt.show()

# Joint plot (scatter + distributions)
sns.jointplot(data=df, x='age', y='salary', kind='scatter')
plt.show()

Correlation & Heatmaps

# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, 
            annot=True,        # Show values
            cmap='coolwarm',   # Color scheme
            center=0,          # Center colormap at 0
            square=True,       # Square cells
            linewidths=1)      # Cell borders
plt.title('Correlation Matrix')
plt.show()

Categorical Plots

# Bar plot (with aggregation)
sns.barplot(data=df, x='city', y='age', estimator=np.mean)
plt.show()

# Count plot (frequency)
sns.countplot(data=df, x='city')
plt.show()

# Point plot (with error bars)
sns.pointplot(data=df, x='city', y='age')
plt.show()

# Strip plot (all points)
sns.stripplot(data=df, x='city', y='age', jitter=True)
plt.show()

Advanced: FacetGrid for Multi-panel Plots

# Create grid based on categorical variable
g = sns.FacetGrid(df, col='city', row='gender', height=4)
g.map(sns.histplot, 'age', bins=20)
plt.show()
Logical & Analytical Thinking Tips
โ–ผ

Problem-Solving Framework

  1. Understand: What is the question asking? What data do I have?
  2. Break Down: Divide complex problems into smaller steps
  3. Pattern Recognition: Look for similarities to known problems
  4. Test Assumptions: Verify your understanding with simple examples
  5. Iterate: Start simple, then add complexity

ML-Specific Thinking

  • Data First: Always explore data before modeling (distributions, missing values, outliers)
  • Baseline: Start with simple models (e.g., mean prediction) before complex ones
  • Validation: Split data (train/test) to evaluate performance honestly
  • Feature Engineering: Transform raw data into meaningful inputs
  • Debugging: Print shapes, check for NaNs, visualize intermediate results

Common Pitfalls to Avoid

  • โŒ Assuming data is clean (always check!)
  • โŒ Overfitting (model memorizes training data)
  • โŒ Data leakage (test data influences training)
  • โŒ Ignoring class imbalance
  • โŒ Not scaling features (important for many algorithms)

๐Ÿ’ก Pro Tips for Success

  • NumPy: Use vectorized operations instead of loops (100x faster!)
  • Pandas: Chain operations with method chaining: df.dropna().groupby('city')['age'].mean()
  • Matplotlib: Use plt.style.use() for consistent aesthetics
  • Seaborn: Perfect for quick EDA; automatically handles DataFrames
  • Practice with real datasets (Kaggle, UCI ML Repository)
  • Use Jupyter notebooks for interactive exploration
  • Google errors and read documentation (it's part of learning!)
Typical ML Workflow
โ–ผ
# 1. Load data
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')

# 2. Inspect
df.head()
df.info()
df.describe()
df.isnull().sum()

# 3. Visualize (EDA)
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df)
sns.heatmap(df.corr(), annot=True)
sns.histplot(df['target'], kde=True)

# 4. Clean
df = df.dropna()
df = df[df['age'] > 0]  # Filter outliers

# 5. Prepare
X = df.drop('target', axis=1)  # Features
y = df['target']                # Target

# 6. Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 7. Train model (example)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

# 8. Evaluate
score = model.score(X_test, y_test)
print(f"Rยฒ Score: {score}")

# 9. Predict
predictions = model.predict(X_test)
๐Ÿงน Data Handling & Cleaning
Collecting raw data, fixing errors/missing values, and transforming it into a format ready for modeling.
Data Collection & Sources
โ–ผ

File-Based Data Sources

CSV (Comma-Separated Values)

import pandas as pd

# Basic load
df = pd.read_csv('data.csv')

# Advanced options
df = pd.read_csv('data.csv',
                 sep=',',              # Delimiter (can be '\t', '|', etc.)
                 header=0,             # Row number for column names
                 index_col=0,          # Column to use as row index
                 na_values=['NA', '?', '-'],  # Custom missing values
                 parse_dates=['date'], # Convert to datetime
                 encoding='utf-8',     # Handle special characters
                 low_memory=False)     # For large files

# Read specific columns only
df = pd.read_csv('data.csv', usecols=['name', 'age', 'salary'])

# Read in chunks (for huge files)
chunk_iter = pd.read_csv('large_file.csv', chunksize=10000)
for chunk in chunk_iter:
    process(chunk)

Excel Files

# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Multiple sheets
excel_file = pd.ExcelFile('data.xlsx')
print(excel_file.sheet_names)

# All sheets as dictionary
all_sheets = pd.read_excel('data.xlsx', sheet_name=None)

JSON

# Simple JSON
df = pd.read_json('data.json')

# Nested JSON
df = pd.read_json('data.json', orient='records')

# From API response
import requests
response = requests.get('https://api.example.com/data')
df = pd.DataFrame(response.json())

# Normalize nested JSON
from pandas import json_normalize
df = json_normalize(data, record_path=['items'])

Parquet (Fast & Efficient)

# Read Parquet
df = pd.read_parquet('data.parquet')

# Write Parquet (compressed)
df.to_parquet('output.parquet', compression='gzip')

Database Sources

SQL Databases (Relational)

from sqlalchemy import create_engine
engine = create_engine('postgresql://user:password@localhost:5432/dbname')

# Read table or query
df = pd.read_sql_table('customers', engine)

# Execute SQL query
query = """
SELECT c.name, c.age, o.total
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.date >= '2024-01-01'
"""
df = pd.read_sql_query(query, engine)

# Write to database
df.to_sql('new_table', engine, if_exists='replace', index=False)

NoSQL Databases

from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['mydatabase']
collection = db['customers']

# Query and convert to DataFrame
cursor = collection.find({'age': {'$gt': 25}})
df = pd.DataFrame(list(cursor))

# Redis
import redis
import json
r = redis.Redis(host='localhost', port=6379, db=0)
data = json.loads(r.get('user:1'))
Data Cleaning Essentials
โ–ผ

1. Initial Data Inspection

# Load data
df = pd.read_csv('data.csv')

# Quick overview
print(df.shape)           # (rows, columns)
print(df.head())          # First 5 rows
print(df.info())          # Data types, non-null counts
print(df.describe())      # Statistics for numeric columns
print(df.isnull().sum())  # Missing values per column
print(df.duplicated().sum())  # Duplicate rows

2. Handling Missing Values

# Detection
missing_pct = (df.isnull().sum() / len(df)) * 100
print(missing_pct[missing_pct > 0])

# Visualization
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()

# Strategies
df.dropna()                    # Remove rows with any NaN
df.dropna(subset=['age'])      # Remove rows with NaN in 'age'
df.fillna(0)                   # Fill with constant
df['age'].fillna(df['age'].median(), inplace=True) # Fill with median
df['price'].fillna(method='ffill', inplace=True)   # Forward fill
df['salary'] = df.groupby('dept')['salary'].transform(lambda x: x.fillna(x.mean()))

3. Handling Duplicates

# Check
print(f"Duplicate rows: {df.duplicated().sum()}")

# Remove (keep first occurrence)
df_clean = df.drop_duplicates()

# Remove based on specific columns
df_clean = df.drop_duplicates(subset=['email', 'phone'])

4. Handling Outliers

# IQR Method
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Remove
df_clean = df[(df['age'] >= lower) & (df['age'] <= upper)]

# Cap (Winsorization)
df['age'] = df['age'].clip(lower=lower, upper=upper)

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['age']))
outliers = df[z_scores > 3]

5. Data Type Corrections

# Convert to numeric
df['age'] = pd.to_numeric(df['age'], errors='coerce')

# Convert to datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')

# Convert to categorical
df['category'] = df['category'].astype('category')

6. Handling Inconsistent Data

# Standardize text
df['city'] = df['city'].str.strip().str.title()

# Fix typos
df['city'] = df['city'].replace({'NY': 'New York', 'NYC': 'New York'})

# Remove special characters
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
Visualization & Patterns
โ–ผ

Univariate Analysis (Single Variable)

import seaborn as sns
import matplotlib.pyplot as plt

# Distribution of numeric variable
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
sns.histplot(df['age'], kde=True)
plt.subplot(1, 2, 2)
sns.boxplot(x=df['age'])
plt.show()

# Categorical variable
sns.countplot(data=df, x='category')
plt.xticks(rotation=45)
plt.show()

Bivariate Analysis (Two Variables)

# Numeric vs Numeric
sns.scatterplot(data=df, x='age', y='salary', hue='department')
plt.show()

# Categorical vs Numeric
sns.boxplot(data=df, x='department', y='salary')
plt.show()

# Categorical vs Categorical
pd.crosstab(df['department'], df['gender']).plot(kind='bar')
plt.show()

Multivariate Analysis

# Correlation heatmap
corr = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()

# Pair Plot
sns.pairplot(df, hue='category')
plt.show()
Preprocessing Techniques
โ–ผ

1. Feature Engineering

# Date features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['is_weekend'] = df['date'].dt.dayofweek.isin([5, 6]).astype(int)

# Mathematical transformations
df['age_squared'] = df['age'] ** 2
df['bmi'] = df['weight'] / (df['height'] ** 2)

# Binning
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 60, 100], labels=['Child', 'Adult', 'Senior'])

# Interaction Features
df['income_per_age'] = df['income'] / df['age']
df['age_income'] = df['age'] * df['income']

2. Feature Selection

# Correlation
corr = df.corr()['target'].abs().sort_values(ascending=False)
top_features = corr[1:6].index.tolist()

# Variance Threshold
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.1)
X_high_var = selector.fit_transform(X)

# Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE
rfe = RFE(model, n_features_to_select=5)
X_rfe = rfe.fit_transform(X, y)

3. Dimensionality Reduction

# PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

# t-SNE (Visualization)
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X_scaled)

4. Scaling and Normalization

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# StandardScaler (Mean=0, Std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# MinMaxScaler (Range [0, 1])
scaler = MinMaxScaler()

# RobustScaler (Robust to outliers)
scaler = RobustScaler()

5. Encoding Categorical Variables

# Label Encoding (Ordinal)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['grade'] = le.fit_transform(df['grade'])

# One-Hot Encoding (Nominal)
df = pd.get_dummies(df, columns=['city'], drop_first=True)

# Target Encoding
target_means = df.groupby('city')['salary'].mean()
df['city_encoded'] = df['city'].map(target_means)
Pipeline & Checklist
โ–ผ

Complete Data Cleaning Pipeline

def clean_data(df):
    # 1. Initial inspection
    print("Shape:", df.shape)
    
    # 2. Remove duplicates
    df = df.drop_duplicates()
    
    # 3. Handle missing values
    num_cols = df.select_dtypes(include=[np.number]).columns
    for col in num_cols:
        df[col].fillna(df[col].median(), inplace=True)
    
    cat_cols = df.select_dtypes(include=['object']).columns
    for col in cat_cols:
        df[col].fillna(df[col].mode()[0], inplace=True)
    
    # 4. Remove outliers (IQR)
    for col in num_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        df = df[~((df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR)))]
        
    # 5. Data type corrections
    date_cols = [c for c in df.columns if 'date' in c.lower()]
    for col in date_cols:
        df[col] = pd.to_datetime(df[col], errors='coerce')
    
    # 6. Encode
    df = pd.get_dummies(df, columns=cat_cols, drop_first=True)
    
    return df

Data Cleaning Checklist

  • โœ… Load data: Check source and format
  • โœ… Initial inspection: Shape, info, missing values
  • โœ… Duplicates: Identify and remove
  • โœ… Missing Data: Drop, fill, or impute
  • โœ… Outliers: Detect and handle
  • โœ… Data Types: Fix dates, numbers, categories
  • โœ… Standardize text: Strip, lowercase, fix typos
  • โœ… Feature engineering: Create new features
  • โœ… Encode: Label or One-Hot encoding
  • โœ… Scale: Standard or MinMax (fit on train only!)
  • โœ… Split: Train/Test split BEFORE scaling
Best Practices & Pitfalls
โ–ผ

Best Practices

  • Always keep a copy of raw data: Never modify original.
  • Document cleaning steps: Use notebooks/comments.
  • Visualize before and after: Verify cleaning.
  • Handle missing data thoughtfully: Don't just drop.
  • Scale after splitting: Prevent data leakage.
  • Validate cleaning: Check if results make sense.

Common Pitfalls to Avoid

  • โŒ Data Leakage: Using test stats for training scaling.
  • โŒ Dropping too much data: Losing info.
  • โŒ Ignoring Types: Treating numbers as strings.
  • โŒ Over-cleaning: Removing valid outliers.
  • โŒ Not handling categorical: Models need numbers.
  • โŒ Scaling before splitting: Leakage risk.

Quick Reference: Data Format Comparison

Format Speed Use Case
CSV Slow Simple, Sharing
Excel Slow Business Reports
JSON Medium APIs, Nested
Parquet Fast Big Data
SQL Fast Structured Queries
๐Ÿง  Core ML Concepts and Algorithms
Machine Learning is about teaching computers to learn patterns from data without being explicitly programmed. This section covers fundamental ML concepts, algorithm categories, and practical guidance on when and why to use each approach. Key Principle: "No Free Lunch" โ€” No single algorithm works best for all problems. Understanding your data and problem type guides algorithm selection.
Machine Learning Paradigms
โ–ผ

Supervised Learning

Definition: Learning from labeled data (input-output pairs). Goal is to learn a mapping function f(X) โ†’ y.

  • Regression: Predict continuous values (price, temperature).
  • Classification: Predict discrete categories (spam/not spam).

Unsupervised Learning

Definition: Learning from unlabeled data to find hidden patterns.

  • Clustering: Group similar data points.
  • Dimensionality Reduction: Reduce feature space.
  • Anomaly Detection: Identify unusual patterns.

Semi-Supervised Learning

Mix of labeled and unlabeled data. Useful when labeling is expensive but data is abundant.

Reinforcement Learning (RL)

Learning through interaction with an environment (trial and error) to maximize cumulative reward.

# Simple Q-Learning example (tabular RL)
import numpy as np

# Initialize Q-table (states ร— actions)
Q = np.zeros((n_states, n_actions))

# Training loop
for episode in range(1000):
    state = env.reset()
    done = False
    
    while not done:
        # Epsilon-greedy action selection
        if np.random.random() < epsilon:
            action = env.action_space.sample()  # Explore
        else:
            action = np.argmax(Q[state])  # Exploit
        
        # Take action
        next_state, reward, done, _ = env.step(action)
        
        # Q-learning update
        Q[state, action] += alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )
        state = next_state
Problem Types Deep Dive
โ–ผ

Regression vs Classification

Aspect Regression Classification
Output Continuous number Discrete category
Examples Price, temperature Spam/ham, cat/dog
Evaluation MSE, RMSE, Rยฒ Accuracy, F1, ROC-AUC

When to use Regression

  • Predicting quantities (sales, revenue)
  • Forecasting (time series)
  • Estimating continuous relationships

When to use Classification

  • Yes/No decisions (approve loan)
  • Multi-class problems (image recognition)
  • Ranking/prioritization
Supervised Learning Algorithms
โ–ผ

1. Linear Regression

Fits a straight line to predict continuous values. Good for linear relationships, interpretable.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
print(f"Rยฒ Score: {model.score(X_test, y_test):.2f}")

2. Logistic Regression

Predicts probability of binary outcomes (0 or 1). Good baseline for classification.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_test)

3. Decision Trees

Tree of if-then-else rules. Handles non-linear data, interpretable, but prone to overfitting.

from sklearn.tree import DecisionTreeClassifier, plot_tree
model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train)

# Visualize
plot_tree(model, filled=True)

4. Random Forest

Ensemble of decision trees. High accuracy, robust, handles non-linear data.

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Feature importance
importances = model.feature_importances_

5. K-Nearest Neighbors (KNN)

Classifies based on K closest examples. Simple, no training, but slow on large data.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# Scale features first!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_scaled, y)

6. Support Vector Machines (SVM)

Finds optimal hyperplane. Great for high-dimensional data (text/images).

from sklearn.svm import SVC
model = SVC(kernel='rbf', C=1.0)
model.fit(X_train_scaled, y_train)

7. Naive Bayes

Probabilistic classifier assuming feature independence. Fast, great for text.

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train, y_train)

8. Gradient Boosting (XGBoost)

Sequential ensemble of weak learners. State-of-the-art accuracy for tabular data.

from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)
Unsupervised Learning Algorithms
โ–ผ

1. K-Means Clustering

Groups data into K clusters based on similarity.

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
clusters = model.fit_predict(X)

# Elbow method for K
inertias = [KMeans(k).fit(X).inertia_ for k in range(1, 10)]

2. Principal Component Analysis (PCA)

Reduces dimensionality while preserving variance.

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95) # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)

3. Autoencoders

Neural network that learns compressed representation. Good for anomaly detection.

# Simple Autoencoder (Keras)
input_img = keras.Input(shape=(input_dim,))
encoded = layers.Dense(32, activation='relu')(input_img)
decoded = layers.Dense(input_dim, activation='sigmoid')(encoded)
autoencoder = keras.Model(input_img, decoded)
Algorithm Comparison & Selection
โ–ผ

Comparison Table

Algorithm Type Pros Cons
Linear Reg Reg Fast, Interpretable Linear only
Random Forest Both Accurate, Robust Slow, Opaque
XGBoost Both High Accuracy Overfitting risk
SVM Both High Dimensions Slow on large data

Selection Guide

By Problem Type:

  • Regression: Linear Regression (baseline) โ†’ Random Forest/XGBoost
  • Binary Class: Logistic Regression (baseline) โ†’ Random Forest/XGBoost
  • Multi-class: Random Forest or Neural Networks
  • Clustering: K-Means (if K known) or DBSCAN

By Data Size:

  • Small (<1k): Logistic Reg, Naive Bayes, KNN
  • Medium: Random Forest, SVM, XGBoost
  • Large (>100k): Neural Networks, XGBoost
Model Evaluation
โ–ผ

Regression Metrics

from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, y_pred) # Lower is better
r2 = r2_score(y_test, y_pred) # 1.0 is perfect

Classification Metrics

from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred) # Balance precision/recall
auc = roc_auc_score(y_test, y_pred_proba) # Threshold independent

Cross-Validation

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(f"CV Acc: {scores.mean():.2%}")
Best Practices & Pitfalls
โ–ผ

Best Practices

  • Start Simple: Baseline with Linear/Logistic Regression.
  • Understand Data: Visualize and clean before modeling.
  • Feature Engineering: Often more impactful than model choice.
  • Always Split: Train/Test split to prevent leakage.
  • Scale Features: Essential for KNN, SVM, Neural Nets.

Common Mistakes

  • โŒ Data Leakage: Scaling before splitting.
  • โŒ Imbalanced Data: Using accuracy as the only metric.
  • โŒ Overfitting: High train accuracy, low test accuracy.
  • โŒ No Cross-Validation: Unreliable performance estimates.
Learning Path & Decision Tree
โ–ผ

Learning Path

  1. Beginner: Linear/Logistic Reg, Decision Trees, K-Means.
  2. Intermediate: Random Forest, KNN, Naive Bayes, PCA.
  3. Advanced: SVM, Gradient Boosting, Neural Networks, RL.

Quick Decision Tree

Predict Number?
โ”œโ”€ Yes โ†’ REGRESSION
โ”‚  โ”œโ”€ Linear? โ†’ Linear Regression
โ”‚  โ””โ”€ Complex? โ†’ Random Forest / XGBoost
โ””โ”€ No โ†’ CLASSIFICATION
   โ”œโ”€ Binary? โ†’ Logistic Regression
   โ”œโ”€ Text? โ†’ Naive Bayes
   โ””โ”€ Complex? โ†’ Random Forest / XGBoost
๐Ÿ”ง Key Algorithms
Essential machine learning algorithms including Linear Regression, Logistic Regression, Decision Trees, Random Forest, SVM, k-NN, Naive Bayes, and Neural Networks with their key concepts and use cases.
Linear/Logistic Regression
โ–ผ

Linear Regression

Predicts continuous values: y = mx + b

  • When: Linear relationship between features and target
  • Assumptions: Linearity, independence, homoscedasticity, normality
  • Pros: Fast, interpretable, works with small data
  • Cons: Assumes linearity, sensitive to outliers

Logistic Regression

Binary classification using sigmoid function

  • When: Binary outcomes (yes/no, 0/1)
  • Output: Probability between 0 and 1
  • Pros: Probabilistic output, fast, interpretable
  • Cons: Linear decision boundary
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Decision Trees/Random Forests
โ–ผ

Decision Trees

Tree structure of if-else decisions

  • Pros: Easy to interpret, handles non-linear relationships, no scaling needed
  • Cons: Prone to overfitting, unstable (small changes โ†’ different tree)

Random Forests

Ensemble of many decision trees (bagging)

  • How: Build multiple trees on random subsets, average predictions
  • Pros: Reduces overfitting, handles missing values, feature importance
  • Cons: Less interpretable, slower than single tree
  • Best for: Tabular data, when you need robust performance
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, max_depth=10)
rf.fit(X_train, y_train)

# Feature importance
importances = rf.feature_importances_
Great baseline model for tabular data
Support Vector Machines
โ–ผ

Core Concept

Find the hyperplane that maximizes margin between classes

The Kernel Trick

Transform data to higher dimensions without computing coordinates

  • Linear: For linearly separable data
  • RBF (Radial Basis Function): Most common, handles non-linear
  • Polynomial: For polynomial relationships

When to Use

  • High-dimensional spaces (text, images)
  • Clear margin of separation
  • Small to medium datasets

Pros & Cons

Pros: Effective in high dimensions, memory efficient

Cons: Slow on large datasets, requires feature scaling

from sklearn.svm import SVC

svm = SVC(kernel='rbf', C=1.0, gamma='scale')
svm.fit(X_train, y_train)
Best for: Text classification, image recognition
Neural Networks
โ–ผ

Architecture Components

  • Input Layer: Receives features
  • Hidden Layers: Learn representations (deep = many layers)
  • Output Layer: Produces predictions
  • Activation Functions: ReLU (hidden), Sigmoid/Softmax (output)

Key Concepts

  • Backpropagation: Update weights using gradient descent
  • Learning Rate: How big each update step is (0.001-0.01 typical)
  • Epochs: Full passes through training data
  • Batch Size: Samples processed before updating weights

When to Use

Complex patterns, images, text, audio, large datasets

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(64, activation='relu', input_shape=(10,)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(X_train, y_train, epochs=50, batch_size=32)
Deep learning powerhouse
k-NN, k-Means, Naive Bayes
โ–ผ

k-Nearest Neighbors (k-NN)

Classify based on k closest training examples

  • Pros: Simple, no training phase, works for multi-class
  • Cons: Slow prediction, sensitive to scale and irrelevant features
  • Tip: Always scale features, try k=3,5,7

k-Means Clustering

Partition data into k clusters (unsupervised)

  • How: Assign points to nearest centroid, update centroids, repeat
  • Use for: Customer segmentation, data compression
  • Choosing k: Elbow method (plot within-cluster sum of squares)

Naive Bayes

Probabilistic classifier using Bayes' theorem

  • Assumption: Features are independent (rarely true but works anyway)
  • Best for: Text classification (spam detection, sentiment)
  • Pros: Fast, works with small data, handles high dimensions
# k-NN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

# k-Means
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
Deep Learning Frameworks
โ–ผ

TensorFlow/Keras

Google's production-ready deep learning framework

  • Best for: Production deployment, mobile (TensorFlow Lite), research
  • Pros: Industry standard, excellent documentation, TensorBoard visualization
  • Keras: High-level API for TensorFlow (easy to use)
  • Use when: Need production deployment, mobile apps, or serving at scale
import tensorflow as tf
from tensorflow import keras

# Sequential API (simple)
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(10,)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    callbacks=[keras.callbacks.EarlyStopping(patience=5)]
)

# Functional API (complex architectures)
inputs = keras.Input(shape=(10,))
x = keras.layers.Dense(64, activation='relu')(inputs)
x = keras.layers.Dense(32, activation='relu')(x)
outputs = keras.layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs=inputs, outputs=outputs)

PyTorch

Facebook's research-focused deep learning framework

  • Best for: Research, experimentation, dynamic models
  • Pros: Pythonic, dynamic computation graphs, easier debugging
  • Popular in: Academic research, NLP (Hugging Face), computer vision
  • Use when: Need flexibility, research, or custom architectures
import torch
import torch.nn as nn
import torch.optim as optim

# Define model
class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(10, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = torch.relu(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))
        return x

model = NeuralNet()
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(50):
    model.train()
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()

TensorFlow vs PyTorch

Aspect TensorFlow PyTorch
Ease of Use Keras makes it easy More Pythonic, intuitive
Learning Curve Moderate Easier for Python devs
Deployment Excellent (TF Serving, Lite) Good (TorchServe)
Research Good Dominant in academia
Debugging Harder (static graphs) Easier (dynamic graphs)
Community Large, industry-focused Large, research-focused

Common Use Cases

  • Computer Vision: Both (PyTorch slightly preferred)
  • NLP: PyTorch (Hugging Face Transformers)
  • Production/Mobile: TensorFlow
  • Research Papers: PyTorch
  • Time Series: Both

Key Libraries

  • TensorFlow: Keras, TensorBoard, TF Data, TF Lite
  • PyTorch: torchvision, torchtext, Lightning (wrapper)
  • Both: ONNX (model interchange format)
Start with Keras for simplicity, PyTorch for research
๐Ÿ“Š Model Evaluation
Comprehensive metrics and techniques for evaluating machine learning models including accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrices for both classification and regression tasks.
Classification Metrics
โ–ผ

Accuracy

Correct predictions / Total predictions

  • When: Balanced classes
  • Misleading when: Imbalanced data (e.g., 95% class A, 5% class B)

Precision

True Positives / (True Positives + False Positives)

  • Question: Of predicted positives, how many are correct?
  • Use when: False positives are costly (spam filter)

Recall (Sensitivity)

True Positives / (True Positives + False Negatives)

  • Question: Of actual positives, how many did we catch?
  • Use when: False negatives are costly (disease detection)

F1-Score

Harmonic mean of precision and recall: 2 ร— (Precision ร— Recall) / (Precision + Recall)

  • Use when: Balance between precision and recall matters

ROC-AUC

Area Under the Receiver Operating Characteristic curve

  • Plots True Positive Rate vs False Positive Rate
  • AUC = 1.0: Perfect classifier
  • AUC = 0.5: Random guessing
  • Use when: Comparing models across thresholds
from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_test, y_pred))
auc = roc_auc_score(y_test, y_pred_proba)
Choose metric based on business impact
Regression Metrics
โ–ผ

Mean Squared Error (MSE)

Average of squared differences: ฮฃ(actual - predicted)ยฒ / n

  • Penalizes large errors heavily
  • Same units as target variable squared

Root Mean Squared Error (RMSE)

Square root of MSE: โˆšMSE

  • Same units as target variable
  • Most common regression metric
  • More interpretable than MSE

Mean Absolute Error (MAE)

Average of absolute differences: ฮฃ|actual - predicted| / n

  • Less sensitive to outliers than MSE/RMSE
  • Same units as target variable
  • More robust metric

Rยฒ (Coefficient of Determination)

Proportion of variance explained: 1 - (SS_res / SS_tot)

  • Rยฒ = 1.0: Perfect predictions
  • Rยฒ = 0.0: As good as predicting mean
  • Can be negative for bad models
  • Scale-independent (compare across datasets)
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
RMSE for magnitude, Rยฒ for model quality
Cross-Validation
โ–ผ

Why Cross-Validation?

Get more reliable performance estimates using all data for both training and validation

k-Fold Cross-Validation

  • Split data into k folds (typically k=5 or 10)
  • Train on k-1 folds, validate on remaining fold
  • Repeat k times, average results
  • Pros: Every sample used for both training and validation

Stratified k-Fold

  • Maintains class distribution in each fold
  • Use for: Imbalanced classification problems

Leave-One-Out (LOO)

  • k = n (number of samples)
  • Use for: Very small datasets
  • Con: Computationally expensive

Time Series Split

  • Respects temporal ordering
  • Critical for: Sequential data (stocks, sales)
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Simple k-fold
scores = cross_val_score(model, X, y, cv=5)
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

# Stratified k-fold
skf = StratifiedKFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X, y, cv=skf)
Always use CV for model selection
Confusion Matrix
โ–ผ

The Matrix

                Predicted
                 Pos    Neg
Actual  Pos     TP     FN
        Neg     FP     TN

Understanding Each Cell

  • True Positive (TP): Correctly predicted positive
  • True Negative (TN): Correctly predicted negative
  • False Positive (FP): Incorrectly predicted positive (Type I error)
  • False Negative (FN): Incorrectly predicted negative (Type II error)

What to Look For

  • High FP? Model too aggressive (reduce threshold)
  • High FN? Model too conservative (increase threshold)
  • Imbalanced diagonal? Class imbalance or poor model

Quick Code

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
Always visualize your confusion matrix
๐Ÿ’ก Practical Tips
Expert guidance on hyperparameter tuning, avoiding overfitting, dealing with imbalanced data, feature engineering best practices, and essential ML workflow tips for building production-ready models.
Hyperparameter Tuning
โ–ผ

Grid Search

Try every combination of specified parameters

  • Pros: Exhaustive, guaranteed to find best in grid
  • Cons: Exponentially slow with more parameters
  • Use when: Few parameters, small ranges
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_

Random Search

Sample random combinations

  • Pros: Faster, explores more space
  • Cons: May miss optimal
  • Use when: Many parameters, large ranges
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [10, 20, 30, 40, None],
    'min_samples_split': [2, 5, 10, 15]
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(),
    param_dist,
    n_iter=20,  # Number of random combinations
    cv=5,
    n_jobs=-1
)

random_search.fit(X_train, y_train)

Key Hyperparameters by Algorithm

Random Forest: n_estimators, max_depth, min_samples_split

SVM: C (regularization), kernel, gamma

Neural Networks: learning_rate, batch_size, hidden_layers, neurons

XGBoost: learning_rate, max_depth, n_estimators, subsample

Start with defaults, then tune most important params
Algorithm Selection Guide
โ–ผ

By Problem Type

Binary Classification:

  • Logistic Regression (baseline)
  • Random Forest (robust)
  • XGBoost (high performance)
  • Neural Networks (complex patterns)

Multi-class Classification:

  • Random Forest
  • XGBoost
  • Naive Bayes (text)

Regression:

  • Linear Regression (baseline)
  • Random Forest
  • XGBoost
  • Neural Networks

Clustering:

  • K-Means (spherical clusters)
  • DBSCAN (arbitrary shapes, outliers)
  • Hierarchical (dendrograms)

By Data Characteristics

Small Data (<10k samples):

  • Logistic Regression, Naive Bayes
  • Simple models to avoid overfitting

Large Data (>100k samples):

  • Neural Networks, XGBoost
  • Can learn complex patterns

High Dimensional (many features):

  • Regularized models (Lasso, Ridge)
  • Random Forest (handles many features)
  • Feature selection first

Imbalanced Classes:

  • Random Forest with class_weight='balanced'
  • XGBoost with scale_pos_weight
  • SMOTE for oversampling

Quick Decision Tree

Need interpretability? โ†’ Logistic Regression or Decision Tree

Need high accuracy? โ†’ XGBoost or Random Forest

Have images/text? โ†’ Neural Networks (CNN/RNN)

Limited time? โ†’ Start with Random Forest

Always try multiple algorithms
Common Pitfalls & Debugging
โ–ผ

Data Leakage

Information from test set leaks into training

  • Example: Scaling before train/test split
  • Fix: Always split first, then preprocess
  • Example: Using future information in time series
  • Fix: Use time-based split

Class Imbalance

One class dominates dataset (e.g., 95% vs 5%)

  • Symptom: High accuracy but poor recall on minority class
  • Solutions:
    • Use stratified sampling
    • Oversample minority class (SMOTE)
    • Undersample majority class
    • Use class weights
    • Change evaluation metric (F1, AUC instead of accuracy)

Poor Performance Checklist

  • โœ“ Check for data leakage
  • โœ“ Verify train/test split is correct
  • โœ“ Look for missing values
  • โœ“ Check feature scaling
  • โœ“ Examine class distribution
  • โœ“ Plot learning curves (more data needed?)
  • โœ“ Try different algorithms
  • โœ“ Engineer better features

Model Not Learning

  • Neural Networks: Learning rate too high/low, bad initialization
  • All models: Features not informative, need more data

Overfitting Signs

  • Training accuracy >> test accuracy (gap >10%)
  • Performance degrades on new data
  • Model too complex for data size
# Check for data leakage
from sklearn.model_selection import cross_val_score

# If cross-validation score much worse than train score โ†’ leakage
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"CV: {cv_scores.mean():.3f}, Train: {train_score:.3f}")
โš ๏ธ Always validate on unseen data
Quick Reference Code
โ–ผ

Complete ML Pipeline

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# 1. Load data
df = pd.read_csv('data.csv')

# 2. Basic exploration
print(df.info())
print(df.describe())
print(df.isnull().sum())

# 3. Prepare features and target
X = df.drop('target', axis=1)
y = df['target']

# 4. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 5. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 6. Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# 7. Evaluate
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# 8. Cross-validation
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
print(f"CV Score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

Pandas Essentials

# Load data
df = pd.read_csv('file.csv')

# Exploration
df.head()
df.shape
df.dtypes
df.describe()
df.isnull().sum()

# Selection
df['column']
df[['col1', 'col2']]
df[df['age'] > 30]

# Missing values
df.dropna()
df.fillna(df.mean())

# Encoding
pd.get_dummies(df, columns=['category'])

# Group by
df.groupby('category')['value'].mean()
Bookmark this for quick reference!