Data Preprocessing

Data preprocessing is a crucial step in any machine learning workflow. It involves cleaning, transforming, and organizing raw data so it becomes suitable for training reliable and accurate machine learning models. Since models heavily depend on data quality, preprocessing ensures the dataset is consistent, complete, and ready for analysis.

Importance of Data Preprocessing

Real-world datasets often contain missing values, noise, outliers, and inconsistencies. Without preprocessing, these issues can severely degrade model performance.

Key Benefits
  • Improves data quality and consistency
  • Enhances model accuracy and generalization
  • Reduces training time and computational cost
  • Makes features compatible with ML algorithms
  • Improves model interpretability

Key Preprocessing Techniques

1. Handling Missing Data

Missing values occur due to errors, incomplete entries, or sensor failures.

Techniques & Examples
  • Identify: df.isnull().sum()
  • Mean/Median/Mode Imputation: e.g., fill missing salary values with the median salary
  • Forward/Backward Fill: e.g., fill missing temperature readings using previous values
  • Drop Rows/Columns: only when less than 5–10% of the data is missing

2. Handling Outliers

Outliers are extreme values that can distort model training and skew analysis.

Detection & Treatment Examples
  • Box Plot + IQR Method: flag values above Q3 + 1.5×IQR
  • Z-score Method: remove values where |z| > 3
  • Scatter Plot: e.g., detecting a 25-year-old earning $10M per year

3. Feature Scaling

Scaling ensures that features contribute equally during model training.

Scaling Techniques
Technique Formula Best Used For
Min-Max Scaling (X − Xmin) / (Xmax − Xmin) KNN, Neural Networks, SVM
Standardization (X − μ) / σ Logistic Regression, PCA, SVM

Example: Scale Age (0–100) and Income (20k–500k) to a common range.

4. Encoding Categorical Variables

Convert text categories into numerical values for model training.

Encoding Types
Type Method Example
Nominal One-Hot Encoding Color: Red → [1,0,0], Blue → [0,1,0]
Ordinal Ordinal Encoding Size: Small→0, Medium→1, Large→2

5. Feature Engineering

Feature engineering involves creating new, meaningful features to improve model performance.

Examples
  • From Birth_DateAge, Is_Adult, Generation
  • From House_Size + RoomsArea_Per_Room
  • From Last_LoginDays_Since_Last_Active
  • Total_Price = Quantity × Unit_Price

6. Dimensionality Reduction

Reduces the number of features while retaining essential information.

Common Methods
  • PCA: reduce 50 correlated features to 10 components
  • Feature Selection: SelectKBest, RFE
  • t-SNE / UMAP: for visualization only

7. Handling Imbalanced Datasets

Important in fraud detection, rare disease prediction, and anomaly detection.

Techniques
Technique Description Recommendation
Oversampling Duplicate minority samples Can cause overfitting
Undersampling Remove majority samples May lose important information
SMOTE Generate synthetic minority samples Widely used and effective
Class Weights Penalize majority-class errors Effective for tree-based and linear models
Golden Rule

Always perform train–validation–test split before preprocessing to avoid data leakage.