Data Preprocessing
Data preprocessing is a crucial step in any machine learning workflow. It involves cleaning, transforming, and organizing raw data so it becomes suitable for training reliable and accurate machine learning models. Since models heavily depend on data quality, preprocessing ensures the dataset is consistent, complete, and ready for analysis.
Importance of Data Preprocessing
Real-world datasets often contain missing values, noise, outliers, and inconsistencies. Without preprocessing, these issues can severely degrade model performance.
Key Benefits
- Improves data quality and consistency
- Enhances model accuracy and generalization
- Reduces training time and computational cost
- Makes features compatible with ML algorithms
- Improves model interpretability
Key Preprocessing Techniques
1. Handling Missing Data
Missing values occur due to errors, incomplete entries, or sensor failures.
Techniques & Examples
- Identify:
df.isnull().sum() - Mean/Median/Mode Imputation: e.g., fill missing salary values with the median salary
- Forward/Backward Fill: e.g., fill missing temperature readings using previous values
- Drop Rows/Columns: only when less than 5–10% of the data is missing
2. Handling Outliers
Outliers are extreme values that can distort model training and skew analysis.
Detection & Treatment Examples
- Box Plot + IQR Method: flag values above Q3 + 1.5×IQR
- Z-score Method: remove values where
|z| > 3 - Scatter Plot: e.g., detecting a 25-year-old earning $10M per year
3. Feature Scaling
Scaling ensures that features contribute equally during model training.
Scaling Techniques
| Technique | Formula | Best Used For |
|---|---|---|
| Min-Max Scaling | (X − Xmin) / (Xmax − Xmin) | KNN, Neural Networks, SVM |
| Standardization | (X − μ) / σ | Logistic Regression, PCA, SVM |
Example: Scale Age (0–100) and Income (20k–500k) to a common range.
4. Encoding Categorical Variables
Convert text categories into numerical values for model training.
Encoding Types
| Type | Method | Example |
|---|---|---|
| Nominal | One-Hot Encoding | Color: Red → [1,0,0], Blue → [0,1,0] |
| Ordinal | Ordinal Encoding | Size: Small→0, Medium→1, Large→2 |
5. Feature Engineering
Feature engineering involves creating new, meaningful features to improve model performance.
Examples
- From
Birth_Date→Age,Is_Adult,Generation - From
House_Size+Rooms→Area_Per_Room - From
Last_Login→Days_Since_Last_Active Total_Price = Quantity × Unit_Price
6. Dimensionality Reduction
Reduces the number of features while retaining essential information.
Common Methods
- PCA: reduce 50 correlated features to 10 components
- Feature Selection: SelectKBest, RFE
- t-SNE / UMAP: for visualization only
7. Handling Imbalanced Datasets
Important in fraud detection, rare disease prediction, and anomaly detection.
Techniques
| Technique | Description | Recommendation |
|---|---|---|
| Oversampling | Duplicate minority samples | Can cause overfitting |
| Undersampling | Remove majority samples | May lose important information |
| SMOTE | Generate synthetic minority samples | Widely used and effective |
| Class Weights | Penalize majority-class errors | Effective for tree-based and linear models |
Golden Rule
Always perform train–validation–test split before preprocessing to avoid data leakage.