Evaluation Metrics, Overfitting/Underfitting
1. Overfitting and Underfitting
1.1 Overfitting
Definition
Overfitting occurs when a machine learning model learns both the underlying patterns and the random fluctuations (noise) present in the training data. As a result, the model performs extremely well on the training set but exhibits poor generalization performance on unseen data.
Key Characteristics
- Very low training error
- Significantly higher validation/test error (large generalization gap)
- High variance and low bias
Common Causes
- Model complexity excessively high relative to the amount of training data (e.g., very deep neural networks, high-degree polynomials)
- Limited or non-representative training data
- Lack of regularization
- Training has continued far beyond the point of optimal validation performance
Prevention and Mitigation Techniques
- L1 (Lasso), L2 (Ridge), or Elastic Net regularization
- Dropout, DropConnect, and stochastic depth (in neural networks)
- Early stopping using a validation set
- Data augmentation and acquiring additional training examples
- Cross-validation for more reliable performance estimation
- Model pruning, architecture simplification, or reducing capacity
- Ensemble methods (bagging, random forests) to reduce variance
1.2 Underfitting
Definition
Underfitting occurs when a model is too simple (has insufficient capacity) to capture the true underlying structure or relationships in the data.
Key Characteristics
- High error on both training and validation/test sets
- High bias and low variance
Common Causes
- Model capacity too low for the complexity of the problem (e.g., fitting a linear model to nonlinear data)
- Overly strong regularization
- Insufficient training duration (especially for iterative algorithms)
- Inadequate or poorly engineered features
Remediation Techniques
- Increase model capacity (add layers, neurons, higher-degree terms, etc.)
- Perform feature engineering or use more expressive feature representations
- Decrease regularization strength
- Allow the model to train for more epochs or use better optimization algorithms
1.3 Bias–Variance Trade-off
The generalization error of a model can be decomposed as:
Generalization Error = Bias² + Variance + Irreducible Error
- Bias: Error arising from overly simplistic assumptions in the learning algorithm (dominant in underfitting)
- Variance: Error arising from sensitivity to small fluctuations in the training set (dominant in overfitting)
- Irreducible Error: Inherent noise in the data that cannot be eliminated
The goal is to find the optimal model complexity that minimizes the sum of bias² and variance.
2. Dataset Splitting Strategies
| Split | Purpose | Typical Proportion | Notes |
|---|---|---|---|
| Training set | Parameter learning (weights, coefficients) | 60–98% | Primary data used for gradient-based optimization |
| Validation set | Hyperparameter tuning and early stopping decisions | 10–20% (or 1–2% on very large datasets) | Never used for gradient updates |
| Test set | Final unbiased evaluation of model performance | 10–20% | Seen only once, after all tuning is complete |
Best Practices
- Keep test set completely isolated until final evaluation
- Use stratified splitting for classification to preserve class distribution
- For small datasets: k-fold cross-validation (commonly k=5 or 10)
- For very large datasets: single train/validation split is often sufficient
3. Evaluation Metrics
3.1 Classification Metrics
| Metric | Formula | Interpretation | Best Used When |
|---|---|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | Overall correctness | Classes are balanced |
| Precision | TP / (TP + FP) | Proportion of positive predictions that are correct | Minimizing false positives is critical |
| Recall (Sensitivity) | TP / (TP + FN) | Proportion of actual positives correctly identified | Minimizing false negatives is critical |
| F1-Score | 2 × (Precision × Recall) / (Precision + Recall) | Harmonic mean of precision and recall | Imbalanced classes, need single metric |
| ROC-AUC | Area under the ROC curve | Ability to discriminate classes across thresholds | Threshold-independent model comparison |
| PR-AUC | Area under Precision-Recall curve | Performance on highly imbalanced datasets | Highly skewed positive/negative ratio |
3.2 Regression Metrics
| Metric | Formula | Properties | Typical Use Case |
|---|---|---|---|
| Mean Absolute Error (MAE) | (1/n) Σ | yᵢ − ŷᵢ | |
| Mean Squared Error (MSE) | (1/n) Σ (yᵢ − ŷᵢ)² | Differentiable, penalizes large errors heavily | Most optimization algorithms minimize MSE |
4. Summary of Key Relationships
| Situation | Training Error | Validation/Test Error | Diagnosis | Action |
|---|---|---|---|---|
| Overfitting | Very low | High | High variance | Regularize, simplify, more data |
| Underfitting | High | High | High bias | Increase capacity, better features |
| Good generalization | Low | Low (close to training) | Balanced bias-variance | Monitor and deploy |