Virtual Labs

Evaluation Metrics, Overfitting/Underfitting

These are common issues in machine learning that affect a model's ability to generalize to new data.

Overfitting happens when a model learns the training data too well, including noise and outliers, instead of capturing the general patterns.
This results in high accuracy on the training data but poor generalization to unseen data, leading to poor test performance.
Overfitting often occurs when the model is too complex (e.g., too many parameters, deep neural networks with excessive layers).
Solutions to Overfitting:
- Regularization (L1/L2 penalties) – Reduces the model complexity by penalizing large coefficients.
- Pruning (for Decision Trees) – Reduces the depth of the tree to prevent it from memorizing noise.
- Dropout (for Neural Networks) – Randomly removes neurons during training to prevent reliance on specific features.
- Early Stopping – Stops training when validation loss stops improving.
- More Training Data – Helps the model generalize better.

Underfitting occurs when a model is too simple to capture the underlying patterns in data.
It results in both high training error and test error.
Common causes include insufficient model complexity, too few training iterations, or inadequate feature representation.
Solutions to Underfitting:
- Increase Model Complexity – Use a more sophisticated model (e.g., moving from linear regression to polynomial regression).
- Feature Engineering – Add relevant features or transformations to better represent the data.
- Reduce Regularization – Loosening L1/L2 penalties can help the model learn more complex patterns.

The train/test split is a crucial technique to evaluate machine learning models.
Training Set – The dataset used to train the model. The model learns patterns and adjusts parameters based on this data.
Testing Set – The dataset used to evaluate the model's generalization to new, unseen data.
A common split ratio is 70% training / 30% testing, but it can be adjusted (e.g., 80/20, 60/40) based on dataset size and use case.
In some cases, a validation set (another split of data) is used to fine-tune hyperparameters before testing.

Evaluation metrics measure how well a model performs on different tasks.

Accuracy = (Correct Predictions) / (Total Predictions)
- Measures how often the model predicts correctly.
- Works well when classes are balanced but can be misleading for imbalanced datasets.
Precision (Positive Predictive Value) = TP / (TP + FP)
- The proportion of correctly predicted positive instances among all predicted positives.
- High precision means fewer false positives.
Recall (Sensitivity / True Positive Rate) = TP / (TP + FN)
- Measures how well the model identifies all actual positives.
- High recall means fewer false negatives.
F1-Score = 2 × (Precision × Recall) / (Precision + Recall)
- The harmonic mean of precision and recall.
- Useful when we need a balance between precision and recall.
ROC-AUC (Receiver Operating Characteristic – Area Under Curve)
- A graphical measure of classification performance at different thresholds.
- AUC close to 1 means good performance, while AUC near 0.5 indicates random guessing.

Mean Absolute Error (MAE) = ( \frac{1}{n} \sum | y_{\text{true}} - y_{\text{pred}} | )
- Measures the average absolute difference between actual and predicted values.
Mean Squared Error (MSE) = ( \frac{1}{n} \sum (y_{\text{true}} - y_{\text{pred}})^2 )
- Squares the errors, giving more weight to larger errors.