Virtual Labs

Random Forest

Part 1: Binary Classification (Bank Marketing Dataset)

The objective of this part is to classify bank customers into two categories (subscribed / not subscribed to a term deposit) using a Random Forest classifier, compare its performance against a single Decision Tree, and explore how key hyperparameters affect accuracy and generalisation.

Step 1: Import the required libraries: numpy, pandas, matplotlib, seaborn, and the relevant modules from sklearn (tree, ensemble, model_selection, metrics).

Step 2: Load the Bank Marketing dataset (bank_marketing.csv). The dataset contains approximately $45,211$ records from direct phone-based marketing campaigns by a Portuguese bank. It includes $17$ feature columns describing customer demographics, financial status, and campaign details, along with a binary target variable indicating subscription outcome (yes/no).

Step 3: Perform Exploratory Data Analysis:

Use head(), info(), describe(), and isnull().sum() to understand the structure and quality of the data.
Check the class distribution of the target variable using value_counts() to identify any class imbalance.

Step 4: Define Features and Target:

Features (X): age, job, marital, education, default, housing, loan, contact, month, day_of_week, duration, campaign, pdays, previous, poutcome, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed.
Target (Y): subscription outcome (yes / no).
Encode categorical features using LabelEncoder or pd.get_dummies().

Step 5: Split the dataset into $80%$ Training Set and $20%$ Testing Set using train_test_split() with a fixed random_state for reproducibility.

Step 6: Train a Single Decision Tree as Baseline:

Initialize a DecisionTreeClassifier with $max_depth = 5$ and $criterion = 'gini'$ .
Fit on the training data and record its test accuracy. This serves as a baseline to compare against the Random Forest.

Step 7: Train a Random Forest Classifier:

Initialize RandomForestClassifier with $n_estimators = 100$ , $criterion = 'gini'$ , and $random_state = 42$ .
Fit on the training data and record test accuracy.
Compare the Random Forest accuracy with the single Decision Tree from Step 6 to observe the ensemble improvement.

Step 8: Hyperparameter Exploration:

Vary n_estimators (e.g., $10, 50, 100, 200, 500$ ): Train a Random Forest for each value, record the test accuracy, and plot accuracy vs. number of trees to observe diminishing returns.
Vary max_depth (e.g., $3, 5, 10, 20, None$ ): Observe the trade-off between underfitting and overfitting.
Vary max_features (e.g., $'sqrt', 'log2', 0.5, 1.0$ ): Observe how feature randomness affects performance and tree diversity.

Step 9: OOB Score Analysis:

Train a RandomForestClassifier with oob_score=True and access oob_score_ after fitting.
Compare the OOB score with the test set accuracy to verify that OOB provides a reliable estimate of generalisation without a separate validation set.

Step 10: Feature Importance Visualisation:

Extract feature_importances_ from the trained Random Forest.
Plot a horizontal bar chart to show the relative importance of each feature.
Identify which features contribute most to the classification and discuss how this aids interpretation.

Step 11: Evaluate Model Performance:

Generate predictions on the test set using the best Random Forest configuration.
Compute Accuracy, Precision, Recall, and F1-Score.
Plot a Confusion Matrix heatmap to visualise true positives, true negatives, false positives, and false negatives.
Display the full Classification Report using classification_report().

Part 2: Multi-Class Classification (Seeds Dataset)

The objective of this part is to classify wheat seeds into three varieties (Kama, Rosa, and Canadian) based on geometric measurements, using a Random Forest classifier. This part reinforces ensemble concepts in a multi-class setting and includes comparative analysis and hyperparameter tuning.

Step 1: Import the required libraries: numpy, pandas, matplotlib, seaborn, and the relevant modules from sklearn.

Step 2: Load the Seeds dataset (Seeds_Dataset.csv). The dataset contains $210$ wheat seed samples with $7$ numerical features (Area, Perimeter, Compactness, Length of Kernel, Width of Kernel, Asymmetry Coefficient, Length of Kernel Groove) and a class label identifying the seed variety ( $1, 2, or 3$ ).