Feedforward Neural Network (MLP)
Procedure
Objective: Explore the structure and training of MLPs on tabular data by training an MLP on the Iris dataset (4 features, 3 classes) and visualising forward and backprop flows and hidden-layer activations for selected samples.
The Iris dataset used contains four input features — Sepal Length, Sepal Width, Petal Length and Petal Width — and three classes: Iris-setosa, Iris-versicolor and Iris-virginica.
This experiment uses an MLP with one input layer, two hidden layers and one output layer, and visualises forward and backward propagation. Different optimisers (RMSprop, SGD and Adam) are compared to determine which yields the best accuracy.
Steps
Import libraries: numpy, pandas, matplotlib for visualisation, and sklearn for data handling, pipelines and evaluation. Use a deep learning framework (e.g., TensorFlow/Keras or PyTorch) for model implementation.
Dataset loading and description:
- Load the Iris dataset (e.g., from Kaggle or sklearn.datasets).
- The dataset shape is (150, 5): four feature columns and one label column.
- Class distribution: Iris-setosa: 50, Iris-versicolor: 50, Iris-virginica: 50 (balanced).
- This dataset has no missing values or duplicate rows; minimal cleaning is required.
- Scale features (e.g., StandardScaler) and encode labels (one-hot encoding).
- Split into train/test sets with 80% training and 20% testing; use stratify to preserve class proportions.
Initialise parameters and model building:
- Typical hyperparameters: epochs=100, batch_size=8, learning_rate=0.01, optimiser=RMSprop (compare with SGD and Adam).
- Build the model with one input layer, two hidden dense layers, and one output layer (softmax). Display the model summary and plot the model architecture.
Model training:
- The model is trained by minimising a cost function, which measures the difference between the predicted output and the true labels. Since the Iris dataset is a multi-class classification problem, the categorical cross-entropy loss is used:
- where:
- is the true label (one-hot encoded),
- is the predicted probability for class ,
- is the number of classes (here, ),
- represents all model parameters (weights and biases).
- The gradients are computed using backpropagation, and the optimisation algorithms (RMSprop, SGD, Adam) update the parameters to minimise this cost function.
- Train for 100 epochs with mini-batches of size 8. Reserve 20% of the training data for validation.
- Plot training and validation curves for loss and accuracy versus epochs.
- Optionally, visualise forward and backward flows for selected samples and record gradient norms for analysis.
Model evaluation:
- Evaluate the model on the test set using accuracy, precision, recall, and F1 score. Visualise results using a confusion matrix.
- Show a classification report (precision, recall, F1-score, support) and compute macro and weighted averages.
- Analyse misclassifications and class-wise performance.
Gradient checkpoints (backprop flow):
- Record L2 norms of gradients per layer (both hidden layers and output layer) for selected samples and save checkpoints for analysis.