k-Nearest Neighbors (KNN)
The objective of this experiment is to classify iris flowers into three categories - Setosa, Versicolor, and Virginica based on a set of morphological features. The input to the model consists of four independent variables representing sepal and petal measurements, while the output is a categorical dependent variable indicating the species of the flower. In this experiment, the k-Nearest Neighbours (KNN) algorithm is used to perform the classification based on distance similarity. The effectiveness of the model is evaluated using standard performance metrics.
Step 1: Import the required libraries such as NumPy, Pandas, Matplotlib, Seaborn, and Scikit-learn.
Step 2: Load the Iris dataset using load_iris() from sklearn.datasets. The dataset consists of 150 samples and 5 columns, where four columns correspond to input features (X) representing sepal and petal measurements, and one column corresponds to the output variable (Y) indicating the flower species.
Step 3: Extract the feature matrix X, target labels y, feature names, and target class names from the dataset.
Step 4: Create a Pandas DataFrame using the extracted feature values.
Step 5: Add the class label names as a separate column in the DataFrame to improve interpretability.
Step 6: Perform exploratory data analysis (EDA) using head(), info(), and describe() to understand the structure and summary statistics of the dataset.
Step 7: Check the distribution of classes using value_counts().
Step 8: Check for missing values in the dataset using isnull().sum().
Step 9: Visualize feature distributions using histograms with kernel density estimation (KDE) for each feature across different classes.
Step 10: Generate a correlation heatmap to analyse relationships among the numerical features.
Step 11: Encode the target labels using the already encoded target values provided by the dataset and add them as a new column in the DataFrame.
Step 12: Define the feature set X using the four numerical attributes: sepal length, sepal width, petal length, and petal width.
Step 13: Define the target variable y as the encoded class labels.
Step 14: Split the dataset into training and testing sets using a 70%-30% ratio with stratified sampling and a fixed random state to ensure reproducibility.
Step 15: Apply feature scaling using StandardScaler() by fitting the scaler on the training data and transforming both training and testing datasets to ensure fair distance computation.
Step 16: Binarize the test labels using label_binarize() to enable multi-class ROC-AUC computation.
Step 17: Initialize the k-Nearest Neighbours classifier with the number of neighbours set to k = 10.
Step 18: Train the KNN model using the scaled training dataset.
Step 19: Predict the class labels using predict() and class probabilities using predict_proba() on the test dataset.
Step 20: Evaluate the model performance using Accuracy, Precision (macro average), Recall (macro average), F1-Score (macro average), and ROC-AUC (One-vs-Rest, multi-class).
Step 21: Generate a classification report showing precision, recall, and F1-score for each class.
Step 22: Plot a confusion matrix heatmap to visualize the classification performance of the model.