30 Fitting Machine Learning Models
-
The package
scikit-learn
includes a comprehensive tools to conduct predictive analysis in machine learning framework -
The package is built on other packages like
NumPy
,SciPy
andMatplotlib
-
It includes modelling option for both supervised and unsupervised problems
-
Methodologies include classification, regression, clustering, dimension reduction
-
It also includes several supportive tools likes model selection, preprocessing, feature extraction etc.
30.1 Steps of model fitting
Import specific modules of
scikit-learn
Import other relevant libraries:
pandas
,numpy
Read the
pandas
DataFrameSummary information of the dataset
Create training and testing datasets
Exploratory analysis of the training dataset
Build and fit the model
Evaluate the model on the test dataset
Explore the model performance
Use the model to predict unseen dataset
30.2 Load Libraries & Read Data
Import scikit-learn
Note: We are importing modules for building a Decision Tree classifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier, plot_tree
Import other libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Set the working directory to the data folder
Read the iris dataset as a Python object DF
DF = pd.read_csv('iris.csv')
30.3 Data Summary
Note: The code demonstrates only a selective summary
# Describe the data
DF.describe()
SepalLength SepalWidth PetalLength PetalWidth
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
# Summary by group
'Species').size() DF.groupby(
Species
setosa 50
versicolor 50
virginica 50
dtype: int64
30.4 Train-Test Split
# Train-Test split
= train_test_split(DF, test_size = 0.3, stratify = DF[['Species']], random_state = 1234)
train, test
print('Full Data: ', DF.shape)
Full Data: (150, 5)
print('Train: ', train.shape)
Train: (105, 5)
print('Test: ', test.shape)
Test: (45, 5)
30.5 Cross-validation
We can employ a procedure called cross-validation (CV) using the training data rather than defining a train and validation split. This provides a more effective way to determine the generalisation performance when comparing models or parameters. The following implementation is depicted here:
Image credit: scikit-learn
Import the ShuffleSplit
function which allows you to define the number of folds, and the proportion of data for testing, in this case 30%.
from sklearn.model_selection import ShuffleSplit
= ShuffleSplit(n_splits=5, test_size=0.3, random_state=0) cv
Once you have defined the splits, you can pass this to the cross_val_score
which is described in section on fitting the classification model with cross-validation.
30.6 Exploratory Data Analysis
Note: The code demonstrates only a selective exploratory data analysis; it is not comprehensive.
# Scatter plot matrix
= 'Species', palette = 'colorblind') sns.pairplot(train, hue
# Correlation plot
plt.clf()= train.corr()
corrmat = True, square = True);
sns.heatmap(corrmat, annot plt.show()
30.6.1 Build and fit the classification model
Here, we build a Decision Tree classifier
= train.drop(['Species'], axis = 1)
X_train = train[['Species']]
y_train
= test.drop(['Species'], axis = 1)
X_test = test[['Species']]
y_test
# feature names
= X_train.columns
fn
# class names
= ['setosa', 'versicolor', 'virginica']
cn
= DecisionTreeClassifier(max_depth = 3, random_state = 123)
clf_DT
clf_DT.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=3, random_state=123)
Note: Python allows all assignments in a single line
X_train, y_train, X_test, y_test = train.drop(['Species'], axis = 1), train[['Species']], test.drop(['Species'], axis = 1), test[['Species']]
30.6.2 Test the model on Test data
Note
test_pred
provides the class names
test_prob
provides the probability for each class names
= clf_DT.predict(X_test)
test_pred
= clf_DT.predict_proba(X_test)
test_prob
print('Test Prediction: ', test_pred.shape)
Test Prediction: (45,)
30.6.3 Model Performance Metrics
print('Model Accuracy: ', f'{metrics.accuracy_score(test_pred, y_test): 0.4f}')
Model Accuracy: 0.9556
print("Precision, Recall, Confusion matrix, in training\n")
# Precision Recall scores
Precision, Recall, Confusion matrix, in training
print(metrics.classification_report(y_test, test_pred, digits = 3))
# Confusion matrix
precision recall f1-score support
setosa 1.000 1.000 1.000 15
versicolor 0.933 0.933 0.933 15
virginica 0.933 0.933 0.933 15
accuracy 0.956 45
macro avg 0.956 0.956 0.956 45
weighted avg 0.956 0.956 0.956 45
print(metrics.confusion_matrix(y_test, test_pred))
[[15 0 0]
[ 0 14 1]
[ 0 1 14]]
plt.clf()
= metrics.plot_confusion_matrix(clf_DT, X_test, y_test, normalize = None)
CM
'Decision Tree Confusion matrix, without normalization');
CM.ax_.set_title(
# plt.show()
Fit the classification model with cross validation
To employ cross validation on our previously defined folds, simply import the cross_val_score
function and append the following lines after you have fit the model.
from sklearn.model_selection import cross_val_score
= cross_val_score(clf_DT, x_train, y_train, cv=cv, scoring='accuracy') scores
This will produce 5 accuracy performance metrics per fold, which we configured to be 5.