Feature selection in machine learning
Very often we find ourselves with feature vectors with large number of components. It is a general understanding and well received one that large number of features is very often a bad idea. Thus we should in most try to reduce the number of features we have in the dataset. Lesser number features can help understand the relationship between the input and output better and give less complicated model which require less resources to run.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import math
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import os
Dataset
The dataset we will be using is the HAPT data from UCI repository. It’s a Human Activity Recognition Dataset, which consists of 561 features 7767 train samples and 3162 test samples. The feature values are normalized between -1 and 1.
file = open("Saved Data/HAPT_Feature_Data.pickle", 'rb')
x_train, y_train, x_test, y_test = pickle.load(file)
file.close()
x_train.describe()
tBodyAcc-Mean-1 | tBodyAcc-Mean-2 | tBodyAcc-Mean-3 | tBodyAcc-STD-1 | tBodyAcc-STD-2 | tBodyAcc-STD-3 | tBodyAcc-Mad-1 | tBodyAcc-Mad-2 | tBodyAcc-Mad-3 | tBodyAcc-Max-1 | ... | fBodyGyroJerkMag-MeanFreq-1 | fBodyGyroJerkMag-Skewness-1 | fBodyGyroJerkMag-Kurtosis-1 | tBodyAcc-AngleWRTGravity-1 | tBodyAccJerk-AngleWRTGravity-1 | tBodyGyro-AngleWRTGravity-1 | tBodyGyroJerk-AngleWRTGravity-1 | tXAxisAcc-AngleWRTGravity-1 | tYAxisAcc-AngleWRTGravity-1 | tZAxisAcc-AngleWRTGravity-1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 | ... | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 | 7767.000000 |
mean | 0.038759 | -0.000647 | -0.018155 | -0.599017 | -0.634424 | -0.691270 | -0.623886 | -0.657884 | -0.740154 | -0.360200 | ... | 0.161745 | -0.316548 | -0.625132 | 0.016774 | 0.018471 | 0.009239 | -0.005184 | -0.485936 | 0.050310 | -0.052888 |
std | 0.101996 | 0.099974 | 0.089927 | 0.441481 | 0.367558 | 0.321641 | 0.418113 | 0.348005 | 0.272619 | 0.499259 | ... | 0.237319 | 0.313899 | 0.302581 | 0.331326 | 0.443540 | 0.601208 | 0.477218 | 0.509278 | 0.300866 | 0.276196 |
min | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | ... | -0.958535 | -1.000000 | -1.000000 | -0.976580 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -0.987874 |
25% | 0.032037 | -0.011209 | -0.028448 | -0.992140 | -0.983570 | -0.984661 | -0.992902 | -0.984131 | -0.986661 | -0.795613 | ... | 0.020312 | -0.548129 | -0.843966 | -0.108225 | -0.261002 | -0.470267 | -0.373565 | -0.810953 | -0.047752 | -0.140560 |
50% | 0.038975 | -0.002921 | -0.019602 | -0.914202 | -0.827970 | -0.827696 | -0.924421 | -0.838559 | -0.852735 | -0.717007 | ... | 0.170819 | -0.353980 | -0.710071 | 0.017627 | 0.029079 | 0.001515 | -0.005503 | -0.706619 | 0.176777 | 0.004583 |
75% | 0.044000 | 0.004303 | -0.011676 | -0.246026 | -0.313069 | -0.450478 | -0.294903 | -0.362671 | -0.540521 | 0.054178 | ... | 0.316240 | -0.137462 | -0.503837 | 0.167695 | 0.314876 | 0.496871 | 0.352690 | -0.488765 | 0.246834 | 0.109507 |
max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.945956 | 1.000000 | 1.000000 | 0.960341 | 1.000000 | 1.000000 | ... | 1.000000 | 0.938491 | 0.911653 | 1.000000 | 1.000000 | 0.998702 | 0.991288 | 1.000000 | 0.482229 | 1.000000 |
8 rows × 561 columns
x_test.describe()
tBodyAcc-Mean-1 | tBodyAcc-Mean-2 | tBodyAcc-Mean-3 | tBodyAcc-STD-1 | tBodyAcc-STD-2 | tBodyAcc-STD-3 | tBodyAcc-Mad-1 | tBodyAcc-Mad-2 | tBodyAcc-Mad-3 | tBodyAcc-Max-1 | ... | fBodyGyroJerkMag-MeanFreq-1 | fBodyGyroJerkMag-Skewness-1 | fBodyGyroJerkMag-Kurtosis-1 | tBodyAcc-AngleWRTGravity-1 | tBodyAccJerk-AngleWRTGravity-1 | tBodyGyro-AngleWRTGravity-1 | tBodyGyroJerk-AngleWRTGravity-1 | tXAxisAcc-AngleWRTGravity-1 | tYAxisAcc-AngleWRTGravity-1 | tZAxisAcc-AngleWRTGravity-1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 | ... | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 | 3162.000000 |
mean | 0.040530 | -0.001695 | -0.019453 | -0.609770 | -0.631731 | -0.711263 | -0.636741 | -0.654112 | -0.758061 | -0.358429 | ... | 0.162821 | -0.284730 | -0.595537 | 0.015090 | 0.021202 | 0.043183 | -0.020684 | -0.511371 | 0.067774 | -0.045618 |
std | 0.101559 | 0.102384 | 0.083897 | 0.405103 | 0.360294 | 0.284112 | 0.379665 | 0.342661 | 0.241093 | 0.479339 | ... | 0.221286 | 0.313079 | 0.309507 | 0.329827 | 0.442320 | 0.626639 | 0.501336 | 0.506231 | 0.326118 | 0.239647 |
min | -0.751552 | -0.962639 | -0.814359 | -0.999715 | -0.999611 | -0.999911 | -0.999274 | -0.999319 | -0.999505 | -0.893297 | ... | -1.000000 | -0.997185 | -0.985065 | -1.000000 | -0.993402 | -0.998898 | -0.990616 | -0.983811 | -0.914248 | -1.000000 |
25% | 0.031544 | -0.011462 | -0.028986 | -0.990017 | -0.979300 | -0.981256 | -0.991374 | -0.980940 | -0.983562 | -0.793628 | ... | 0.029500 | -0.522444 | -0.828862 | -0.111039 | -0.252667 | -0.493294 | -0.438279 | -0.828871 | -0.007741 | -0.096185 |
50% | 0.038861 | -0.002700 | -0.019488 | -0.807078 | -0.686914 | -0.738534 | -0.828256 | -0.707638 | -0.778504 | -0.673711 | ... | 0.176792 | -0.322259 | -0.681600 | 0.013685 | 0.032937 | 0.043555 | -0.022624 | -0.728029 | 0.175138 | -0.005486 |
75% | 0.043751 | 0.005224 | -0.011233 | -0.270904 | -0.358639 | -0.487716 | -0.324933 | -0.403433 | -0.578883 | 0.052758 | ... | 0.313719 | -0.095637 | -0.452381 | 0.161560 | 0.316195 | 0.609499 | 0.387438 | -0.528704 | 0.257544 | 0.094051 |
max | 0.976950 | 0.989925 | 0.766017 | 0.465271 | 1.000000 | 0.848957 | 0.439686 | 1.000000 | 0.689027 | 0.909166 | ... | 0.984200 | 1.000000 | 1.000000 | 0.998898 | 0.986347 | 1.000000 | 1.000000 | 0.829446 | 1.000000 | 0.972160 |
8 rows × 561 columns
# converting the dataframes into numpy arrays
y_train = y_train.values
y_test = y_test.values
Univariate Selection
In univariate selection we use different statistical measures to measure the correlation between two variables. Here we measure the correlation between the features and the class labels. The most correlated features are marked by rank. The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif
'''
The available scoring functions are
For regression: f_regression, mutual_info_regression
For classification: chi2, f_classif, mutual_info_classif
For chi2 measure the feature values must be non-negative
The methods based on F-test estimate the degree of linear dependency between two random
variables. On the other hand, mutual information methods can capture any kind of statistical
dependency, but they require more samples for accurate estimation.
Using a regression scoring function with a classification problem, we give useless
results.
'''
print("Before data shape {}".format(x_train.shape))
k_select = SelectKBest(f_classif, k=200)
k_fit = k_select.fit(x_train, y_train)
k_scores = pd.DataFrame(k_fit.scores_)
k_columns = pd.DataFrame(x_train.columns)
# concatenate the two dataframes
k_feature_scores = pd.concat([k_columns, k_scores], axis=1)
k_feature_scores.columns = ['Features', 'Score']
Before data shape (7767, 561)
# print the 50 best featurs for mutual_information measure, just change the
# f_classify with mutual_info_classif
print(k_feature_scores.nlargest(50, 'Score'))
Features Score
9 tBodyAcc-Max-1 1.040566
89 tBodyAccJerk-Max-1 0.989803
381 fBodyAccJerk-BandsEnergyOld-1 0.963095
53 tGravityAcc-Min-2 0.948072
389 fBodyAccJerk-BandsEnergyOld-9 0.947161
50 tGravityAcc-Max-2 0.938974
90 tBodyAccJerk-Max-2 0.925289
229 tBodyAccJerkMag-Max-1 0.923134
271 fBodyAcc-Mad-1 0.921421
92 tBodyAccJerk-Min-1 0.912167
310 fBodyAcc-BandsEnergyOld-9 0.911884
268 fBodyAcc-STD-1 0.909825
393 fBodyAccJerk-BandsEnergyOld-13 0.909533
203 tBodyAccMag-Max-1 0.908645
216 tGravityAccMag-Max-1 0.908645
314 fBodyAcc-BandsEnergyOld-13 0.907935
12 tBodyAcc-Min-1 0.905583
3 tBodyAcc-STD-1 0.904722
281 fBodyAcc-Energy-1 0.904376
302 fBodyAcc-BandsEnergyOld-1 0.900154
265 fBodyAcc-Mean-1 0.896168
16 tBodyAcc-Energy-1 0.895885
83 tBodyAccJerk-STD-1 0.895730
360 fBodyAccJerk-Energy-1 0.895420
93 tBodyAccJerk-Min-2 0.893954
171 tBodyGyroJerk-Max-3 0.892226
96 tBodyAccJerk-Energy-1 0.891629
6 tBodyAcc-Mad-1 0.887885
347 fBodyAccJerk-STD-1 0.886659
94 tBodyAccJerk-Min-3 0.886502
174 tBodyGyroJerk-Min-3 0.884669
86 tBodyAccJerk-Mad-1 0.882345
344 fBodyAccJerk-Mean-1 0.882204
287 fBodyAcc-ropy-1 0.881254
350 fBodyAccJerk-Mad-1 0.879969
230 tBodyAccJerkMag-Min-1 0.876834
233 tBodyAccJerkMag-IQR-1 0.875518
91 tBodyAccJerk-Max-3 0.874960
256 tBodyGyroJerkMag-Min-1 0.874171
255 tBodyGyroJerkMag-Max-1 0.872292
169 tBodyGyroJerk-Max-1 0.871754
274 fBodyAcc-Max-1 0.871643
234 tBodyAccJerkMag-ropy-1 0.866766
228 tBodyAccJerkMag-Mad-1 0.866394
172 tBodyGyroJerk-Min-1 0.865644
353 fBodyAccJerk-Max-1 0.865535
503 fBodyAccMag-STD-1 0.864301
99 tBodyAccJerk-IQR-1 0.863093
226 tBodyAccJerkMag-Mean-1 0.862525
231 tBodyAccJerkMag-SMA-1 0.862525
# print the 50 best featurs for f_classfiy measure
print(k_feature_scores.nlargest(50, 'Score'))
Features Score
366 fBodyAccJerk-ropy-1 14496.239726
40 tGravityAcc-Mean-1 12054.131140
367 fBodyAccJerk-ropy-1.1 11306.825668
49 tGravityAcc-Max-1 11065.679008
52 tGravityAcc-Min-1 11031.900335
56 tGravityAcc-Energy-1 10720.464105
234 tBodyAccJerkMag-ropy-1 9814.528595
523 fBodyAccJerkMag-ropy-1 9682.241808
102 tBodyAccJerk-ropy-1 9523.607692
287 fBodyAcc-ropy-1 9418.547047
9 tBodyAcc-Max-1 8671.414741
104 tBodyAccJerk-ropy-1.2 8157.204363
368 fBodyAccJerk-ropy-1.2 8025.484573
271 fBodyAcc-Mad-1 8024.732276
103 tBodyAccJerk-ropy-1.1 7942.885584
3 tBodyAcc-STD-1 7708.283718
558 tXAxisAcc-AngleWRTGravity-1 7688.959321
280 fBodyAcc-SMA-1 7664.668858
265 fBodyAcc-Mean-1 7271.657144
268 fBodyAcc-STD-1 6981.717609
200 tBodyAccMag-Mean-1 6711.954898
205 tBodyAccMag-SMA-1 6711.954898
213 tGravityAccMag-Mean-1 6711.954898
218 tGravityAccMag-SMA-1 6711.954898
6 tBodyAcc-Mad-1 6653.273981
15 tBodyAcc-SMA-1 6642.539739
288 fBodyAcc-ropy-1.1 6490.443702
510 fBodyAccMag-ropy-1 6275.685439
226 tBodyAccJerkMag-Mean-1 6012.889757
231 tBodyAccJerkMag-SMA-1 6012.889757
95 tBodyAccJerk-SMA-1 5903.222356
184 tBodyGyroJerk-ropy-1.2 5880.204260
228 tBodyAccJerkMag-Mad-1 5800.225498
359 fBodyAccJerk-SMA-1 5784.397398
83 tBodyAccJerk-STD-1 5725.835979
203 tBodyAccMag-Max-1 5704.186208
216 tGravityAccMag-Max-1 5704.186208
86 tBodyAccJerk-Mad-1 5695.291359
289 fBodyAcc-ropy-1.2 5650.141394
344 fBodyAccJerk-Mean-1 5531.241992
233 tBodyAccJerkMag-IQR-1 5407.956917
347 fBodyAccJerk-STD-1 5386.003335
227 tBodyAccJerkMag-STD-1 5385.158618
350 fBodyAccJerk-Mad-1 5383.402360
272 fBodyAcc-Mad-2 5349.873965
502 fBodyAccMag-Mean-1 5288.676934
507 fBodyAccMag-SMA-1 5288.676934
504 fBodyAccMag-Mad-1 5266.209025
266 fBodyAcc-Mean-2 5262.554965
87 tBodyAccJerk-Mad-2 5254.592044
Feature Importance
We can also get the importance of each feature from the ‘'’feature_importances_’’’ or ‘'’coef_’’ attributes of any estimators that has it. The features can be considered unimportant and removed, if the corresponding ‘'’coef_’’’ or ‘'’feature_importances_’’’ values are below any threshold. We can specify the threshold numerically or use some heruistics to find it. The the Tree based estimaters can be used to compute feature importances with their ‘'’feature_importances_’’’ attribute.
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
xtra_clf = ExtraTreesClassifier(n_estimators=100)
xtra_clf.fit(x_train, y_train)
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
# plot the graph of feature importance of the Extra Tree Classifier
xtra_feature_importance = pd.Series(xtra_clf.feature_importances_, index=x_train.columns)
plt.figure(figsize=(10, 10))
xtra_feature_importance.nlargest(20).plot(kind='barh')
plt.show()
# let's select the features that are important as per the Extra Tree classifier
select_model = SelectFromModel(xtra_clf, prefit=True)
x_train_xtra = select_model.transform(x_train)
print("After xtra shape {}".format(x_train_xtra.shape))
After xtra shape (7767, 158)
Correlation Matrix with Heatmap
Correlation gives us a measure of how two quantities are related to each other. What we want to here is to discover the correlated features in the dataset. If two features say x_1 and x_2 are highly correleated then we can ignore either one of them as we gain no information using them both to build the model and most probably we may be degrading the model by using them together. Correlation can be positive (i.e., increase x_1 shows increase in x_2) or negative (i.e., increase in x_1 shows decrease in x_2 or vice-versa).
Heatmap makes it easy to identify which features are correlated to each other. We will use the Seaborn library to show the heat map. Since the original 561 features is very large we will just show the correlation between first 10 features.
import seaborn as sns
x_train_heat = x_train.drop(columns=x_train.columns[10:])
plt.figure(figsize=(10, 10))
sns.heatmap(x_train_heat.corr(), annot=True, fmt=".2f")
plt.show()
From the heatmap we can see that the features tBodyAcc-Mean-1, tBodyAcc-Mean-2 and tBodyAcc-Mean-3 are highly uncorrelated. The features tBodyAcc-Max1 and tBodyAcc-STD-1 are higly correlated with correlation of 0.97. We can use this to select out the uncorrelated featurse and decrease the dimension of the data.