Very often we find ourselves with feature vectors with large number of components. It is a general understanding and well received one that large number of features is very often a bad idea. Thus we should in most try to reduce the number of features we have in the dataset. Lesser number features can help understand the relationship between the input and output better and give less complicated model which require less resources to run.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import math
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import os

Dataset

The dataset we will be using is the HAPT data from UCI repository. It’s a Human Activity Recognition Dataset, which consists of 561 features 7767 train samples and 3162 test samples. The feature values are normalized between -1 and 1.

file = open("Saved Data/HAPT_Feature_Data.pickle", 'rb')
x_train, y_train, x_test, y_test = pickle.load(file)
file.close()

x_train.describe()

	tBodyAcc-Mean-1	tBodyAcc-Mean-2	tBodyAcc-Mean-3	tBodyAcc-STD-1	tBodyAcc-STD-2	tBodyAcc-STD-3	tBodyAcc-Mad-1	tBodyAcc-Mad-2	tBodyAcc-Mad-3	tBodyAcc-Max-1	...	fBodyGyroJerkMag-MeanFreq-1	fBodyGyroJerkMag-Skewness-1	fBodyGyroJerkMag-Kurtosis-1	tBodyAcc-AngleWRTGravity-1	tBodyAccJerk-AngleWRTGravity-1	tBodyGyro-AngleWRTGravity-1	tBodyGyroJerk-AngleWRTGravity-1	tXAxisAcc-AngleWRTGravity-1	tYAxisAcc-AngleWRTGravity-1	tZAxisAcc-AngleWRTGravity-1
count	7767.000000	7767.000000	7767.000000	7767.000000	7767.000000	7767.000000	7767.000000	7767.000000	7767.000000	7767.000000	...	7767.000000	7767.000000	7767.000000	7767.000000	7767.000000	7767.000000	7767.000000	7767.000000	7767.000000	7767.000000
mean	0.038759	-0.000647	-0.018155	-0.599017	-0.634424	-0.691270	-0.623886	-0.657884	-0.740154	-0.360200	...	0.161745	-0.316548	-0.625132	0.016774	0.018471	0.009239	-0.005184	-0.485936	0.050310	-0.052888
std	0.101996	0.099974	0.089927	0.441481	0.367558	0.321641	0.418113	0.348005	0.272619	0.499259	...	0.237319	0.313899	0.302581	0.331326	0.443540	0.601208	0.477218	0.509278	0.300866	0.276196
min	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	...	-0.958535	-1.000000	-1.000000	-0.976580	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-0.987874
25%	0.032037	-0.011209	-0.028448	-0.992140	-0.983570	-0.984661	-0.992902	-0.984131	-0.986661	-0.795613	...	0.020312	-0.548129	-0.843966	-0.108225	-0.261002	-0.470267	-0.373565	-0.810953	-0.047752	-0.140560
50%	0.038975	-0.002921	-0.019602	-0.914202	-0.827970	-0.827696	-0.924421	-0.838559	-0.852735	-0.717007	...	0.170819	-0.353980	-0.710071	0.017627	0.029079	0.001515	-0.005503	-0.706619	0.176777	0.004583
75%	0.044000	0.004303	-0.011676	-0.246026	-0.313069	-0.450478	-0.294903	-0.362671	-0.540521	0.054178	...	0.316240	-0.137462	-0.503837	0.167695	0.314876	0.496871	0.352690	-0.488765	0.246834	0.109507
max	1.000000	1.000000	1.000000	1.000000	0.945956	1.000000	1.000000	0.960341	1.000000	1.000000	...	1.000000	0.938491	0.911653	1.000000	1.000000	0.998702	0.991288	1.000000	0.482229	1.000000

8 rows × 561 columns

x_test.describe()

	tBodyAcc-Mean-1	tBodyAcc-Mean-2	tBodyAcc-Mean-3	tBodyAcc-STD-1	tBodyAcc-STD-2	tBodyAcc-STD-3	tBodyAcc-Mad-1	tBodyAcc-Mad-2	tBodyAcc-Mad-3	tBodyAcc-Max-1	...	fBodyGyroJerkMag-MeanFreq-1	fBodyGyroJerkMag-Skewness-1	fBodyGyroJerkMag-Kurtosis-1	tBodyAcc-AngleWRTGravity-1	tBodyAccJerk-AngleWRTGravity-1	tBodyGyro-AngleWRTGravity-1	tBodyGyroJerk-AngleWRTGravity-1	tXAxisAcc-AngleWRTGravity-1	tYAxisAcc-AngleWRTGravity-1	tZAxisAcc-AngleWRTGravity-1
count	3162.000000	3162.000000	3162.000000	3162.000000	3162.000000	3162.000000	3162.000000	3162.000000	3162.000000	3162.000000	...	3162.000000	3162.000000	3162.000000	3162.000000	3162.000000	3162.000000	3162.000000	3162.000000	3162.000000	3162.000000
mean	0.040530	-0.001695	-0.019453	-0.609770	-0.631731	-0.711263	-0.636741	-0.654112	-0.758061	-0.358429	...	0.162821	-0.284730	-0.595537	0.015090	0.021202	0.043183	-0.020684	-0.511371	0.067774	-0.045618
std	0.101559	0.102384	0.083897	0.405103	0.360294	0.284112	0.379665	0.342661	0.241093	0.479339	...	0.221286	0.313079	0.309507	0.329827	0.442320	0.626639	0.501336	0.506231	0.326118	0.239647
min	-0.751552	-0.962639	-0.814359	-0.999715	-0.999611	-0.999911	-0.999274	-0.999319	-0.999505	-0.893297	...	-1.000000	-0.997185	-0.985065	-1.000000	-0.993402	-0.998898	-0.990616	-0.983811	-0.914248	-1.000000
25%	0.031544	-0.011462	-0.028986	-0.990017	-0.979300	-0.981256	-0.991374	-0.980940	-0.983562	-0.793628	...	0.029500	-0.522444	-0.828862	-0.111039	-0.252667	-0.493294	-0.438279	-0.828871	-0.007741	-0.096185
50%	0.038861	-0.002700	-0.019488	-0.807078	-0.686914	-0.738534	-0.828256	-0.707638	-0.778504	-0.673711	...	0.176792	-0.322259	-0.681600	0.013685	0.032937	0.043555	-0.022624	-0.728029	0.175138	-0.005486
75%	0.043751	0.005224	-0.011233	-0.270904	-0.358639	-0.487716	-0.324933	-0.403433	-0.578883	0.052758	...	0.313719	-0.095637	-0.452381	0.161560	0.316195	0.609499	0.387438	-0.528704	0.257544	0.094051
max	0.976950	0.989925	0.766017	0.465271	1.000000	0.848957	0.439686	1.000000	0.689027	0.909166	...	0.984200	1.000000	1.000000	0.998898	0.986347	1.000000	1.000000	0.829446	1.000000	0.972160

8 rows × 561 columns

# converting the dataframes into numpy arrays
y_train = y_train.values
y_test = y_test.values

Univariate Selection

In univariate selection we use different statistical measures to measure the correlation between two variables. Here we measure the correlation between the features and the class labels. The most correlated features are marked by rank. The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif
'''
The available scoring functions are 
For regression: f_regression, mutual_info_regression
For classification: chi2, f_classif, mutual_info_classif 

For chi2 measure the feature values must be non-negative

The methods based on F-test estimate the degree of linear dependency between two random
variables. On the other hand, mutual information methods can capture any kind of statistical
dependency, but they require more samples for accurate estimation.

Using a regression scoring function with a classification problem, we give useless
results.
'''

print("Before data shape {}".format(x_train.shape))

k_select = SelectKBest(f_classif, k=200)
k_fit = k_select.fit(x_train, y_train)

k_scores = pd.DataFrame(k_fit.scores_)
k_columns = pd.DataFrame(x_train.columns)

# concatenate the two dataframes
k_feature_scores = pd.concat([k_columns, k_scores], axis=1)
k_feature_scores.columns = ['Features', 'Score']

Before data shape (7767, 561)

# print the 50 best featurs for mutual_information measure, just change the 
# f_classify with mutual_info_classif
print(k_feature_scores.nlargest(50, 'Score'))

                           Features     Score
                  tBodyAcc-Max-1  1.040566
             tBodyAccJerk-Max-1  0.989803
 fBodyAccJerk-BandsEnergyOld-1  0.963095
              tGravityAcc-Min-2  0.948072
 fBodyAccJerk-BandsEnergyOld-9  0.947161
              tGravityAcc-Max-2  0.938974
             tBodyAccJerk-Max-2  0.925289
         tBodyAccJerkMag-Max-1  0.923134
                fBodyAcc-Mad-1  0.921421
             tBodyAccJerk-Min-1  0.912167
     fBodyAcc-BandsEnergyOld-9  0.911884
                fBodyAcc-STD-1  0.909825
fBodyAccJerk-BandsEnergyOld-13  0.909533
             tBodyAccMag-Max-1  0.908645
          tGravityAccMag-Max-1  0.908645
    fBodyAcc-BandsEnergyOld-13  0.907935
                 tBodyAcc-Min-1  0.905583
                  tBodyAcc-STD-1  0.904722
             fBodyAcc-Energy-1  0.904376
     fBodyAcc-BandsEnergyOld-1  0.900154
               fBodyAcc-Mean-1  0.896168
              tBodyAcc-Energy-1  0.895885
             tBodyAccJerk-STD-1  0.895730
         fBodyAccJerk-Energy-1  0.895420
             tBodyAccJerk-Min-2  0.893954
           tBodyGyroJerk-Max-3  0.892226
          tBodyAccJerk-Energy-1  0.891629
                  tBodyAcc-Mad-1  0.887885
            fBodyAccJerk-STD-1  0.886659
             tBodyAccJerk-Min-3  0.886502
           tBodyGyroJerk-Min-3  0.884669
             tBodyAccJerk-Mad-1  0.882345
           fBodyAccJerk-Mean-1  0.882204
               fBodyAcc-ropy-1  0.881254
            fBodyAccJerk-Mad-1  0.879969
         tBodyAccJerkMag-Min-1  0.876834
         tBodyAccJerkMag-IQR-1  0.875518
             tBodyAccJerk-Max-3  0.874960
        tBodyGyroJerkMag-Min-1  0.874171
        tBodyGyroJerkMag-Max-1  0.872292
           tBodyGyroJerk-Max-1  0.871754
                fBodyAcc-Max-1  0.871643
        tBodyAccJerkMag-ropy-1  0.866766
         tBodyAccJerkMag-Mad-1  0.866394
           tBodyGyroJerk-Min-1  0.865644
            fBodyAccJerk-Max-1  0.865535
             fBodyAccMag-STD-1  0.864301
             tBodyAccJerk-IQR-1  0.863093
        tBodyAccJerkMag-Mean-1  0.862525
         tBodyAccJerkMag-SMA-1  0.862525

# print the 50 best featurs for f_classfiy measure
print(k_feature_scores.nlargest(50, 'Score'))

                        Features         Score
        fBodyAccJerk-ropy-1  14496.239726
          tGravityAcc-Mean-1  12054.131140
      fBodyAccJerk-ropy-1.1  11306.825668
           tGravityAcc-Max-1  11065.679008
           tGravityAcc-Min-1  11031.900335
        tGravityAcc-Energy-1  10720.464105
     tBodyAccJerkMag-ropy-1   9814.528595
     fBodyAccJerkMag-ropy-1   9682.241808
        tBodyAccJerk-ropy-1   9523.607692
            fBodyAcc-ropy-1   9418.547047
               tBodyAcc-Max-1   8671.414741
      tBodyAccJerk-ropy-1.2   8157.204363
      fBodyAccJerk-ropy-1.2   8025.484573
             fBodyAcc-Mad-1   8024.732276
      tBodyAccJerk-ropy-1.1   7942.885584
               tBodyAcc-STD-1   7708.283718
tXAxisAcc-AngleWRTGravity-1   7688.959321
             fBodyAcc-SMA-1   7664.668858
            fBodyAcc-Mean-1   7271.657144
             fBodyAcc-STD-1   6981.717609
         tBodyAccMag-Mean-1   6711.954898
          tBodyAccMag-SMA-1   6711.954898
      tGravityAccMag-Mean-1   6711.954898
       tGravityAccMag-SMA-1   6711.954898
               tBodyAcc-Mad-1   6653.273981
              tBodyAcc-SMA-1   6642.539739
          fBodyAcc-ropy-1.1   6490.443702
         fBodyAccMag-ropy-1   6275.685439
     tBodyAccJerkMag-Mean-1   6012.889757
      tBodyAccJerkMag-SMA-1   6012.889757
          tBodyAccJerk-SMA-1   5903.222356
     tBodyGyroJerk-ropy-1.2   5880.204260
      tBodyAccJerkMag-Mad-1   5800.225498
         fBodyAccJerk-SMA-1   5784.397398
          tBodyAccJerk-STD-1   5725.835979
          tBodyAccMag-Max-1   5704.186208
       tGravityAccMag-Max-1   5704.186208
          tBodyAccJerk-Mad-1   5695.291359
          fBodyAcc-ropy-1.2   5650.141394
        fBodyAccJerk-Mean-1   5531.241992
      tBodyAccJerkMag-IQR-1   5407.956917
         fBodyAccJerk-STD-1   5386.003335
      tBodyAccJerkMag-STD-1   5385.158618
         fBodyAccJerk-Mad-1   5383.402360
             fBodyAcc-Mad-2   5349.873965
         fBodyAccMag-Mean-1   5288.676934
          fBodyAccMag-SMA-1   5288.676934
          fBodyAccMag-Mad-1   5266.209025
            fBodyAcc-Mean-2   5262.554965
          tBodyAccJerk-Mad-2   5254.592044

Feature Importance

We can also get the importance of each feature from the ‘'’feature_importances_’’’ or ‘'’coef_’’ attributes of any estimators that has it. The features can be considered unimportant and removed, if the corresponding ‘'’coef_’’’ or ‘'’feature_importances_’’’ values are below any threshold. We can specify the threshold numerically or use some heruistics to find it. The the Tree based estimaters can be used to compute feature importances with their ‘'’feature_importances_’’’ attribute.

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

xtra_clf = ExtraTreesClassifier(n_estimators=100)
xtra_clf.fit(x_train, y_train)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

# plot the graph of feature importance of the Extra Tree Classifier
xtra_feature_importance = pd.Series(xtra_clf.feature_importances_, index=x_train.columns)
plt.figure(figsize=(10, 10))
xtra_feature_importance.nlargest(20).plot(kind='barh')
plt.show()

# let's select the features that are important as per the Extra Tree classifier
select_model = SelectFromModel(xtra_clf, prefit=True)
x_train_xtra = select_model.transform(x_train)
print("After xtra shape {}".format(x_train_xtra.shape))

After xtra shape (7767, 158)

Correlation Matrix with Heatmap

Correlation gives us a measure of how two quantities are related to each other. What we want to here is to discover the correlated features in the dataset. If two features say x_1 and x_2 are highly correleated then we can ignore either one of them as we gain no information using them both to build the model and most probably we may be degrading the model by using them together. Correlation can be positive (i.e., increase x_1 shows increase in x_2) or negative (i.e., increase in x_1 shows decrease in x_2 or vice-versa).

Heatmap makes it easy to identify which features are correlated to each other. We will use the Seaborn library to show the heat map. Since the original 561 features is very large we will just show the correlation between first 10 features.

import seaborn as sns

x_train_heat = x_train.drop(columns=x_train.columns[10:])

plt.figure(figsize=(10, 10))
sns.heatmap(x_train_heat.corr(), annot=True, fmt=".2f")
plt.show()

From the heatmap we can see that the features tBodyAcc-Mean-1, tBodyAcc-Mean-2 and tBodyAcc-Mean-3 are highly uncorrelated. The features tBodyAcc-Max1 and tBodyAcc-STD-1 are higly correlated with correlation of 0.97. We can use this to select out the uncorrelated featurse and decrease the dimension of the data.