SD Microdata Analysis - Part III

In Part I of this series the ACS PUMS data was used to generate the training and test sets for use in future analysis. In Part II, this data was further transformed to a form that was acceptable as input to various ML algorithm implementations in Python's sklearn library. In Part III we will be applying one or more ML algorithms in an attempt to answer the questions posed in Part I (repeated here under Data Questions). Conclusions are under Insights.

Data Questions

  1. Can incomes levels be reliably predicted using demographic, eduction level, among other indicators for individuals in various communities of the San Diego county?
  2. Which factors contribute the most to income predictions?

Data Analysis

Read in the training and test datasets, apply the desired modeling algorithms, evaluate and iterate as necessary. Note: This data has been pre-processed for missing values and has undergone further tranformations to prepare it for input to sklearn.

In [1]:
import sys
import os
import pandas as pd 
import numpy as np 
import pprint

# display related
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import pylab, mlab, gridspec, cm
from IPython.core.pylabtools import figsize, getfigs
from IPython.display import display, HTML
from pylab import *

# ML related
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import Perceptron
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import expon
from sklearn.svm import SVC
from sklearn import metrics

# Misc
import time
import h5py
import pickle

%matplotlib inline
In [2]:
custom_css = "ipy_styles.css"
HTML(open(custom_css, "r").read())
Out[2]:
In [3]:
CWD = os.getcwd()

DATADIR = 'data'

DATAFILE_TRAIN = "afc_pums_skl_train.h5"
DATAFILE_TEST = "afc_pums_skl_test.h5"

DATAFILE_VARS = "afc_pums_skl_vars.pkl"
In [4]:
global train_features, train_labels, test_features, test_labels
global skl_vars

Classification

In answering Q1 we are attempting to classify the samples as belonging to either one of the classes: AGI<=50K, AGI>50K represented in the dataset by 0 and 1. As a first step in selecting the appropriate classification algorithm to apply it is useful to know if the data is linearly separable.

In [5]:
train_df = pd.read_hdf(os.path.join(DATADIR,DATAFILE_TRAIN), 'train_data')
test_df = pd.read_hdf(os.path.join(DATADIR,DATAFILE_TEST), 'test_data')
    
with open(os.path.join(DATADIR,DATAFILE_VARS), 'rb') as pklf:    
    skl_vars = pickle.load(pklf)
Sample Size

Before further analysis though let's take a closer look at the size of our training and test sets. ML algorithms - some of which are employed in tests of linear separability - are computationally intensive, and this proves challenging with larger datasets (100K+ samples), notwithstanding the fact that popular implementations of these algorithms, such as sklearn, are highly optimized for performance.

In [6]:
train_sample_count = train_df.shape[0]
test_sample_count = test_df.shape[0]
In [7]:
s = "<b>Sample Size</b>: <b>Train</b> {} <b>Test</b> {}".format(train_df.shape[0],test_df.shape[0])
display(HTML(s))
total = train_sample_count + test_sample_count
s = "<b>Sample Ratio</b> (Train-Test) : {}% - {}%".format(round((train_sample_count/total)*100,2), 
                                                          round((test_sample_count/total)*100,2))
display(HTML(s))
Sample Size: Train 1663117 Test 140852
Sample Ratio (Train-Test) : 92.19% - 7.81%
Memory Size

We took stock of the sample size above but what does it translate to in terms of the amount of RAM needed?

In [11]:
train_mem_size_bytes = (train_df.as_matrix()).nbytes 
test_mem_size_bytes = (test_df.as_matrix()).nbytes 

s = "<b>Size in Memory:</b> <b>Train </b> {} MB <b>Test </b> {} MB".format(round((train_mem_size_bytes/1000000),2),
                                                                           round((test_mem_size_bytes/1000000),2))
display(HTML(s))
Size in Memory: Train 651.94 MB Test 55.21 MB

At 1.6M and 0.14M samples for training and testing we have a lot of data to work with. But do we need to use all of it? For reasons mentioned earlier it is reasonable to start with smaller amounts of data, see what results we get and add more data as needed. We start off with about 10% of the original data for training while keeping the test set unchanged.

But ahead of subsampling our data it would help to take stock of how imbalanced (or not) our data is. This is needed so we can maintain the same ratio of postive-to-negative samples in any random sample we generate from the original dataset.

Imbalance

Imbalanced data as it concerns classification problems refers to one or more of the classes being significantly under-represented in the data compared to others. Data in the real world often suffer from imbalance of this kind (e.g.: fraudulent cases in financial transactions). While some ML algorithms inherently deal with imbalanced data, most require resampling techniques to get to a more balanced dataset. Oversampling the minority class is preferred to undersampling the majority class since it has the potential to affect naturally occuring distributions of the data. In general, information loss (that accompanies undersampling) is best avoided when possible.

In [12]:
def class_ratio(samples,sample_name):
    s = "<b>{} Samples</b>: <b>Pos</b> {} <b>Neg</b> {} <b>Ratio</b> (Pos-Neg) {}% - {}%".format(sample_name, 
                                                                                            samples.value_counts()[1],
                                                                                            samples.value_counts()[0],
                                                                                            round(samples.mean()*100,2),
                                                                                            round((1-samples.mean())*100,2))
    display(HTML(s))
    
class_ratio(train_df['LABEL'],"Train")
class_ratio(test_df['LABEL']," Test")
Train Samples: Pos 344687 Neg 1318430 Ratio (Pos-Neg) 20.73% - 79.27%
Test Samples: Pos 31471 Neg 109381 Ratio (Pos-Neg) 22.34% - 77.66%

A balanced dataset is one that has a near 50/50 split for the two classes. We have a 20/80 split for both our training and test sets. In real-world classification problems, such as fraud detection or cancer diagnosis, penalties for mis-classifying the minority class - i.e.: not flagging a fraudulent transaction or not diagnosing a cancerous tumor, as such - are huge and hence the need for prioritizing accuracy for one class over the other. That is however not the case with what we are trying to do here. In our case, we care equally about classifying both positive and negative samples.

Given that we have a substantial amount of data to work with, the fact that our test set has the same positve-to-negative sample ratio as the training set, and that we care equally about acurracy of either class, it is reasonable to leave this dataset as is. In effect, we will not be attempting to balance the dataset.

In [13]:
train_idx = np.random.choice(train_df.index.values,int(0.1 * train_sample_count),replace=False)
test_idx = np.random.choice(test_df.index.values,int(1 * test_sample_count),replace=False)

train_sample_df = train_df.loc[train_idx,:]
test_sample_df = test_df.loc[test_idx,:]

Below we verify that resampling our data has not changed the ratio of positive-to-negative samples in the training and test sets.

In [14]:
class_ratio(train_sample_df['LABEL'],"Train")
class_ratio(test_sample_df['LABEL']," Test")
Train Samples: Pos 34174 Neg 132137 Ratio (Pos-Neg) 20.55% - 79.45%
Test Samples: Pos 31471 Neg 109381 Ratio (Pos-Neg) 22.34% - 77.66%
In [15]:
features = skl_vars[:-1]
train_features = train_sample_df[features].as_matrix()
train_labels = train_sample_df['LABEL'].values
test_features = test_sample_df[features].as_matrix()
test_labels = test_sample_df['LABEL'].values
Linear Separability

Data belonging to two classes is said to be linearly separable if there is at least one line seperating them. In determining the linear separability of a dataset we are determining if its features have a linear or a non-linear relationship to the output. There are many techniques to go about doing this - perceptron and linear programming are among the popular ones. Below, we use sklearn's perceptron to fit and predict on the training data. If the data is linearly separable the perceptron should converge to 0 errors on the training set.

In [16]:
perceptron = Perceptron(max_iter=1000, alpha=0.0001, random_state = 0)
perceptron.fit(train_features, train_labels)
predicted = perceptron.predict(train_features)
In [17]:
def plot_confusionmatrix(cm,title):
    
    f, ax = plt.subplots()
    plt.clf() 
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Oranges)

    plt.title(title)
    plt.ylabel('Actual',fontweight='bold')
    plt.xlabel('Predicted',fontweight='bold')

    truth_values = ['Negative','Positive']
    tick_marks = np.arange(len(truth_values))
    plt.xticks(tick_marks, truth_values)
    plt.yticks(tick_marks, truth_values, rotation=90, va='center')

    s = [['TN','FP'], ['FN', 'TP']]
  
    for i in range(2):
        for j in range(2):
            plt.text(j,i, str(s[i][j]) + " = " + str(cm[i][j]), va='center', ha='center')
            
    return ax
In [18]:
cm = confusion_matrix(train_labels, predicted)
 
ax = plot_confusionmatrix(cm,"'Perceptron Confusion Matrix'")
plt.show()
In [19]:
accuracy = perceptron.score(train_features,train_labels)
s = "<b>Perceptron Accuracy</b> (Train): {0:.2f}%".format(accuracy*100)
display(HTML(s))
Perceptron Accuracy (Train): 79.58%
Which Classifier?

With an accuracy well below 100% - even after multiple passes, with adjustments to the learning rate and max iterations - we can safely conclude that the data is not linearly separable. Armed with this knowledge we can now attempt to use non-linear classifiers such as the kernelized SVM classifier, Random Forest classifier which uses an ensemble model constructed from many decision tree classifiers, MLP classifier and finally, a KNN classifier.

While it is rather easy to decide when to use a non-linear classifier in lieu of a linear classifier, the choice between the various non-linear classifiers themselves is not one that is easily made. SVM is typically able to outperform KNN and Random Forest but is limited by its inability to scale for larger datasets. MLP requires a large dataset for it to be useful. KNN, Random Forest and SVM are all non-parametric algorithms, while MLP is parametric which is to say that it has a fixed computational cost that does not depend on the size of the training set. There is also the fact that except for Random Forest the rest of the algorithms do not inherently handle categorical data, requiring non-trivial tranformations to the input. An option, and one that is often used, is to apply each of these algorithms on the data, compare the performance and pick the best.

This comparison of classifiers from sklearn can be used to get a sense for how these algorithms perform at a very rudimentary level.

Support Vector Machines (SVM)

Here, we apply a kernelized SVM classifier to the data and take it through the necessary steps - hyperparameter tuning, model evaluation among others - to inform us on whether it is the right classifier to use. Note: We are using the default rbf (Radial Basis Function) kernel here since it is more flexible (being non-parametric) and has the potential to significantly outperform the other kernels (e.g.: poly) available.

In [20]:
start_t = time.time()
# create and train the SVM classifier
clf = SVC()
clf.fit(train_features,train_labels)
end_t = time.time()
In [21]:
start_p = time.time()
# predict on the test set
predicted = clf.predict(test_features)
end_p = time.time()
In [22]:
s = "<b>Execution Time</b> (10% samples): <b>Training</b> {}s <b>Prediction</b> {}s".format(round((end_t-start_t),2),
                                                                                            round((end_p-start_p),2))
display(HTML(s))
Execution Time (10% samples): Training 1095.73s Prediction 592.84s
Evaluation Metrics

The confusion matrix and classification report are two readily available tools within sklearn.metrics with which one can assess and compute other metrics to evaluate the performance of a model.

In [23]:
cm = confusion_matrix(test_labels, predicted)
 
ax = plot_confusionmatrix(cm,"SVM Confusion Matrix")
plt.show()
In [24]:
classification_report = metrics.classification_report(predicted, test_labels,target_names=['AGI<=50K','AGI>50K'])
print(classification_report)
             precision    recall  f1-score   support

   AGI<=50K       0.94      0.84      0.89    121660
    AGI>50K       0.39      0.64      0.49     19192

avg / total       0.86      0.82      0.83    140852

Although precision and recall, and often the f1-score are the preferred metrics by which to evaluate classfication models, especially those with imbalanced data, we are more concerned with the sensitivity and specificity of the model in this case. sensitivity is the true positive rate and is the same as recall, while specificity is the true negative rate.

In [25]:
def accuracy_measures(test_labels,predicted):
    cm = confusion_matrix(test_labels, predicted)
    accuracy = metrics.accuracy_score(test_labels,predicted)
    sensitivity = (cm[1][1]/(cm[1][0]+cm[1][1]))
    specificity = (cm[0][0]/(cm[0][0]+cm[0][1]))
    s = "<b>Accuracy</b>: {}% <b>True Positive Rate</b>: {}% <b>True Negative Rate</b>: {}%".format(round(accuracy*100,2),
                                                                                                    round(sensitivity*100,2),
                                                                                                    round(specificity*100,2))
    display(HTML(s))
In [26]:
null_accuracy = max(test_labels.mean(), 1 - test_labels.mean())
s = "<b>Null Accuracy</b>: {}%".format(round(null_accuracy*100,2))
display(HTML(s))

accuracy_measures(test_labels,predicted)
Null Accuracy: 77.66%
Accuracy: 81.58% True Positive Rate: 39.27% True Negative Rate: 93.75%

From the above results we find that our model outperforms a null model - a model that would always predict 0 - by only a small margin. This is not a model that performs very well.

Drilling down further we see that accuracy for the positive samples is particularly low. Given how skewed the data is, this is not entirely unexpected. Some approaches to addressing this include balancing the data, and adding class-weighted parameters to the model to give more prominence to the minority class.

Computational Complexity

One approach that is tempting to use, but highly prohibitive, is increasing the size of the training set, thanks to the computational complexity of the SVM classifier.

\begin{equation*} SVM Computational Complexity = O(\mathbf{n}_{features} \times \mathbf{n}^2_{samples} ) \end{equation*}

From sklearn's reference guide for SVC:

The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.

At 10% of the original dataset the existing training set, at 160K, is already large enough to cause SVM to be considerably slow. Any increase in size will lead to an exponential increase in the computational time and is best avoided.

Balanced Data

Let's try balancing the training set so we have a 50/50 split for the two classes, re-run the classfication and see what our accuracy measures look like.

In [27]:
train_pos_idx = np.where(train_df['LABEL']==1)[0]
train_neg_idx = np.where(train_df['LABEL']==0)[0]

train_idx_1 = np.random.choice(train_pos_idx,int(0.05 * train_sample_count),replace=False)
train_idx_2 = np.random.choice(train_neg_idx,int(0.05 * train_sample_count),replace=False)

train_idx = np.sort(np.concatenate((train_idx_1,train_idx_2),axis=0))
train_sample_df = train_df.loc[train_idx,:]

train_features = train_sample_df[features].as_matrix()
train_labels = train_sample_df['LABEL'].values

class_ratio(train_sample_df['LABEL'],"Train")
Train Samples: Pos 83155 Neg 83155 Ratio (Pos-Neg) 50.0% - 50.0%
In [28]:
start_t = time.time()
clf = SVC()
clf.fit(train_features,train_labels)
end_t = time.time()
In [29]:
start_p = time.time()
predicted = clf.predict(test_features)
end_p = time.time()
In [30]:
s = "<b>Execution Time</b> (10% Balanced): <b>Training</b> {}s <b>Prediction</b> {}s".format(round((end_t-start_t),2),
                                                                                             round((end_p-start_p),2))
display(HTML(s))
Execution Time (10% Balanced): Training 2084.31s Prediction 688.58s
In [31]:
accuracy_measures(test_labels,predicted)
Accuracy: 71.13% True Positive Rate: 88.19% True Negative Rate: 66.23%

This is a very interesting result and in a way, validates arguments both for, and against balanced datasets. The fact that the true positive rate more than doubled is clearly an indication that the imbalance in the training set affected outcomes. But at the same time, the drop in the true negative rate and its contribution to the drop in overall accuracy is an indicator that the training set ratios need to be more reflective of the test set data. Note: this latter point is a rather important one to keep in mind when modeling real-world data.

Class Based Weights

Below, we use the training data with 20/80 class-split as-is, but set the class_weight argument to balanced. From the SVC reference guide:

Set the parameter C of class i to (class_weight[i] x C) for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

In [32]:
train_idx = np.random.choice(train_df.index.values,int(0.1 * train_sample_count),replace=False)
train_sample_df = train_df.loc[train_idx,:]
train_features = train_sample_df[features].as_matrix()
train_labels = train_sample_df['LABEL'].values
In [33]:
start_t = time.time()
# create and train the SVM classifier
clf = SVC(class_weight='balanced')
clf.fit(train_features,train_labels)
end_t = time.time()
In [34]:
start_p = time.time()
# predict on the test set
predicted = clf.predict(test_features)
end_p = time.time()
In [35]:
s = "<b>Execution Time</b> (10% Balanced): <b>Training</b> {}s <b>Prediction</b> {}s".format(round((end_t-start_t),2),
                                                                                             round((end_p-start_p),2))
display(HTML(s))
Execution Time (10% Balanced): Training 2240.3s Prediction 681.43s
In [36]:
accuracy_measures(test_labels,predicted)
Accuracy: 70.77% True Positive Rate: 89.16% True Negative Rate: 65.48%

From above, it's clear that setting the class-weight parameter to balanced has the same effect on outcomes as balancing the dataset. Given these results, what other strategies can we employ to boost the accuracy?

Hyper-parameter Tuning

From above runs of the SVM classifier the choice of a non-linear kernel - rbf in our case - makes the most sense. It is less clear if using a balanced class-weight helps. Further more, we don't quite know what values to use for other hyper-parameters such as C and gamma both of which greatly affect the performance of the classifier. C is the penalty for mis-classfications. It is a trade-off between how smooth the decision boundary is and the number of classification errors. A higher value for C results in a more non-linear fit and consequently, lower errors. gamma is the kernel co-efficient. It controls the influence that each sample exerts on the decision boundary. A higher value for gamma results in points closer to the decision-boundary having a larger influence and leads to a more non-linear fit. In general, higher values of C and gamma result in higher non-linearity in the model fit and can lead to overfitting. More on these parameters here.

By default, sklearn sets both C and gamma to low values. There are many methods to go about tuning these parameters to find optimal values. sklearn offers GridSearchCV and RandomizedSearchCV, two cross-validation methods for parameter tuning, documented here in great detail. Here is a comparison of the two for a Random Forest classifier. We will be using Randomized Search here mainly due to the lower computation costs involved.

Tuning on Sub-sampled Data

Ideally, all modeling, including that which is carried out in service of hyper-parameter tuning, would include the entire dataset available for the purpose. But as previously mentioned, the computational complexity of many ML algorithms, and especially SVM is a big deterrent to this. Subsampling is therefore more common in practice than it should be. So what is the effect of subsampling on hyper-parameter tuning? It depends on the data but in general, it does result in a hit to the achievable accuracy. Here we use 10% of the original dataset for hyper-parameter tuning.

In [37]:
train_idx = np.random.choice(train_df.index.values,int(0.1 * train_sample_count),replace=False)
train_sample_df = train_df.loc[train_idx,:]
train_features = train_sample_df[features].as_matrix()
train_labels = train_sample_df['LABEL'].values
Parameter Distributions

Typical ranges for hyper-parameters are designed to sweep through values on a log scale covering an equal number of values both lower, and higher, than the default values. With just 3 of these parameters being tuned below we have a huge number of combinations (6x6x2=72) for which the algorithm has to be run. Considering that a single run of the algorithm for 10% of the samples took ~45 minutes we are looking at a run time close to 60 hours!

Cross Validation

And it does not end there. A typical problem with hyper-parameter tuning, thanks to the number of times that a model is trained on the same data, is overfitting. Cross-validation is used as a way to mitigate this. A more detailed explanation can be found here. Effectively, each combination of parameters is cross-validated k times in k-fold cross-validation. For our case, choosing a small cross-validation number such as 3 will result in 216 (72x3) runs. Assuming a ~28-min execution time per run, this amounts to 99 hours, or about 4 days in run time! Yes, there is no such thing as instant-gratification in ML.

Parallelism

From the sklearn documentation we learn that sklearn is using Python's joblib implementation to parallelize operations where possible. Here is an excerpt from the Changelog for sklearn v0.17.

Upgraded to joblib 0.9.3 to benefit from the new automatic batching of short tasks. This makes it possible for scikit-learn to benefit from parallelism when many very short tasks are executed in parallel, for instance by the grid_search.GridSearchCV meta-estimator with n_jobs > 1 used with a large grid of parameters on a small dataset.

And here is one more:

GridSearchCV and RandomizedSearchCV evaluate each parameter setting independently. Computations can be run in parallel if your OS supports it, by using the keyword n_jobs=-1. See function signature for more details.

In [38]:
param_dist = {'C': np.logspace(-3, 2, 6), 'gamma': np.logspace(-3, 2, 6),
               'kernel': ['rbf'], 'class_weight':['balanced', None]}

clf = RandomizedSearchCV(SVC(), param_distributions=param_dist, n_iter=72, scoring='accuracy',
                         n_jobs=-1, cv=3)
start_t = time.time()
clf.fit(train_features,train_labels)
end_t = time.time()
Best Parameters

Once the randomized search CV algorithm completes its run the best parameters, as determined by it, can be accessed as below.

In [39]:
clf.best_params_
Out[39]:
{'C': 10.0,
 'class_weight': None,
 'gamma': 0.10000000000000001,
 'kernel': 'rbf'}
In [40]:
# Ref: http://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html
# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

Below we print the top 5 models in terms of accuracy and the parameter values for each.

In [41]:
report(clf.cv_results_,5)
Model with rank: 1
Mean validation score: 0.832 (std: 0.001)
Parameters: {'kernel': 'rbf', 'gamma': 0.10000000000000001, 'class_weight': None, 'C': 10.0}

Model with rank: 2
Mean validation score: 0.831 (std: 0.001)
Parameters: {'kernel': 'rbf', 'gamma': 0.01, 'class_weight': None, 'C': 100.0}

Model with rank: 3
Mean validation score: 0.830 (std: 0.001)
Parameters: {'kernel': 'rbf', 'gamma': 0.10000000000000001, 'class_weight': None, 'C': 100.0}

Model with rank: 4
Mean validation score: 0.829 (std: 0.000)
Parameters: {'kernel': 'rbf', 'gamma': 1.0, 'class_weight': None, 'C': 1.0}

Model with rank: 5
Mean validation score: 0.829 (std: 0.001)
Parameters: {'kernel': 'rbf', 'gamma': 0.001, 'class_weight': None, 'C': 100.0}

Best Estimator

The clf object uses the best parameter settings by default when called to predict. Another way of ensuring this is by using the best_estimator_ member that is part of clf.

In [42]:
start_p = time.time()
predicted = clf.best_estimator_.predict(test_features)
end_p = time.time()
In [43]:
s = "<b>Execution Time</b> (10%): <b>Training</b> {}s <b>Prediction</b> {}s".format(round((end_t-start_t),2),
                                                                                    round((end_p-start_p),2))
display(HTML(s))
Execution Time (10%): Training 283626.43s Prediction 556.81s

It appears we were able to improve upon the estimated single core execution time of 99h by about 20h using all 4 cores available to us.

In [44]:
accuracy_measures(test_labels,predicted)
Accuracy: 82.14% True Positive Rate: 42.21% True Negative Rate: 93.63%

Bagging

We've clearly not been able to improve the accuracy much over our previous iterations of the model. Two obvious reasons for this have been discussed previously and restated below:

  • Loss resulting from under-sampling the original data (we've been using only 10% of the training data so far)
  • Loss due to imbalanced data (postive samples are under-represented)

Bagging, a.k.a Bootstrap Aggregation, is a solution to counter both. Here we are essentially creating a meta-estimator by training many base estimators each of which is trained on a sub-sample randomly drawn (in our case, without replacement) from the training set. We will be using all of the original training set for this purpose. The final prediction is an aggregate of the predictions of the base estimators.

sklearn provides an easy way to accomplish this through the BaggingClassfier.

In [45]:
# use the entire training set
train_features = train_df[features].as_matrix()
train_labels = train_df['LABEL'].values
In [49]:
from sklearn.ensemble import BaggingClassifier

N_EST = 10
params = clf.best_estimator_.get_params()

est = SVC() 
est.set_params(**params)

clf_b = BaggingClassifier(est, n_estimators=N_EST, max_samples=1/N_EST, max_features=1.0, 
                          bootstrap = False, oob_score=False, n_jobs=-1)

start_t = time.time()
clf_b.fit(train_features,train_labels)
end_t = time.time()
In [50]:
start_p = time.time()
predicted = clf_b.predict(test_features)
end_p = time.time()
In [51]:
s = "<b>Execution Time</b> (100%): <b>Training</b> {}s <b>Prediction</b> {}s".format(round((end_t-start_t),2),
                                                                                    round((end_p-start_p),2))
display(HTML(s))
Execution Time (100%): Training 17636.48s Prediction 2553.48s

Note that the prediction takes much longer here than for other models since it is computed as a mean of the predictions of the base estimators. To be more precise, this is what the sklearn documentation says:

The predicted class of an input sample is computed as the class with the highest mean predicted probability. If base estimators do not implement a predict_proba method, then it resorts to voting.

In [52]:
accuracy_measures(test_labels,predicted)
Accuracy: 82.16% True Positive Rate: 41.51% True Negative Rate: 93.85%
Kernel Approximation

And lastly, we'll apply a technique known as Kernel Approximation to our dataset and compare the performance to previous models. This technique relies on transforming the existing feature space to arrive at one that may potentially render the data more feasible to the application of a linear algorithm. More on this here and here.

In [61]:
from sklearn.kernel_approximation import RBFSampler
from sklearn.linear_model import SGDClassifier

gamma = clf.best_estimator_.get_params()['gamma']
feature_mapper = RBFSampler(gamma=gamma, random_state=2017)
start_m = time.time()
train_features_new = feature_mapper.fit_transform(train_features)
end_m = time.time()

clf_sgd = SGDClassifier(tol=None,max_iter=1000)   
start_t = time.time()
clf_sgd.fit(train_features_new,train_labels)
end_t = time.time()
In [62]:
test_features_new = feature_mapper.fit_transform(test_features)
start_p = time.time()
predicted = clf_sgd.predict(test_features_new)
end_p = time.time()
In [63]:
accuracy_measures(test_labels,predicted)
Accuracy: 81.12% True Positive Rate: 33.49% True Negative Rate: 94.82%
In [64]:
s = "<b>Execution Time</b> (Kernel Approximation): <b>Training</b> {}s <b>Prediction</b> {}s".format(
    round((end_t-start_t),2),round((end_p-start_p),2))
display(HTML(s))
Execution Time (Kernel Approximation): Training 663.7s Prediction 0.01s
Feature Importance

While the accuracy from many of the models is not as high as one might want, it may still be useful to know which features contributed the most towards it. This is standard practice once a model is trained and can be accomplished easily using well-known sklearn API. However, SVM classifiers that use a non-linear kernel, such as rbf in our case, are among the small subset of models for which this is not possible. This has to do with the fact that for non-linear kernels the feature-space is transformed into a new high-dimensional space. As such, any co-efficients associated with features in this space can no longer be mapped to the original features of the datatset.

Even with the SGD classifier we've significantly transformed our input features using RBFSampler for the co-efficients (which can be accessed using the coef_ attribute of the classifier) to be interpreted in terms of the original features.

Next Steps?

Now, we've built a number of models and seem to have come up short. What could we possibly do to further improve upon these results? Should we try out classifiers other than SVM, perhaps MLP? Should we revisit feature selection? May be throwing out as many features as we did (in part 1 of this series) wasn't such a great idea. Should we consider fetching even more than the 1.6M samples we've already trained this model on? Why restrict the training set to PUMS data from California when we can get data for the entire nation? All are valid questions. But the answers to them depend on whether our lower than desired accuracy is a result of high bias or high variance. And this is where learning curves can help.

Learning Curve

An important concept in machine learning is that of the bias-variance tradeoff. And a key objective of effective model building is to not only reduce bias - i.e.: arrive at a model that best fits the data - but also to lower variance - i.e.: ensure that the model is generalizable enough to work with a broad spectrum of unseen data. Here is sklearn's primer on learning curves, part of a comprehensive guide on model selection and evaluation.

Let's plot the learning curve for our data/model and see what we learn. More on the learning curve here.

In [73]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import learning_curve  

params = clf.best_estimator_.get_params()

est = SVC() 
est.set_params(**params)

# each cv fold is 10% of the original dataset with per fold train-test split of .8/.2
cv = ShuffleSplit(n_splits=5, test_size=int(0.02 * train_sample_count), train_size=int(0.1 * train_sample_count), 
                  random_state=2017)

start_t = time.time()
train_sizes, train_scores, valid_scores = learning_curve(est, train_features, train_labels, 
                                                         train_sizes=np.linspace(0.2, 1.0, 5), cv=cv, n_jobs=-1)
end_t = time.time()
In [74]:
# code ref: http://docs.w3cub.com/scikit_learn/auto_examples/model_selection/plot_learning_curve/
# function to plot the learning curve
def plot_learning_curve(train_sizes, train_scores, test_scores, title, ylim):
    
    plt.figure()
    plt.title(title)
    
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt
In [75]:
title = "Learning Curve: SVM (rbf)" 
ylim = (0.7, 1.01)

plot_learning_curve(train_sizes, train_scores, valid_scores, title, ylim)
plt.show()

From the above plot we see that both curves start off at a pretty high accuracy (without much seperating them) and converge very quickly, at about 130K samples and stay the same even as more data is added. Clearly, adding more data is not going to improve the situation any further and this was evident with the Bagging classifier that we built previously. So any improvements will need to come from finding a better fit to existing data. In other words, we need to work on lowering the bias (if, at all, that is possible) rather than spend any more time and effort lowering the variance.

Insights

With models based on the SVM classifier we were able to accurately predict the income levels of the San Diego county population, given their demographic information, approximately 82% of the time. This is only slightly better than the null accuracy (~77%) and not particularly satisfying. There are several unexplored options yet to improve the accuracy. A few that we might benefit from exploring further are listed below:

  • Applying other classifiers such as KNN, Random Forest, MLP
  • Revisiting feature-selection

That said, the classification accuracy for the UCI Adult dataset (which was the inspiration for this effort) was no more than 86% (as detailed in this paper) using an ensemble method (NBTree) that combined Naive Bayes and Decision Tree classifiers.

Additionally, apart from the steps outlined above, it would help to scale the ML workflow to take advantage of computing clusters (AWS among others) to train our future models. Even though our dataset doesn't exactly qualify as "Big Data", the complexity of some our models is enough to lock up resources on the local machine for days on end, even for this dataset.

Additional Analysis

We continue our analysis further in an attempt to improve the prediction accuracy for the data we have. You can find this analysis here.