In Part I of this series the ACS PUMS data was used (after pre-processing steps to handle missing data and labeling) to generate the training and test sets. In Part II we will further wrangle these datasets into a form that is suitable for use by Python's scikit-learn
(sklearn
for short) package. We will also be applying transformations needed to allow specific Machine Learning (ML) algorithms to train on this data for predicting income-levels on the test set. The data questions we are attempting to answer through this series is repeated under Data Questions. Any conclusions will be part of the last report in this series.
Read in the training and test datasets, wrangle the datasets into a form suitable for input to sklearn
algorithms, specifically those related to classification. Note that this data is pre-processed for missing values.
import sys
import os
import pandas as pd
import numpy as np
import pprint
# display related
from IPython.display import display, HTML
# ML related
from sklearn.preprocessing import LabelBinarizer, StandardScaler
# Misc
import time
import h5py
import pickle
CWD = os.getcwd()
DATADIR = 'data'
DATAFILE_TRAIN = "afc_pums_train.csv"
DATAFILE_TEST = "afc_pums_test.csv"
OUTFILE_TRAIN = "afc_pums_skl_train.h5"
OUTFILE_TEST = "afc_pums_skl_test.h5"
OUTFILE_VARS = "afc_pums_skl_vars.pkl"
Read the data files and tranform them into the right format as required by sklearn
. Note: sklearn
works with numpy
arrays, not with pandas
dataframes. One can however use sklearn-pandas
to use apply the necessary transformations to entire dataframes and convert it to a format suitable for input to sklearn
. Or one could choose to apply transformations to individual fields of the dataframe, concatenate the resulting output and convert it to numpy matrices. We will go with the latter option since it helps break-out and explain each step.
train_df = pd.read_csv(os.path.join(DATADIR,DATAFILE_TRAIN),skipinitialspace=True)
display(train_df.head(n=3))
test_df = pd.read_csv(os.path.join(DATADIR,DATAFILE_TEST),skipinitialspace=True)
display(test_df.head(n=3))
Below we see that there are no missing values and the data-types are all numeric. However, we know that the majority of the variables in this dataset are categorical (with numeric instead of character codes). We need to transform these categorical variables to the appropriate format before we use it for classification.
train_df.info()
`sklearn` routines expect all data to be in numerical form and this applies to output labels as well. Below we convert the output labels under CAT
to their numerical equivalents. The sklearn
reference guide offers handy examples on how to do this.
lb = LabelBinarizer()
lb.fit(['AGI<=50K','AGI>50K'])
train_df['LABEL'] = lb.transform(train_df['CAT'])
train_df.head(n=3)
Categorial variables require special handling prior to their use in modeling algorithms. The treatment varies depending on whether it is an ordinal or a nominal variable. Ordinal variables merely need to be converted into numerical levels. In this case, since all of our variables are expressed as numerical values we do not need to carry out any additional transformations. However, nominal variables need to be converted to a one-hot encoding format, in effect, requiring many dummy variables to replace each nominal variable.
Below is the list of categorical variables in our dataset by type:
SCHL
COW
, MAR
, RELP
, SEX
, DIS
, POBP
, RAC1P
pd.options.display.float_format = '{:20,.0f}'.format
categorical_vars = ['COW', 'MAR', 'RELP', 'SCHL', 'SEX', 'DIS', 'POBP', 'RAC1P']
train_df[categorical_vars].describe().loc[['min','max']]
We see, from above, that nearly all nominal variables can be relatively easily tranformed to their one-hot-encodings given their smaller ranges except for POBP
. Given the 1.6M samples we have, introducing one-hot encodings for high-cardinality categoricals like POBP
is going to be very challenging from a memory standpoint. This article provides a good overview on techniques to handle such high-cardinality nominals. We will use the supervised ratio technique, the simplest and most intuitive of the lot.
# generate dummy variables for low-cardinality nominals
df_cow = pd.get_dummies(train_df['COW'])
df_cow.columns = ["COW_{}".format(str(int(s))) for s in df_cow.columns]
df_cow.head(n=3)
df_mar = pd.get_dummies(train_df['MAR'])
df_mar.columns = ["MAR_{}".format(str(int(s))) for s in df_mar.columns]
df_relp = pd.get_dummies(train_df['RELP'])
df_relp.columns = ["RELP_{}".format(str(int(s))) for s in df_relp.columns]
df_sex = pd.get_dummies(train_df['SEX'])
df_sex.columns = ["SEX_{}".format(str(int(s))) for s in df_sex.columns]
df_dis = pd.get_dummies(train_df['DIS'])
df_dis.columns = ["DIS_{}".format(str(int(s))) for s in df_dis.columns]
df_race = pd.get_dummies(train_df['RAC1P'])
df_race.columns = ["RACE_{}".format(str(int(s))) for s in df_race.columns]
In the supervised ratio
technique applied below we convert each sample of the high-cardinality categorical (POBP) into a continous value representing its contribution to positive labels (i.e.: labels whose value is 1) in the entire dataset.
source = train_df['POBP']
target = train_df['LABEL']
global dict_sr
dict_sr = dict()
def supervised_ratio(x):
if x in dict_sr:
sr = dict_sr[x]
else:
# find all samples whose value for the `source` field is the
# same as that for sample `x`
idx = np.where(source==x)
tt = target.loc[idx]
# find out how many of those samples are positive and negative
pi = (tt==1).astype(int).sum()
ni = (tt==0).astype(int).sum()
# find the positive-to-negative ratio
sr = (pi/(pi+ni))
dict_sr[x] = sr
return float(sr)
start = time.time()
pob_ratios = train_df['POBP'].apply(supervised_ratio)
end = time.time()
df_pob = pob_ratios.to_frame()
pd.options.display.float_format = '{:20,.4f}'.format
df_pob.head(n=10).T
# concatenate all the dataframes generated so far, along with data for quantitative variables from the
# original training data
sk_train_df = pd.concat([train_df[['PWGTP','AGEP']],df_cow,df_mar,df_relp,df_sex,df_dis,df_pob,df_race,train_df[['LABEL']]],
axis=1)
Now that we have most of our data in the right format for input to sklearn
there's one last thing left to do - scale the features. In this case we just need to scale the continuous variables. The one-hot encoded variables and any ordinals should not be scaled.
scaler = StandardScaler()
sk_train_df[['PWGTP','AGEP']] = scaler.fit_transform(sk_train_df[['PWGTP','AGEP']])
pd.options.display.float_format = '{:20,.4f}'.format
sk_train_df.head(n=3)
All the transformations applied to the training data must also be applied to the test data.
pd.options.display.float_format = '{:20,.4f}'.format
test_df['LABEL'] = lb.transform(test_df['CAT'])
df_cow = pd.get_dummies(test_df['COW'])
df_cow.columns = ["COW_{}".format(str(int(s))) for s in df_cow.columns]
df_mar = pd.get_dummies(test_df['MAR'])
df_mar.columns = ["MAR_{}".format(str(int(s))) for s in df_mar.columns]
df_relp = pd.get_dummies(test_df['RELP'])
df_relp.columns = ["RELP_{}".format(str(int(s))) for s in df_relp.columns]
df_sex = pd.get_dummies(test_df['SEX'])
df_sex.columns = ["SEX_{}".format(str(int(s))) for s in df_sex.columns]
df_dis = pd.get_dummies(test_df['DIS'])
df_dis.columns = ["DIS_{}".format(str(int(s))) for s in df_dis.columns]
df_race = pd.get_dummies(test_df['RAC1P'])
df_race.columns = ["RACE_{}".format(str(int(s))) for s in df_race.columns]
source = test_df['POBP']
target = test_df['LABEL']
pob_ratios = test_df['POBP'].apply(supervised_ratio)
df_pob = pob_ratios.to_frame()
sk_test_df = pd.concat([test_df[['PWGTP','AGEP']],df_cow,df_mar,df_relp,df_sex,df_dis,df_pob,df_race,test_df[['LABEL']]],
axis=1)
sk_test_df[['PWGTP','AGEP']] = scaler.fit_transform(sk_test_df[['PWGTP','AGEP']])
sk_test_df.head(n=3)
While we could save the dataframes to CSV, it is more efficient to convert store them as HDF5 binary files. We also save the column headers as a pickle file for later use.
vars = sk_train_df.columns.tolist()
# bzip2 results in the smallest size on disk but may result in more read/writes to disk
sk_train_df.to_hdf(OUTFILE_TRAIN, 'train_data', mode='w', complevel=9, complib='bzip2')
sk_test_df.to_hdf(OUTFILE_TEST, 'test_data', mode='w',complevel=9, complib='bzip2')
with open(OUTFILE_VARS, 'wb') as pklf:
pickle.dump(vars, pklf)
Below is a summary of the operations carried in this analysis:
supervised_ratio
technique For additional analysis and insights please refer to part III of this series located here.