SD Microdata Analysis - Part II

In Part I of this series the ACS PUMS data was used (after pre-processing steps to handle missing data and labeling) to generate the training and test sets. In Part II we will further wrangle these datasets into a form that is suitable for use by Python's scikit-learn (sklearn for short) package. We will also be applying transformations needed to allow specific Machine Learning (ML) algorithms to train on this data for predicting income-levels on the test set. The data questions we are attempting to answer through this series is repeated under Data Questions. Any conclusions will be part of the last report in this series.

Data Questions

  1. Can incomes levels be reliably predicted using demographic, eduction level, among other indicators for individuals in various communities of the San Diego county?
  2. Which factors contribute the most to income predictions?

Data Analysis

Read in the training and test datasets, wrangle the datasets into a form suitable for input to sklearn algorithms, specifically those related to classification. Note that this data is pre-processed for missing values.

In [2]:
import sys
import os
import pandas as pd 
import numpy as np 
import pprint

# display related
from IPython.display import display, HTML

# ML related
from sklearn.preprocessing import LabelBinarizer, StandardScaler

# Misc
import time
import h5py
import pickle
In [3]:
CWD = os.getcwd()

DATADIR = 'data'

DATAFILE_TRAIN = "afc_pums_train.csv"
DATAFILE_TEST = "afc_pums_test.csv"

OUTFILE_TRAIN = "afc_pums_skl_train.h5"
OUTFILE_TEST = "afc_pums_skl_test.h5"

OUTFILE_VARS = "afc_pums_skl_vars.pkl"

Explore the Data

Read the data files and tranform them into the right format as required by sklearn. Note: sklearn works with numpy arrays, not with pandas dataframes. One can however use sklearn-pandas to use apply the necessary transformations to entire dataframes and convert it to a format suitable for input to sklearn. Or one could choose to apply transformations to individual fields of the dataframe, concatenate the resulting output and convert it to numpy matrices. We will go with the latter option since it helps break-out and explain each step.

In [4]:
train_df = pd.read_csv(os.path.join(DATADIR,DATAFILE_TRAIN),skipinitialspace=True)
display(train_df.head(n=3))
test_df = pd.read_csv(os.path.join(DATADIR,DATAFILE_TEST),skipinitialspace=True)
display(test_df.head(n=3))
PUMA ST PWGTP AGEP COW MAR RELP SCHL SEX DIS POBP RAC1P CAT
0 8508 6 21 63 1.0 1 0 19.0 1 2 6 1 AGI>50K
1 8508 6 22 60 1.0 1 1 19.0 2 2 6 1 AGI<=50K
2 8105 6 14 79 9.0 2 0 16.0 2 1 6 1 AGI<=50K
PUMA ST PWGTP AGEP COW MAR RELP SCHL SEX DIS POBP RAC1P CAT
0 7311 6 12 93 9.0 2 0 18.0 1 1 40 1 AGI<=50K
1 7311 6 20 36 1.0 5 2 19.0 1 2 6 9 AGI<=50K
2 7313 6 3 28 1.0 3 17 17.0 1 2 6 8 AGI<=50K

Below we see that there are no missing values and the data-types are all numeric. However, we know that the majority of the variables in this dataset are categorical (with numeric instead of character codes). We need to transform these categorical variables to the appropriate format before we use it for classification.

In [5]:
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1663117 entries, 0 to 1663116
Data columns (total 13 columns):
PUMA     1663117 non-null int64
ST       1663117 non-null int64
PWGTP    1663117 non-null int64
AGEP     1663117 non-null int64
COW      1663117 non-null float64
MAR      1663117 non-null int64
RELP     1663117 non-null int64
SCHL     1663117 non-null float64
SEX      1663117 non-null int64
DIS      1663117 non-null int64
POBP     1663117 non-null int64
RAC1P    1663117 non-null int64
CAT      1663117 non-null object
dtypes: float64(2), int64(10), object(1)
memory usage: 165.0+ MB

Handling Categorical Output

`sklearn` routines expect all data to be in numerical form and this applies to output labels as well. Below we convert the output labels under CAT to their numerical equivalents. The sklearn reference guide offers handy examples on how to do this.

In [6]:
lb = LabelBinarizer()
lb.fit(['AGI<=50K','AGI>50K'])
train_df['LABEL'] = lb.transform(train_df['CAT'])
train_df.head(n=3)
Out[6]:
PUMA ST PWGTP AGEP COW MAR RELP SCHL SEX DIS POBP RAC1P CAT LABEL
0 8508 6 21 63 1.0 1 0 19.0 1 2 6 1 AGI>50K 1
1 8508 6 22 60 1.0 1 1 19.0 2 2 6 1 AGI<=50K 0
2 8105 6 14 79 9.0 2 0 16.0 2 1 6 1 AGI<=50K 0

Handling Categorical Features

Categorial variables require special handling prior to their use in modeling algorithms. The treatment varies depending on whether it is an ordinal or a nominal variable. Ordinal variables merely need to be converted into numerical levels. In this case, since all of our variables are expressed as numerical values we do not need to carry out any additional transformations. However, nominal variables need to be converted to a one-hot encoding format, in effect, requiring many dummy variables to replace each nominal variable.

Below is the list of categorical variables in our dataset by type:

  • ordinal: SCHL
  • nominal: COW, MAR, RELP, SEX, DIS, POBP, RAC1P
In [7]:
pd.options.display.float_format = '{:20,.0f}'.format

categorical_vars = ['COW', 'MAR', 'RELP', 'SCHL', 'SEX', 'DIS', 'POBP', 'RAC1P']
train_df[categorical_vars].describe().loc[['min','max']]
Out[7]:
COW MAR RELP SCHL SEX DIS POBP RAC1P
min 1 1 0 1 1 1 1 1
max 9 5 17 24 2 2 554 9

We see, from above, that nearly all nominal variables can be relatively easily tranformed to their one-hot-encodings given their smaller ranges except for POBP. Given the 1.6M samples we have, introducing one-hot encodings for high-cardinality categoricals like POBP is going to be very challenging from a memory standpoint. This article provides a good overview on techniques to handle such high-cardinality nominals. We will use the supervised ratio technique, the simplest and most intuitive of the lot.

In [8]:
# generate dummy variables for low-cardinality nominals
df_cow = pd.get_dummies(train_df['COW'])
df_cow.columns = ["COW_{}".format(str(int(s))) for s in df_cow.columns]
df_cow.head(n=3)
Out[8]:
COW_1 COW_2 COW_3 COW_4 COW_5 COW_6 COW_7 COW_8 COW_9
0 1 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 1
In [9]:
df_mar = pd.get_dummies(train_df['MAR'])
df_mar.columns = ["MAR_{}".format(str(int(s))) for s in df_mar.columns]
df_relp = pd.get_dummies(train_df['RELP'])
df_relp.columns = ["RELP_{}".format(str(int(s))) for s in df_relp.columns]
df_sex = pd.get_dummies(train_df['SEX'])
df_sex.columns = ["SEX_{}".format(str(int(s))) for s in df_sex.columns]
df_dis = pd.get_dummies(train_df['DIS'])
df_dis.columns = ["DIS_{}".format(str(int(s))) for s in df_dis.columns]
df_race = pd.get_dummies(train_df['RAC1P'])
df_race.columns = ["RACE_{}".format(str(int(s))) for s in df_race.columns]
High Cardinality Categoricals

In the supervised ratio technique applied below we convert each sample of the high-cardinality categorical (POBP) into a continous value representing its contribution to positive labels (i.e.: labels whose value is 1) in the entire dataset.

In [10]:
source = train_df['POBP']
target = train_df['LABEL']

global dict_sr
dict_sr = dict()
def supervised_ratio(x):
    
    if x in dict_sr:
        sr = dict_sr[x]
    else:
        # find all samples whose value for the `source` field is the 
        # same as that for sample `x`
        idx = np.where(source==x)
        tt = target.loc[idx]
        # find out how many of those samples are positive and negative        
        pi = (tt==1).astype(int).sum()
        ni = (tt==0).astype(int).sum()
        # find the positive-to-negative ratio
        sr = (pi/(pi+ni))
        
        dict_sr[x] = sr
        
    return float(sr)
In [11]:
start = time.time()
pob_ratios = train_df['POBP'].apply(supervised_ratio)
end = time.time()
df_pob = pob_ratios.to_frame()
In [12]:
pd.options.display.float_format = '{:20,.4f}'.format
df_pob.head(n=10).T
Out[12]:
0 1 2 3 4 5 6 7 8 9
POBP 0.1696 0.1696 0.1696 0.1696 0.1696 0.3702 0.1696 0.3272 0.2331 0.4062
In [13]:
# concatenate all the dataframes generated so far, along with data for quantitative variables from the 
# original training data
sk_train_df = pd.concat([train_df[['PWGTP','AGEP']],df_cow,df_mar,df_relp,df_sex,df_dis,df_pob,df_race,train_df[['LABEL']]],
                        axis=1)

Scaling

Now that we have most of our data in the right format for input to sklearn there's one last thing left to do - scale the features. In this case we just need to scale the continuous variables. The one-hot encoded variables and any ordinals should not be scaled.

In [14]:
scaler = StandardScaler()
sk_train_df[['PWGTP','AGEP']] = scaler.fit_transform(sk_train_df[['PWGTP','AGEP']])
In [15]:
pd.options.display.float_format = '{:20,.4f}'.format
sk_train_df.head(n=3)
Out[15]:
PWGTP AGEP COW_1 COW_2 COW_3 COW_4 COW_5 COW_6 COW_7 COW_8 ... RACE_1 RACE_2 RACE_3 RACE_4 RACE_5 RACE_6 RACE_7 RACE_8 RACE_9 LABEL
0 0.0353 1.0032 1 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 1
1 0.1039 0.8692 1 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
2 -0.4450 1.7176 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0

3 rows × 49 columns

Apply Transformations to Test Data

All the transformations applied to the training data must also be applied to the test data.

In [16]:
pd.options.display.float_format = '{:20,.4f}'.format

test_df['LABEL'] = lb.transform(test_df['CAT'])

df_cow = pd.get_dummies(test_df['COW'])
df_cow.columns = ["COW_{}".format(str(int(s))) for s in df_cow.columns]
df_mar = pd.get_dummies(test_df['MAR'])
df_mar.columns = ["MAR_{}".format(str(int(s))) for s in df_mar.columns]
df_relp = pd.get_dummies(test_df['RELP'])
df_relp.columns = ["RELP_{}".format(str(int(s))) for s in df_relp.columns]
df_sex = pd.get_dummies(test_df['SEX'])
df_sex.columns = ["SEX_{}".format(str(int(s))) for s in df_sex.columns]
df_dis = pd.get_dummies(test_df['DIS'])
df_dis.columns = ["DIS_{}".format(str(int(s))) for s in df_dis.columns]
df_race = pd.get_dummies(test_df['RAC1P'])
df_race.columns = ["RACE_{}".format(str(int(s))) for s in df_race.columns]

source = test_df['POBP']
target = test_df['LABEL']

pob_ratios = test_df['POBP'].apply(supervised_ratio)
df_pob = pob_ratios.to_frame()

sk_test_df = pd.concat([test_df[['PWGTP','AGEP']],df_cow,df_mar,df_relp,df_sex,df_dis,df_pob,df_race,test_df[['LABEL']]],
                        axis=1)
sk_test_df[['PWGTP','AGEP']] = scaler.fit_transform(sk_test_df[['PWGTP','AGEP']])
sk_test_df.head(n=3)
Out[16]:
PWGTP AGEP COW_1 COW_2 COW_3 COW_4 COW_5 COW_6 COW_7 COW_8 ... RACE_1 RACE_2 RACE_3 RACE_4 RACE_5 RACE_6 RACE_7 RACE_8 RACE_9 LABEL
0 -0.6112 2.3540 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
1 -0.1314 -0.2042 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
2 -1.1509 -0.5633 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

3 rows × 49 columns

Storing the Data

While we could save the dataframes to CSV, it is more efficient to convert store them as HDF5 binary files. We also save the column headers as a pickle file for later use.

In [17]:
vars = sk_train_df.columns.tolist()

# bzip2 results in the smallest size on disk but may result in more read/writes to disk
sk_train_df.to_hdf(OUTFILE_TRAIN, 'train_data', mode='w', complevel=9, complib='bzip2')
sk_test_df.to_hdf(OUTFILE_TEST, 'test_data', mode='w',complevel=9, complib='bzip2')

with open(OUTFILE_VARS, 'wb') as pklf:
    pickle.dump(vars, pklf)

Outro

Below is a summary of the operations carried in this analysis:

  • Applying the following transformation on the training and test sets:
    • Convert the label from string to binary format
    • Convert low-cardinality categorical features to numericals by applying one-hot encoding
    • Convert high-cardinality categorical features to continuous variables using the supervised_ratio technique
    • Scale the numerical features using a standard scaler
  • Storing the train and test sets as HDF5 files for further analysis

Additional Analysis

For additional analysis and insights please refer to part III of this series located here.