This notebook is part of a series that investigates microdata - i.e.: survey response data at the individual level, as opposed responses aggregated by region - for populations in the San Diego county. Here we look at the potential for predicting income levels based on demographic, education level and other information specified as part of the microdata for different zipcodes in the region. The data questions answered by this series are listed under Data Questions. The conclusions are part of the last report in this series.
The American Community Survey (ACS) from the US Census Bureau makes available microdata samples for communities across the US. The microdata samples, referred to as PUMS (Public Use Microdata Samples), are available in 1-yr and 5-yr formats. This analysis used the 2016 5-yr estimates since they have a higher accuracy and are available for all communities without restrictions on the population size. Note that PUMS data is captured with restricted geographic information (owing to privacy protections) and are most commonly identified using PUMAs (Public Use Microdata Areas). More on the PUMAs here. For this analysis a tool from the Missouri Census Data Center was used to map PUMAs to zip codes.
This tutorial and handbook are good starting points to familiarize oneself with PUMS data. An overview of technical details for PUMS data is here. Additional technical details can be found here. Guidance specific to the 2012-2016 5-yr PUMS data can be found here.
The 2016 5-yr PUMS data in CSV format was downloaded using American Fact Finder (AFF). Note that this file contains samples for all of California since it is not possible filter by geography for the PUMS files. The PUMA to zipcode crosswalk file is also California specific.
import sys
import os
import pandas as pd
import numpy as np
import pprint
# display related
from IPython.display import display, HTML
CWD = os.getcwd()
DATADIR = 'data'
DATAFILE_PUMS = "afc_pums_5yr_2016.csv"
DATAFILE_XWALK = "puma_zipcode_xwalk.csv"
DATAFILE_CODES = "afc_pums_field_codes.csv"
OUTFILE_TRAIN = "afc_pums_train.csv"
OUTFILE_TEST = "afc_pums_test.csv"
We explore a small subset of this data to begin with since the datafile is fairly large (~1.6GB).
df = pd.read_csv(os.path.join(DATADIR,DATAFILE_PUMS),skipinitialspace=True,nrows=10)
df.head()
%pprint
df.columns.values.tolist()
The above fields are listed out (along with descriptions and valid values) in the 2016 5-yr PUMS Data Dictionary. We will only be looking at a subset of these fields - those related to demographic, education level among others - for this analysis.
From above we see that this dataset comes with a lot of features, 283 to be precise. Many, like the replicate weights (PWGTP{1-80}), are less concerned with the core information specific to each sample and have more to do with statistical measures associated with the sample for use in deriving population estimates. More about them here. Given our task here - predicting the income level - and the size of this dataset (at ~1.6M samples it is more than adequate for standard ML modeling) we have no need for the replication weights.
From among the remaining ~200 features there is one-on-one mapping from a core variable to a data-quality flag - these are called person record-allocation flags (FXXXXP). These flags indicate if a value associated with a variable was allocated or actually observed (i.e.: received as part of the survey response). For the purpose of our analysis here we do not care about how the data was arrived at just as long as the methodology used is well-designed and reasonable.
Discounting the flags we have 128 features to work with. Of these, 11 are race related (RACXXX). Although having more detailed race data is likely to benefit the accuracy of our models given existing trends (Ref: Racial Wage Gap in the US), we've decided to use a single high-level variable (RAC1P) to encode racial information and forego the others.
Among the remaining features we chose those that closely resemble the ones used by an earlier dataset that inspired the creation of this one. This was primarily driven by a desire to compare achievable accuracy measures given similar datasets (despite the datasets being nearly 25 years apart in time) and to some extent by a desire to restrict features to those related to core demographic information.
Note: It may well turn out that many of the unused features were simply not available previously and that they may indeed help explain a significant enough portion of the variation in the dataset. And if that is the case we would benefit from revisiting feature selection at a later stage.
# read in desired field codes and descriptive names from file and store them for lookup
df_codes = pd.read_csv(os.path.join(DATADIR,DATAFILE_CODES),skipinitialspace=True)
dict_codes = dict(zip(df_codes.FieldCode,df_codes.FieldName))
cols = df_codes.FieldCode.values
cols.tolist()
Read the PUMA-to-Zipcode crosswalk file and extract zipcodes for PUMAs specific to San Diego county. Also create a list of uniques PUMAs for the San Diego region. Note: we are only taking PUMAs for the cities that are part of the San Diego county. This will therefore not account for the unincorporated, as well as, non-municipal areas of the county.
# utf-8 is the default encoding which leads to decode errors
df_xwalk = pd.read_csv(os.path.join(DATADIR,DATAFILE_XWALK),skipinitialspace=True,skiprows=[1],encoding = "ISO-8859-1")
df_xwalk.head()
sd_cities = ['Carlsbad','Chula Vista','Coronado','Delmar','El Cajon','Encinitas','Escondido','Imperial Beach',
'La Mesa','Lemon Grove','National City','Oceanside','Poway','San Diego','San Marcos','Santee',
'Solana Beach','Vista']
sd_zipnames = [s + ', CA' for s in sd_cities]
df_xwalk_sd = df_xwalk.loc[df_xwalk['zipname'].isin(sd_zipnames)]
df_xwalk_sd = df_xwalk_sd[['puma12','zcta5','zipname']]
sd_pumas = df_xwalk_sd.puma12.unique()
Now read the full dataset of PUMA samples, extract those specific to the San Diego county and set that aside. This will form our test set. The rest of the dataset or a portion of it will be the training set.
df = pd.read_csv(os.path.join(DATADIR,DATAFILE_PUMS),skipinitialspace=True,usecols=cols)
# settings to display all coulmns
pd.set_option('display.max_columns', 50)
df.head(n=10)
Before splitting the data into test and training sets though, we need to carry out the following transformations (common to both sets):
Notes:
INTP
, RETP
, OIP
and so on.df['CAT'] = np.where(df['PINCP']>50000.0, 'AGI>50K', 'AGI<=50K')
df.head()
Now that we have the income category label we can remove all other fields related to income.
income_cols = ['ADJINC','INTP','OIP','PAP','RETP','SEMP','SSIP','SSP','WAGP','POVPIP']
df.drop(income_cols,axis=1,inplace=True)
Let's identify other fields that are not essential or do not fit into the criteria we are using to predict the income levels.
drop_cols = ['RT','SERIALNO','SPORDER','CIT','JWTR','WKHP','WKW']
df.drop(drop_cols,axis=1,inplace=True)
Now clearly there are many missing values here as is evident from the above dump of a very small slice of the data. There are many ways to handle missing data and there is no hard-and-fast rule as to which one is the best. These strategies either discard, impute, or model the data that is missing.
Let's get a sense of how much is missing to see if discarding is even an option. Below is the percentage of samples with atleast one missing value:
int((df.isnull().values.any(axis=1).sum()/df.shape[0])*100)
That's an unusually high percentage of samples with NaNs. It's very likely that one or two fields are responsible for this. Let's see if there are such fields. Below is the percentage of NaNs in each column:
col_nans = ((df.isnull().values.sum(axis=0)/df.shape[0])*100).astype(int)
col_nans.tolist()
It does look like a handful of columns result in a very high percentage of NaNs. Let's examine what those columns are.
hi_nan_idx = np.where(col_nans > 50)[0]
hi_nan_cols = df.columns.values[hi_nan_idx]
hi_nan_cols.tolist()
The fields refer to Spoken English Level, Science-Engineering (STEM) Field and Science-Engineering Related Field in the dataset. All of these seem to be fairly important features for our classification algorithm since they all appear to have a correlation with income levels.
Place of Birth (POBP) is fairly strong indicator of the level of spoken English and so one could argue that ENG could be dropped. Both the STEM fields have unusually large number of NaNs at 76% each and it is non-trivial to infer their values although some strategies do exist (e.g.: immigrants, especially with higher incomes tend to come from the STEM field). POBP seems to play a part here too and we may do well to discard these STEM fields.
tmpdf = df.drop(hi_nan_cols,axis=1)
int((tmpdf.isnull().values.any(axis=1).sum()/tmpdf.shape[0])*100)
We see that dropping ENG
, SCIENGP
and SCIENGRLP
leads to a significant reduction (51% to be precise) in the number of samples with missing values. Let's now take a look at the next group of high NaN fields and decide what to do with them.
col_nans = ((tmpdf.isnull().values.sum(axis=0)/tmpdf.shape[0])*100).astype(int)
(tmpdf.columns.values[np.where(col_nans > 20)[0]]).tolist()
The above fields are Worker Class and Industry. Let's take a look at samples that have NaNs for either of these fields and see what we might learn.
tmpdf.loc[np.isnan(tmpdf['COW']) | np.isnan(tmpdf['INDP'])].head(n=3)
tmpdf.loc[~np.isnan(tmpdf['COW']) | ~np.isnan(tmpdf['INDP'])].head(n=3)
You can adjust the value of n
above to view larger slices of the data for each of the conditions. There appears to be a relation between Total Earnings (PERNP
) and the two fields. And further, it appears that COW and INDP seem to be either NaN or not for the same samples. Let's test these theories.
cow_nan_idx = np.where(np.isnan(tmpdf['COW']))[0]
s = "<b>NaN Counts</b>: <b>INDP</b> {} <b>COW</b> {}".format(np.isnan(tmpdf.INDP.values[cow_nan_idx]).sum(),len(cow_nan_idx))
display(HTML(s))
unq = np.unique(tmpdf.PERNP.values[cow_nan_idx])
unq[np.where(np.isnan(unq))] = 0.
s = "Unique values of PERNP when COW and INDP are NaN: {}".format(np.unique(unq))
display(HTML(s))
As suspected, the same samples have NaNs for both COW and INDP. And further, we see that when they are NaN, the Total Earnings is either NaN or 0. This suggests a strong correlation between the two and allows us to drop one of them. Note: It's still possible to argue that retaining both is of value but its a weak argument.
If we are going to retain one of them it is better to retain COW since INDP seems to be more of a refinement of COW and also because it is easier to infer COW for the rest of the missing values than it is to do so for INDP. Give that individuals without any earnings are the ones who have a NaN for worker-class we can reasonably conclude that these individuals are either retired or unemployed. The closest value that one can assign is 9. From the data dictionary this is what this value represents:
9 .Unemployed and last worked 5 years ago or earlier or never .worked
tmpdf.loc[cow_nan_idx,['COW']] = 9.
tmpdf.drop(['INDP','PERNP','PINCP'],axis=1,inplace=True)
int((tmpdf.isnull().values.any(axis=1).sum()/tmpdf.shape[0])*100)
With these changes we are down to 3% of samples with at least one missing value. This is a small enough percentage of samples that we can discard these samples without concern for how it might affect the accuracy of our model.
nan_samples = np.where(tmpdf.isnull().values.any(axis=1))[0]
tmpdf.drop(nan_samples,inplace=True)
int((tmpdf.isnull().values.any(axis=1).sum()/tmpdf.shape[0])*100)
And there we have it, a dataset with zero missing values. We can now split this to take out the San Diego specific samples and keep that aside as the test set.
Below we filter the data on PUMAs specific to the San Diego county such that the test set is exclusive to those PUMAs while the training set contains PUMAs for the rest of California excluding San Diego county.
test_df = tmpdf.loc[tmpdf['PUMA'].isin(sd_pumas) & (tmpdf['ST'] == 6)]
train_df = tmpdf.loc[~(tmpdf['PUMA'].isin(sd_pumas) & (tmpdf['ST'] == 6))]
total = train_df.shape[0] + test_df.shape[0]
s = "<b>Sample Counts</b>: <b>Train</b> {} <b>Test</b> {}".format(train_df.shape[0],test_df.shape[0])
display(HTML(s))
s = "<b>Sample Percentages</b>: <b>Train</b> {:.2f} <b>Test</b> {:.2f}".format(((train_df.shape[0]/total)*100),
((test_df.shape[0]/total)*100))
display(HTML(s))
Looks like San Diego PUMAs account for nearly 8% of this PUMS dataset. This is a good sample size for our test set. We write the training and test data sets to file for use in further analysis.
# write the datasets out to file
train_df.to_csv(OUTFILE_TRAIN, encoding='utf-8', index=False)
test_df.to_csv(OUTFILE_TEST, encoding='utf-8', index=False)
Below is a summary of the operations carried out in this part of the analysis:
For additional analysis and insights please refer to part II of this series located here.