San Diego Microdata Analysis - Part I

This notebook is part of a series that investigates microdata - i.e.: survey response data at the individual level, as opposed responses aggregated by region - for populations in the San Diego county. Here we look at the potential for predicting income levels based on demographic, education level and other information specified as part of the microdata for different zipcodes in the region. The data questions answered by this series are listed under Data Questions. The conclusions are part of the last report in this series.

Data Questions

  1. Can incomes levels be reliably predicted using demographic, education level, among other indicators for individuals in various communities of the San Diego county?
  2. Which factors contribute the most to income level predictions?

Data Source

The American Community Survey (ACS) from the US Census Bureau makes available microdata samples for communities across the US. The microdata samples, referred to as PUMS (Public Use Microdata Samples), are available in 1-yr and 5-yr formats. This analysis used the 2016 5-yr estimates since they have a higher accuracy and are available for all communities without restrictions on the population size. Note that PUMS data is captured with restricted geographic information (owing to privacy protections) and are most commonly identified using PUMAs (Public Use Microdata Areas). More on the PUMAs here. For this analysis a tool from the Missouri Census Data Center was used to map PUMAs to zip codes.

This tutorial and handbook are good starting points to familiarize oneself with PUMS data. An overview of technical details for PUMS data is here. Additional technical details can be found here. Guidance specific to the 2012-2016 5-yr PUMS data can be found here.

Data Analysis

The 2016 5-yr PUMS data in CSV format was downloaded using American Fact Finder (AFF). Note that this file contains samples for all of California since it is not possible filter by geography for the PUMS files. The PUMA to zipcode crosswalk file is also California specific.

In [1]:
import sys
import os
import pandas as pd 
import numpy as np 
import pprint

# display related
from IPython.display import display, HTML
In [2]:
CWD = os.getcwd()

DATADIR = 'data'

DATAFILE_PUMS = "afc_pums_5yr_2016.csv"
DATAFILE_XWALK = "puma_zipcode_xwalk.csv"
DATAFILE_CODES = "afc_pums_field_codes.csv"

OUTFILE_TRAIN = "afc_pums_train.csv"
OUTFILE_TEST = "afc_pums_test.csv"

Exploring the Data

We explore a small subset of this data to begin with since the datafile is fairly large (~1.6GB).

In [3]:
df = pd.read_csv(os.path.join(DATADIR,DATAFILE_PUMS),skipinitialspace=True,nrows=10)
df.head()
Out[3]:
RT SERIALNO SPORDER PUMA ST ADJINC PWGTP AGEP CIT CITWP ... PWGTP71 PWGTP72 PWGTP73 PWGTP74 PWGTP75 PWGTP76 PWGTP77 PWGTP78 PWGTP79 PWGTP80
0 P 2012000000011 1 8508 6 1056030 21 63 1 NaN ... 37 38 23 22 35 6 23 7 7 22
1 P 2012000000011 2 8508 6 1056030 22 60 1 NaN ... 42 36 24 22 33 6 23 6 6 22
2 P 2012000000022 1 8105 6 1056030 14 79 1 NaN ... 14 25 22 22 14 4 14 23 13 25
3 P 2012000000031 1 1305 6 1056030 20 49 1 NaN ... 20 5 21 21 37 39 6 7 30 30
4 P 2012000000031 2 1305 6 1056030 19 48 1 NaN ... 18 5 16 17 38 33 6 6 29 31

5 rows × 283 columns

In [4]:
%pprint
df.columns.values.tolist() 
Pretty printing has been turned OFF
Out[4]:
['RT', 'SERIALNO', 'SPORDER', 'PUMA', 'ST', 'ADJINC', 'PWGTP', 'AGEP', 'CIT', 'CITWP', 'COW', 'DDRS', 'DEAR', 'DEYE', 'DOUT', 'DPHY', 'DRAT', 'DRATX', 'DREM', 'ENG', 'FER', 'GCL', 'GCM', 'GCR', 'HINS1', 'HINS2', 'HINS3', 'HINS4', 'HINS5', 'HINS6', 'HINS7', 'INTP', 'JWMNP', 'JWRIP', 'JWTR', 'LANX', 'MAR', 'MARHD', 'MARHM', 'MARHT', 'MARHW', 'MARHYP', 'MIG', 'MIL', 'MLPA', 'MLPB', 'MLPCD', 'MLPE', 'MLPFG', 'MLPH', 'MLPI', 'MLPJ', 'MLPK', 'NWAB', 'NWAV', 'NWLA', 'NWLK', 'NWRE', 'OIP', 'PAP', 'RELP', 'RETP', 'SCH', 'SCHG', 'SCHL', 'SEMP', 'SEX', 'SSIP', 'SSP', 'WAGP', 'WKHP', 'WKL', 'WKW', 'WRK', 'YOEP', 'ANC', 'ANC1P', 'ANC2P', 'DECADE', 'DIS', 'DRIVESP', 'ESP', 'ESR', 'FOD1P', 'FOD2P', 'HICOV', 'HISP', 'INDP', 'JWAP', 'JWDP', 'LANP', 'MIGPUMA', 'MIGSP', 'MSP', 'NAICSP', 'NATIVITY', 'NOP', 'OC', 'OCCP', 'PAOC', 'PERNP', 'PINCP', 'POBP', 'POVPIP', 'POWPUMA', 'POWSP', 'PRIVCOV', 'PUBCOV', 'QTRBIR', 'RAC1P', 'RAC2P', 'RAC3P', 'RACAIAN', 'RACASN', 'RACBLK', 'RACNH', 'RACNUM', 'RACPI', 'RACSOR', 'RACWHT', 'RC', 'SCIENGP', 'SCIENGRLP', 'SFN', 'SFR', 'SOCP', 'VPS', 'WAOB', 'FAGEP', 'FANCP', 'FCITP', 'FCITWP', 'FCOWP', 'FDDRSP', 'FDEARP', 'FDEYEP', 'FDISP', 'FDOUTP', 'FDPHYP', 'FDRATP', 'FDRATXP', 'FDREMP', 'FENGP', 'FESRP', 'FFERP', 'FFODP', 'FGCLP', 'FGCMP', 'FGCRP', 'FHINS1P', 'FHINS2P', 'FHINS3C', 'FHINS3P', 'FHINS4C', 'FHINS4P', 'FHINS5C', 'FHINS5P', 'FHINS6P', 'FHINS7P', 'FHISP', 'FINDP', 'FINTP', 'FJWDP', 'FJWMNP', 'FJWRIP', 'FJWTRP', 'FLANP', 'FLANXP', 'FMARP', 'FMARHDP', 'FMARHMP', 'FMARHTP', 'FMARHWP', 'FMARHYP', 'FMIGP', 'FMIGSP', 'FMILPP', 'FMILSP', 'FOCCP', 'FOIP', 'FPAP', 'FPERNP', 'FPINCP', 'FPOBP', 'FPOWSP', 'FPRIVCOVP', 'FPUBCOVP', 'FRACP', 'FRELP', 'FRETP', 'FSCHGP', 'FSCHLP', 'FSCHP', 'FSEMP', 'FSEXP', 'FSSIP', 'FSSP', 'FWAGP', 'FWKHP', 'FWKLP', 'FWKWP', 'FWRKP', 'FYOEP', 'PWGTP1', 'PWGTP2', 'PWGTP3', 'PWGTP4', 'PWGTP5', 'PWGTP6', 'PWGTP7', 'PWGTP8', 'PWGTP9', 'PWGTP10', 'PWGTP11', 'PWGTP12', 'PWGTP13', 'PWGTP14', 'PWGTP15', 'PWGTP16', 'PWGTP17', 'PWGTP18', 'PWGTP19', 'PWGTP20', 'PWGTP21', 'PWGTP22', 'PWGTP23', 'PWGTP24', 'PWGTP25', 'PWGTP26', 'PWGTP27', 'PWGTP28', 'PWGTP29', 'PWGTP30', 'PWGTP31', 'PWGTP32', 'PWGTP33', 'PWGTP34', 'PWGTP35', 'PWGTP36', 'PWGTP37', 'PWGTP38', 'PWGTP39', 'PWGTP40', 'PWGTP41', 'PWGTP42', 'PWGTP43', 'PWGTP44', 'PWGTP45', 'PWGTP46', 'PWGTP47', 'PWGTP48', 'PWGTP49', 'PWGTP50', 'PWGTP51', 'PWGTP52', 'PWGTP53', 'PWGTP54', 'PWGTP55', 'PWGTP56', 'PWGTP57', 'PWGTP58', 'PWGTP59', 'PWGTP60', 'PWGTP61', 'PWGTP62', 'PWGTP63', 'PWGTP64', 'PWGTP65', 'PWGTP66', 'PWGTP67', 'PWGTP68', 'PWGTP69', 'PWGTP70', 'PWGTP71', 'PWGTP72', 'PWGTP73', 'PWGTP74', 'PWGTP75', 'PWGTP76', 'PWGTP77', 'PWGTP78', 'PWGTP79', 'PWGTP80']

The above fields are listed out (along with descriptions and valid values) in the 2016 5-yr PUMS Data Dictionary. We will only be looking at a subset of these fields - those related to demographic, education level among others - for this analysis.

Feature Selection

From above we see that this dataset comes with a lot of features, 283 to be precise. Many, like the replicate weights (PWGTP{1-80}), are less concerned with the core information specific to each sample and have more to do with statistical measures associated with the sample for use in deriving population estimates. More about them here. Given our task here - predicting the income level - and the size of this dataset (at ~1.6M samples it is more than adequate for standard ML modeling) we have no need for the replication weights.

From among the remaining ~200 features there is one-on-one mapping from a core variable to a data-quality flag - these are called person record-allocation flags (FXXXXP). These flags indicate if a value associated with a variable was allocated or actually observed (i.e.: received as part of the survey response). For the purpose of our analysis here we do not care about how the data was arrived at just as long as the methodology used is well-designed and reasonable.

Discounting the flags we have 128 features to work with. Of these, 11 are race related (RACXXX). Although having more detailed race data is likely to benefit the accuracy of our models given existing trends (Ref: Racial Wage Gap in the US), we've decided to use a single high-level variable (RAC1P) to encode racial information and forego the others.

Among the remaining features we chose those that closely resemble the ones used by an earlier dataset that inspired the creation of this one. This was primarily driven by a desire to compare achievable accuracy measures given similar datasets (despite the datasets being nearly 25 years apart in time) and to some extent by a desire to restrict features to those related to core demographic information.

Note: It may well turn out that many of the unused features were simply not available previously and that they may indeed help explain a significant enough portion of the variation in the dataset. And if that is the case we would benefit from revisiting feature selection at a later stage.

In [5]:
# read in desired field codes and descriptive names from file and store them for lookup
df_codes = pd.read_csv(os.path.join(DATADIR,DATAFILE_CODES),skipinitialspace=True)
dict_codes = dict(zip(df_codes.FieldCode,df_codes.FieldName))
cols = df_codes.FieldCode.values
cols.tolist()
Out[5]:
['RT', 'SERIALNO', 'SPORDER', 'PUMA', 'ST', 'ADJINC', 'PWGTP', 'AGEP', 'CIT', 'COW', 'ENG', 'INTP', 'JWTR', 'MAR', 'OIP', 'PAP', 'RELP', 'RETP', 'SCHL', 'SEMP', 'SEX', 'SSIP', 'SSP', 'WAGP', 'WKHP', 'WKW', 'DIS', 'INDP', 'PERNP', 'PINCP', 'POBP', 'POVPIP', 'RAC1P', 'SCIENGP', 'SCIENGRLP']

Read the PUMA-to-Zipcode crosswalk file and extract zipcodes for PUMAs specific to San Diego county. Also create a list of uniques PUMAs for the San Diego region. Note: we are only taking PUMAs for the cities that are part of the San Diego county. This will therefore not account for the unincorporated, as well as, non-municipal areas of the county.

In [6]:
# utf-8 is the default encoding which leads to decode errors
df_xwalk = pd.read_csv(os.path.join(DATADIR,DATAFILE_XWALK),skipinitialspace=True,skiprows=[1],encoding = "ISO-8859-1")
df_xwalk.head()
Out[6]:
state puma12 zcta5 stab zipname PUMAname intptlon intptlat pop14 afact AFACT2
0 6 300 89010 CA Dyer, NV Alpine, Amador, Calaveras, Inyo, Mariposa, Mon... -117.914727 37.487642 30.552528 0.000164 1.0
1 6 7101 89019 CA Sandy Valley, NV San Bernardino County (Northeast)--Twentynine ... -115.645203 35.756243 71.624408 0.000495 1.0
2 6 300 89060 CA Pahrump, NV Alpine, Amador, Calaveras, Inyo, Mariposa, Mon... -116.144460 36.185628 29.780006 0.000160 1.0
3 6 300 89061 CA Pahrump, NV Alpine, Amador, Calaveras, Inyo, Mariposa, Mon... -115.911793 35.965713 50.626011 0.000272 1.0
4 6 5700 89439 CA Verdi, NV Nevada & Sierra Counties -120.017768 39.536953 74.148148 0.000728 1.0
In [7]:
sd_cities = ['Carlsbad','Chula Vista','Coronado','Delmar','El Cajon','Encinitas','Escondido','Imperial Beach',
             'La Mesa','Lemon Grove','National City','Oceanside','Poway','San Diego','San Marcos','Santee',
             'Solana Beach','Vista']

sd_zipnames = [s + ', CA' for s in sd_cities]

df_xwalk_sd = df_xwalk.loc[df_xwalk['zipname'].isin(sd_zipnames)]
df_xwalk_sd = df_xwalk_sd[['puma12','zcta5','zipname']]
sd_pumas = df_xwalk_sd.puma12.unique()

Pre-Processing

Now read the full dataset of PUMA samples, extract those specific to the San Diego county and set that aside. This will form our test set. The rest of the dataset or a portion of it will be the training set.

In [8]:
df = pd.read_csv(os.path.join(DATADIR,DATAFILE_PUMS),skipinitialspace=True,usecols=cols)
In [9]:
# settings to display all coulmns
pd.set_option('display.max_columns', 50)

df.head(n=10)
Out[9]:
RT SERIALNO SPORDER PUMA ST ADJINC PWGTP AGEP CIT COW ENG INTP JWTR MAR OIP PAP RELP RETP SCHL SEMP SEX SSIP SSP WAGP WKHP WKW DIS INDP PERNP PINCP POBP POVPIP RAC1P SCIENGP SCIENGRLP
0 P 2012000000011 1 8508 6 1056030 21 63 1 1.0 NaN 0.0 1.0 1 0.0 0.0 0 0.0 19.0 0.0 1 0.0 0.0 65000.0 40.0 1.0 2 1190.0 65000.0 65000.0 6 501.0 1 NaN NaN
1 P 2012000000011 2 8508 6 1056030 22 60 1 1.0 NaN 0.0 1.0 1 0.0 0.0 1 0.0 19.0 0.0 2 0.0 0.0 40000.0 40.0 1.0 2 7070.0 40000.0 40000.0 6 501.0 1 NaN NaN
2 P 2012000000022 1 8105 6 1056030 14 79 1 NaN 1.0 0.0 NaN 2 0.0 0.0 0 18000.0 16.0 0.0 2 0.0 21600.0 0.0 NaN NaN 1 NaN 0.0 39600.0 6 363.0 1 NaN NaN
3 P 2012000000031 1 1305 6 1056030 20 49 1 1.0 NaN 0.0 1.0 1 0.0 0.0 0 0.0 21.0 0.0 1 0.0 0.0 69000.0 60.0 1.0 2 4670.0 69000.0 69000.0 6 501.0 1 1.0 2.0
4 P 2012000000031 2 1305 6 1056030 19 48 1 1.0 NaN 0.0 1.0 1 0.0 0.0 1 0.0 22.0 0.0 2 0.0 0.0 200000.0 50.0 1.0 2 6990.0 200000.0 200000.0 6 501.0 1 1.0 2.0
5 P 2012000000040 1 2901 6 1056030 2 60 1 1.0 NaN 0.0 1.0 3 0.0 0.0 0 0.0 16.0 0.0 2 0.0 0.0 55000.0 40.0 1.0 2 5380.0 55000.0 55000.0 17 429.0 1 NaN NaN
6 P 2012000000040 2 2901 6 1056030 2 30 1 NaN NaN 0.0 NaN 1 0.0 0.0 2 0.0 17.0 0.0 2 0.0 10200.0 0.0 NaN NaN 2 NaN 0.0 10200.0 6 429.0 1 NaN NaN
7 P 2012000000067 1 8511 6 1056030 16 50 1 3.0 NaN 0.0 1.0 1 0.0 0.0 0 0.0 23.0 0.0 2 0.0 0.0 80000.0 50.0 2.0 2 7860.0 80000.0 80000.0 15 501.0 6 2.0 2.0
8 P 2012000000071 1 7311 6 1056030 12 93 1 NaN NaN 1800.0 NaN 2 0.0 0.0 0 36000.0 18.0 0.0 1 0.0 2400.0 0.0 NaN NaN 1 NaN 0.0 40200.0 40 501.0 1 NaN NaN
9 P 2012000000071 2 7311 6 1056030 20 36 1 1.0 NaN 0.0 1.0 5 0.0 0.0 2 0.0 19.0 0.0 1 0.0 0.0 34000.0 40.0 1.0 2 7880.0 34000.0 34000.0 6 501.0 9 NaN NaN

Before splitting the data into test and training sets though, we need to carry out the following transformations (common to both sets):

  • Create a category field with two classes based on AGI (Aggregated Gross Income) as below:
    • AGI<=50K
    • AGI>50K
  • Handle missing values

Notes:

  • The AGI is given by PINCP which covers both earnings and income. In effect, we need not sum up other income fields such as INTP, RETP, OIP and so on.
  • The AGI need not be adjusted by ADJINC since these are 5-yr samples
Label Data
In [10]:
df['CAT'] = np.where(df['PINCP']>50000.0, 'AGI>50K', 'AGI<=50K')
df.head()
Out[10]:
RT SERIALNO SPORDER PUMA ST ADJINC PWGTP AGEP CIT COW ENG INTP JWTR MAR OIP PAP RELP RETP SCHL SEMP SEX SSIP SSP WAGP WKHP WKW DIS INDP PERNP PINCP POBP POVPIP RAC1P SCIENGP SCIENGRLP CAT
0 P 2012000000011 1 8508 6 1056030 21 63 1 1.0 NaN 0.0 1.0 1 0.0 0.0 0 0.0 19.0 0.0 1 0.0 0.0 65000.0 40.0 1.0 2 1190.0 65000.0 65000.0 6 501.0 1 NaN NaN AGI>50K
1 P 2012000000011 2 8508 6 1056030 22 60 1 1.0 NaN 0.0 1.0 1 0.0 0.0 1 0.0 19.0 0.0 2 0.0 0.0 40000.0 40.0 1.0 2 7070.0 40000.0 40000.0 6 501.0 1 NaN NaN AGI<=50K
2 P 2012000000022 1 8105 6 1056030 14 79 1 NaN 1.0 0.0 NaN 2 0.0 0.0 0 18000.0 16.0 0.0 2 0.0 21600.0 0.0 NaN NaN 1 NaN 0.0 39600.0 6 363.0 1 NaN NaN AGI<=50K
3 P 2012000000031 1 1305 6 1056030 20 49 1 1.0 NaN 0.0 1.0 1 0.0 0.0 0 0.0 21.0 0.0 1 0.0 0.0 69000.0 60.0 1.0 2 4670.0 69000.0 69000.0 6 501.0 1 1.0 2.0 AGI>50K
4 P 2012000000031 2 1305 6 1056030 19 48 1 1.0 NaN 0.0 1.0 1 0.0 0.0 1 0.0 22.0 0.0 2 0.0 0.0 200000.0 50.0 1.0 2 6990.0 200000.0 200000.0 6 501.0 1 1.0 2.0 AGI>50K
Remove Non-Essential Fields

Now that we have the income category label we can remove all other fields related to income.

In [11]:
income_cols = ['ADJINC','INTP','OIP','PAP','RETP','SEMP','SSIP','SSP','WAGP','POVPIP']
df.drop(income_cols,axis=1,inplace=True)

Let's identify other fields that are not essential or do not fit into the criteria we are using to predict the income levels.

In [12]:
drop_cols = ['RT','SERIALNO','SPORDER','CIT','JWTR','WKHP','WKW']
df.drop(drop_cols,axis=1,inplace=True)
Handling Missing Values

Now clearly there are many missing values here as is evident from the above dump of a very small slice of the data. There are many ways to handle missing data and there is no hard-and-fast rule as to which one is the best. These strategies either discard, impute, or model the data that is missing.

Let's get a sense of how much is missing to see if discarding is even an option. Below is the percentage of samples with atleast one missing value:

In [13]:
int((df.isnull().values.any(axis=1).sum()/df.shape[0])*100)
Out[13]:
93

That's an unusually high percentage of samples with NaNs. It's very likely that one or two fields are responsible for this. Let's see if there are such fields. Below is the percentage of NaNs in each column:

In [14]:
col_nans = ((df.isnull().values.sum(axis=0)/df.shape[0])*100).astype(int)
col_nans.tolist()
Out[14]:
[0, 0, 0, 0, 42, 60, 0, 0, 3, 0, 0, 42, 19, 17, 0, 0, 76, 76, 0]

It does look like a handful of columns result in a very high percentage of NaNs. Let's examine what those columns are.

In [15]:
hi_nan_idx = np.where(col_nans > 50)[0]
hi_nan_cols = df.columns.values[hi_nan_idx]
hi_nan_cols.tolist()
Out[15]:
['ENG', 'SCIENGP', 'SCIENGRLP']

The fields refer to Spoken English Level, Science-Engineering (STEM) Field and Science-Engineering Related Field in the dataset. All of these seem to be fairly important features for our classification algorithm since they all appear to have a correlation with income levels.

Place of Birth (POBP) is fairly strong indicator of the level of spoken English and so one could argue that ENG could be dropped. Both the STEM fields have unusually large number of NaNs at 76% each and it is non-trivial to infer their values although some strategies do exist (e.g.: immigrants, especially with higher incomes tend to come from the STEM field). POBP seems to play a part here too and we may do well to discard these STEM fields.

In [16]:
tmpdf = df.drop(hi_nan_cols,axis=1)
int((tmpdf.isnull().values.any(axis=1).sum()/tmpdf.shape[0])*100)
Out[16]:
42

We see that dropping ENG, SCIENGP and SCIENGRLP leads to a significant reduction (51% to be precise) in the number of samples with missing values. Let's now take a look at the next group of high NaN fields and decide what to do with them.

In [17]:
col_nans = ((tmpdf.isnull().values.sum(axis=0)/tmpdf.shape[0])*100).astype(int)
(tmpdf.columns.values[np.where(col_nans > 20)[0]]).tolist()
Out[17]:
['COW', 'INDP']

The above fields are Worker Class and Industry. Let's take a look at samples that have NaNs for either of these fields and see what we might learn.

In [18]:
tmpdf.loc[np.isnan(tmpdf['COW']) | np.isnan(tmpdf['INDP'])].head(n=3)
Out[18]:
PUMA ST PWGTP AGEP COW MAR RELP SCHL SEX DIS INDP PERNP PINCP POBP RAC1P CAT
2 8105 6 14 79 NaN 2 0 16.0 2 1 NaN 0.0 39600.0 6 1 AGI<=50K
6 2901 6 2 30 NaN 1 2 17.0 2 2 NaN 0.0 10200.0 6 1 AGI<=50K
8 7311 6 12 93 NaN 2 0 18.0 1 1 NaN 0.0 40200.0 40 1 AGI<=50K
In [19]:
tmpdf.loc[~np.isnan(tmpdf['COW']) | ~np.isnan(tmpdf['INDP'])].head(n=3)
Out[19]:
PUMA ST PWGTP AGEP COW MAR RELP SCHL SEX DIS INDP PERNP PINCP POBP RAC1P CAT
0 8508 6 21 63 1.0 1 0 19.0 1 2 1190.0 65000.0 65000.0 6 1 AGI>50K
1 8508 6 22 60 1.0 1 1 19.0 2 2 7070.0 40000.0 40000.0 6 1 AGI<=50K
3 1305 6 20 49 1.0 1 0 21.0 1 2 4670.0 69000.0 69000.0 6 1 AGI>50K

You can adjust the value of n above to view larger slices of the data for each of the conditions. There appears to be a relation between Total Earnings (PERNP) and the two fields. And further, it appears that COW and INDP seem to be either NaN or not for the same samples. Let's test these theories.

In [20]:
cow_nan_idx = np.where(np.isnan(tmpdf['COW']))[0]
s = "<b>NaN Counts</b>: <b>INDP</b> {} <b>COW</b> {}".format(np.isnan(tmpdf.INDP.values[cow_nan_idx]).sum(),len(cow_nan_idx))
display(HTML(s))
unq = np.unique(tmpdf.PERNP.values[cow_nan_idx])
unq[np.where(np.isnan(unq))] = 0.
s = "Unique values of PERNP when COW and INDP are NaN: {}".format(np.unique(unq))
display(HTML(s))
NaN Counts: INDP 789759 COW 789759
Unique values of PERNP when COW and INDP are NaN: [ 0.]

As suspected, the same samples have NaNs for both COW and INDP. And further, we see that when they are NaN, the Total Earnings is either NaN or 0. This suggests a strong correlation between the two and allows us to drop one of them. Note: It's still possible to argue that retaining both is of value but its a weak argument.

If we are going to retain one of them it is better to retain COW since INDP seems to be more of a refinement of COW and also because it is easier to infer COW for the rest of the missing values than it is to do so for INDP. Give that individuals without any earnings are the ones who have a NaN for worker-class we can reasonably conclude that these individuals are either retired or unemployed. The closest value that one can assign is 9. From the data dictionary this is what this value represents:

9 .Unemployed and last worked 5 years ago or earlier or never .worked

In [21]:
tmpdf.loc[cow_nan_idx,['COW']] = 9.
tmpdf.drop(['INDP','PERNP','PINCP'],axis=1,inplace=True)
int((tmpdf.isnull().values.any(axis=1).sum()/tmpdf.shape[0])*100)
Out[21]:
3

With these changes we are down to 3% of samples with at least one missing value. This is a small enough percentage of samples that we can discard these samples without concern for how it might affect the accuracy of our model.

In [22]:
nan_samples = np.where(tmpdf.isnull().values.any(axis=1))[0]
tmpdf.drop(nan_samples,inplace=True)
int((tmpdf.isnull().values.any(axis=1).sum()/tmpdf.shape[0])*100)
Out[22]:
0

And there we have it, a dataset with zero missing values. We can now split this to take out the San Diego specific samples and keep that aside as the test set.

Training and Test Sets

Below we filter the data on PUMAs specific to the San Diego county such that the test set is exclusive to those PUMAs while the training set contains PUMAs for the rest of California excluding San Diego county.

In [23]:
test_df = tmpdf.loc[tmpdf['PUMA'].isin(sd_pumas) & (tmpdf['ST'] == 6)]
train_df = tmpdf.loc[~(tmpdf['PUMA'].isin(sd_pumas) & (tmpdf['ST'] == 6))]
total = train_df.shape[0] + test_df.shape[0]

s = "<b>Sample Counts</b>: <b>Train</b> {} <b>Test</b> {}".format(train_df.shape[0],test_df.shape[0])
display(HTML(s))
s = "<b>Sample Percentages</b>: <b>Train</b> {:.2f} <b>Test</b> {:.2f}".format(((train_df.shape[0]/total)*100),
                                                                     ((test_df.shape[0]/total)*100))
display(HTML(s))
Sample Counts: Train 1663117 Test 140852
Sample Percentages: Train 92.19 Test 7.81

Looks like San Diego PUMAs account for nearly 8% of this PUMS dataset. This is a good sample size for our test set. We write the training and test data sets to file for use in further analysis.

In [24]:
# write the datasets out to file
train_df.to_csv(OUTFILE_TRAIN, encoding='utf-8', index=False)
test_df.to_csv(OUTFILE_TEST, encoding='utf-8', index=False)

Outro

Below is a summary of the operations carried out in this part of the analysis:

  • Download the raw datasets: (a) 2016 5yr ACS PUMS data for California (b) PUMA to Zipcode Crosswalk
  • Merge datasets (a) and (b) to create PUMS samples identifiable by frequently used geo-ids
  • Transform the dataset to retain relevant features
  • Label the dataset with desired classes based on income levels
  • Handle missing data
  • Partition the dataset into training and test sets based on PUMAs
  • Store the training and test datasets to file

Additional Analysis

For additional analysis and insights please refer to part II of this series located here.