/ SOCIO-ECONOMIC , SAN DIEGO

Income and the Individual

This week, as I was browsing through some datasets at the UCI Machine Learning Repository - a well-known source of open datasets for machine learning enthusiasts - I came across one that caught my attention. It is among the more popular datasets on the site and it got me interested for two reasons: (a) its use of microdata samples from the American Community Survey a.k.a PUMS data, something I’d been wanting to explore for a while now and (b) its relative simplicity - anywhere from 10-15 features capturing demographic, education level, place of birth, among other information for individuals and an associated income level. The machine learning task that this data was meant to be used for was Classification. In other words, the features in this dataset were to be used to predict the income level (categorised into classes: AGI <= 50K and AGI > 50K) for each sample.

The data came from a 1994 PUMS release and did not contain any information pertaining to geography. The research paper that originally looked into this data reported ~85% accuracy (not particularly high, but not bad either for the limited set of features used) using an ensemble classification algorithm (NBTree). I wondered if more recent data would perform similarly and what, if any, implications it had on specific geographies. So I went about collecting the more recent PUMS data, in this case, 5-yr estimates spanning 2012-2016, and set out to test the algorithms on data specific to the San Diego county. Note that while the test data was San Diego specific, the training data itself came from across California.

While this effort did not yield results one could deem satisfactory - the models attempted averaged around 82% accuracy - the process of taking this data through the rigors of a full-scale machine learning workflow left me with lessons I couldn’t have learned elsewhere.

Additional Analysis

For additional analysis and the Python code used to generate these plots and insights take a look at this series of Python notebooks: [1], [2], [3].

ANALYSIS Python VISUALIZATION matplotlib FORMAT csv
ALGORITHMS Perceptron, SVM