What follows is a look at the demographics that make up the San Diego county. The objective here is to gather a deeper understanding of the demographic data through insightful visualizations. In other words, the emphasis is on visualizing the data rather than data analysis. Questions answered by this effort are under Data Questions. This effort is intended solely as an excercise in visualizing the demographic data.
What is the demographic make-up of the San Diego county and its cities?
American Community Survey (ACS) from the US Census Bureau is the gold standard when it comes to publicly available sources of data concerning the US population. ACS datasets are available in the form of tables, with each table assigned a unique identifier. The full list of tables (1000+) can be accessed here and downloaded manually. ACS also provides APIs to download the datasets.
Guides to using data from the ACS is here. The methodology used to capture the data is detailed here and code lists among other documentation can be found here.
For the demographic data 5-year estimates were chosen since it is more precise compared to the 1-year estimate. For the purpose of this analysis, the data profiles (for each of the cities) contain all of the higher-level demographics information we are looking.
Datasets were chosen to be downloaded by API. Instructions on the URL request to be used for 5-year estimates can be found here.
import pandas as pd
import numpy as np
import requests
import json
from IPython.display import display, HTML
from enum import Enum, unique, auto
from itertools import islice
Constructing an API request for downloading data from the ACS involves figuring out the following things:
Once each of the above types are decided one would need to further choose among the available content for each type. E.g.: the actual year (2016, 2015, ...) for which to fetch the estimates, the specific tables within each product type (DP05, B01001, ...) and names of geographies (San Diego, Los Angeles ... if county is the chosen geography type).
All of this can be quite overwhelming. One approach is to browse what's available for each selection using the American Fact Finder (AFF) Search Tool and pick out values based on what the search yields. It is even more useful to save the selections to disk through the save search
feature. This saves the selections as <file>.aff
, an XML file with the full search terms. This file can be further used to capture information (such as place ids and so on) needed to construct the URL. Additionally, there is also a spreadsheet of FIPS codes for all geographic areas selectable with ACS queries here.
A developer's guide to using the API is here and a tutorial on using the API here and here. The ACS site also lists many examples of URL requests for downloading its datasets.
NOTE: Buried in the developer's guide is the fact there is a limit on the number of variables that can be requested per API. It is currently 50.
# these are possible values for the product type of 3-year ACS estimates
@unique
class ProductType(Enum):
DETAILED_TABLE = auto()
COMPARISON_PROFILE = auto()
DATA_PROFILE = auto()
QUICK_TABLE = auto()
SUBJECT_TABLE = auto()
GEOGRAPHY_HEADER = auto()
# these are possible values for varaible types of a detailed table
# other tables have a subset of these values
@unique
class VarType(Enum):
ESTIMATE = auto()
MARGIN_ERROR = auto()
PERCENT_ESTIMATE = auto()
PERCENT_MARGIN = auto()
ESTIMATE_MARGIN = auto()
ESTIMATE_PERCENT = auto()
ALL = auto()
PER_QUERY_VARIABLE_LIMIT = 50
HEADER_VAR = 'NAME'
# common vars appended to the fetched data
COMMON_VARS = ['state','place']
# base URLs for requesting 2016 ACS 5-year estimates
BASE_URL_DATA_PROFILE = "https://api.census.gov/data/2016/acs/acs5/profile"
BASE_URL_DETAILED_TABLE = "https://api.census.gov/data/2016/acs/acs5"
CA_STATE_FIPS = 6
SD_CITIES_FIPS = [11194, 13392, 16378, 18506, 21712, 22678, 22804, 36294, 40004, 41124, 50398, 53322, 58520,
66000, 68196, 70224, 72506, 82996]
# NOTE: the DP05 dataset is being fetched in two batches from 1-49 and 50-81 variables.
# This is a workaround to avoid the server error response (400) received when all variables are requested at once.
DATASETS = [{'type': ProductType.DATA_PROFILE,'ids': ['DP05'],'var_start':1, 'var_end': 84,'var_type': VarType.ESTIMATE}]
CONF_FILE = 'acs_api.conf'
OUT_FILE = 'acs_5yr_2016_dp05_sd.csv'
Data is downloaded by constructing an Http request as per the API specification. The Http response is a 2-dimensional array in the JSON format (as per the ACS guidelines). The JSON response is decoded prior to further analysis.
acs_api_key = None
state_fips = format(CA_STATE_FIPS,'02d')
# utility function to return a list of suffixes for the specified variable type
# list may contain one or more suffixes
#
# var_type: Character indicating the type of variable (estimate, margin of error)
#
# Returns: list of suffixes as characters
def var_type_to_suffix(var_type):
# determine suffix based on var type
suffix_list = ['E']
if (var_type is VarType.MARGIN_ERROR):
suffix_list = ['M']
elif (var_type is VarType.PERCENT_ESTIMATE):
suffix_list = ['PE']
elif (var_type is VarType.PERCENT_MARGIN):
suffix_list = ['PM']
elif (var_type is VarType.ESTIMATE_MARGIN):
suffix_list = ['E','M']
elif (var_type is VarType.ESTIMATE_PERCENT):
suffix_list = ['E','PE']
elif (var_type is VarType.ALL):
suffix_list = ['E','M','PE', 'PM']
return suffix_list
# utility function to construct a string of variable names for the
# specified product type and table id
#
# prod_type: Product Type as needed by the ACS API
# id: Table-ID as relevant to the product type
# vars: Indices of the variables (within the specified table) that are being requested
# var_type: Enum indicating the type of variable (estimate, margin of error)
#
# Returns: a single string with common seperated variable names
#
# NOTE: this functions does not do any error-checking
def get_acs_table_vars(prod_type,id,vars,var_type):
# always get header names for the variables
vlist_str = HEADER_VAR
# default product type: detailed table
fmt = '03d'
if (prod_type is ProductType.DATA_PROFILE):
fmt = '04d'
# determine suffix based on var type
suffix_list = var_type_to_suffix(var_type)
prefix = id + '_'
for idx, v in enumerate(vars):
vl = [(prefix + format(v, fmt)) + s for s in suffix_list]
vs = ",".join(vl)
vlist_str += "," + vs
return vlist_str
# Function to construct a URL request for the specified data and return the
# JSON formatted response
#
# prod_type: Enum indicating the product type as needed by the ACS API
# id: Table-ID as relevant to the product type
# vars: Indices of the variables (within the specified table) that are being requested
# var_type: Enum indicating the type of variable (estimate, margin of error)
#
# Returns: JSON formatted data fetched from ACS, if successful
# None, upon failure
#
# Raises: TypeError, ValueError, other exceptions
def fetch_acs_data_sd_cities(prod_type, id, vars, var_type):
global acs_api_key
data = None
if not isinstance(prod_type, ProductType):
raise TypeError('prod_type must be of type ProductType')
if not isinstance(var_type, VarType):
raise TypeError('var_type must be of type VarType')
# specify product type
# default product type: detailed table
url = BASE_URL_DETAILED_TABLE
if (prod_type is ProductType.DATA_PROFILE):
url = BASE_URL_DATA_PROFILE
# specify variables
url += "?get="
var_list = get_acs_table_vars(prod_type,id,vars,var_type)
url += var_list
# specify geographies
url += "&for=place:"
url += ",".join([str(fc) for fc in SD_CITIES_FIPS])
url += ("&in=state:" + state_fips)
url += "&key="
# add api-key
if acs_api_key is None:
with open(CONF_FILE,'r') as f:
acs_api_key = f.read()
url += acs_api_key
#print(url)
try:
# The response for all queries is formatted as a two dimensional JSON array
# where the first row provides column names and subsequent rows provide
# data values. (Ref: API Guide)
resp = requests.get(url)
# proceed further only if received a 200 OK status
if resp.ok:
#data = resp.json()
data = json.loads(resp.content)
else:
resp.raise_for_status()
except ValueError as e:
print("Error: JSON Decoder: {}".format(str(e)))
raise e
except Exception as e:
raise e
return data
# function to return a list of tuples with start end variable indices complying
# with the per-query limit
#
# start_id: start id of the variable
# end_id: end id of the variable
# var_type: Enum indicating the type of variable (estimate, margin of error)
#
# Returns: List of tuples containing the start and end index of each batch of
# variables
def batch_vars(start_id, end_id, var_type):
batches = []
suffix_list = var_type_to_suffix(var_type)
factor = len(suffix_list)
# true limit is discounted by 1 since NAME takes up one spot
# and then scaled by factor
# this is used for index generation for each batch
limit = int((PER_QUERY_VARIABLE_LIMIT - 1)/factor)
if (limit):
num_vars = ((end_id-start_id)+1)
num_vars *= factor
# the number of batches itself is still determined by the
# query limit
num_batches = int(num_vars/PER_QUERY_VARIABLE_LIMIT)
si = start_id
for i in range(1,num_batches+1):
ei = si + limit - 1
batches.append((si,ei))
si = ei + 1
batches.append((si,end_id))
#for (i,j) in batches:
# print("({},{})".format(i,j))
return batches
We loop through the datasets specified in DATASETS
and download the data via ACS API. The code below takes care to chunk the requests sent out based on the number of variables needed to be fetched and the API query limit. It further re-assembles the data received for the various chunks and stores it as a pandas dataframe in a dictionary using the table ID as the key.
from collections import defaultdict
data_dict = defaultdict(list)
for d in DATASETS:
prod_type = d['type']
var_type = d['var_type']
table_ids = d['ids']
var_id_start = d['var_start']
var_id_end = d['var_end']
for id in table_ids:
try:
# code to split the variables into batches of 50 to account for the API limitation
var_batches = batch_vars(var_id_start,var_id_end,var_type)
table_df = None
for i, (batch_start,batch_end) in enumerate(var_batches):
# fetch the data for each batch
vars = [no for no in range(batch_start, batch_end+1)]
batch_data = fetch_acs_data_sd_cities(prod_type,id,vars,var_type)
# convert json data from each batch to dataframe and then concatenate
# data frames for all batches belonging to the same table
df = pd.DataFrame(data=batch_data[1:],columns=batch_data[0]).T
# convert row index name into column index name (?)
df.columns = df.loc[HEADER_VAR]
df = df.drop(HEADER_VAR,axis=0)
# save common rows from each batch
# remove them here and re-attach a single copy to the concatenated df
df_common = df.filter(items=COMMON_VARS,axis=0)
df = df.drop(COMMON_VARS,axis=0)
#display(df.head())
table_df = pd.concat([table_df,df],axis=0)
#re-attach common variables
table_df = pd.concat([table_df,df_common],axis=0)
data_dict[id].append(table_df)
#display(table_df.head())
except Exception as e:
print(e)
break
Since we're looking at a single table for this analysis we'll create a short-cut for it and use that going forward. Here is a sample of what this table looks like for the ACS demographics data profile given by table ID DP05.
df = data_dict['DP05'][0]
df.head(n=2)
We see that while the place names are specified as well-known names in use, the actual variables themselves are not. This is fixed by downloading the variables information (available in HTML, XML and JSON formats) and recoding the variable ids to well-known labels.
#
# This code fragment is here so we know how to fetch the variable meta-data and incorporate it into
# the dataset. Currently, we don't actually use it in meaningful ways.
#
url = "https://api.census.gov/data/2016/acs/acs5/profile/variables.json"
resp = requests.get(url)
if resp.ok:
data = resp.json() # decode json to dict
else:
print("Error downloading variable names for table DP05 / response status ({})".format(resp.status_code))
# data is in the form of a dictionary as below
#
# {
# 'variables':
# {
# 'for': {...}
# 'in': {...}
# 'var_id_1': {
# 'concept': '',
# 'group': '',
# 'label': '',
# 'limit': 0,
# 'predicateType': 'int',
# 'validValues': []
# },
# 'var_id_2': {...},
# ...
# }
# }
# we could do a lot with the above information such as validate the variable values to conform to the type, limit and expected
# values. For our present purpose we will assume the data is valid. We are primarily interested in the label so we can recode
# columns to well-known representations rather than the unique IDs they are currently.
df.columns = [s.split("city, California")[0] for s in df.columns]
vars = data['variables']
labels = []
id_list = list(df.index)
for id in id_list:
l = 'NA'
if id in vars:
l = vars[id]['label']
labels.append(l)
df.insert(0, 'VarName', labels)
df.columns.name = None
df.head(n=2)
We'll choose a select few of these variables for exploratory analysis via Tableau. Below, we write the selection out to file as a CSV, to be imported into Tableau.
# the names in cols correspond to the var ids in the vars list below
cols = ['Total','Male','Female','Median Age','18+','65+','Male-65+','Female-65+','1 Race','2+ Races',
'White','African American','Native American','Asian','Pacific Islander','Other','Hispanic']
vars = [1,2,3,17,18,21,26,27,29,30,59,60,61,62,63,64,66]
d = DATASETS[0]
var_str = get_acs_table_vars(d['type'],d['ids'][0],vars,d['var_type'])
var_list = var_str.split(",")[1:] # drop HEADER_VAR
var_list += COMMON_VARS
df = df.drop('VarName', axis=1)
df = df.T
df = df[var_list]
df.columns = cols + COMMON_VARS
# convert values to integers
df['Median Age'] = df['Median Age'].astype('float')
df[cols] = df[cols].astype('int')
# insert fields that are derived
df.insert(4, '<18', (df['Total'] - df['18+']))
df.insert(6, '0-64', (df['Total'] - df['65+']))
# convert index to column
df.reset_index(inplace=True)
df.rename(columns={'index':'City','state': 'State', 'place': 'Place'}, inplace=True)
# re-arrange columns
cols = df.columns.tolist()
cols = cols[-2:] + cols[:-2]
df = df.reindex(columns= cols)
df.head()
df.to_csv(OUT_FILE, encoding='utf-8', index=False)
Tableau visualizations are typically preferred when conveying insights outside of the engineering/developer community. Two things that make Tableau particularly good in such cases is the ability to make interactive visualizations and to present them as narratives or stories. The ease and speed with which one can create these high-quality, highly engaging visualizations is what makes the tool so popular, especially in the enterprise world.
The public (and free) version of Tableau is used for these visualizations here. Tableau Public can be downloaded as a dektop app (just like the professional version), used to create visualizations locally, and then to publish the same online for anyone to view. This is a great way to share works based on open data.
The visualizations that follow help get an overview of the demographics for the 18 cities that make up the San Diego county. Each field of the demographic data can additionally be compared against other fields from the same city or acorss cities.
Tableau 10.2+ recognizes zip-code based geographies allowing it to automatically draw zip-codes (those identified as playing a geographic role) on a map. However, all other geographies must be associated with a geospatial file - commonly known as a shapefile - in ordered to be rendered on a map. Conveniently enough, ACS offers free downloads of many shapefiles specific to US geographies. Furthermore, these files can be downloaded based on FIPS codes that are part of ACS datasets.
Shapefiles for all places in California can be downloaded as a single zip archive here. Once downloaded, they can be loaded into Tableau and merged with the ACS demographics dataset (which is similarly loaded into Tableau). This tutorial helps explain how shapefiles can be used within Tableau.
There are a great many things that one can do with the above visualiztion such as, zooming in and out, searching for a city using the search box in the upper left corner, selecting a city by hovering over it on the map and triggering other actions such as tool-tip displays (as seen above) that contain information about the region selected.
One of the main reasons for using interactive visualizations has to do with the need and ability to filter out data. Here, a filter for cities, and another for individual fields of the demographic data allows one to drill down and compare just the right amount of data, at the right level of detail.
One of the key goals, if not "the" goal, of data visualization is identifying relationships between the many features in the data. Comparing and constrasting values assumed by one or more variables across the dataset is the means by which to do this. It is no surprise then that visualization tools are expected to provide easy-to-use means to make these comparisons. Tableau excels at this task. This along with the many chart options (with helpful suggestions) it provides helps make data visualization using Tableau a breeze.
Here, race-based population counts is compared across cities in the San Diego county. Stacking up the bars this way allows one to not only get a sense of the racial composition within each city, but also comparisons across several cities.
Bubbles are a great way of sizing up quantities and comparing one or more quantities. Here, male and female population counts are represented as bubbles. Gender ratios appear to be very well balanced across the county based on this plot.
Here we compare the populations of adults (18+
) with non-adults (<18
) and seniors (65+
) with non-seniors (0-64
). The visualization has been sorted (a default feature that Tableau provides) by population counts in each category to gain easy access to additional insights.
Dashboards and Stories are two features of Tableau that are geared towards the presentation of the visualizations created within Tableau. Dashboards allow several visualizations to be juxtaposed within a single view and for two or more of them to be controlled using the same triggers such as filters. This helps makes for a rich presentation of multiple inter-related visualizations. Stories, as the name suggests, help tell a story by combining narrative elements with visualizations.
This dashboard combines the map-based high-level overview with the ability to select a city and the demographics data fields. The selected city is both highlighted on the map as well as through the bar plot for the various demographic fields chosen.
This dashboard combines the three visualizations comparing gender, age and race across the county, allowing one to get a sense of the entire dataset, across the county through a single view.