SD County Demographics¶

What follows is a look at the demographics that make up the San Diego county. The objective here is to gather a deeper understanding of the demographic data through insightful visualizations. In other words, the emphasis is on visualizing the data rather than data analysis. Questions answered by this effort are under Data Questions. This effort is intended solely as an excercise in visualizing the demographic data.

Data Questions¶

What is the demographic make-up of the San Diego county and its cities?

Data Source¶

American Community Survey (ACS) from the US Census Bureau is the gold standard when it comes to publicly available sources of data concerning the US population. ACS datasets are available in the form of tables, with each table assigned a unique identifier. The full list of tables (1000+) can be accessed here and downloaded manually. ACS also provides APIs to download the datasets.

Guides to using data from the ACS is here. The methodology used to capture the data is detailed here and code lists among other documentation can be found here.

Datasets¶

For the demographic data 5-year estimates were chosen since it is more precise compared to the 1-year estimate. For the purpose of this analysis, the data profiles (for each of the cities) contain all of the higher-level demographics information we are looking.

Datasets were chosen to be downloaded by API. Instructions on the URL request to be used for 5-year estimates can be found here.

import pandas as pd
import numpy as np
import requests
import json

from IPython.display import display, HTML
from enum import Enum, unique, auto
from itertools import islice

Constructing an API request for downloading data from the ACS involves figuring out the following things:

Dataset Type: 1/3/5-year estimates
Product Type: Data profiles, Detailed Tables, Subject Tables, ...
Geography Type: Region, State, County, Place, ...

Once each of the above types are decided one would need to further choose among the available content for each type. E.g.: the actual year (2016, 2015, ...) for which to fetch the estimates, the specific tables within each product type (DP05, B01001, ...) and names of geographies (San Diego, Los Angeles ... if county is the chosen geography type).

All of this can be quite overwhelming. One approach is to browse what's available for each selection using the American Fact Finder (AFF) Search Tool and pick out values based on what the search yields. It is even more useful to save the selections to disk through the save search feature. This saves the selections as <file>.aff, an XML file with the full search terms. This file can be further used to capture information (such as place ids and so on) needed to construct the URL. Additionally, there is also a spreadsheet of FIPS codes for all geographic areas selectable with ACS queries here.

A developer's guide to using the API is here and a tutorial on using the API here and here. The ACS site also lists many examples of URL requests for downloading its datasets.

NOTE: Buried in the developer's guide is the fact there is a limit on the number of variables that can be requested per API. It is currently 50.

# these are possible values for the product type of 3-year ACS estimates 
@unique
class ProductType(Enum):    
    DETAILED_TABLE = auto()
    COMPARISON_PROFILE = auto()
    DATA_PROFILE = auto()
    QUICK_TABLE = auto()        
    SUBJECT_TABLE = auto()
    GEOGRAPHY_HEADER = auto()
    
# these are possible values for varaible types of a detailed table
# other tables have a subset of these values
@unique
class VarType(Enum):
    ESTIMATE = auto()
    MARGIN_ERROR = auto()
    PERCENT_ESTIMATE = auto()
    PERCENT_MARGIN = auto()
    ESTIMATE_MARGIN = auto()
    ESTIMATE_PERCENT = auto()
    ALL = auto()

PER_QUERY_VARIABLE_LIMIT = 50

HEADER_VAR = 'NAME'
#  common vars appended to the fetched data
COMMON_VARS = ['state','place']

# base URLs for requesting 2016 ACS 5-year estimates
BASE_URL_DATA_PROFILE = "https://api.census.gov/data/2016/acs/acs5/profile"
BASE_URL_DETAILED_TABLE = "https://api.census.gov/data/2016/acs/acs5"

CA_STATE_FIPS = 6
SD_CITIES_FIPS = [11194, 13392, 16378, 18506, 21712, 22678, 22804, 36294, 40004, 41124, 50398, 53322, 58520, 
                  66000, 68196, 70224, 72506, 82996]

# NOTE: the DP05 dataset is being fetched in two batches from 1-49 and 50-81 variables. 
# This is a workaround to avoid the server error response (400) received when all variables are requested at once.
DATASETS = [{'type': ProductType.DATA_PROFILE,'ids': ['DP05'],'var_start':1, 'var_end': 84,'var_type': VarType.ESTIMATE}]

CONF_FILE = 'acs_api.conf'
OUT_FILE = 'acs_5yr_2016_dp05_sd.csv'

Downloading the Data¶

Data is downloaded by constructing an Http request as per the API specification. The Http response is a 2-dimensional array in the JSON format (as per the ACS guidelines). The JSON response is decoded prior to further analysis.

acs_api_key = None
state_fips = format(CA_STATE_FIPS,'02d')

# utility function to return a list of suffixes for the specified variable type
# list may contain one or more suffixes
#
# var_type: Character indicating the type of variable (estimate, margin of error)
#
#  Returns: list of suffixes as characters
def var_type_to_suffix(var_type):
    
    # determine suffix based on var type
    suffix_list = ['E']    
    if (var_type is VarType.MARGIN_ERROR):
        suffix_list = ['M']
    elif (var_type is VarType.PERCENT_ESTIMATE):
        suffix_list = ['PE']
    elif (var_type is VarType.PERCENT_MARGIN):
        suffix_list = ['PM']
    elif (var_type is VarType.ESTIMATE_MARGIN):
        suffix_list = ['E','M']
    elif (var_type is VarType.ESTIMATE_PERCENT):
        suffix_list = ['E','PE']    
    elif (var_type is VarType.ALL):
        suffix_list = ['E','M','PE', 'PM']
    
    return suffix_list
      

# utility function to construct a string of variable names for the
# specified product type and table id
#
# prod_type: Product Type as needed by the ACS API  
#        id: Table-ID as relevant to the product type  
#      vars: Indices of the variables (within the specified table) that are being requested
#  var_type: Enum indicating the type of variable (estimate, margin of error)
# 
#   Returns: a single string with common seperated variable names
# 
# NOTE: this functions does not do any error-checking
def get_acs_table_vars(prod_type,id,vars,var_type):
            
    # always get header names for the variables
    vlist_str = HEADER_VAR
    
    # default product type: detailed table
    fmt = '03d'
    
    if (prod_type is ProductType.DATA_PROFILE):        
        fmt = '04d'

    # determine suffix based on var type
    suffix_list = var_type_to_suffix(var_type)        
        
    prefix = id + '_'    
    for idx, v in enumerate(vars):        
            
        vl = [(prefix + format(v, fmt)) + s for s in suffix_list] 
        vs = ",".join(vl)
        vlist_str += "," + vs
    
    return vlist_str       

# Function to construct a URL request for the specified data and return the 
# JSON formatted response 
# 
# prod_type: Enum indicating the product type as needed by the ACS API  
#        id: Table-ID as relevant to the product type  
#      vars: Indices of the variables (within the specified table) that are being requested
#  var_type: Enum indicating the type of variable (estimate, margin of error)  
#
# Returns: JSON formatted data fetched from ACS, if successful
#                                          None, upon failure
# 
# Raises: TypeError, ValueError, other exceptions
def fetch_acs_data_sd_cities(prod_type, id, vars, var_type):
    
    global acs_api_key
    data = None
    
    if not isinstance(prod_type, ProductType):
        raise TypeError('prod_type must be of type ProductType')
        
    if not isinstance(var_type, VarType):
        raise TypeError('var_type must be of type VarType')    
        
    # specify product type
    # default product type: detailed table    
    url = BASE_URL_DETAILED_TABLE
    
    if (prod_type is ProductType.DATA_PROFILE):
        url = BASE_URL_DATA_PROFILE        
    
    # specify variables
    url += "?get="    
    
    var_list = get_acs_table_vars(prod_type,id,vars,var_type)    
    url += var_list
        
    # specify geographies
    url += "&for=place:"
    url += ",".join([str(fc) for fc in SD_CITIES_FIPS])
    url += ("&in=state:" + state_fips)   
    url += "&key="
    
    # add api-key
    if acs_api_key is None:        
        with open(CONF_FILE,'r') as f:
            acs_api_key = f.read()
        
    url += acs_api_key
    #print(url)
    
    try:
        # The response for all queries is formatted as a two dimensional JSON array
        # where the first row provides column names and subsequent rows provide
        # data values. (Ref: API Guide)
        resp = requests.get(url)
        
        # proceed further only if received a 200 OK status 
        if resp.ok:
            #data = resp.json()            
            data = json.loads(resp.content)
        else:            
            resp.raise_for_status()
    except ValueError as e:
        print("Error: JSON Decoder: {}".format(str(e)))        
        raise e
    except Exception as e:
        raise e
    
    return data    

# function to return a list of tuples with start end variable indices complying 
# with the per-query limit
#
# start_id: start id of the variable
#   end_id: end id of the variable
# var_type: Enum indicating the type of variable (estimate, margin of error)
# 
#  Returns: List of tuples containing the start and end index of each batch of 
#           variables
def batch_vars(start_id, end_id, var_type):
    
    batches = []
    suffix_list = var_type_to_suffix(var_type)
    factor = len(suffix_list)
    
    # true limit is discounted by 1 since NAME takes up one spot
    # and then scaled by factor    
    
    # this is used for index generation for each batch
    limit = int((PER_QUERY_VARIABLE_LIMIT - 1)/factor)      
    
    if (limit):
    
        num_vars = ((end_id-start_id)+1)    
        num_vars *= factor
        
        # the number of batches itself is still determined by the
        # query limit
        num_batches = int(num_vars/PER_QUERY_VARIABLE_LIMIT)
    
        si = start_id        
        for i in range(1,num_batches+1):
            ei = si + limit - 1
            batches.append((si,ei))
            si = ei + 1 
            
        batches.append((si,end_id))
        
        #for (i,j) in batches:
        #    print("({},{})".format(i,j))
    
    return batches

We loop through the datasets specified in DATASETS and download the data via ACS API. The code below takes care to chunk the requests sent out based on the number of variables needed to be fetched and the API query limit. It further re-assembles the data received for the various chunks and stores it as a pandas dataframe in a dictionary using the table ID as the key.

from collections import defaultdict
data_dict = defaultdict(list)

for d in DATASETS:
    
    prod_type = d['type']
    var_type = d['var_type']
    table_ids = d['ids']    
    var_id_start = d['var_start']
    var_id_end = d['var_end']
    
    for id in table_ids:
        try:  
            # code to split the variables into batches of 50 to account for the API limitation
            var_batches = batch_vars(var_id_start,var_id_end,var_type)            
            
            table_df = None
            for i, (batch_start,batch_end) in enumerate(var_batches):
                
                # fetch the data for each batch
                vars = [no for no in range(batch_start, batch_end+1)]
                batch_data = fetch_acs_data_sd_cities(prod_type,id,vars,var_type)            
                
                # convert json data from each batch to dataframe and then concatenate
                # data frames for all batches belonging to the same table
                df = pd.DataFrame(data=batch_data[1:],columns=batch_data[0]).T
                
                # convert row index name into column index name (?)
                df.columns = df.loc[HEADER_VAR]
                df = df.drop(HEADER_VAR,axis=0)
                # save common rows from each batch
                # remove them here and re-attach a single copy to the concatenated df
                df_common = df.filter(items=COMMON_VARS,axis=0)
                df = df.drop(COMMON_VARS,axis=0)
                
                #display(df.head())
                table_df = pd.concat([table_df,df],axis=0)                
                
            #re-attach common variables
            table_df = pd.concat([table_df,df_common],axis=0)                
            data_dict[id].append(table_df)
            #display(table_df.head())
            
        except Exception as e:
            print(e)
            break

Since we're looking at a single table for this analysis we'll create a short-cut for it and use that going forward. Here is a sample of what this table looks like for the ACS demographics data profile given by table ID DP05.

df = data_dict['DP05'][0]
df.head(n=2)

We see that while the place names are specified as well-known names in use, the actual variables themselves are not. This is fixed by downloading the variables information (available in HTML, XML and JSON formats) and recoding the variable ids to well-known labels.

#
# This code fragment is here so we know how to fetch the variable meta-data and incorporate it into
# the dataset. Currently, we don't actually use it in meaningful ways.
#
url = "https://api.census.gov/data/2016/acs/acs5/profile/variables.json"

resp = requests.get(url)
if resp.ok:    
    data = resp.json() # decode json to dict
else:            
    print("Error downloading variable names for table DP05 / response status ({})".format(resp.status_code))
    
# data is in the form of a dictionary as below
#
# {
#    'variables': 
#    {
#           'for': {...}
#            'in': {...}
#      'var_id_1': {
#                    'concept': '',
#                    'group': '',
#                    'label': '',
#                    'limit': 0,
#                    'predicateType': 'int',
#                    'validValues': []
#                  },
#      'var_id_2': {...},
#       ...
#    }
#  }  

# we could do a lot with the above information such as validate the variable values to conform to the type, limit and expected  
# values. For our present purpose we will assume the data is valid. We are primarily interested in the label so we can recode 
# columns to well-known representations rather than the unique IDs they are currently.

df.columns = [s.split("city, California")[0] for s in df.columns]

vars = data['variables']

labels = []
id_list = list(df.index)
for id in id_list:
    l = 'NA'
    if id in vars:
        l = vars[id]['label']
        
    labels.append(l)
    
df.insert(0, 'VarName', labels)
df.columns.name = None

df.head(n=2)

We'll choose a select few of these variables for exploratory analysis via Tableau. Below, we write the selection out to file as a CSV, to be imported into Tableau.

# the names in cols correspond to the var ids in the vars list below

cols = ['Total','Male','Female','Median Age','18+','65+','Male-65+','Female-65+','1 Race','2+ Races',
        'White','African American','Native American','Asian','Pacific Islander','Other','Hispanic']

vars = [1,2,3,17,18,21,26,27,29,30,59,60,61,62,63,64,66]

d = DATASETS[0]
var_str = get_acs_table_vars(d['type'],d['ids'][0],vars,d['var_type']) 
var_list = var_str.split(",")[1:]    # drop HEADER_VAR
var_list += COMMON_VARS

df = df.drop('VarName', axis=1)
df = df.T
df = df[var_list]

df.columns = cols + COMMON_VARS

# convert values to integers 
df['Median Age'] = df['Median Age'].astype('float')
df[cols] = df[cols].astype('int')

# insert fields that are derived
df.insert(4, '<18', (df['Total'] - df['18+']))
df.insert(6, '0-64', (df['Total'] - df['65+']))

# convert index to column
df.reset_index(inplace=True)
df.rename(columns={'index':'City','state': 'State', 'place': 'Place'}, inplace=True)

# re-arrange columns
cols = df.columns.tolist()
cols = cols[-2:] + cols[:-2]
df = df.reindex(columns= cols)

df.head()

df.to_csv(OUT_FILE, encoding='utf-8', index=False)

Visualization in Tableau¶

Tableau visualizations are typically preferred when conveying insights outside of the engineering/developer community. Two things that make Tableau particularly good in such cases is the ability to make interactive visualizations and to present them as narratives or stories. The ease and speed with which one can create these high-quality, highly engaging visualizations is what makes the tool so popular, especially in the enterprise world.

The public (and free) version of Tableau is used for these visualizations here. Tableau Public can be downloaded as a dektop app (just like the professional version), used to create visualizations locally, and then to publish the same online for anyone to view. This is a great way to share works based on open data.

The visualizations that follow help get an overview of the demographics for the 18 cities that make up the San Diego county. Each field of the demographic data can additionally be compared against other fields from the same city or acorss cities.

Map-based Visualization¶

Tableau 10.2+ recognizes zip-code based geographies allowing it to automatically draw zip-codes (those identified as playing a geographic role) on a map. However, all other geographies must be associated with a geospatial file - commonly known as a shapefile - in ordered to be rendered on a map. Conveniently enough, ACS offers free downloads of many shapefiles specific to US geographies. Furthermore, these files can be downloaded based on FIPS codes that are part of ACS datasets.

Shapefiles for all places in California can be downloaded as a single zip archive here. Once downloaded, they can be loaded into Tableau and merged with the ACS demographics dataset (which is similarly loaded into Tableau). This tutorial helps explain how shapefiles can be used within Tableau.

There are a great many things that one can do with the above visualiztion such as, zooming in and out, searching for a city using the search box in the upper left corner, selecting a city by hovering over it on the map and triggering other actions such as tool-tip displays (as seen above) that contain information about the region selected.

Filtering Data¶

One of the main reasons for using interactive visualizations has to do with the need and ability to filter out data. Here, a filter for cities, and another for individual fields of the demographic data allows one to drill down and compare just the right amount of data, at the right level of detail.

Visualizing Relationships¶

One of the key goals, if not "the" goal, of data visualization is identifying relationships between the many features in the data. Comparing and constrasting values assumed by one or more variables across the dataset is the means by which to do this. It is no surprise then that visualization tools are expected to provide easy-to-use means to make these comparisons. Tableau excels at this task. This along with the many chart options (with helpful suggestions) it provides helps make data visualization using Tableau a breeze.

Comparing Race Across Cities¶

Here, race-based population counts is compared across cities in the San Diego county. Stacking up the bars this way allows one to not only get a sense of the racial composition within each city, but also comparisons across several cities.

Comparing Gender Across Cities¶

Bubbles are a great way of sizing up quantities and comparing one or more quantities. Here, male and female population counts are represented as bubbles. Gender ratios appear to be very well balanced across the county based on this plot.

Comparing Age Groups Across Cities¶

Here we compare the populations of adults (18+) with non-adults (<18) and seniors (65+) with non-seniors (0-64). The visualization has been sorted (a default feature that Tableau provides) by population counts in each category to gain easy access to additional insights.

Dashboards¶

Dashboards and Stories are two features of Tableau that are geared towards the presentation of the visualizations created within Tableau. Dashboards allow several visualizations to be juxtaposed within a single view and for two or more of them to be controlled using the same triggers such as filters. This helps makes for a rich presentation of multiple inter-related visualizations. Stories, as the name suggests, help tell a story by combining narrative elements with visualizations.

Dashboard: Demographics Overview¶

This dashboard combines the map-based high-level overview with the ability to select a city and the demographics data fields. The selected city is both highlighted on the map as well as through the bar plot for the various demographic fields chosen.

Dashboard: Demographics Comparison¶

This dashboard combines the three visualizations comparing gender, age and race across the county, allowing one to get a sense of the entire dataset, across the county through a single view.

Reference¶

The Tableau workbook containing these visualizations can be found here. The CSV data source used by Tableau (and written by the code in this notebook) can be found here.

NAME	Carlsbad city, California	Chula Vista city, California	Coronado city, California	Del Mar city, California	El Cajon city, California	Encinitas city, California	Escondido city, California	Imperial Beach city, California	La Mesa city, California	Lemon Grove city, California	National City city, California	Oceanside city, California	Poway city, California	San Diego city, California	San Marcos city, California	Santee city, California	Solana Beach city, California	Vista city, California
DP05_0001E	112008	260598	24852	4312	102894	62160	149912	27180	59125	26421	60349	173829	49675	1374812	91479	56839	13312	98560
DP05_0002E	54781	129083	13032	2287	51988	30369	74112	13448	28589	13139	30198	86420	24280	690291	44871	27220	6176	49952

	VarName	Carlsbad	Chula Vista	Coronado	Del Mar	El Cajon	Encinitas	Escondido	Imperial Beach	La Mesa	Lemon Grove	National City	Oceanside	Poway	San Diego	San Marcos	Santee	Solana Beach	Vista
DP05_0001E	Estimate!!SEX AND AGE!!Total population	112008	260598	24852	4312	102894	62160	149912	27180	59125	26421	60349	173829	49675	1374812	91479	56839	13312	98560
DP05_0002E	Estimate!!SEX AND AGE!!Total population!!Male	54781	129083	13032	2287	51988	30369	74112	13448	28589	13139	30198	86420	24280	690291	44871	27220	6176	49952

	State	Place	City	Total	Male	Female	Median Age	<18	18+	0-64	...	Female-65+	1 Race	2+ Races	White	African American	Native American	Asian	Pacific Islander	Other	Hispanic
0	06	11194	Carlsbad	112008	54781	57227	42	25596	86412	94485	...	9511	107359	4649	98750	2256	1091	11199	651	3138	15805
1	06	13392	Chula Vista	260598	129083	131515	34	68091	192507	230133	...	17641	246733	13865	185812	15977	2194	47497	2622	21914	152683
2	06	16378	Coronado	24852	13032	11820	37	4534	20318	20213	...	2645	24152	700	22677	1033	214	1196	127	330	3757
3	06	18506	Del Mar	4312	2287	2025	50	473	3839	3297	...	553	4268	44	4190	19	18	115	0	14	321
4	06	21712	El Cajon	102894	51988	50906	34	25369	77525	91088	...	7072	95939	6955	79903	8294	2382	5980	2194	11732	28941