San Diego County: Analysis of Influenza Datasets

This report contains analysis (and related commentary) of data collected on the influenza virus, commonly referred to as the Flu virus, in the San Diego county. California state law requires reporting of influenza outbreaks, including deaths of individuals between the ages of 0 and 64 years resulting from the same. The analysis aims to answer questions under Data Questions below. Results from the analysis can be found under Insights.

Data Source

All of the data used in this analysis is sourced from the open data portal of SD HHSA (San Diego County Health & Human Services Agency) and can be downloaded from this link.

The codefile can be found here. Note that the codefile covers all datasets from the SD HHSA and is not specific to the flu dataset.

Data Questions

These questions are intended to understand the impact of flu virus over time, across age-groups and across communities.

  1. What are the flu trends year-over-year?
  2. Are different communities impacted differently by the flu?
  3. Which age-groups are the most vulnerable and have flu trends changed for these groups?

Data Analysis

Below is the full set of transformations carried out on the dataset, as well as exploratory visualizations on the same to answer the data questions.

In [1]:
import sys
import os
import pandas as pd 
import numpy as np 
import pprint
from itertools import product, groupby

# plotting and display related
import matplotlib.pyplot as plt
from matplotlib import pylab, mlab, gridspec, cm
from IPython.core.pylabtools import figsize, getfigs
from IPython.display import display, HTML
from pylab import *

# custom utilities
import utils.pyutils as pyt

%matplotlib inline
In [2]:
CWD = os.getcwd()

DATADIR = 'data'
DATAFILE = "Flu_Public_Health_Data.csv"

Exploring the Data

We first inspect the fields in the data to decide which ones we'd need for the analysis.

In [3]:
df = pd.read_csv(os.path.join(DATADIR,DATAFILE),skipinitialspace=True)
In [4]:
%pprint
df.columns.values.tolist() 
Pretty printing has been turned OFF
Out[4]:
['CONDITION', 'OUTCOME', 'Year', 'Geography', 'GeoType', 'GeoName', 'GeoID', 'Region', 'SES', 'District', 'Total', 'TotalRate', 'AARate', 'Age0_14', 'Age0_14Rate', 'Age15_24', 'Age15_24Rate', 'Age25_44', 'Age25_44Rate', 'Age45_64', 'Age45_64Rate', 'Age65Plus', 'Age65PlusRate', 'Total_Male', 'Total_MaleRate', 'AA_TotalMaleRate', 'Age0_14_Male', 'Age0_14_MaleRate', 'Age15_24_Male', 'Age15_24_MaleRate', 'Age25_44_Male', 'Age25_44_MaleRate', 'Age45_64_Male', 'Age45_64_MaleRate', 'Age65Plus_Male', 'Age65Plus_MaleRate', 'Total_Female', 'Total_FemaleRate', 'AA_TotalFemaleRate', 'Age0_14_Female', 'Age0_14_FemaleRate', 'Age15_24_Female', 'Age15_24_FemaleRate', 'Age25_44_Female', 'Age25_44_FemaleRate', 'Age45_64_Female', 'Age45_64_FemaleRate', 'Age45Plus_Female', 'Age45Plus_FemaleRate', 'Age65Plus_Female', 'Age65Plus_FemaleRate', 'White_Total', 'White_TotalRate', 'White_Age0_14', 'White_Age0_14Rate', 'White_Age15_24', 'White_Age15_24Rate', 'White_Age25_44', 'White_Age25_44Rate', 'White_Age45_64', 'White_Age45_64Rate', 'White_Age65Plus', 'White_Age65PlusRate', 'White_Total_Male', 'White_Total_MaleRate', 'White_Age0_14_Male', 'White_Age0_14_MaleRate', 'White_Age15_24_Male', 'White_Age15_24_MaleRate', 'White_Age25_44_Male', 'White_Age25_44_MaleRate', 'White_Age45_64_Male', 'White_Age45_64_MaleRate', 'White_Age65Plus_Male', 'White_Age65Plus_MaleRate', 'White_Total_Female', 'White_Total_FemaleRate', 'White_Age0_14_Female', 'White_Age0_14_FemaleRate', 'White_Age15_24_Female', 'White_Age15_24_FemaleRate', 'White_Age25_44_Female', 'White_Age25_44_FemaleRate', 'White_Age45_64_Female', 'White_Age45_64_FemaleRate', 'White_Age45Plus_Female', 'White_Age45Plus_FemaleRate', 'White_Age65Plus_Female', 'White_Age65Plus_FemaleRate', 'Black_Total', 'Black_TotalRate', 'Black_Age0_14', 'Black_Age0_14Rate', 'Black_Age15_24', 'Black_Age15_24Rate', 'Black_Age25_44', 'Black_Age25_44Rate', 'Black_Age45_64', 'Black_Age45_64Rate', 'Black_Age65Plus', 'Black_Age65PlusRate', 'Black_Total_Male', 'Black_Total_MaleRate', 'Black_Age0_14_Male', 'Black_Age0_14_MaleRate', 'Black_Age15_24_Male', 'Black_Age15_24_MaleRate', 'Black_Age25_44_Male', 'Black_Age25_44_MaleRate', 'Black_Age45_64_Male', 'Black_Age45_64_MaleRate', 'Black_Age65Plus_Male', 'Black_Age65Plus_MaleRate', 'Black_Total_Female', 'Black_Total_FemaleRate', 'Black_Age0_14_Female', 'Black_Age0_14_FemaleRate', 'Black_Age15_24_Female', 'Black_Age15_24_FemaleRate', 'Black_Age25_44_Female']
In [5]:
%pprint
Pretty printing has been turned ON

From a look at the fields that are part of this data we can see that it contains breakdowns of the statistics by ethnicity and gender. For our analysis we are not concerned with this breakdown and can drop the related fields.

In [6]:
df = df.iloc[:,:23]
df.head(n=3)
Out[6]:
CONDITION OUTCOME Year Geography GeoType GeoName GeoID Region SES District ... Age0_14 Age0_14Rate Age15_24 Age15_24Rate Age25_44 Age25_44Rate Age45_64 Age45_64Rate Age65Plus Age65PlusRate
0 Flu Death 2015 San Diego County NaN SAN DIEGO COUNTY 99 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 60.0 14.49
1 Flu Death 2015 Central Region Region CENTRAL 3 CENTRAL NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 6.0 12.06
2 Flu Death 2015 Central San Diego SRA CENTRAL SAN DIEGO 1 CENTRAL Low Income Supervisorial District 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 23 columns

Let's examine the contents of a few other key fields to get a better sense of what this data contains.

In [7]:
display(HTML("<b>Years</b>: {}".format(list(df.Year.unique()))))
display(HTML("<b>Outcome</b>: {}".format(list(df.OUTCOME.unique()))))
display(HTML("<b>Region</b>: {}".format(list(df.Region.unique()))))
display(HTML("<b>SES</b>: {}".format(list(df.SES.unique()))))
Years: [2015, 2012, 2014, 2013, 2011]
Outcome: ['Death', 'Hospitalization', 'ED Discharge']
Region: [nan, 'CENTRAL', 'EAST', 'NORTH CENTRAL', 'NORTH COASTAL', 'NORTH INLAND', 'SOUTH']
SES: [nan, 'Low Income', 'Lowest Income', 'Moderately High Income', 'Highest Income', 'Moderately Low Income', 'High Income']

The data contains records for only the last 5 years. This is not very helpful when trying to understand long-term trends but nevertheless these records are contiguous and will help understand some short term ones. More importantly, data is available not only for all the major areas in the county but for both fatalities and non-fatal outcomes resulting from the outbreak.

That said, the percentage of the population afflicted by the flu is likely significantly larger than the aggregate of the outcomes recorded in the data since most people either self-medicate or are handled by non-emergency facilities (including their local physician). One should however expect trends in the larger population to be reflected even in this smaller subset comprising of deaths, hospitalizations and Emergency Dept (ED) discharges.

In [8]:
display(HTML("<b>Geo Type</b>: {}".format(list(df.GeoType.unique()))))
Geo Type: [nan, 'Region', 'SRA', 'Municipal', 'SupervisorDistrict']

Geographies

From the GeoType field values listed above one can see that the data is recorded for many different geographies for any given year. From the code book we learn that there are 41 Sub-Regional Area (SRAs), 18 city municipalities, 6 HHSA Service Regions and 5 Supervisory districts. The numbers are reported on the basis of each of these geographies and more so since any one of these geographies cannot easily be summed up in terms of the other. E.g.: An SRA can cut across two regions, as also across municipal boundaries.

Essentially, data must eventually be filtered by the geography type in order to be further analyzed. For this analysis we will opt for data recorded on a city municipality basis. Although, it must be noted that the SRA is the geography of lowest granularity in the data. It is typically good to retain information at the lowest granularity and aggregate them into larger geographies. Further, SRAs have additional information in the form of SES (Socio-Economic Sector) within the data. This information is lost when using larger geographies such as munipalities.

In [9]:
df = df[df.GeoType == "Municipal"]
df.shape
Out[9]:
(242, 23)
In [10]:
pyt.summarize(df)
Dimensions: (242, 23)
CONDITION                     (<U3 null: 0 len: 242 unq: 1, [Flu])
OUTCOME          (<U15 null: 0 len: 242 unq: 3, [Hospitalizatio...
Year             (int32 null: 0 len: 242 unq: 5, [2012, 2015, 2...
Geography        (<U22 null: 0 len: 242 unq: 19, [City of Sante...
GeoType                 (<U9 null: 0 len: 242 unq: 1, [Municipal])
GeoName          (<U14 null: 0 len: 242 unq: 19, [SANTEE, CARLS...
GeoID            (int32 null: 0 len: 242 unq: 19, [16, 1, 2, 5,...
Region                  (float64 null: 242 len: 242 unq: 1, [nan])
SES                     (float64 null: 242 len: 242 unq: 1, [nan])
District                (float64 null: 242 len: 242 unq: 1, [nan])
Total            (float64 null: 78 len: 242 unq: 98, [nan, 5.0,...
TotalRate        (float64 null: 78 len: 242 unq: 164, [nan, 1.9...
AARate           (float64 null: 78 len: 242 unq: 163, [nan, 2.1...
Age0_14          (float64 null: 146 len: 242 unq: 57, [nan, 25....
Age0_14Rate      (float64 null: 146 len: 242 unq: 97, [nan, 133...
Age15_24         (float64 null: 169 len: 242 unq: 40, [nan, 12....
Age15_24Rate     (float64 null: 169 len: 242 unq: 74, [nan, 82....
Age25_44         (float64 null: 147 len: 242 unq: 49, [nan, 5.0...
Age25_44Rate     (float64 null: 147 len: 242 unq: 96, [nan, 138...
Age45_64         (float64 null: 150 len: 242 unq: 41, [nan, 8.0...
Age45_64Rate     (float64 null: 150 len: 242 unq: 91, [nan, 2.5...
Age65Plus        (float64 null: 134 len: 242 unq: 43, [nan, 5.0...
Age65PlusRate    (float64 null: 134 len: 242 unq: 108, [nan, 18...
dtype: object

Missing Data

From the summary of the remaining data we see that a number of fields contain NaN's (represented by the null count in the summary). We purge fields (i.e.: columns) for which all values are NULL while continuing to retain rows for which all key indicators are NULL. Such rows are place holders for regions which reported no data with the specified outcome.

In [11]:
# drop cols with NAs for ALL rows
df = df.dropna(how='all',axis=1)
df.shape

# drop rows with NAs for ALL of the key indicators
#df = df.dropna(subset=['Total','Age0_14','Age15_24','Age25_44','Age45_64','Age65Plus'],how='all')
#df = df.fillna(0)
Out[11]:
(242, 20)
Analysis: Q1, Q3

Note that in the table below we aggregrate numbers by year and outcome. On close inspection one can see that the Total does not add up to the numbers under the various age groups. This is likely due to the fact that numbers less than 5 are not identified on an age group basis even though they show up in the total. This has to do with compliance requirements of healthcare open data owing to privacy issues.

In [12]:
pd.options.display.float_format = '{:,.0f}'.format

# group by Year
df_yr = df.groupby(['Year','OUTCOME'])

cols = ['Total','Age0_14','Age15_24','Age25_44','Age45_64','Age65Plus']

df_yr_totals = df_yr[cols].sum().fillna(0)
df_yr_totals
Out[12]:
Total Age0_14 Age15_24 Age25_44 Age45_64 Age65Plus
Year OUTCOME
2011 Death 6 0 0 0 0 0
ED Discharge 1,742 548 400 476 197 69
Hospitalization 310 32 7 35 49 120
2012 Death 5 0 0 0 0 5
ED Discharge 1,199 384 189 352 150 56
Hospitalization 171 21 5 13 28 68
2013 Death 40 0 0 0 0 30
ED Discharge 4,579 1,916 596 993 685 355
Hospitalization 768 69 8 47 105 471
2014 Death 28 0 0 0 8 9
ED Discharge 3,286 973 511 920 614 243
Hospitalization 549 52 8 65 143 214
2015 Death 44 0 0 0 0 44
ED Discharge 5,186 2,170 695 1,020 723 550
Hospitalization 786 65 7 35 102 517
In [13]:
#plt.style.use = 'default'

matplotlib.rcParams['font.size'] = 10
matplotlib.rcParams['font.weight'] = 'bold'
matplotlib.rcParams['font.family'] = 'sans-serif'
matplotlib.rcParams['font.sans-serif'] = ['Tahoma']

# just plot the totals for each group
df_plot = df_yr_totals.unstack()['Total']

plt.plot(df_plot,marker='o')
plt.legend(df_plot.columns, prop={'size': 12, 'weight': 'regular'})

fig = plt.gcf()
fig.set_size_inches(12, 6)

ax = plt.gca()

ax.grid(False)
ax.set_facecolor('white')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.xaxis.set_major_locator(MaxNLocator(integer=True))

title = "San Diego Cities: Flu Trends (2011-2015), Aggregates"
ax.set_title(title,fontsize=14, fontweight='bold',color='black')
ax.title.set_position([.5,1.05])

plt.show()
In [14]:
pd.options.display.float_format = '{:,.2f}'.format

def hl_max_outcome(s):
    '''
    highlight the maximum in a Series based on outcome.
    '''
    ed = s.xs('ED Discharge',level='OUTCOME')
    d = s.xs('Death',level='OUTCOME')
    h = s.xs('Hospitalization',level='OUTCOME')
    
    #is_max = ((s == ed.max()) | (s == d.max()) | (s == h.max())) & (s != 0)
    #return ['background-color: yellow' if v else '' for v in is_max]
    
    color = []
    for v in s:
        
        if v == 0:
            color.append('')
            continue
        
        if v == d.max():
            color.append('background-color: lightskyblue')
        elif v == h.max():
            color.append('background-color: lightgreen')
        elif v == ed.max():
            color.append('background-color: sandybrown')
        else:
            color.append('')
            
    return color

cols = ['TotalRate','Age0_14Rate','Age15_24Rate','Age25_44Rate','Age45_64Rate','Age65PlusRate']

df_yr_rates = df_yr[cols].agg([np.mean,np.max,np.min]).fillna(0)
df_yr_rates_hl = df_yr_rates.style.apply(hl_max_outcome)

df_yr_rates_hl
Out[14]:
TotalRate Age0_14Rate Age15_24Rate Age25_44Rate Age45_64Rate Age65PlusRate
mean amax amin mean amax amin mean amax amin mean amax amin mean amax amin mean amax amin
Year OUTCOME
2011 Death 0.46 0.46 0.46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ED Discharge 59.8394 88.54 20.72 107.512 189.07 71.28 94.5207 165.23 47.05 57.7264 83.94 32.61 33.263 52.05 19.4 32.908 55.68 22.45
Hospitalization 10.29 16.5 4.76 10.1333 15.04 6.57 3.47 3.47 3.47 10.49 18.66 4.78 10.7 11.6 9.8 55.3933 69.87 32.19
2012 Death 0.34 0.34 0.34 0 0 0 0 0 0 0 0 0 0 0 0 3.16 3.16 3.16
ED Discharge 43.3418 63.55 18.4 82.2808 153.37 43.54 65.398 128.62 31.62 51.0823 86.54 24.76 32.556 98.26 16.24 22.41 31.31 17.72
Hospitalization 8.57857 14.31 5.29 6.49 7.13 5.85 2.31 2.31 2.31 2.905 3.92 1.89 7.59333 13.45 3.63 39.435 81.16 21.25
2013 Death 4.91667 10.99 1.74 0 0 0 0 0 0 0 0 0 0 0 0 28.5033 63.32 10.61
ED Discharge 150.304 255.62 90.41 335.062 559.97 179.9 127.075 186.36 58.43 125.976 298.85 68.97 102.08 209.14 47.08 110.461 208.12 70.62
Hospitalization 27.6975 54.22 16.36 17.9417 26.59 10.54 3.81 3.81 3.81 10.815 20.49 5.61 37.2783 119.51 13.97 130.921 214.86 77.64
2014 Death 1.84 3.28 0.93 0 0 0 0 0 0 0 0 0 2.51 2.51 2.51 5.72 5.72 5.72
ED Discharge 98.0994 154.56 46.89 169.958 375.67 52.61 111.922 241.03 54.52 99.2947 211.63 33 73.9575 129.22 25.04 66.2023 121.25 40.96
Hospitalization 19.03 30.55 10.05 14.304 26.31 8.53 3.88 3.88 3.88 13.5183 21.5 5.77 28.935 45.85 16.4 60.3191 102.18 42.57
2015 Death 3.3475 7.67 1.73 0 0 0 0 0 0 0 0 0 0 0 0 23.8075 48.04 14.27
ED Discharge 163.573 374.35 67.15 379.297 800.89 133.87 154.313 302.2 82.25 130.378 307.22 56.39 103.512 207.57 31.81 157.35 348.34 85.5
Hospitalization 27.6073 71.84 9.74 13.19 20.68 8.83 3.42 3.42 3.42 7.95667 12.2 4.63 23.705 37.65 15.01 149.985 501.77 48.91

In the above table, the mean, max and min rates (among all cities that reported data) for each outcome (per population of 100,000) can be seen for each of the five years in the dataset. Years 2013 and 2015 standout due to the high rates of death, hospitalization and ED discharges respectively.

Analysis: Q2, Q3

As we look into exploring the dataset for city specific insights we first list out the cities for which samples are recorded. We see that the unincorporated areas of the San Diego county are listed as a single municipal area. It is important to also note the fact that cities within the county vary widely by both population and geographical area.

In [15]:
display(HTML("<b>Geo Name</b>: {}".format(list(df.GeoName.unique()))))
Geo Name: ['SANTEE', 'CARLSBAD', 'CHULA VISTA', 'EL CAJON', 'ENCINITAS', 'ESCONDIDO', 'LA MESA', 'LEMON GROVE', 'NATIONAL CITY', 'OCEANSIDE', 'POWAY', 'SAN DIEGO', 'SAN MARCOS', 'UNINCORPORATED', 'CORONADO', 'VISTA', 'SOLANA BEACH', 'DEL MAR', 'IMPERIAL BEACH']
In [16]:
pd.options.display.float_format = '{:,.0f}'.format

cols = ['Total','Age0_14','Age15_24','Age25_44','Age45_64','Age65Plus']
prop = ['GeoName','Year','OUTCOME']

df_city = df.groupby(prop, as_index=True, group_keys=False)
df_city = df_city.apply(lambda x: x.set_index(prop))
df_city = df_city[cols].fillna(0)
display(df_city.head())
display(df_city.tail())
Total Age0_14 Age15_24 Age25_44 Age45_64 Age65Plus
GeoName Year OUTCOME
CARLSBAD 2011 ED Discharge 53 17 9 15 6 5
Hospitalization 6 0 0 0 0 0
2012 ED Discharge 42 18 0 13 5 0
Hospitalization 0 0 0 0 0 0
2013 Death 0 0 0 0 0 0
Total Age0_14 Age15_24 Age25_44 Age45_64 Age65Plus
GeoName Year OUTCOME
VISTA 2014 Death 0 0 0 0 0 0
ED Discharge 90 35 17 21 10 7
Hospitalization 14 0 0 0 0 5
2015 ED Discharge 186 101 19 39 14 14
Hospitalization 19 0 0 0 0 12
In [17]:
df_city_totals = df_city['Total'].xs(2015,level='Year').unstack()
df_city_totals.plot(kind='bar', stacked=True, colormap='Set1')

fig = plt.gcf()
fig.set_size_inches(18, 6)

plt.legend(df_plot.columns, prop={'size': 12, 'weight': 'regular'})

ax = plt.gca()
ax.grid(False)
ax.set_facecolor('white')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.xaxis.label.set_visible(False)

title = "San Diego Cities: Flu Trends Yr 2015"
ax.set_title(title,fontsize=14, fontweight='bold',color='black')
ax.title.set_position([.5,1.05])

Once we split the data by municipality and plot them, as above, we see how this difference in population and area plays out. The Unincorporated Areas and the City of San Diego together account for a large share of those afflicted by the flu. But given differences in population of the various cities it is best to plot the rates of flu affliction to determine if these cities are indeed prone to bigger outbreaks of the flu than the others.

In [18]:
df_city_sd = df_city.xs('SAN DIEGO',level='GeoName',drop_level=True)
df_unincorporated = df_city.xs('UNINCORPORATED',level='GeoName',drop_level=True)

#df_city = df_city.drop(['SAN DIEGO','UNINCORPORATED'], level='GeoName')

fig, (ax1,ax2) = plt.subplots(1,2)

fig = plt.gcf()
fig.set_size_inches(18, 6)

df_city_sd['Total'].unstack().plot(kind='bar',stacked=True,ax=ax1,colormap='Set1')

title = "City of San Diego: Flu Trends (2011 - 2015)"
ax1.set_title(title,fontsize=12, fontweight='bold',color='black')
ax1.title.set_position([.5,1.05])

df_unincorporated['Total'].unstack().plot(kind='bar',stacked=True,ax=ax2,colormap='Set1')
   
title = "Unincorporated Areas: Flu Trends (2011 - 2015)"
ax2.set_title(title,fontsize=12, fontweight='bold',color='black')
ax2.title.set_position([.5,1.05])

for ax in [ax1,ax2]:
    ax.legend(df_plot.columns, prop={'size': 12, 'weight': 'regular'})
    ax.grid(False)
    ax.set_facecolor('white')
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.xaxis.label.set_visible(False)    
    locs, labels = xticks()
    ax.set_xticklabels(labels,rotation='30')
In [19]:
pd.options.display.float_format = '{:,.2f}'.format

cols = ['TotalRate','Age0_14Rate','Age15_24Rate','Age25_44Rate','Age45_64Rate','Age65PlusRate']
prop = ['GeoName','Year','OUTCOME']

df_city = df.groupby(prop, as_index=True, group_keys=False)
df_city = df_city.apply(lambda x: x.set_index(prop))
df_city = df_city[cols].fillna(0)
df_city.xs('SAN DIEGO',level='GeoName')
Out[19]:
TotalRate Age0_14Rate Age15_24Rate Age25_44Rate Age45_64Rate Age65PlusRate
Year OUTCOME
2011 Death 0.46 0.00 0.00 0.00 0.00 0.00
ED Discharge 52.05 82.85 74.20 50.98 26.19 26.44
Hospitalization 11.03 8.79 3.47 4.78 11.60 43.66
2012 Death 0.34 0.00 0.00 0.00 0.00 3.16
ED Discharge 32.34 54.61 34.35 31.53 19.51 22.85
Hospitalization 5.66 5.85 2.31 1.89 5.70 21.25
2013 Death 1.74 0.00 0.00 0.00 0.00 11.58
ED Discharge 138.59 301.65 131.48 100.85 88.56 102.13
Hospitalization 23.26 11.53 3.81 6.13 13.97 137.64
2014 Death 1.31 0.00 0.00 0.00 2.51 5.72
ED Discharge 102.91 147.55 114.34 95.72 88.91 67.73
Hospitalization 17.23 8.53 3.88 5.77 23.03 68.72
2015 Death 1.73 0.00 0.00 0.00 0.00 14.73
ED Discharge 158.61 371.32 148.43 97.84 95.84 145.22
Hospitalization 25.41 20.68 3.42 4.63 15.01 137.97
In [20]:
df_city_rates = df_city['TotalRate'].xs(2015,level='Year').unstack()
df_city_rates = df_city_rates.sort_values(['ED Discharge','Hospitalization','Death'],ascending=False)

fig, (ax1,ax2) = plt.subplots(1,2)
fig.set_size_inches(18, 12)

df_city_rates.plot(kind='barh', stacked=False, ax=ax1, colormap='Set1')

ax1.legend(df_plot.columns, prop={'size': 12, 'weight': 'regular'})
ax1.grid(False)
ax1.set_facecolor('white')
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)

ax1.yaxis.label.set_visible(False)

title = "San Diego Cities: Flu Rates Yr 2015"
ax1.set_title(title,fontsize=14, fontweight='bold',color='black')
ax1.title.set_position([.5,1.05])

df_city_totals['NonFatalities'] = df_city_totals['ED Discharge'] + df_city_totals['Hospitalization']

labels = df_city_totals.index
sizes = df_city_totals.NonFatalities.fillna(0).values
#cs=cm.Set1(sizes/len(sizes))
cmap = cm.OrRd
cs = cmap(np.linspace(0., 1., len(labels)))

pie_whole = ax2.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=False, startangle=90, colors=cs, labeldistance=1.05)
ax2.axis('equal')

for slice in pie_whole[0]:
    slice.set_edgecolor('white')

title = "San Diego Cities: Flu (Non-Fatalities) Yr 2015"
ax2.set_title(title,fontsize=14, fontweight='bold',color='black')
ax2.title.set_position([.5,1.05])
Analysis: Q3

We've seen both flu incidences as well as the rate of flu afflictions for various age groups across both different time windows as well as geographies as part of the analysis for Q1 and Q2. Here, we split the data on age take a look at how they are distributed across the various age-groups.

In [21]:
pd.options.display.float_format = '{:,.0f}'.format

total = ['Total']
seniors = ['Age65Plus']
nonseniors = ['Age0_14','Age15_24','Age25_44','Age45_64']

nonfatalities = ['ED Discharge','Hospitalization']

# add a new column with aggregated nos across non-senior age groups
df_yr_totals['Age0_64'] = df_yr_totals[nonseniors].sum(axis=1)
# sum up non-fatal outcomes for each year and age-group
df_nf = df_yr_totals.loc(axis=0)[:,nonfatalities].groupby(level=['Year']).sum(axis=0)

# get this data-frame in the right format so we can concatenate it to df_yr_totals
df_nf['OUTCOME'] = ['NonFatalities'] * 5
df_nf.set_index('OUTCOME', append=True, inplace=True)
df_nf.reorder_levels(['Year','OUTCOME'])
#df_nf

df_age_totals = pd.concat([df_yr_totals,df_nf],axis=0,levels=['Year']).sort_index(axis=0)
df_age_totals.tail(n=8)
Out[21]:
Total Age0_14 Age15_24 Age25_44 Age45_64 Age65Plus Age0_64
Year OUTCOME
2014 Death 28 0 0 0 8 9 8
ED Discharge 3,286 973 511 920 614 243 3,018
Hospitalization 549 52 8 65 143 214 268
NonFatalities 3,835 1,025 519 985 757 457 3,286
2015 Death 44 0 0 0 0 44 0
ED Discharge 5,186 2,170 695 1,020 723 550 4,608
Hospitalization 786 65 7 35 102 517 209
NonFatalities 5,972 2,235 702 1,055 825 1,067 4,817
In [54]:
# sort columns here so we can mitigate errors resulting from 
# re-ordering caused by stack()
df_age_totals = df_age_totals.reindex_axis(sorted(df_age_totals.columns), axis=1)

df_plot = df_age_totals.unstack(level=-1)
# stack causes re-ordering of index values
# CR: https://github.com/pandas-dev/pandas/issues/15105
df_plot = df_plot.stack(level=0)

years = [2015,2014,2013,2012,2011]
groups = ['Age0_14','Age0_64','Age65Plus']
outcomes = ['Death','Hospitalization','ED Discharge']

df_plot = df_plot.loc(axis=0)[years,groups]
df_plot = df_plot[outcomes]

# convert to multi-index columns
#df_plot.columns.rename('GROUPS',level=0,inplace=True)

#levels = [list(i.values) for i in df_plot.columns.levels]
#levels = [['Stats']] + levels

#df_plot.columns = pd.MultiIndex.from_product(levels, names=(None, 'GROUPS','OUTCOME'))
#df_plot.head(n=3)
In [34]:
def add_line(ax, xpos, ypos):
    line = plt.Line2D([xpos, xpos], [ypos + .1, ypos],
                      transform=ax.transAxes, color='black')
    line.set_clip_on(False)
    ax.add_line(line)

def label_len(my_index,level):
    labels = my_index.get_level_values(level)
    return [(k, sum(1 for i in g)) for k,g in groupby(labels)]

def label_group_bar_table(ax, df):
    ypos = -.1
    scale = 1./df.index.size
    for level in range(df.index.nlevels)[::-1]:
        pos = 0
        for label, rpos in label_len(df.index,level):
            lxpos = (pos + .5 * rpos)*scale
            ax.text(lxpos, ypos, label, ha='center', transform=ax.transAxes)
            add_line(ax, pos*scale, ypos)
            pos += rpos
        add_line(ax, pos*scale , ypos)
        ypos -= .1
In [57]:
fig = plt.gcf()
fig.set_size_inches(18, 8)

ax = fig.add_subplot(111)

df_plot.plot(kind='bar',stacked=False,ax=fig.gca(),colormap='Set1')

#Below 3 lines remove default labels
labels = ['' for item in ax.get_xticklabels()]
ax.set_xticklabels(labels)
ax.set_xlabel('')

label_group_bar_table(ax, df_plot)
fig.subplots_adjust(bottom=.1*df.index.nlevels)

#ax.legend(df_plot.columns, prop={'size': 12, 'weight': 'regular'})
ax.grid(False)
ax.set_facecolor('white')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

title = "San Diego Cities: Flu Trends (2011-2015), Aggregates"
ax.set_title(title,fontsize=14, fontweight='bold',color='black')
ax.title.set_position([.5,1.05])

plt.show()

Insights

  • In any given year, the number of ED discharges outnumber the hospitalizations and both vastly outnumber the fatalities. This is true across all age groups except seniors (65 years and older) and across all municipalities represented in the dataset. Among seniors, the number of hospitalizations are likely to be closer to ED discharges, often overtaking them.
  • In the last 5 years spanning 2011-2015, 2013 and 2015 had larger outbreaks of flu than the other years, characterized by significantly higher rates of flu among all age-groups. Note: Such differences can also be the result of more robust data reporting and gathering in those years.
  • In the year 2015, the City of San Diego and the Unincorporate Areas accounted for a significantly larger share of all flu incidences (non-fatal) with approximately 42% and 14% of the total across all municipalities. This was a result of these regions having a greater total population than a bigger outbreak of the flu as revealed by inspecting the flu rates for the year. This is likely to be a pattern repeated each year, although this was not explored here.
  • In the year 2015, eight municipalities had higher rates of flu (non-fatal incidences) than the City of San Diego, with Lemon Grove having the highest rate.
  • Seniors are the dominant group affected by fatalities from flu with outcomes among non-seniors being dominated by ED discharges.