This report contains analysis (and related commentary) of data collected on the influenza virus, commonly referred to as the Flu virus, in the San Diego county. California state law requires reporting of influenza outbreaks, including deaths of individuals between the ages of 0 and 64 years resulting from the same. The analysis aims to answer questions under Data Questions below. Results from the analysis can be found under Insights.
All of the data used in this analysis is sourced from the open data portal of SD HHSA (San Diego County Health & Human Services Agency) and can be downloaded from this link.
The codefile can be found here. Note that the codefile covers all datasets from the SD HHSA and is not specific to the flu dataset.
These questions are intended to understand the impact of flu virus over time, across age-groups and across communities.
Below is the full set of transformations carried out on the dataset, as well as exploratory visualizations on the same to answer the data questions.
import sys
import os
import pandas as pd
import numpy as np
import pprint
from itertools import product, groupby
# plotting and display related
import matplotlib.pyplot as plt
from matplotlib import pylab, mlab, gridspec, cm
from IPython.core.pylabtools import figsize, getfigs
from IPython.display import display, HTML
from pylab import *
# custom utilities
import utils.pyutils as pyt
%matplotlib inline
CWD = os.getcwd()
DATADIR = 'data'
DATAFILE = "Flu_Public_Health_Data.csv"
We first inspect the fields in the data to decide which ones we'd need for the analysis.
df = pd.read_csv(os.path.join(DATADIR,DATAFILE),skipinitialspace=True)
%pprint
df.columns.values.tolist()
%pprint
From a look at the fields that are part of this data we can see that it contains breakdowns of the statistics by ethnicity and gender. For our analysis we are not concerned with this breakdown and can drop the related fields.
df = df.iloc[:,:23]
df.head(n=3)
Let's examine the contents of a few other key fields to get a better sense of what this data contains.
display(HTML("<b>Years</b>: {}".format(list(df.Year.unique()))))
display(HTML("<b>Outcome</b>: {}".format(list(df.OUTCOME.unique()))))
display(HTML("<b>Region</b>: {}".format(list(df.Region.unique()))))
display(HTML("<b>SES</b>: {}".format(list(df.SES.unique()))))
The data contains records for only the last 5 years. This is not very helpful when trying to understand long-term trends but nevertheless these records are contiguous and will help understand some short term ones. More importantly, data is available not only for all the major areas in the county but for both fatalities and non-fatal outcomes resulting from the outbreak.
That said, the percentage of the population afflicted by the flu is likely significantly larger than the aggregate of the outcomes recorded in the data since most people either self-medicate or are handled by non-emergency facilities (including their local physician). One should however expect trends in the larger population to be reflected even in this smaller subset comprising of deaths, hospitalizations and Emergency Dept (ED) discharges.
display(HTML("<b>Geo Type</b>: {}".format(list(df.GeoType.unique()))))
From the GeoType field values listed above one can see that the data is recorded for many different geographies for any given year. From the code book we learn that there are 41 Sub-Regional Area (SRAs), 18 city municipalities, 6 HHSA Service Regions and 5 Supervisory districts. The numbers are reported on the basis of each of these geographies and more so since any one of these geographies cannot easily be summed up in terms of the other. E.g.: An SRA can cut across two regions, as also across municipal boundaries.
Essentially, data must eventually be filtered by the geography type in order to be further analyzed. For this analysis we will opt for data recorded on a city municipality basis. Although, it must be noted that the SRA is the geography of lowest granularity in the data. It is typically good to retain information at the lowest granularity and aggregate them into larger geographies. Further, SRAs have additional information in the form of SES (Socio-Economic Sector) within the data. This information is lost when using larger geographies such as munipalities.
df = df[df.GeoType == "Municipal"]
df.shape
pyt.summarize(df)
From the summary of the remaining data we see that a number of fields contain NaN's (represented by the null count in the summary). We purge fields (i.e.: columns) for which all values are NULL while continuing to retain rows for which all key indicators are NULL. Such rows are place holders for regions which reported no data with the specified outcome.
# drop cols with NAs for ALL rows
df = df.dropna(how='all',axis=1)
df.shape
# drop rows with NAs for ALL of the key indicators
#df = df.dropna(subset=['Total','Age0_14','Age15_24','Age25_44','Age45_64','Age65Plus'],how='all')
#df = df.fillna(0)
Note that in the table below we aggregrate numbers by year and outcome. On close inspection one can see that the Total does not add up to the numbers under the various age groups. This is likely due to the fact that numbers less than 5 are not identified on an age group basis even though they show up in the total. This has to do with compliance requirements of healthcare open data owing to privacy issues.
pd.options.display.float_format = '{:,.0f}'.format
# group by Year
df_yr = df.groupby(['Year','OUTCOME'])
cols = ['Total','Age0_14','Age15_24','Age25_44','Age45_64','Age65Plus']
df_yr_totals = df_yr[cols].sum().fillna(0)
df_yr_totals
#plt.style.use = 'default'
matplotlib.rcParams['font.size'] = 10
matplotlib.rcParams['font.weight'] = 'bold'
matplotlib.rcParams['font.family'] = 'sans-serif'
matplotlib.rcParams['font.sans-serif'] = ['Tahoma']
# just plot the totals for each group
df_plot = df_yr_totals.unstack()['Total']
plt.plot(df_plot,marker='o')
plt.legend(df_plot.columns, prop={'size': 12, 'weight': 'regular'})
fig = plt.gcf()
fig.set_size_inches(12, 6)
ax = plt.gca()
ax.grid(False)
ax.set_facecolor('white')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
title = "San Diego Cities: Flu Trends (2011-2015), Aggregates"
ax.set_title(title,fontsize=14, fontweight='bold',color='black')
ax.title.set_position([.5,1.05])
plt.show()
pd.options.display.float_format = '{:,.2f}'.format
def hl_max_outcome(s):
'''
highlight the maximum in a Series based on outcome.
'''
ed = s.xs('ED Discharge',level='OUTCOME')
d = s.xs('Death',level='OUTCOME')
h = s.xs('Hospitalization',level='OUTCOME')
#is_max = ((s == ed.max()) | (s == d.max()) | (s == h.max())) & (s != 0)
#return ['background-color: yellow' if v else '' for v in is_max]
color = []
for v in s:
if v == 0:
color.append('')
continue
if v == d.max():
color.append('background-color: lightskyblue')
elif v == h.max():
color.append('background-color: lightgreen')
elif v == ed.max():
color.append('background-color: sandybrown')
else:
color.append('')
return color
cols = ['TotalRate','Age0_14Rate','Age15_24Rate','Age25_44Rate','Age45_64Rate','Age65PlusRate']
df_yr_rates = df_yr[cols].agg([np.mean,np.max,np.min]).fillna(0)
df_yr_rates_hl = df_yr_rates.style.apply(hl_max_outcome)
df_yr_rates_hl
In the above table, the mean, max and min rates (among all cities that reported data) for each outcome (per population of 100,000) can be seen for each of the five years in the dataset. Years 2013 and 2015 standout due to the high rates of death, hospitalization and ED discharges respectively.
As we look into exploring the dataset for city specific insights we first list out the cities for which samples are recorded. We see that the unincorporated areas of the San Diego county are listed as a single municipal area. It is important to also note the fact that cities within the county vary widely by both population and geographical area.
display(HTML("<b>Geo Name</b>: {}".format(list(df.GeoName.unique()))))
pd.options.display.float_format = '{:,.0f}'.format
cols = ['Total','Age0_14','Age15_24','Age25_44','Age45_64','Age65Plus']
prop = ['GeoName','Year','OUTCOME']
df_city = df.groupby(prop, as_index=True, group_keys=False)
df_city = df_city.apply(lambda x: x.set_index(prop))
df_city = df_city[cols].fillna(0)
display(df_city.head())
display(df_city.tail())
df_city_totals = df_city['Total'].xs(2015,level='Year').unstack()
df_city_totals.plot(kind='bar', stacked=True, colormap='Set1')
fig = plt.gcf()
fig.set_size_inches(18, 6)
plt.legend(df_plot.columns, prop={'size': 12, 'weight': 'regular'})
ax = plt.gca()
ax.grid(False)
ax.set_facecolor('white')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.xaxis.label.set_visible(False)
title = "San Diego Cities: Flu Trends Yr 2015"
ax.set_title(title,fontsize=14, fontweight='bold',color='black')
ax.title.set_position([.5,1.05])
Once we split the data by municipality and plot them, as above, we see how this difference in population and area plays out. The Unincorporated Areas and the City of San Diego together account for a large share of those afflicted by the flu. But given differences in population of the various cities it is best to plot the rates of flu affliction to determine if these cities are indeed prone to bigger outbreaks of the flu than the others.
df_city_sd = df_city.xs('SAN DIEGO',level='GeoName',drop_level=True)
df_unincorporated = df_city.xs('UNINCORPORATED',level='GeoName',drop_level=True)
#df_city = df_city.drop(['SAN DIEGO','UNINCORPORATED'], level='GeoName')
fig, (ax1,ax2) = plt.subplots(1,2)
fig = plt.gcf()
fig.set_size_inches(18, 6)
df_city_sd['Total'].unstack().plot(kind='bar',stacked=True,ax=ax1,colormap='Set1')
title = "City of San Diego: Flu Trends (2011 - 2015)"
ax1.set_title(title,fontsize=12, fontweight='bold',color='black')
ax1.title.set_position([.5,1.05])
df_unincorporated['Total'].unstack().plot(kind='bar',stacked=True,ax=ax2,colormap='Set1')
title = "Unincorporated Areas: Flu Trends (2011 - 2015)"
ax2.set_title(title,fontsize=12, fontweight='bold',color='black')
ax2.title.set_position([.5,1.05])
for ax in [ax1,ax2]:
ax.legend(df_plot.columns, prop={'size': 12, 'weight': 'regular'})
ax.grid(False)
ax.set_facecolor('white')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.xaxis.label.set_visible(False)
locs, labels = xticks()
ax.set_xticklabels(labels,rotation='30')
pd.options.display.float_format = '{:,.2f}'.format
cols = ['TotalRate','Age0_14Rate','Age15_24Rate','Age25_44Rate','Age45_64Rate','Age65PlusRate']
prop = ['GeoName','Year','OUTCOME']
df_city = df.groupby(prop, as_index=True, group_keys=False)
df_city = df_city.apply(lambda x: x.set_index(prop))
df_city = df_city[cols].fillna(0)
df_city.xs('SAN DIEGO',level='GeoName')
df_city_rates = df_city['TotalRate'].xs(2015,level='Year').unstack()
df_city_rates = df_city_rates.sort_values(['ED Discharge','Hospitalization','Death'],ascending=False)
fig, (ax1,ax2) = plt.subplots(1,2)
fig.set_size_inches(18, 12)
df_city_rates.plot(kind='barh', stacked=False, ax=ax1, colormap='Set1')
ax1.legend(df_plot.columns, prop={'size': 12, 'weight': 'regular'})
ax1.grid(False)
ax1.set_facecolor('white')
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
ax1.yaxis.label.set_visible(False)
title = "San Diego Cities: Flu Rates Yr 2015"
ax1.set_title(title,fontsize=14, fontweight='bold',color='black')
ax1.title.set_position([.5,1.05])
df_city_totals['NonFatalities'] = df_city_totals['ED Discharge'] + df_city_totals['Hospitalization']
labels = df_city_totals.index
sizes = df_city_totals.NonFatalities.fillna(0).values
#cs=cm.Set1(sizes/len(sizes))
cmap = cm.OrRd
cs = cmap(np.linspace(0., 1., len(labels)))
pie_whole = ax2.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=False, startangle=90, colors=cs, labeldistance=1.05)
ax2.axis('equal')
for slice in pie_whole[0]:
slice.set_edgecolor('white')
title = "San Diego Cities: Flu (Non-Fatalities) Yr 2015"
ax2.set_title(title,fontsize=14, fontweight='bold',color='black')
ax2.title.set_position([.5,1.05])
We've seen both flu incidences as well as the rate of flu afflictions for various age groups across both different time windows as well as geographies as part of the analysis for Q1 and Q2. Here, we split the data on age take a look at how they are distributed across the various age-groups.
pd.options.display.float_format = '{:,.0f}'.format
total = ['Total']
seniors = ['Age65Plus']
nonseniors = ['Age0_14','Age15_24','Age25_44','Age45_64']
nonfatalities = ['ED Discharge','Hospitalization']
# add a new column with aggregated nos across non-senior age groups
df_yr_totals['Age0_64'] = df_yr_totals[nonseniors].sum(axis=1)
# sum up non-fatal outcomes for each year and age-group
df_nf = df_yr_totals.loc(axis=0)[:,nonfatalities].groupby(level=['Year']).sum(axis=0)
# get this data-frame in the right format so we can concatenate it to df_yr_totals
df_nf['OUTCOME'] = ['NonFatalities'] * 5
df_nf.set_index('OUTCOME', append=True, inplace=True)
df_nf.reorder_levels(['Year','OUTCOME'])
#df_nf
df_age_totals = pd.concat([df_yr_totals,df_nf],axis=0,levels=['Year']).sort_index(axis=0)
df_age_totals.tail(n=8)
# sort columns here so we can mitigate errors resulting from
# re-ordering caused by stack()
df_age_totals = df_age_totals.reindex_axis(sorted(df_age_totals.columns), axis=1)
df_plot = df_age_totals.unstack(level=-1)
# stack causes re-ordering of index values
# CR: https://github.com/pandas-dev/pandas/issues/15105
df_plot = df_plot.stack(level=0)
years = [2015,2014,2013,2012,2011]
groups = ['Age0_14','Age0_64','Age65Plus']
outcomes = ['Death','Hospitalization','ED Discharge']
df_plot = df_plot.loc(axis=0)[years,groups]
df_plot = df_plot[outcomes]
# convert to multi-index columns
#df_plot.columns.rename('GROUPS',level=0,inplace=True)
#levels = [list(i.values) for i in df_plot.columns.levels]
#levels = [['Stats']] + levels
#df_plot.columns = pd.MultiIndex.from_product(levels, names=(None, 'GROUPS','OUTCOME'))
#df_plot.head(n=3)
def add_line(ax, xpos, ypos):
line = plt.Line2D([xpos, xpos], [ypos + .1, ypos],
transform=ax.transAxes, color='black')
line.set_clip_on(False)
ax.add_line(line)
def label_len(my_index,level):
labels = my_index.get_level_values(level)
return [(k, sum(1 for i in g)) for k,g in groupby(labels)]
def label_group_bar_table(ax, df):
ypos = -.1
scale = 1./df.index.size
for level in range(df.index.nlevels)[::-1]:
pos = 0
for label, rpos in label_len(df.index,level):
lxpos = (pos + .5 * rpos)*scale
ax.text(lxpos, ypos, label, ha='center', transform=ax.transAxes)
add_line(ax, pos*scale, ypos)
pos += rpos
add_line(ax, pos*scale , ypos)
ypos -= .1
fig = plt.gcf()
fig.set_size_inches(18, 8)
ax = fig.add_subplot(111)
df_plot.plot(kind='bar',stacked=False,ax=fig.gca(),colormap='Set1')
#Below 3 lines remove default labels
labels = ['' for item in ax.get_xticklabels()]
ax.set_xticklabels(labels)
ax.set_xlabel('')
label_group_bar_table(ax, df_plot)
fig.subplots_adjust(bottom=.1*df.index.nlevels)
#ax.legend(df_plot.columns, prop={'size': 12, 'weight': 'regular'})
ax.grid(False)
ax.set_facecolor('white')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
title = "San Diego Cities: Flu Trends (2011-2015), Aggregates"
ax.set_title(title,fontsize=14, fontweight='bold',color='black')
ax.title.set_position([.5,1.05])
plt.show()