This is an analysis of the homeless population in the San Diego county. Part of this analysis is a comparison of homelessness trends between San Diego and two other counties of similar socio-economic characteristics. The analysis aims to answer questions under Data Questions below. Results from the analysis can be found under Insights.
All of the data used in this analysis is sourced from the HUD Exchange, the information portal of the Housing and Urban Development (HUD) agency of the US government. Cities and counties across the US are required by the HUD to report counts of the homeless annually. Homeless counts are captured annually on a single day in January, across the US, through the PIT (Point-in-Time) count. The raw dataset containing PIT counts for all counties can be downloaded from this link: PIT Counts Since 2007. Information on the methodology used to capture this data is here.
Following is the full set of transformations carried out on the data, as well as exploratory visualizations and statistical tests used to answer the data questions.
import sys
import os
import pandas as pd
import numpy as np
from scipy import stats
import warnings
warnings.filterwarnings(action='once')
import matplotlib.pyplot as plt
from matplotlib import pylab, mlab, gridspec, cm
from IPython.core.pylabtools import figsize, getfigs
from IPython.display import display, HTML, display_html
from pylab import *
import seaborn as sns
#sns.set(color_codes=True)
# custom utilities
import utils.pyutils as pyt
%matplotlib inline
def display_tables(*args):
html_str=''
for df in args:
html_str+=df.to_html()
display_html(html_str.replace('table','table style="display:inline"'),raw=True)
CWD = os.getcwd()
DATADIR = 'data'
DATAFILE = "pit-counts-by-coc.xlsx"
# load the spreadsheet
xl = pd.ExcelFile(os.path.join(DATADIR,DATAFILE))
print(xl.sheet_names)
We can see from above that the dataset has 10 years worth of homelessness data spanning 2007-2016. We will be using all 10 years of this data for the counties of our choice.
One aspect of this analysis is the comparison of homelessness trends in the San Diego county to other counties in the US with similar socio-economic characteristics. Specifically, the following size and socio-economic indicators were used to assess similarity:
Note: This still leaves out other factors such as Unemployment Rate, Median Home Price and Population Density, all of which have an impact on homelessness. E.g.: Denser areas are likely to have greater economic opportunities.
The IndexMundi site was used to compare county stats and the following counties were chosen based on the criteria listed above.
Clearly, Orange County is far more similar to the San Diego County given that both are prominent Southern California regions in close proximity to each other. Dallas County was chosen precisely because it does not share this geographic proximity to San Diego and is yet similar enough for comparison.
COUNTIES = ['San Diego City and County CoC', 'Dallas City & County/Irving CoC','Santa Ana/Anaheim/Orange County CoC']
df = pd.DataFrame()
# iterate through each year's data and collate them into a single dataframe
for sheet in xl.sheet_names[:-2]:
tmp_df = xl.parse(sheet)
#filter out data for only counties we want
tmp_df = tmp_df[tmp_df['CoC Name'].isin(COUNTIES)]
#retain only data that is present for all years
tmp_df = tmp_df.iloc[:,1:11]
#rename cols to remove year so they can be merged
tmp_df.columns = [s.split(',')[0] for s in tmp_df.columns]
tmp_df['Year'] = int(sheet)
#display(tmp_df)
# append each data frame
df = pd.concat([df,tmp_df],ignore_index=True,axis='rows')
#convert columns to numeric
cols = df.columns.tolist()
cols = cols[1:-1]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
pd.options.display.float_format = '{:.0f}'.format
#df = df.set_index('Year')
df.head(n=6)
pyt.summarize(df)
Often, plotting time-series data is enough to identify a definite trend (upward, downward, or cyclical) and that is what we will do with the annual homeless counts for San Diego county.
df_sd = df[df['CoC Name'] == COUNTIES[0]]
df_plot = df_sd.drop(columns=['CoC Name','Sheltered Homeless Individuals','Unsheltered Homeless Individuals',
'Sheltered Homeless People in Families','Unsheltered Homeless People in Families'])
# rearrange cols so we can plot paired values appropriately
cols = df_plot.columns.tolist()
cols = cols[1:] + cols[0:1]
df_plot = df_plot[cols]
# convert the data to long-format, suitable for plotting
df_melted = df_plot.melt('Year', var_name='Category', value_name='Count')
#display_tables(df_melted.head(n=2),df_melted.tail(n=2))
sns.set(style="ticks", rc={'lines.linewidth': 0.9,
'font.size': 10,
'font.weight': 'bold',
'font.family': 'sans-serif',
'font.sans-serif': ['Tahoma'],
'legend.fontsize': 12,
'legend.handlelength': 2,
'xtick.major.pad': 20,
'ytick.major.pad': 20})
g = sns.factorplot(x='Year', y='Count', hue='Category', data=df_melted,
palette="Paired",
size = 10, aspect = 1.5,
legend=False)
sns.despine()
fig = g.fig
ax = fig.get_axes()[0]
fig.set_size_inches(12, 6)
ax.grid(False)
ax.set_facecolor('white')
ax.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0.)
ax.xaxis.label.set_visible(False)
ax.yaxis.label.set_visible(False)
title = "San Diego County: Homeless Counts (2007-2016)"
ax.set_title(title,fontsize=14, fontweight='bold',color='black')
ax.title.set_position([.5,1.05])
No clear trend emerges from the plot above which points to a certain complexity in the region's homelessness situation. A look at the spread of these counts should help get a better understanding.
df_plot.describe(include=[float64])
fig, ax = plt.subplots()
g = sns.boxplot(x="Count", y="Category", data=df_melted, whis=np.inf, palette="Paired",ax=ax)
g = sns.swarmplot(x="Count", y="Category", data=df_melted, size=4, color="red", linewidth=0)
g = sns.despine()
fig.set_size_inches(10, 5)
ax.grid(False)
ax.set_facecolor('white')
ax.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0.)
ax.xaxis.label.set_visible(False)
ax.yaxis.label.set_visible(False)
title = "Summary Stats: San Diego County Homeless Counts (2007-2016)"
ax.set_title(title,fontsize=13, fontweight='bold',color='black')
ax.title.set_position([.5,1.05])
It is telling that the spreads for the homeless in shelters, and those who are with families is much tighter than the others.
When it comes to the sheltered homeless this indicates that either shelter capacity has remained more or less the same over the last decade or access to shelters is poorly managed. Based on the many reports of shelters being overrun it is hard not to conclude that it is more of the former.
The smaller spread in homeless families is likely due to factors that afford families greater economic opportunities both due to the presence of possibly multiple adults (who can each earn a living) and the availability of a greater share of government programs (primarily for families with children), as well as social factors such as a higher motivation to get out of a homeless situation or to never enter one. This also supports a well-known fact that many of the homeless tend to be the mentally ill who are seperated from their families either due to abandonment or some other reason.
Something else that would be interesting to plot here would be the pair-wise relationships between variables in this dataset.
# compare only the homeless counts
cols = list(df_plot.select_dtypes(include=[float64]).columns.values)
sns.set(style="ticks", rc={'xtick.major.pad': 5,
'ytick.major.pad': 5,
'font.weight': 'regular',})
plt.rcParams["axes.labelpad"] = 20
plt.rcParams["axes.labelweight"] = 'bold'
def corrfunc(x, y, **kws):
r, _ = stats.pearsonr(x, y)
ax = plt.gca()
ax.annotate("r = {:.2f}".format(r),
xy=(.1, .9), xycoords=ax.transAxes, color='red')
g = sns.pairplot(df_plot[cols], x_vars=["Homeless Individuals", "Homeless People in Families", "Total Homeless"],
y_vars=["Unsheltered Homeless", "Sheltered Homeless"], kind="reg", size=3)
g.map(corrfunc)
g.fig.suptitle('Homeless Category Relationships: San Diego County', fontweight='bold', fontsize=13);
# used to adjust space between each plot in the grid
plt.subplots_adjust(hspace=0.3, wspace=0.6)
# adjust distance between title and grid
plt.subplots_adjust(top=0.85)
Based on the above plots (and the r
values listed) we can clearly see that homeless individuals mostly tend to be unsheltered and that the total homeless population is dominated by those that are unsheltered. Homeless people in families also show a moderately high correlation with the sheltered homeless indicating that this group is likely to be mostly sheltered.
A key aspect of understanding time-series data such as the one we have here is to identify the presence or absence of a trend. This is assessed by fitting a line through the time-series data and measuring its slope. Here we pick data for Sheltered Homeless and Unsheltered Homeless since these seem to be representative of other trend lines as seen from the plot above.
pd.options.display.float_format = '{:.2f}'.format
df_sd_stats = pd.DataFrame(data=np.zeros((2,5)),columns=['slope','intercept','r_val','p_val','se'],
index = ['Sheltered Homeless','Unsheltered Homeless'])
df_trends = df_sd.sort_values(['Year']).reset_index()
df_trends['Time Delta'] = (df_trends['Year'] - df_trends['Year'][0]).values
df_sd_stats.loc['Sheltered Homeless'] = list(stats.linregress(df_trends['Time Delta'],df_trends['Sheltered Homeless']))
df_sd_stats.loc['Unsheltered Homeless'] = list(stats.linregress(df_trends['Time Delta'],df_trends['Unsheltered Homeless']))
df_sd_stats
Here is the gist of what the above results indicate:
Note that this is an interpretation from our attempt to fit a linear model and the only conclusion that one can draw from this is that the homeless trends for the county are not linear.
sns.set(style="ticks")
fig, (ax1,ax2) = plt.subplots(1,2)
fig = plt.gcf()
fig.set_size_inches(18, 6)
p = sns.regplot(x='Time Delta', y='Sheltered Homeless', data=df_trends, order=4, ci=None, scatter_kws={"s": 50}, ax=ax1)
p = p.set(xticks=np.arange(0,10,1))
p = sns.regplot(x='Time Delta', y='Unsheltered Homeless', data=df_trends, order=4, ci=None, scatter_kws={"s": 50}, ax=ax2)
p = p.set(xticks=np.arange(0,10,1))
sns.despine()
for ax in (ax1,ax2):
ax.grid(False)
ax.set_facecolor('white')
ax.xaxis.labelpad = 20
ax.yaxis.labelpad = 20
ax.set_xlabel("Time Delta (Years)",fontsize=12,fontweight='bold')
ax1.set_ylabel("Sheltered Homeless Count",fontsize=12,fontweight='bold');
ax2.set_ylabel("Unsheltered Homeless Count",fontsize=12,fontweight='bold');
A great way to test out whether to fit a linear or a non-linear model and even to play around with parameters within each is by using Seaborn's lmplot and regplot functions. Here we find a couple of polynomial fits for the county's sheltered and unsheltered homeless counts. But all said and done polynomial fits result in complex models and may not necessarily be the best way to understand trends. Some approaches favor piecewise linear models over non-linear ones.
Comparing trends across regions with similar socio-economic characteristics offers insights in that it can help identify differences in policy, all else being equal. But this being a comparison of observed samples, as opposed empirical ones, there are many factors that are likely to influence any differences in trends. Nevertheless, this is worthwhile pursuit and especially so if it yields significant differences.
We start by plotting trends for the counties of choice.
# function to plot the dataframe passed in and save it as a PNG
def table_as_png(df):
from pandas.plotting import table
# set fig size
fig, ax = plt.subplots(figsize=(12, 3),dpi=100);
p = plt.subplots_adjust(top=0.97, bottom=0.2, left=0.05, right=0.97, hspace=0.2);
# no axes
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
# no frame
ax.set_frame_on(False)
# plot table
tab = table(ax, df, loc='upper right',colWidths=[0.17]*len(df.columns),
cellLoc = 'center', rowLoc = 'center');
# set font manually
tab.auto_set_font_size(False)
tab.set_fontsize(8)
# save the result
plt.savefig('table.png',dpi=200,transparent=True)
plt.close();
pd.options.display.float_format = '{:.0f}'.format
df_plot = df.drop(columns=['Sheltered Homeless Individuals','Unsheltered Homeless Individuals',
'Sheltered Homeless People in Families','Unsheltered Homeless People in Families'])
# rename county names
df_plot['CoC Name'] = df_plot['CoC Name'].map({ 'San Diego City and County CoC': 'San Diego',
'Dallas City & County/Irving CoC': 'Dallas',
'Santa Ana/Anaheim/Orange County CoC': 'Orange'})
# rearrange cols so we can plot paired values appropriately
cols = df_plot.columns.tolist()
cols = [cols[-1]] + [cols[0]] + cols[1:-1]
df_plot = df_plot[cols]
df_plot = df_plot.rename(columns={'CoC Name':'County'})
# save this so we can use it to stack the dataframe
categories = cols[2:]
#df_plot.head(n=6)
df_county = df_plot.groupby(['County'])
df_county_means = df_county[categories].aggregate(np.mean)
table_as_png(df_county_means);
display(HTML("<h6 style='text-align: left'>Average Homeless Counts (2007-2016)</h6>"))
df_county_means
# convert the data to long-format, suitable for plotting
df_melted = df_plot.set_index(['Year','County']).rename_axis(['Category'],axis=1).stack().reset_index()
df_melted = df_melted.rename(columns={0:'Count'})
#display_tables(df_melted.head(n=5),df_melted.tail(n=5))
sns.set(style="ticks")
order = df_melted['Year'].unique().sort()
hue_order=['San Diego','Dallas','Orange']
def plot(x,y, **kwargs):
data = kwargs.pop('data')
sns.pointplot(x,y, hue='County', data=data, ci='sd', errwidth=1, order=order, hue_order=hue_order, scale=0.5)
g = sns.FacetGrid(df_melted, col='Category', col_wrap=3, sharey='col', size=4.5, aspect=1.1)
g = g.map_dataframe(plot,'Year','Count', palette='deep')
g = g.set_titles(col_template="{col_name}", fontweight='bold', fontsize=11)
g.axes[4].legend(loc='upper right', bbox_to_anchor=(1.4, 1, 0, 0), fontsize=13)
for ax in g.axes.flat:
ax.xaxis.label.set_visible(False)
ax.yaxis.label.set_visible(False)
plt.setp(ax.get_xticklabels(), visible=True)
ax.set_xticklabels(ax.get_xticklabels(),rotation=45 )
# used to adjust space between each plot in the grid
plt.subplots_adjust(hspace=0.3, wspace=0.1)
# adjust distance between title and grid
plt.subplots_adjust(top=0.85)
g.fig.suptitle('Homeless Trend Comparison: San Diego, Dallas and Orange Counties', fontweight='bold', fontsize=15);
The following are observable from these plots:
NOTE: Given the non-lineariaties involved in the trend lines for the various homeless categories comparing them across counties, using statistical techniques, is non-trivial. That combined with the fact that trend differences are easily observable from the plots above meant that statistical tests to quantify trend differences have been left out of this analysis.
The following are the key findings of this analysis:
A comprehensive report on the homeless across the US for the last 10 years from 2007-2016 can found here: