Homeless in San Diego: A Comparative Analysis¶

This is an analysis of the homeless population in the San Diego county. Part of this analysis is a comparison of homelessness trends between San Diego and two other counties of similar socio-economic characteristics. The analysis aims to answer questions under Data Questions below. Results from the analysis can be found under Insights.

Data Source¶

All of the data used in this analysis is sourced from the HUD Exchange, the information portal of the Housing and Urban Development (HUD) agency of the US government. Cities and counties across the US are required by the HUD to report counts of the homeless annually. Homeless counts are captured annually on a single day in January, across the US, through the PIT (Point-in-Time) count. The raw dataset containing PIT counts for all counties can be downloaded from this link: PIT Counts Since 2007. Information on the methodology used to capture this data is here.

Data Questions¶

What does the homelessness trend in San Diego look like?
How does homelessness in San Diego compare to other counties?

Data Analysis¶

Following is the full set of transformations carried out on the data, as well as exploratory visualizations and statistical tests used to answer the data questions.

import sys
import os
import pandas as pd 
import numpy as np 

from scipy import stats
  
import warnings
warnings.filterwarnings(action='once')

import matplotlib.pyplot as plt
from matplotlib import pylab, mlab, gridspec, cm
from IPython.core.pylabtools import figsize, getfigs
from IPython.display import display, HTML, display_html
from pylab import *

import seaborn as sns
#sns.set(color_codes=True)

# custom utilities
import utils.pyutils as pyt  

%matplotlib inline

C:\Anaconda3\envs\ipykernel_py3\lib\importlib\_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__
  return f(*args, **kwds)

def display_tables(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

CWD = os.getcwd()

DATADIR = 'data'
DATAFILE = "pit-counts-by-coc.xlsx"

# load the spreadsheet
xl = pd.ExcelFile(os.path.join(DATADIR,DATAFILE))
print(xl.sheet_names)

['2016', '2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008', '2007', 'CoC Mergers', 'Revisions']

We can see from above that the dataset has 10 years worth of homelessness data spanning 2007-2016. We will be using all 10 years of this data for the counties of our choice.

Counties for Comparison¶

One aspect of this analysis is the comparison of homelessness trends in the San Diego county to other counties in the US with similar socio-economic characteristics. Specifically, the following size and socio-economic indicators were used to assess similarity:

Population Size
Housing Units per Household
Demographic Makeup - Age, Ethnicity
Median Household Income
Population (%) Below Poverty Level
Per Capita Income

Note: This still leaves out other factors such as Unemployment Rate, Median Home Price and Population Density, all of which have an impact on homelessness. E.g.: Denser areas are likely to have greater economic opportunities.

The IndexMundi site was used to compare county stats and the following counties were chosen based on the criteria listed above.

Clearly, Orange County is far more similar to the San Diego County given that both are prominent Southern California regions in close proximity to each other. Dallas County was chosen precisely because it does not share this geographic proximity to San Diego and is yet similar enough for comparison.

COUNTIES = ['San Diego City and County CoC', 'Dallas City & County/Irving CoC','Santa Ana/Anaheim/Orange County CoC']

df = pd.DataFrame()
# iterate through each year's data and collate them into a single dataframe
for sheet in xl.sheet_names[:-2]:
    tmp_df = xl.parse(sheet)
    #filter out data for only counties we want
    tmp_df = tmp_df[tmp_df['CoC Name'].isin(COUNTIES)]
    #retain only data that is present for all years
    tmp_df = tmp_df.iloc[:,1:11]
    #rename cols to remove year so they can be merged
    tmp_df.columns = [s.split(',')[0] for s in tmp_df.columns]
    tmp_df['Year'] = int(sheet)    
    #display(tmp_df)   
    # append each data frame
    df = pd.concat([df,tmp_df],ignore_index=True,axis='rows')
    
#convert columns to numeric
cols = df.columns.tolist()
cols = cols[1:-1]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')

pd.options.display.float_format = '{:.0f}'.format
#df = df.set_index('Year')
df.head(n=6)

pyt.summarize(df)

Dimensions: (30, 11)
CoC Name                                   (<U35 null: 0 len: 30 unq: 3, [San Diego City ...
Total Homeless                             (float64 null: 0 len: 30 unq: 28, [8669.0, 431...
Sheltered Homeless                         (float64 null: 0 len: 30 unq: 27, [3729.0, 211...
Unsheltered Homeless                       (float64 null: 0 len: 30 unq: 25, [4940.0, 220...
Homeless Individuals                       (float64 null: 0 len: 30 unq: 28, [6955.0, 302...
Sheltered Homeless Individuals             (float64 null: 0 len: 30 unq: 28, [2297.0, 833...
Unsheltered Homeless Individuals           (float64 null: 0 len: 30 unq: 25, [4658.0, 219...
Homeless People in Families                (float64 null: 0 len: 30 unq: 27, [1714.0, 129...
Sheltered Homeless People in Families      (float64 null: 0 len: 30 unq: 28, [1432.0, 128...
Unsheltered Homeless People in Families    (float64 null: 0 len: 30 unq: 22, [282.0, 6.0,...
Year                                       (int32 null: 0 len: 30 unq: 10, [2016, 2015, 2...
dtype: object

Analysis: Q1¶

Often, plotting time-series data is enough to identify a definite trend (upward, downward, or cyclical) and that is what we will do with the annual homeless counts for San Diego county.

df_sd = df[df['CoC Name'] == COUNTIES[0]]
df_plot = df_sd.drop(columns=['CoC Name','Sheltered Homeless Individuals','Unsheltered Homeless Individuals',
                                'Sheltered Homeless People in Families','Unsheltered Homeless People in Families'])

# rearrange cols so we can plot paired values appropriately
cols = df_plot.columns.tolist()
cols = cols[1:] + cols[0:1]
df_plot = df_plot[cols]

# convert the data to long-format, suitable for plotting
df_melted = df_plot.melt('Year', var_name='Category',  value_name='Count')

#display_tables(df_melted.head(n=2),df_melted.tail(n=2))

sns.set(style="ticks", rc={'lines.linewidth': 0.9,
                           'font.size': 10,
                           'font.weight': 'bold',
                           'font.family': 'sans-serif',
                           'font.sans-serif': ['Tahoma'],
                           'legend.fontsize': 12,
                           'legend.handlelength': 2,
                           'xtick.major.pad': 20,
                           'ytick.major.pad': 20})

g = sns.factorplot(x='Year', y='Count', hue='Category', data=df_melted, 
                   palette="Paired",
                   size = 10, aspect = 1.5,
                   legend=False)
sns.despine()

fig = g.fig
ax = fig.get_axes()[0]

fig.set_size_inches(12, 6)

ax.grid(False)
ax.set_facecolor('white')
ax.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0.)
ax.xaxis.label.set_visible(False)
ax.yaxis.label.set_visible(False)

title = "San Diego County: Homeless Counts (2007-2016)"
ax.set_title(title,fontsize=14, fontweight='bold',color='black')
ax.title.set_position([.5,1.05])

No clear trend emerges from the plot above which points to a certain complexity in the region's homelessness situation. A look at the spread of these counts should help get a better understanding.

df_plot.describe(include=[float64])

fig, ax = plt.subplots()

g = sns.boxplot(x="Count", y="Category", data=df_melted, whis=np.inf, palette="Paired",ax=ax)
g = sns.swarmplot(x="Count", y="Category", data=df_melted, size=4, color="red", linewidth=0)
g = sns.despine()

fig.set_size_inches(10, 5)

ax.grid(False)
ax.set_facecolor('white')
ax.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0.)
ax.xaxis.label.set_visible(False)
ax.yaxis.label.set_visible(False)

title = "Summary Stats: San Diego County Homeless Counts (2007-2016)" 
ax.set_title(title,fontsize=13, fontweight='bold',color='black')
ax.title.set_position([.5,1.05])

It is telling that the spreads for the homeless in shelters, and those who are with families is much tighter than the others.

When it comes to the sheltered homeless this indicates that either shelter capacity has remained more or less the same over the last decade or access to shelters is poorly managed. Based on the many reports of shelters being overrun it is hard not to conclude that it is more of the former.

The smaller spread in homeless families is likely due to factors that afford families greater economic opportunities both due to the presence of possibly multiple adults (who can each earn a living) and the availability of a greater share of government programs (primarily for families with children), as well as social factors such as a higher motivation to get out of a homeless situation or to never enter one. This also supports a well-known fact that many of the homeless tend to be the mentally ill who are seperated from their families either due to abandonment or some other reason.

Pair-Wise Relationships¶

Something else that would be interesting to plot here would be the pair-wise relationships between variables in this dataset.

# compare only the homeless counts
cols = list(df_plot.select_dtypes(include=[float64]).columns.values)

sns.set(style="ticks", rc={'xtick.major.pad': 5,
                           'ytick.major.pad': 5,
                           'font.weight': 'regular',})

plt.rcParams["axes.labelpad"] = 20
plt.rcParams["axes.labelweight"] = 'bold'

def corrfunc(x, y, **kws):
    r, _ = stats.pearsonr(x, y)
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.1, .9), xycoords=ax.transAxes, color='red')

g = sns.pairplot(df_plot[cols], x_vars=["Homeless Individuals", "Homeless People in Families", "Total Homeless"],
                 y_vars=["Unsheltered Homeless", "Sheltered Homeless"], kind="reg", size=3)
g.map(corrfunc)

g.fig.suptitle('Homeless Category Relationships: San Diego County', fontweight='bold', fontsize=13);

# used to adjust space between each plot in the grid
plt.subplots_adjust(hspace=0.3, wspace=0.6)
# adjust distance between title and grid
plt.subplots_adjust(top=0.85)

Based on the above plots (and the r values listed) we can clearly see that homeless individuals mostly tend to be unsheltered and that the total homeless population is dominated by those that are unsheltered. Homeless people in families also show a moderately high correlation with the sheltered homeless indicating that this group is likely to be mostly sheltered.

Trend Analysis¶

A key aspect of understanding time-series data such as the one we have here is to identify the presence or absence of a trend. This is assessed by fitting a line through the time-series data and measuring its slope. Here we pick data for Sheltered Homeless and Unsheltered Homeless since these seem to be representative of other trend lines as seen from the plot above.

pd.options.display.float_format = '{:.2f}'.format

df_sd_stats = pd.DataFrame(data=np.zeros((2,5)),columns=['slope','intercept','r_val','p_val','se'],
                          index = ['Sheltered Homeless','Unsheltered Homeless'])

df_trends = df_sd.sort_values(['Year']).reset_index()
df_trends['Time Delta'] = (df_trends['Year'] - df_trends['Year'][0]).values

df_sd_stats.loc['Sheltered Homeless'] = list(stats.linregress(df_trends['Time Delta'],df_trends['Sheltered Homeless']))
df_sd_stats.loc['Unsheltered Homeless'] = list(stats.linregress(df_trends['Time Delta'],df_trends['Unsheltered Homeless']))

df_sd_stats

Here is the gist of what the above results indicate:

The slopes for both trends are positive indicating a rising or upward trend with time
The p-values for both trends are greater than 0.05 implying that while a trend is present it is not statistically significant (i.e.: it does not deviate significantly from a line with slope 0)
The r-value for the sheltered homeless is low enough (<= 0.20) to indicate that there is no correlation between time and the sheltered homeless count. In other words, it does not vary significantly with time.
The r-value for the unsheltered homeless is moderate (0.20 <= r <= 0.40) which is to say that there is some correlation between time and the unsheltered homeless count. In other words, it does vary slightly with time and based on the slope, it increases with time.
The standard error (se) values for the two trends are a reflection of their spreads, with the sheltered homeless having a tighter spread and a much smaller value for se.

Note that this is an interpretation from our attempt to fit a linear model and the only conclusion that one can draw from this is that the homeless trends for the county are not linear.

sns.set(style="ticks")

fig, (ax1,ax2) = plt.subplots(1,2)

fig = plt.gcf()
fig.set_size_inches(18, 6)

p = sns.regplot(x='Time Delta', y='Sheltered Homeless', data=df_trends, order=4, ci=None, scatter_kws={"s": 50}, ax=ax1)
p = p.set(xticks=np.arange(0,10,1))

p = sns.regplot(x='Time Delta', y='Unsheltered Homeless', data=df_trends, order=4, ci=None, scatter_kws={"s": 50}, ax=ax2)
p = p.set(xticks=np.arange(0,10,1))

sns.despine()

for ax in (ax1,ax2):
    ax.grid(False)
    ax.set_facecolor('white')
    ax.xaxis.labelpad = 20
    ax.yaxis.labelpad = 20
    ax.set_xlabel("Time Delta (Years)",fontsize=12,fontweight='bold')
    
ax1.set_ylabel("Sheltered Homeless Count",fontsize=12,fontweight='bold');
ax2.set_ylabel("Unsheltered Homeless Count",fontsize=12,fontweight='bold');

A great way to test out whether to fit a linear or a non-linear model and even to play around with parameters within each is by using Seaborn's lmplot and regplot functions. Here we find a couple of polynomial fits for the county's sheltered and unsheltered homeless counts. But all said and done polynomial fits result in complex models and may not necessarily be the best way to understand trends. Some approaches favor piecewise linear models over non-linear ones.

Analysis: Q2¶

Comparing trends across regions with similar socio-economic characteristics offers insights in that it can help identify differences in policy, all else being equal. But this being a comparison of observed samples, as opposed empirical ones, there are many factors that are likely to influence any differences in trends. Nevertheless, this is worthwhile pursuit and especially so if it yields significant differences.

We start by plotting trends for the counties of choice.

# function to plot the dataframe passed in and save it as a PNG
def table_as_png(df):
    
    from pandas.plotting import table
    
    # set fig size
    fig, ax = plt.subplots(figsize=(12, 3),dpi=100); 
    p = plt.subplots_adjust(top=0.97, bottom=0.2, left=0.05, right=0.97, hspace=0.2);
    # no axes
    ax.xaxis.set_visible(False)  
    ax.yaxis.set_visible(False)  
    # no frame
    ax.set_frame_on(False)  
    # plot table
    tab = table(ax, df, loc='upper right',colWidths=[0.17]*len(df.columns),
                cellLoc = 'center', rowLoc = 'center');  
    # set font manually
    tab.auto_set_font_size(False)
    tab.set_fontsize(8)     
    # save the result
    plt.savefig('table.png',dpi=200,transparent=True)
    plt.close();

pd.options.display.float_format = '{:.0f}'.format

df_plot = df.drop(columns=['Sheltered Homeless Individuals','Unsheltered Homeless Individuals',
                           'Sheltered Homeless People in Families','Unsheltered Homeless People in Families'])

# rename county names 
df_plot['CoC Name'] = df_plot['CoC Name'].map({ 'San Diego City and County CoC': 'San Diego',
                                                'Dallas City & County/Irving CoC': 'Dallas',
                                                'Santa Ana/Anaheim/Orange County CoC': 'Orange'})

# rearrange cols so we can plot paired values appropriately
cols = df_plot.columns.tolist()
cols = [cols[-1]]  + [cols[0]] + cols[1:-1]
df_plot = df_plot[cols]
df_plot = df_plot.rename(columns={'CoC Name':'County'})

# save this so we can use it to stack the dataframe
categories = cols[2:]

#df_plot.head(n=6)
df_county = df_plot.groupby(['County'])
df_county_means = df_county[categories].aggregate(np.mean)

table_as_png(df_county_means);
display(HTML("<h6 style='text-align: left'>Average Homeless Counts (2007-2016)</h6>"))
df_county_means

# convert the data to long-format, suitable for plotting
df_melted = df_plot.set_index(['Year','County']).rename_axis(['Category'],axis=1).stack().reset_index()
df_melted = df_melted.rename(columns={0:'Count'})
#display_tables(df_melted.head(n=5),df_melted.tail(n=5))

sns.set(style="ticks")

order = df_melted['Year'].unique().sort()
hue_order=['San Diego','Dallas','Orange']

def plot(x,y, **kwargs):    
    data = kwargs.pop('data')
    sns.pointplot(x,y, hue='County', data=data, ci='sd', errwidth=1, order=order, hue_order=hue_order, scale=0.5)
                
g = sns.FacetGrid(df_melted, col='Category', col_wrap=3, sharey='col', size=4.5, aspect=1.1)
g = g.map_dataframe(plot,'Year','Count', palette='deep')

g = g.set_titles(col_template="{col_name}", fontweight='bold', fontsize=11)
g.axes[4].legend(loc='upper right', bbox_to_anchor=(1.4, 1, 0, 0), fontsize=13)

for ax in g.axes.flat:
    ax.xaxis.label.set_visible(False)
    ax.yaxis.label.set_visible(False)
    plt.setp(ax.get_xticklabels(), visible=True)
    ax.set_xticklabels(ax.get_xticklabels(),rotation=45 )

# used to adjust space between each plot in the grid
plt.subplots_adjust(hspace=0.3, wspace=0.1)
# adjust distance between title and grid
plt.subplots_adjust(top=0.85)
g.fig.suptitle('Homeless Trend Comparison: San Diego, Dallas and Orange Counties', fontweight='bold', fontsize=15);

The following are observable from these plots:

All three counties exhibit similar trends for the sheltered homeless and for homeless in families
Dallas county exhibits a significantly different trend for unsheltered homeless in that its unsheltered homeless counts are almost stationary (i.e.: not varying with time) compared to the highly non-linear patterns in both San Diego and Orange counties.
Dallas county has a much lower (in absolute terms) homeless counts across all categories. This is likely partly explained by the difference in population size between Dallas and the other counties but further analysis is required to understand other factors that contribute to these low numbers.
San Diego county has the highest number of homeless counts (with the exception of 2 years when it was lower than that of Orange county) across all categories and a significantly higher average than the other counties.
The difference in homeless counts between San Diego and Orange counties is especially striking given their significant similarities. This potentially points to differences in policy that is worth examining.

NOTE: Given the non-lineariaties involved in the trend lines for the various homeless categories comparing them across counties, using statistical techniques, is non-trivial. That combined with the fact that trend differences are easily observable from the plots above meant that statistical tests to quantify trend differences have been left out of this analysis.

Insights¶

The following are the key findings of this analysis:

San Diego homeless counts are trending upwards over the decade spanning 2007-2016
The following two trends were observed across all counties whose data was analysed for the same period
- The number of homeless in shelters has remained steady (neither increasing nor decreasing significantly) over the same period
- Individuals (those without support from families) are the most vulnerable group, prone to homelessness by significantly larger numbers than those within families.
San Diego has greater numbers of homeless on average, across all categories, than Dallas and Orange counties, two counties of similar socio-economic characteristics

Outro¶

A comprehensive report on the homeless across the US for the last 10 years from 2007-2016 can found here:

	CoC Name	Total Homeless	Sheltered Homeless	Unsheltered Homeless	Homeless Individuals	Sheltered Homeless Individuals	Unsheltered Homeless Individuals	Homeless People in Families	Sheltered Homeless People in Families	Unsheltered Homeless People in Families	Year
0	San Diego City and County CoC	8669	3729	4940	6955	2297	4658	1714	1432	282	2016
1	Santa Ana/Anaheim/Orange County CoC	4319	2118	2201	3028	833	2195	1291	1285	6	2016
2	Dallas City & County/Irving CoC	3810	3071	739	2633	1900	733	1177	1171	6	2016
3	San Diego City and County CoC	8742	4586	4156	6761	2849	3912	1981	1737	244	2015
4	Santa Ana/Anaheim/Orange County CoC	4452	2251	2201	3073	878	2195	1379	1373	6	2015
5	Dallas City & County/Irving CoC	3141	2778	363	2219	1863	356	922	915	7	2015

	Sheltered Homeless	Unsheltered Homeless	Homeless Individuals	Homeless People in Families	Total Homeless
count	10	10	10	10	10
mean	4179	4462	6688	1952	8640
std	295	713	753	254	734
min	3729	3345	5295	1673	7326
25%	3980	4020	6345	1744	8401
50%	4143	4310	6656	1924	8588
75%	4406	4848	6998	2025	8845
max	4586	5642	7763	2460	10013

	Total Homeless	Sheltered Homeless	Unsheltered Homeless	Homeless Individuals	Homeless People in Families
County
Dallas	3499	3199	300	2082	1417
Orange	5477	2488	2989	4000	1477
San Diego	8640	4179	4462	6688	1952

	slope	intercept	r_val	p_val	se
Sheltered Homeless	19.19	4092.53	0.20	0.58	33.72
Unsheltered Homeless	94.24	4037.44	0.40	0.25	76.26