Homeless in San Diego: A Comparative Analysis

This is an analysis of the homeless population in the San Diego county. Part of this analysis is a comparison of homelessness trends between San Diego and two other counties of similar socio-economic characteristics. The analysis aims to answer questions under Data Questions below. Results from the analysis can be found under Insights.

Data Source

All of the data used in this analysis is sourced from the HUD Exchange, the information portal of the Housing and Urban Development (HUD) agency of the US government. Cities and counties across the US are required by the HUD to report counts of the homeless annually. Homeless counts are captured annually on a single day in January, across the US, through the PIT (Point-in-Time) count. The raw dataset containing PIT counts for all counties can be downloaded from this link: PIT Counts Since 2007. Information on the methodology used to capture this data is here.

Data Questions

  1. What does the homelessness trend in San Diego look like?
  2. How does homelessness in San Diego compare to other counties?

Data Analysis

Following is the full set of transformations carried out on the data, as well as exploratory visualizations and statistical tests used to answer the data questions.

In [1]:
import sys
import os
import pandas as pd 
import numpy as np 

from scipy import stats
  
import warnings
warnings.filterwarnings(action='once')

import matplotlib.pyplot as plt
from matplotlib import pylab, mlab, gridspec, cm
from IPython.core.pylabtools import figsize, getfigs
from IPython.display import display, HTML, display_html
from pylab import *

import seaborn as sns
#sns.set(color_codes=True)

# custom utilities
import utils.pyutils as pyt  

%matplotlib inline
C:\Anaconda3\envs\ipykernel_py3\lib\importlib\_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__
  return f(*args, **kwds)
In [2]:
def display_tables(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)
In [3]:
CWD = os.getcwd()

DATADIR = 'data'
DATAFILE = "pit-counts-by-coc.xlsx"
In [4]:
# load the spreadsheet
xl = pd.ExcelFile(os.path.join(DATADIR,DATAFILE))
print(xl.sheet_names)
['2016', '2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008', '2007', 'CoC Mergers', 'Revisions']

We can see from above that the dataset has 10 years worth of homelessness data spanning 2007-2016. We will be using all 10 years of this data for the counties of our choice.

Counties for Comparison

One aspect of this analysis is the comparison of homelessness trends in the San Diego county to other counties in the US with similar socio-economic characteristics. Specifically, the following size and socio-economic indicators were used to assess similarity:

  • Population Size
  • Housing Units per Household
  • Demographic Makeup - Age, Ethnicity
  • Median Household Income
  • Population (%) Below Poverty Level
  • Per Capita Income

Note: This still leaves out other factors such as Unemployment Rate, Median Home Price and Population Density, all of which have an impact on homelessness. E.g.: Denser areas are likely to have greater economic opportunities.

The IndexMundi site was used to compare county stats and the following counties were chosen based on the criteria listed above.

Clearly, Orange County is far more similar to the San Diego County given that both are prominent Southern California regions in close proximity to each other. Dallas County was chosen precisely because it does not share this geographic proximity to San Diego and is yet similar enough for comparison.

In [5]:
COUNTIES = ['San Diego City and County CoC', 'Dallas City & County/Irving CoC','Santa Ana/Anaheim/Orange County CoC']
In [6]:
df = pd.DataFrame()
# iterate through each year's data and collate them into a single dataframe
for sheet in xl.sheet_names[:-2]:
    tmp_df = xl.parse(sheet)
    #filter out data for only counties we want
    tmp_df = tmp_df[tmp_df['CoC Name'].isin(COUNTIES)]
    #retain only data that is present for all years
    tmp_df = tmp_df.iloc[:,1:11]
    #rename cols to remove year so they can be merged
    tmp_df.columns = [s.split(',')[0] for s in tmp_df.columns]
    tmp_df['Year'] = int(sheet)    
    #display(tmp_df)   
    # append each data frame
    df = pd.concat([df,tmp_df],ignore_index=True,axis='rows')
    
#convert columns to numeric
cols = df.columns.tolist()
cols = cols[1:-1]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')

pd.options.display.float_format = '{:.0f}'.format
#df = df.set_index('Year')
df.head(n=6)    
Out[6]:
CoC Name Total Homeless Sheltered Homeless Unsheltered Homeless Homeless Individuals Sheltered Homeless Individuals Unsheltered Homeless Individuals Homeless People in Families Sheltered Homeless People in Families Unsheltered Homeless People in Families Year
0 San Diego City and County CoC 8669 3729 4940 6955 2297 4658 1714 1432 282 2016
1 Santa Ana/Anaheim/Orange County CoC 4319 2118 2201 3028 833 2195 1291 1285 6 2016
2 Dallas City & County/Irving CoC 3810 3071 739 2633 1900 733 1177 1171 6 2016
3 San Diego City and County CoC 8742 4586 4156 6761 2849 3912 1981 1737 244 2015
4 Santa Ana/Anaheim/Orange County CoC 4452 2251 2201 3073 878 2195 1379 1373 6 2015
5 Dallas City & County/Irving CoC 3141 2778 363 2219 1863 356 922 915 7 2015
In [7]:
pyt.summarize(df)
Dimensions: (30, 11)
CoC Name                                   (<U35 null: 0 len: 30 unq: 3, [San Diego City ...
Total Homeless                             (float64 null: 0 len: 30 unq: 28, [8669.0, 431...
Sheltered Homeless                         (float64 null: 0 len: 30 unq: 27, [3729.0, 211...
Unsheltered Homeless                       (float64 null: 0 len: 30 unq: 25, [4940.0, 220...
Homeless Individuals                       (float64 null: 0 len: 30 unq: 28, [6955.0, 302...
Sheltered Homeless Individuals             (float64 null: 0 len: 30 unq: 28, [2297.0, 833...
Unsheltered Homeless Individuals           (float64 null: 0 len: 30 unq: 25, [4658.0, 219...
Homeless People in Families                (float64 null: 0 len: 30 unq: 27, [1714.0, 129...
Sheltered Homeless People in Families      (float64 null: 0 len: 30 unq: 28, [1432.0, 128...
Unsheltered Homeless People in Families    (float64 null: 0 len: 30 unq: 22, [282.0, 6.0,...
Year                                       (int32 null: 0 len: 30 unq: 10, [2016, 2015, 2...
dtype: object

Analysis: Q1

Often, plotting time-series data is enough to identify a definite trend (upward, downward, or cyclical) and that is what we will do with the annual homeless counts for San Diego county.

In [8]:
df_sd = df[df['CoC Name'] == COUNTIES[0]]
df_plot = df_sd.drop(columns=['CoC Name','Sheltered Homeless Individuals','Unsheltered Homeless Individuals',
                                'Sheltered Homeless People in Families','Unsheltered Homeless People in Families'])

# rearrange cols so we can plot paired values appropriately
cols = df_plot.columns.tolist()
cols = cols[1:] + cols[0:1]
df_plot = df_plot[cols]

# convert the data to long-format, suitable for plotting
df_melted = df_plot.melt('Year', var_name='Category',  value_name='Count')

#display_tables(df_melted.head(n=2),df_melted.tail(n=2))
In [9]:
sns.set(style="ticks", rc={'lines.linewidth': 0.9,
                           'font.size': 10,
                           'font.weight': 'bold',
                           'font.family': 'sans-serif',
                           'font.sans-serif': ['Tahoma'],
                           'legend.fontsize': 12,
                           'legend.handlelength': 2,
                           'xtick.major.pad': 20,
                           'ytick.major.pad': 20})

g = sns.factorplot(x='Year', y='Count', hue='Category', data=df_melted, 
                   palette="Paired",
                   size = 10, aspect = 1.5,
                   legend=False)
sns.despine()

fig = g.fig
ax = fig.get_axes()[0]

fig.set_size_inches(12, 6)

ax.grid(False)
ax.set_facecolor('white')
ax.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0.)
ax.xaxis.label.set_visible(False)
ax.yaxis.label.set_visible(False)

title = "San Diego County: Homeless Counts (2007-2016)"
ax.set_title(title,fontsize=14, fontweight='bold',color='black')
ax.title.set_position([.5,1.05])

No clear trend emerges from the plot above which points to a certain complexity in the region's homelessness situation. A look at the spread of these counts should help get a better understanding.

In [10]:
df_plot.describe(include=[float64])
Out[10]:
Sheltered Homeless Unsheltered Homeless Homeless Individuals Homeless People in Families Total Homeless
count 10 10 10 10 10
mean 4179 4462 6688 1952 8640
std 295 713 753 254 734
min 3729 3345 5295 1673 7326
25% 3980 4020 6345 1744 8401
50% 4143 4310 6656 1924 8588
75% 4406 4848 6998 2025 8845
max 4586 5642 7763 2460 10013
In [11]:
fig, ax = plt.subplots()

g = sns.boxplot(x="Count", y="Category", data=df_melted, whis=np.inf, palette="Paired",ax=ax)
g = sns.swarmplot(x="Count", y="Category", data=df_melted, size=4, color="red", linewidth=0)
g = sns.despine()

fig.set_size_inches(10, 5)

ax.grid(False)
ax.set_facecolor('white')
ax.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0.)
ax.xaxis.label.set_visible(False)
ax.yaxis.label.set_visible(False)

title = "Summary Stats: San Diego County Homeless Counts (2007-2016)" 
ax.set_title(title,fontsize=13, fontweight='bold',color='black')
ax.title.set_position([.5,1.05])

It is telling that the spreads for the homeless in shelters, and those who are with families is much tighter than the others.

When it comes to the sheltered homeless this indicates that either shelter capacity has remained more or less the same over the last decade or access to shelters is poorly managed. Based on the many reports of shelters being overrun it is hard not to conclude that it is more of the former.

The smaller spread in homeless families is likely due to factors that afford families greater economic opportunities both due to the presence of possibly multiple adults (who can each earn a living) and the availability of a greater share of government programs (primarily for families with children), as well as social factors such as a higher motivation to get out of a homeless situation or to never enter one. This also supports a well-known fact that many of the homeless tend to be the mentally ill who are seperated from their families either due to abandonment or some other reason.

Pair-Wise Relationships

Something else that would be interesting to plot here would be the pair-wise relationships between variables in this dataset.

In [44]:
# compare only the homeless counts
cols = list(df_plot.select_dtypes(include=[float64]).columns.values)

sns.set(style="ticks", rc={'xtick.major.pad': 5,
                           'ytick.major.pad': 5,
                           'font.weight': 'regular',})

plt.rcParams["axes.labelpad"] = 20
plt.rcParams["axes.labelweight"] = 'bold'

def corrfunc(x, y, **kws):
    r, _ = stats.pearsonr(x, y)
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.1, .9), xycoords=ax.transAxes, color='red')

g = sns.pairplot(df_plot[cols], x_vars=["Homeless Individuals", "Homeless People in Families", "Total Homeless"],
                 y_vars=["Unsheltered Homeless", "Sheltered Homeless"], kind="reg", size=3)
g.map(corrfunc)

g.fig.suptitle('Homeless Category Relationships: San Diego County', fontweight='bold', fontsize=13);

# used to adjust space between each plot in the grid
plt.subplots_adjust(hspace=0.3, wspace=0.6)
# adjust distance between title and grid
plt.subplots_adjust(top=0.85)

Based on the above plots (and the r values listed) we can clearly see that homeless individuals mostly tend to be unsheltered and that the total homeless population is dominated by those that are unsheltered. Homeless people in families also show a moderately high correlation with the sheltered homeless indicating that this group is likely to be mostly sheltered.

Trend Analysis

A key aspect of understanding time-series data such as the one we have here is to identify the presence or absence of a trend. This is assessed by fitting a line through the time-series data and measuring its slope. Here we pick data for Sheltered Homeless and Unsheltered Homeless since these seem to be representative of other trend lines as seen from the plot above.

In [13]:
pd.options.display.float_format = '{:.2f}'.format

df_sd_stats = pd.DataFrame(data=np.zeros((2,5)),columns=['slope','intercept','r_val','p_val','se'],
                          index = ['Sheltered Homeless','Unsheltered Homeless'])

df_trends = df_sd.sort_values(['Year']).reset_index()
df_trends['Time Delta'] = (df_trends['Year'] - df_trends['Year'][0]).values

df_sd_stats.loc['Sheltered Homeless'] = list(stats.linregress(df_trends['Time Delta'],df_trends['Sheltered Homeless']))
df_sd_stats.loc['Unsheltered Homeless'] = list(stats.linregress(df_trends['Time Delta'],df_trends['Unsheltered Homeless']))

df_sd_stats
Out[13]:
slope intercept r_val p_val se
Sheltered Homeless 19.19 4092.53 0.20 0.58 33.72
Unsheltered Homeless 94.24 4037.44 0.40 0.25 76.26

Here is the gist of what the above results indicate:

  • The slopes for both trends are positive indicating a rising or upward trend with time
  • The p-values for both trends are greater than 0.05 implying that while a trend is present it is not statistically significant (i.e.: it does not deviate significantly from a line with slope 0)
  • The r-value for the sheltered homeless is low enough (<= 0.20) to indicate that there is no correlation between time and the sheltered homeless count. In other words, it does not vary significantly with time.
  • The r-value for the unsheltered homeless is moderate (0.20 <= r <= 0.40) which is to say that there is some correlation between time and the unsheltered homeless count. In other words, it does vary slightly with time and based on the slope, it increases with time.
  • The standard error (se) values for the two trends are a reflection of their spreads, with the sheltered homeless having a tighter spread and a much smaller value for se.

Note that this is an interpretation from our attempt to fit a linear model and the only conclusion that one can draw from this is that the homeless trends for the county are not linear.

In [14]:
sns.set(style="ticks")

fig, (ax1,ax2) = plt.subplots(1,2)

fig = plt.gcf()
fig.set_size_inches(18, 6)

p = sns.regplot(x='Time Delta', y='Sheltered Homeless', data=df_trends, order=4, ci=None, scatter_kws={"s": 50}, ax=ax1)
p = p.set(xticks=np.arange(0,10,1))

p = sns.regplot(x='Time Delta', y='Unsheltered Homeless', data=df_trends, order=4, ci=None, scatter_kws={"s": 50}, ax=ax2)
p = p.set(xticks=np.arange(0,10,1))

sns.despine()

for ax in (ax1,ax2):
    ax.grid(False)
    ax.set_facecolor('white')
    ax.xaxis.labelpad = 20
    ax.yaxis.labelpad = 20
    ax.set_xlabel("Time Delta (Years)",fontsize=12,fontweight='bold')
    
ax1.set_ylabel("Sheltered Homeless Count",fontsize=12,fontweight='bold');
ax2.set_ylabel("Unsheltered Homeless Count",fontsize=12,fontweight='bold');

A great way to test out whether to fit a linear or a non-linear model and even to play around with parameters within each is by using Seaborn's lmplot and regplot functions. Here we find a couple of polynomial fits for the county's sheltered and unsheltered homeless counts. But all said and done polynomial fits result in complex models and may not necessarily be the best way to understand trends. Some approaches favor piecewise linear models over non-linear ones.

Analysis: Q2

Comparing trends across regions with similar socio-economic characteristics offers insights in that it can help identify differences in policy, all else being equal. But this being a comparison of observed samples, as opposed empirical ones, there are many factors that are likely to influence any differences in trends. Nevertheless, this is worthwhile pursuit and especially so if it yields significant differences.

We start by plotting trends for the counties of choice.

In [15]:
# function to plot the dataframe passed in and save it as a PNG
def table_as_png(df):
    
    from pandas.plotting import table
    
    # set fig size
    fig, ax = plt.subplots(figsize=(12, 3),dpi=100); 
    p = plt.subplots_adjust(top=0.97, bottom=0.2, left=0.05, right=0.97, hspace=0.2);
    # no axes
    ax.xaxis.set_visible(False)  
    ax.yaxis.set_visible(False)  
    # no frame
    ax.set_frame_on(False)  
    # plot table
    tab = table(ax, df, loc='upper right',colWidths=[0.17]*len(df.columns),
                cellLoc = 'center', rowLoc = 'center');  
    # set font manually
    tab.auto_set_font_size(False)
    tab.set_fontsize(8)     
    # save the result
    plt.savefig('table.png',dpi=200,transparent=True)
    plt.close();
In [16]:
pd.options.display.float_format = '{:.0f}'.format

df_plot = df.drop(columns=['Sheltered Homeless Individuals','Unsheltered Homeless Individuals',
                           'Sheltered Homeless People in Families','Unsheltered Homeless People in Families'])

# rename county names 
df_plot['CoC Name'] = df_plot['CoC Name'].map({ 'San Diego City and County CoC': 'San Diego',
                                                'Dallas City & County/Irving CoC': 'Dallas',
                                                'Santa Ana/Anaheim/Orange County CoC': 'Orange'})

# rearrange cols so we can plot paired values appropriately
cols = df_plot.columns.tolist()
cols = [cols[-1]]  + [cols[0]] + cols[1:-1]
df_plot = df_plot[cols]
df_plot = df_plot.rename(columns={'CoC Name':'County'})

# save this so we can use it to stack the dataframe
categories = cols[2:]

#df_plot.head(n=6)
df_county = df_plot.groupby(['County'])
df_county_means = df_county[categories].aggregate(np.mean)

table_as_png(df_county_means);
display(HTML("<h6 style='text-align: left'>Average Homeless Counts (2007-2016)</h6>"))
df_county_means
Average Homeless Counts (2007-2016)
Out[16]:
Total Homeless Sheltered Homeless Unsheltered Homeless Homeless Individuals Homeless People in Families
County
Dallas 3499 3199 300 2082 1417
Orange 5477 2488 2989 4000 1477
San Diego 8640 4179 4462 6688 1952
In [17]:
# convert the data to long-format, suitable for plotting
df_melted = df_plot.set_index(['Year','County']).rename_axis(['Category'],axis=1).stack().reset_index()
df_melted = df_melted.rename(columns={0:'Count'})
#display_tables(df_melted.head(n=5),df_melted.tail(n=5))
In [24]:
sns.set(style="ticks")

order = df_melted['Year'].unique().sort()
hue_order=['San Diego','Dallas','Orange']

def plot(x,y, **kwargs):    
    data = kwargs.pop('data')
    sns.pointplot(x,y, hue='County', data=data, ci='sd', errwidth=1, order=order, hue_order=hue_order, scale=0.5)
                
g = sns.FacetGrid(df_melted, col='Category', col_wrap=3, sharey='col', size=4.5, aspect=1.1)
g = g.map_dataframe(plot,'Year','Count', palette='deep')

g = g.set_titles(col_template="{col_name}", fontweight='bold', fontsize=11)
g.axes[4].legend(loc='upper right', bbox_to_anchor=(1.4, 1, 0, 0), fontsize=13)

for ax in g.axes.flat:
    ax.xaxis.label.set_visible(False)
    ax.yaxis.label.set_visible(False)
    plt.setp(ax.get_xticklabels(), visible=True)
    ax.set_xticklabels(ax.get_xticklabels(),rotation=45 )

# used to adjust space between each plot in the grid
plt.subplots_adjust(hspace=0.3, wspace=0.1)
# adjust distance between title and grid
plt.subplots_adjust(top=0.85)
g.fig.suptitle('Homeless Trend Comparison: San Diego, Dallas and Orange Counties', fontweight='bold', fontsize=15);

The following are observable from these plots:

  • All three counties exhibit similar trends for the sheltered homeless and for homeless in families
  • Dallas county exhibits a significantly different trend for unsheltered homeless in that its unsheltered homeless counts are almost stationary (i.e.: not varying with time) compared to the highly non-linear patterns in both San Diego and Orange counties.
  • Dallas county has a much lower (in absolute terms) homeless counts across all categories. This is likely partly explained by the difference in population size between Dallas and the other counties but further analysis is required to understand other factors that contribute to these low numbers.
  • San Diego county has the highest number of homeless counts (with the exception of 2 years when it was lower than that of Orange county) across all categories and a significantly higher average than the other counties.
  • The difference in homeless counts between San Diego and Orange counties is especially striking given their significant similarities. This potentially points to differences in policy that is worth examining.

NOTE: Given the non-lineariaties involved in the trend lines for the various homeless categories comparing them across counties, using statistical techniques, is non-trivial. That combined with the fact that trend differences are easily observable from the plots above meant that statistical tests to quantify trend differences have been left out of this analysis.

Insights

The following are the key findings of this analysis:

  • San Diego homeless counts are trending upwards over the decade spanning 2007-2016
  • The following two trends were observed across all counties whose data was analysed for the same period
    • The number of homeless in shelters has remained steady (neither increasing nor decreasing significantly) over the same period
    • Individuals (those without support from families) are the most vulnerable group, prone to homelessness by significantly larger numbers than those within families.
  • San Diego has greater numbers of homeless on average, across all categories, than Dallas and Orange counties, two counties of similar socio-economic characteristics

Outro

A comprehensive report on the homeless across the US for the last 10 years from 2007-2016 can found here: