This is an analysis of the socio-economic impact of severe weather events in the San Diego county. This analysis is carried out for the last 20 years starting 1997 and ending in 2016. The analysis aims to answer questions specified under Data Questions below. Results from the analysis can be found under Insights.
All of the data used in this analysis is sourced from the NOAA (National Oceanic and Atmospheric Administration). This data can be downloaded by either of the methods below. Each years data was downloaded as a seperate CSV (gzipped) file.
Follow links from the NOAA main page as specified below
NOAA -> Frequently Accessed Tools & Resources -> Climate Data & Reports -> Find Climate Data Online -> Storm Events (found under Additional Data Access on the same page) -> Bulk Data Download -> HTTP
The code file for the dataset explaining the fields can be found here.
Socio-Economic impact refers to the impact in terms of the population, such as deaths and injuries, and in terms of damage to material assets such as property and crops. With this analysis we are attempting to quantify this impact. Specifically, we are looking to answer the following questions:
Below are R scripts and commetaries detailing the analysis carried out on the dataset.
#suppress warnings related to R versions among others
options(warn=-1,scipen=999)
#do this to turn warnings back on
#options(warn=0)
The raw data files are expected to be in a directory named data under the current workspace.
# load the necessary packages
suppressPackageStartupMessages(require(R.utils,quietly = TRUE))
suppressPackageStartupMessages(require(car,quietly = TRUE))
suppressPackageStartupMessages(require(tools,quietly = TRUE))
suppressPackageStartupMessages(require(ggplot2,quietly = TRUE))
suppressPackageStartupMessages(require(matrixStats,quietly = TRUE))
suppressPackageStartupMessages(require(repr,quietly = TRUE))
suppressPackageStartupMessages(require(reshape2,quietly = TRUE))
suppressPackageStartupMessages(require(scales,quietly = TRUE))
suppressPackageStartupMessages(require(R2HTML,quietly = TRUE))
CWD <- getwd()
DATADIR <- paste(CWD,"data",sep="/")
# create a temp directory to store figures generated here
path.figures <- paste(basename(tempdir()),"figure","",sep = "/")
We read in data for each year, filter it out for the county of interest (in this case, San Diego) and collate the data from each of the files for the 20 year period from 1997 to 2016 into a single data frame.
BASE_FILE_PREFIX <- "StormEvents_details-ftp_v1.0"
TARGET_YR_START <- 1997
TARGET_YR_END <- 2016
target_years <- TARGET_YR_START:TARGET_YR_END
# these are the fields we care about and their datatypes; also specified in the documentation
target_cols <- c("STATE","STATE_FIPS","YEAR","EVENT_TYPE","CZ_TYPE","CZ_FIPS",
"CZ_NAME","INJURIES_DIRECT","INJURIES_INDIRECT","DEATHS_DIRECT",
"DEATHS_INDIRECT","DAMAGE_PROPERTY","DAMAGE_CROPS")
target_cols_class <- c("character","numeric","numeric","character","character","numeric",
"character","numeric","numeric","numeric",
"numeric","character","character")
# FIPS code for San Diego
SD_COUNTY_FIPS <- 73
SD_COUNTY_NAME <- "SAN DIEGO"
# this function takes in the year (e.g.: 2016) and reads data from the appropriate datafile
# returning the CSV data, filtered out for the San Diego county, as a data frame
ExtractCSV <- function(yr)
{
datafile_prefix <- paste(BASE_FILE_PREFIX,yr,sep="_d")
datafile <- grep(datafile_prefix,dir(DATADIR),value=TRUE)
datafile <- paste(DATADIR,datafile,sep="/")
# check if file is not already uncompressed
if(file_ext(datafile)=="gz")
{
gunzip(datafile)
csvfile <- strsplit(datafile,".gz")[[1]]
}
else
csvfile = datafile
# create a connection with the positions of the cols we care about
col_pos <- match(target_cols,scan(csvfile,sep=',',what=character(),nlines=1))
con <- pipe(paste("cut -d, -f",paste(col_pos,collapse=',')," ",csvfile,sep=''))
# read the data
csvdata <- read.csv(con,stringsAsFactors = FALSE,allowEscapes = TRUE,strip.white = TRUE)
# filter by county
df <- csvdata[with(csvdata,grepl("san diego",CZ_NAME,perl=T,ignore.case = T)),]
#df <- csvdata[(csvdata$CZ_NAME == SD_COUNTY_NAME),]
return (df)
}
df <- as.data.frame(do.call(rbind,sapply(target_years,ExtractCSV,simplify=FALSE)))
head(df)
We inspect the data for missing values and discard any samples that do not have useful information for this analysis.
cat(paste("# Samples:", nrow(df), "\n"))
cat(paste("# Events: ",length(unique(df$EVENT_TYPE)),"\n"))
unique(df$EVENT_TYPE)
# check for missing values
ifelse (sum(colSums(is.na(df)))==0,cat("No missing data\n"),print(colSums(is.na(df))))
# use only samples where not all properties are missing
good <- !(is.na(df$DEATHS_DIRECT) & is.na(df$DEATHS_INDIRECT) & is.na(df$INJURIES_DIRECT) & is.na(df$INJURIES_INDIRECT) & is.na(df$DAMAGE_PROPERTY) & is.na(df$DAMAGE_CROPS))
df <- df[good,]
cat(paste("# Samples Remaining:",nrow(df),"\n"))
We convert values in the numeric fields to their appropriate data type since they are all read in as character strings.
# convert variables to the appropriate types
for(i in 1:ncol(df))
{
if(target_cols_class[i] == "numeric")
df[,i] <- as.numeric(df[,i])
}
Inspecting the DAMAGE_PROPERTY and DAMAGE_CROPS fields and from the code file it is clear that these need to be transformed in order to arrive at the actual damage in terms of US Dollars (USD). We essentially multiply the numerical part of each of these fields by the exponent determined by the character (K is a thousand, M is a million and so on) following the numerical part.
cat(paste("Property Damage (Raw):",sep=" "))
cat(unique(df$DAMAGE_PROPERTY)[1:10])
cat(paste("\n"," Crop Damage (Raw): ",sep=" "))
cat(unique(df$DAMAGE_CROPS)[1:10])
# convert property and crop damage values to numeric by multiplying the exponents (K, M, B) with the
# appropriate multiple of 10;
ProcessExponent <- function(x)
{
val <- 0
exp <- sapply(strsplit(x,""),tail,1)
num <- as.numeric(gsub(".$","",x))
if (!is.na(num))
{
fac <- ifelse((exp=="K"),10**3,ifelse((exp=="M"),
10**6,ifelse((exp=="B"),10**9,10**0)))
val <- (num * fac)
}
return (val)
}
df$DAMAGE_PROPERTY_USD <- sapply(df$DAMAGE_PROPERTY,ProcessExponent)
df$DAMAGE_CROPS_USD <- sapply(df$DAMAGE_CROPS,ProcessExponent)
This is the result of the above transformation.
cat(paste("Property Damage (USD): ",sep=" "))
cat(unique(df$DAMAGE_PROPERTY_USD)[1:10])
cat(paste("\n"," Crop Damage (USD): ",sep=" "))
cat(unique(df$DAMAGE_CROPS_USD)[1:10])
Since we are attempting to understand the impact by event we first group the data by event type. Social impact pertains to the deaths and injuries in the data. While it is good that the data seperates out the deaths and injuries by whether they were caused directly or indirectly by the event, we are more interested in the overall impact in this analysis.
The output is a list of events that had the greatest impact by deaths, followed by injuries.
# group by event type
df_evt <- split(df,as.factor(df$EVENT_TYPE))
# function to collect and compute information related to population damage
# sum up the deaths and injuries, collapsing direct and indirect counts, the specified input (x)
# also count the number of instances within x
SocialImpact <- function(x)
{
c((sum(x$DEATHS_DIRECT,na.rm=TRUE) + sum(x$DEATHS_INDIRECT,na.rm=TRUE)),
(sum(x$INJURIES_DIRECT,na.rm = TRUE) + sum(x$INJURIES_INDIRECT,na.rm = TRUE)),
nrow(x))
}
df_social <- as.data.frame(do.call(rbind,
sapply(df_evt,SocialImpact,simplify=FALSE)))
names(df_social) <- c("DEATHS","INJURIES","INCIDENTS")
# compute % of total impact
df_social$DEATHS_PCT <- round((df_social$DEATHS/sum(df_social$DEATHS))*100,2)
df_social$INJURIES_PCT <- round((df_social$INJURIES/sum(df_social$INJURIES))*100,2)
# order the weather events by number of deaths first, then by number of
# injuries in the descending order
order <- order(df_social$DEATHS,df_social$INJURIES,decreasing = TRUE)
df_social <- df_social[order,]
In the table below, the number of deaths and injuries resulting from each event across the entire span is given by DEATHS and INJURIES respectively. Note that these measures are aggregates of both direct and indirect occurences for each impact type. INCIDENTS represents the number of incidents of the event across the entire span (20 years). The percentage of total impact (in deaths and injuries) accounted for by each event is also represented.
head(df_social,n=10)
plot_options <- list(theme(panel.background = element_rect(fill = "transparent", colour = NA),
legend.background = element_rect(fill = "transparent", colour = NA),
legend.key = element_rect(fill = "white"),
axis.line.x = element_line(color="gray", size = 0.5),
axis.line.y = element_line(color="gray", size = 0.5),
plot.title=element_text(family="Times", face="bold", size=8,color="lightyellow4"),
#legend.title = element_text(family="Times", face="bold", size=8),
legend.title = element_blank(),
legend.text = element_text(size=7, face="bold"),
legend.position = c(0.9,0.9),
axis.title.x = element_text(size=8,face="bold",color='brown3'),
axis.title.y = element_text(size=8,face="bold",color='brown3'),
axis.ticks = element_line(color='gray'),
axis.text.x = element_text(size=8,angle=45, hjust=1,face="bold"),
axis.text.y = element_text(size=7,face="bold")))
options(repr.plot.width = 5, repr.plot.height = 4)
# we're only plotting the top 10 events
df_social_topten <- df_social[1:10,1:2]
# reshape the data frame to suit plotting needs
df_social_topten$event <- rownames(df_social_topten)
df_melt <- melt(df_social_topten,id.vars='event')
# this is needed since ggplot (by default) does not stick to the order
# in the original data frame
ordered_plot <- function(lvl)
{
df_melt$event <- factor(df_melt$event, levels = df_melt$event[order(-df_melt$value[df_melt$variable == lvl])])
ggplot(data=df_melt) +
geom_bar(aes(x=event, y=value, fill = variable), width = 0.8,
position = 'dodge', stat="identity") +
plot_options
}
plot <- ordered_plot(levels(df_melt$variable)[1]) +
xlab("Weather Event") + ylab("") +
ggtitle("Severe Weather Events: Social Impact by Event (1997-2016), San Diego County") +
scale_fill_manual(name="",values=c('DEATHS'='lightblue','INJURIES'='lightsalmon'),
labels = c("Deaths","Injuries"))
plot
We carry out similar analysis to output events by their greatest economic impact. Here the list is ordered by damage to property, followed by crops since the San Diego county is a predominantly urban county with property being the dominant asset owned by the population as opposed to crops.
# function to collect and compute information related to crop damage
# sum up the damage collapsing direct and indirect instances for each group
EconomicImpact <- function(x)
{
c(sum(x$DAMAGE_PROPERTY_USD,na.rm=TRUE), sum(x$DAMAGE_CROPS_USD,na.rm=TRUE),nrow(x))
}
df_econ <- as.data.frame(do.call(rbind,
sapply(df_evt,EconomicImpact,simplify=FALSE)))
names(df_econ) <- c("DAMAGE_PROPERTY","DAMAGE_CROPS","INCIDENTS")
# order the weather events by number of deaths first, then by number of
# injuries in the descending order
order <- order(df_econ$DAMAGE_PROPERTY,df_econ$DAMAGE_CROPS,decreasing = TRUE)
df_econ <- df_econ[order,]
# compute % of total impact
df_econ$DAMAGE_PROPERTY_PCT <- round((df_econ$DAMAGE_PROPERTY/sum(df_econ$DAMAGE_PROPERTY))*100,2)
df_econ$DAMAGE_CROPS_PCT <- round((df_econ$DAMAGE_CROPS/sum(df_econ$DAMAGE_CROPS))*100,2)
In the table below, damage to property and crops in USD is given by DAMAGE_PROPERTY andDAMAGE_CROPS respectively. INCIDENTS represents the number of incidents of the event across the entire span (20 years). The percentage of total damage (in property and crops) accounted for by each event is also represented.
head(df_econ,n=10)
options(repr.plot.width = 6, repr.plot.height = 5)
# we're only plotting the top 10 events
df_econ_topten <- df_econ[1:10,1:2]
# reshape the data frame to suit plotting needs
df_econ_topten$event <- rownames(df_econ_topten)
df_melt <- melt(df_econ_topten,id.vars='event')
# do this since we are using log scale (this way we
# prevent log(0) from being plotted)
df_melt[df_melt == 0] <- NA
breaks = 10**c(0:10)
labs = c("1","10","100","1K","10K","100K","1M","10M","100M","1B","10B")
plot <- ordered_plot(levels(df_melt$variable)[1]) +
scale_y_log10(breaks = breaks, labels = labs) +
theme(axis.text.x = element_text(size=8,angle=0, hjust=0.5,face="bold")) +
coord_flip() +
xlab("Weather Event") + ylab("Amount (USD)") +
ggtitle("Severe Weather Events: Economic Impact by Event (1997-2016), San Diego County") +
scale_fill_manual(name="",values=c('DAMAGE_PROPERTY'='lightblue','DAMAGE_CROPS'='lightsalmon'),
labels = c("Property Damage","Crop Damage"))
plot
We start here by grouping the data by the year of occurence and subsequently summing them by group.
# group by year
df_yr <- split(df,as.factor(df$YEAR))
df_yr_social <- as.data.frame(do.call(rbind,
sapply(df_yr,SocialImpact,simplify=FALSE)))
names(df_yr_social) <- c("DEATHS","INJURIES","INCIDENTS")
options(repr.plot.width = 6, repr.plot.height = 5)
x_ticks <- as.numeric(rownames(df_yr_social))
x_ticks_min <- min(x_ticks)
x_ticks_max <- max(x_ticks)
y_ticks_min <- min(colMins(data.matrix(df_yr_social)))
y_ticks_max <- max(colMaxs(data.matrix(df_yr_social)))
plot <- ggplot(data=df_yr_social) +
geom_col(aes(x=x_ticks,y=INCIDENTS),width=0.8,fill='black',alpha='0.1') +
geom_line(aes(x=x_ticks,y=DEATHS,group=1,color='DEATHS'),size=0.8) +
geom_point(aes(x=x_ticks,y=DEATHS,group=1,color='DEATHS'),size=0.9) +
geom_line(aes(x=x_ticks,y=INJURIES,group=1,color='INJURIES'),size=0.8) +
geom_point(aes(x=x_ticks,y=INJURIES,group=1,color='INJURIES'),size=0.9) +
scale_x_continuous(breaks=seq(x_ticks_min,x_ticks_max,1)) +
scale_y_continuous(breaks=seq(y_ticks_min,y_ticks_max,20)) +
xlab('Year') + ylab('') +
ggtitle("Severe Weather Events: Social Impact by Year (1997-2016), San Diego County") +
scale_color_manual(name = "", values=c('DEATHS'='lightblue','INJURIES'='lightsalmon'),
labels = c("Deaths","Injuries"))
In the plot below, the incidents are represented by bars for each year. Note that here the number of incidents is the aggregate across all event types.
t(df_yr_social)
plot + plot_options
df_yr_econ <- as.data.frame(do.call(rbind,
sapply(df_yr,EconomicImpact,simplify=FALSE)))
names(df_yr_econ) <- c("DAMAGE_PROPERTY","DAMAGE_CROPS","INCIDENTS")
options(repr.plot.width = 6, repr.plot.height = 5)
df_yr_econ[df_yr_econ == 0] <- 0
x_ticks <- as.numeric(rownames(df_yr_econ))
x_ticks_min <- min(x_ticks)
x_ticks_max <- max(x_ticks)
y_ticks_min <- min(colMins(data.matrix(df_yr_econ)))
y_ticks_max <- max(colMaxs(data.matrix(df_yr_econ)))
breaks = 10**c(0:10)
labs = c("1","10","100","1K","10K","100K","1M","10M","100M","1B","10B")
plot <- ggplot(data=df_yr_econ) +
geom_line(aes(x=x_ticks,y=INCIDENTS,group=1,linetype='INCIDENTS'),color="black") +
geom_point(aes(x=x_ticks,y=INCIDENTS,group=1),color="black",size=0.8) +
geom_line(aes(x=x_ticks,y=DAMAGE_PROPERTY,group=1,color='DAMAGE_PROPERTY')) +
geom_point(aes(x=x_ticks,y=DAMAGE_PROPERTY,group=1,color='DAMAGE_PROPERTY'),size=0.8) +
geom_line(aes(x=x_ticks,y=DAMAGE_CROPS,group=1,color='DAMAGE_CROPS')) +
geom_point(aes(x=x_ticks,y=DAMAGE_CROPS,group=1,color='DAMAGE_CROPS'),size=0.8) +
scale_x_continuous(breaks=seq(x_ticks_min,x_ticks_max,1)) +
scale_y_log10(breaks = breaks, labels = labs) +
xlab('Year') + ylab('Amount (USD)') +
ggtitle("Severe Weather Events: Economic Impact by Year (1997-2016), San Diego County") +
scale_color_manual(name = "", values=c('DAMAGE_PROPERTY'='lightsalmon','DAMAGE_CROPS'='lightblue'),
labels = c("Crop Damage","Property Damage")) +
scale_linetype_manual(name="",values=c('INCIDENTS'='dashed'),labels=c("Incidents"))
t(df_yr_econ)
plot + plot_options
# filter out data specific to wildfires
df_wildfire <- df[df$EVENT_TYPE == 'Wildfire',]
# group by year
df_wf_yr <- split(df_wildfire,as.factor(df_wildfire$YEAR))
df_wf_yr_social <- as.data.frame(do.call(rbind,sapply(df_wf_yr,SocialImpact,simplify=FALSE)))
names(df_wf_yr_social) <- c("DEATHS","INJURIES","INCIDENTS")
df_wf_yr_social$YEAR <- rownames(df_wf_yr_social)
df_wf_yr_econ <- as.data.frame(do.call(rbind,sapply(df_wf_yr,EconomicImpact,simplify=FALSE)))
names(df_wf_yr_econ) <- c("DAMAGE_PROPERTY","DAMAGE_CROPS","INCIDENTS")
df_wf_yr_econ$YEAR <- rownames(df_wf_yr_econ)
# merge the data
df_wf_yr <- merge(df_wf_yr_social, df_wf_yr_econ)
# order the data by year (do this before computng percentages)
df_wf_yr <- df_wf_yr[order(df_wf_yr$YEAR),]
# compute the percentages of key indicators that wildfires accounted for, each year
df_wf_yr$INCIDENTS_PCT <- round((df_wf_yr$INCIDENTS/df_yr_econ$INCIDENTS)*100,2)
df_wf_yr$DEATHS_PCT <- round((df_wf_yr$DEATHS/df_yr_social$DEATHS)*100,2)
df_wf_yr$INJURIES_PCT <- round((df_wf_yr$INJURIES/df_yr_social$INJURIES)*100,2)
df_wf_yr$DAMAGE_PROPERTY_PCT <- round((df_wf_yr$DAMAGE_PROPERTY/df_yr_econ$DAMAGE_PROPERTY)*100,2)
df_wf_yr$DAMAGE_CROPS_PCT <- round((df_wf_yr$DAMAGE_CROPS/df_yr_econ$DAMAGE_CROPS)*100,2)
# re-order the columns
df_wf_yr <- df_wf_yr[,c(c(2:6),1,c(7:11))]
# remove row indices
rownames(df_wf_yr) <- NULL
options(repr.plot.width = 6, repr.plot.height = 5)
df_wf_yr[df_wf_yr == 0] <- 0
x_ticks <- df_wf_yr$YEAR
x_ticks_min <- min(x_ticks)
x_ticks_max <- max(x_ticks)
y_ticks_min <- min(colMins(data.matrix(df_wf_yr)))
y_ticks_max <- max(colMaxs(data.matrix(df_wf_yr)))
breaks = 10**c(0:10)
labs = c("1","10","100","1K","10K","100K","1M","10M","100M","1B","10B")
plot <- ggplot(data=df_wf_yr) +
geom_col(aes(x=x_ticks,y=INCIDENTS),width=0.8,fill='black',alpha='0.1') +
geom_line(aes(x=x_ticks,y=DEATHS,group=1,color='DEATHS'),size=0.8) +
geom_point(aes(x=x_ticks,y=DEATHS,group=1,color='DEATHS'),size=0.9) +
geom_line(aes(x=x_ticks,y=INJURIES,group=1,color='INJURIES'),size=0.8) +
geom_point(aes(x=x_ticks,y=INJURIES,group=1,color='INJURIES'),size=0.9) +
geom_line(aes(x=x_ticks,y=DAMAGE_PROPERTY,group=1,color='DAMAGE_PROPERTY'),size=0.8) +
geom_point(aes(x=x_ticks,y=DAMAGE_PROPERTY,group=1,color='DAMAGE_PROPERTY'),size=0.9) +
geom_line(aes(x=x_ticks,y=DAMAGE_CROPS,group=1,color='DAMAGE_CROPS'),size=0.8) +
geom_point(aes(x=x_ticks,y=DAMAGE_CROPS,group=1,color='DAMAGE_CROPS'),size=0.9) +
scale_x_discrete(breaks=seq(x_ticks_min,x_ticks_max,1)) +
scale_y_log10(breaks = breaks, labels = labs) +
xlab('Year') + ylab('') +
ggtitle("Wildfires: Socio-Economic Impact by Year (1997-2016), San Diego County") +
scale_color_manual(name = "",
values=c('DEATHS'='firebrick','INJURIES'='lightsalmon','DAMAGE_PROPERTY'='darkseagreen',
'DAMAGE_CROPS'='deepskyblue3'),
labels = c("Crop Damage", "Property Damage","Deaths","Injuries")) +
scale_linetype_manual(name="",values=c('INCIDENTS'='dashed'),labels=c("Incidents"))
Since wildfires have a disproportionately large impact on the region it is useful to take a closer look at it. This table indicates high impact nature of wildfires in that a few incidents can result in widespread death and destruction.
As before, the bars represent the number of wildfire incidents in each year.
t(df_wf_yr[,c(1,6:11)])
plot + plot_options
These are a few of the insights gathered from the analysis above:
This is by no means an exhaustive analysis. Many possibilities yet remain - investigating the cyclicality of event occurences, as well as that of the event impact (e.g.: an event may have a high impact one year followed by several years with low impact), assessing the effect of one event on another (e.g.: high impact wildfires are often followed by landslides, this is especially true in Southern California where the wet season follows right after the Santa Ana Winds, the time when fires are most likey to spread) and so on.