Collaborators: Shahryar Shagoshtasbi, Bernard Bikki, Erfan Farhad Zadeh
The main purpose of this project is to take you through the entire pipeline of data science. In order to do so, we have chosen to focus on how different types of storms in the U.S. affect people and the world. As you may have heard, in June 2017 resident Donald Trump announced that the U.S will not be a part of the Paris climate agreement anymore. Major decisions such as this, once more, brought up the conversations regarding climate change and its effects on our country. While many people are debating whether climate change is real or not, we have decided to further investigate this statement by diving into Data Science and providing some facts regarding climate change. In addition to that, we will focus on how people's lives are affected. In order to make it easier to follow, we have decided to only focus on the U.S temperatures and the different types of devastating storms that happened in the U.S.
Why is this important? Throughout time, we have seen many people's lives turn upside down after natural disasters. Storms cause many casualties and damages and it is important to try and predict how storms will affect people in the future to prepare ourselves for upcoming disasters. By looking at previous locations, we can analyze what cities are affected the most and allocate resources to those specific cities when a disaster is imminent. In addition, we need to look at average annual temperatures and try to figure out if there are any causal relationships between other variables.
Over this tutorial we will be going through the Data Science Lifecycle as following:
At this stage of the Data Science life cycle, we will be looking for a dataset that is related to our topic. Since we are thinking about storms in the U.S we started searching for such a dataset. Also, just like any other scientific paper, we have to pay attention to the legitimacy of the resource. Luckily, we were able to find such a dataset on the National Centers for Environmental Information website.
Using the HTTP access, download every storm events details and fatalities file in .csv.gz format from 1950 to 2020. Store all the details files in a folder named "details" and all the fatalities files in a folder named "fatalities" in your project directory.
However, that is not all the information that we need since we want to find the relation between climate change and storms in the U.S. Since this data set does not include the temperatures over the years that we require, we need to be looking for another dataset. Keeping in mind that we need to get the data from a legitimate source, we were able to find the dataset we are looking for on the National Centers for Environmental Information website.
Select data for average temperatures (12-month scale) from 1950 - 2020 and click plot. After the data has been retrieved, click the excel icon to download the data in csv format. Make sure that the downloaded file (labeled "temperatures.csv") is in your project directory.
During this project, we will be using Python language, and we use tools such as iPython and Jupiter Notebook to develop this project. If you haven't heard about Jupiter notebooks before, make sure to learn more about them in here.
Just like any other Python project, we need to import some libraries. Here are some of the libraries we will be using throughout this tutorial.
import os
import folium
import warnings
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from scipy.stats import norm
from sklearn import linear_model
from IPython.display import HTML
from folium.plugins import TimestampedGeoJson
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
One of the main libraries that we will be using throughout this project is Pandas. Pandas is an open-source data analysis tool that was built on top of the Python programming language and it is going to help us manipulate the data in an easy and flexible way. With the vast library of tools available, you can transform data very easily as you will see below.
Another library that helps maximize efficiency is NumPy. This library allows for easy computation for large datasets and it is another way to store and manipulate information.
python generate_combined_data.py
# retrieve data starting from year (1950 + start_year), original value = 0
start_year = 0
# used to retrive top n rows, in this case, top 1000 rows.
n = 1000
# ignore warnings
warnings.filterwarnings('ignore')
def removeUnnecessaryColumnsBefore(df):
# for the sake of the tutorial, remove combined date and time columns
del df['BEGIN_DATE_TIME']
del df['END_DATE_TIME']
del df['MONTH_NAME']
del df['SOURCE']
del df['EPISODE_NARRATIVE']
del df['EVENT_NARRATIVE']
del df['DATA_SOURCE']
del df['STATE_FIPS']
del df['CZ_TYPE']
del df['CZ_FIPS']
del df['CZ_NAME']
del df['WFO']
del df['TOR_F_SCALE']
del df['TOR_LENGTH']
del df['TOR_WIDTH']
del df['TOR_OTHER_WFO']
del df['TOR_OTHER_CZ_STATE']
del df['TOR_OTHER_CZ_FIPS']
del df['TOR_OTHER_CZ_NAME']
del df['BEGIN_RANGE']
del df['BEGIN_AZIMUTH']
del df['BEGIN_LOCATION']
del df['END_RANGE']
del df['END_AZIMUTH']
del df['END_LOCATION']
del df['CZ_TIMEZONE']
del df['FLOOD_CAUSE']
del df['CATEGORY']
del df['DAMAGE_CROPS']
del df['MAGNITUDE']
del df['MAGNITUDE_TYPE']
return df
def removeUnnecessaryColumnsAfter(df):
del df['BEGIN_YEARMONTH']
del df['BEGIN_DAY']
del df['BEGIN_TIME']
del df['END_YEARMONTH']
del df['END_DAY']
del df['END_TIME']
del df['INJURIES_DIRECT']
del df['INJURIES_INDIRECT']
del df['DEATHS_DIRECT']
del df['DEATHS_INDIRECT']
return df
def getTopNRows(df, n):
df = df.reset_index(drop=True).fillna("")
# keep only the top n rows with the most deaths and convert text to float data for damage_property.
# also, don't get confused that mul has the same value as n, they are seperate variables
for index, row in df.iterrows():
x = str(df.iloc[index, 5])
mul = 1000 #
if x.isdigit():
mul = 1
elif x == "":
x = "0"
mul = 1
elif x[-1] == "K":
x = x[:-1]
mul = 1000
elif x[-1] == "M":
x = x[:-1]
mul = 1000000
elif x[-1] == "B":
x = x[:-1]
mul = 1000000000
else:
x = "0"
if x == np.nan or x == "":
df.at[index, 'DAMAGE_PROPERTY'] = 0
else:
df.at[index, 'DAMAGE_PROPERTY'] = float(x) * mul
# combine injuries and deaths
df['INJURIES'] = df['INJURIES_DIRECT'] + df['INJURIES_INDIRECT']
df['DEATHS'] = df['DEATHS_DIRECT'] + df['DEATHS_INDIRECT']
# sort by deaths first, then injuries, then damage property
df = df.sort_values(by=['DAMAGE_PROPERTY'], ascending=False)
df = df.sort_values(by=['INJURIES'], ascending=False)
df = df.sort_values(by=['DEATHS'], ascending=False)
# return top n rows
df = df.head(n).reset_index(drop=True)
return df
def modifyDates(df):
for index, row in df.iterrows():
bdate = str(df.iloc[index - 1]['BEGIN_YEARMONTH']) + str(df.iloc[index - 1]['BEGIN_DAY'])
bdate = datetime.datetime.strptime(bdate, '%Y%m%d').date()
edate = str(df.iloc[index - 1]['END_YEARMONTH']) + str(df.iloc[index - 1]['END_DAY'])
edate = datetime.datetime.strptime(edate, '%Y%m%d').date()
# get begin time
btime = str(df.iloc[index - 1]['BEGIN_TIME'])
btlen = len(btime)
if btlen == 4:
bhour = btime[:2]
bmin = btime[2:]
elif btlen == 3:
bhour = btime[0]
bmin = btime[1:]
elif btlen == 2:
bhour = "0"
bmin = btime
elif int(btime) == 0:
bhour = "0"
bmin = "0"
else:
bhour = "0"
bmin = btime[0]
btime = datetime.time(int(bhour), int(bmin))
# get end time
etime = str(df.iloc[index - 1]['END_TIME'])
etlen = len(etime)
if etlen == 4:
ehour = etime[:2]
emin = etime[2:]
elif etlen == 3:
ehour = etime[0]
emin = etime[1:]
elif etlen == 2:
ehour = "0"
emin = etime
elif int(etime) == 0:
ehour = "0"
emin = "0"
else:
ehour = "0"
emin = etime[0]
etime = datetime.time(int(ehour), int(emin))
# combine begin/end date and time objects
df.at[index, 'BEGIN_DATE_TIME'] = datetime.datetime.combine(bdate, btime)
df.at[index, 'END_DATE_TIME'] = datetime.datetime.combine(edate, etime)
return df
path = './details/'
details = pd.DataFrame()
# use for loop to iterate through each individual "details" file and combine into one big file
c = 0
for file in os.listdir(path):
if file.endswith(".gz"):
if c >= start_year:
temp_df = pd.read_csv(path + file, compression='gzip', low_memory=False)
# remove initial columns
temp_df = removeUnnecessaryColumnsBefore(temp_df)
# since there is an excessive amount of data, we are going to gather the top n rows of data for each year
# order by damage_property in descending order (take rows with most damage)
temp_df = getTopNRows(temp_df, n)
# format dates
temp_df = modifyDates(temp_df)
# add to the combined data frame which will later be written to a file.
details = details.append(temp_df)
c = c + 1
# filter out columns
details = removeUnnecessaryColumnsAfter(details)
details.head(10)
def removeUnnecessaryColumnsFatalities(df):
del df['FAT_YEARMONTH']
del df['FAT_DAY']
del df['FAT_TIME']
del df['FATALITY_TYPE']
del df['EVENT_YEARMONTH']
del df['FATALITY_ID']
del df['FATALITY_DATE']
return df
path = './fatalities/'
fatalities = pd.DataFrame()
# use for loop to iterate through each individual "fatalities" file and combine into one big file
c = 0
for file in os.listdir(path):
if file.endswith(".gz"):
if c >= start_year:
temp_df = pd.read_csv(path + file, compression='gzip', low_memory=False)
# format dates
temp_df = modifyDatesFatalities(temp_df)
# filter out columns
temp_df = removeUnnecessaryColumnsFatalities(temp_df)
# add to the combined data frame which will later be written to a file
fatalities = fatalities.append(temp_df)
c = c + 1
# merge details and fatalities
details = pd.merge(details, fatalities, on='EVENT_ID', how='left')
details = details.reset_index(drop=True).fillna("")
details.head(10)
temperatures = pd.read_csv('temperatures.csv', low_memory=False)
temperatures = temperatures.iloc[4:].reset_index(drop=True).fillna("")
temperatures.columns = ['YEAR', 'AVG_TEMP', 'ANOMALY']
for index, row in temperatures.iterrows():
yearmonth = str(temperatures.iloc[index]['YEAR'])
year = yearmonth[0:4]
temperatures.at[index, 'YEAR'] = int(year)
# merge details and temperatures
details = pd.merge(details, temperatures, on='YEAR', how='left')
details = details.reset_index(drop=True).fillna("")
# remove anomaly and episode id columns
del details['ANOMALY']
del details['EPISODE_ID']
# store modified data
details.to_csv('details.csv', index=False)
# final cleaned dataframe
details.head(10)
It seems like our data has missing values like NaNs. There are many ways to handle missing data however one of the methods is to drop all missing data entries.
The other extreme is to create a model to sample and fill in the data.
Ex: Drop all rows with NaN first in specific columns of use. This is not done in preprocessing because some of the rows that would be removed when filtering one attribute might be useful when analyzing another attribute.
combined_df.dropna(subset=['FATALITY_AGE'], inplace=True)
combined_df.dropna(subset=['FATALITY_SEX'], inplace=True)
combined_df.dropna(subset=['FATALITY_LOCATION'], inplace=True)
Ex: Drop all duplicates to work with any attribute other than fatalities (age, sex, location)
combined_df = combined_df.drop_duplicates(subset='EVENT_ID', keep="first")
# reading from file assuming completion of the data processing step ONCE!
# (running the data processing step more than once is just a waste of computation time)
combined_df = pd.read_csv("details.csv", low_memory= False)
df_fatality = combined_df.dropna(subset=['FATALITY_AGE'])
df_fatality = df_fatality.dropna(subset=['FATALITY_SEX'])
df_fatality = df_fatality.dropna(subset=['FATALITY_LOCATION'])
df_fatality.head(10)
combined_df = combined_df.drop_duplicates(subset='EVENT_ID', keep="first")
combined_df.head(10)
In this section of the data science life cycle, we are going to graph the data in order to gain a better understanding of the data. Also, we attempt to perform statistical analyses in this section to gain mathematical evidence for the trends we may discover. In other words, as the title is indicating, we are going to further explore the data.
# getting differnt types of the storms
combined_df.EVENT_TYPE.unique()
# add a new column with the total number of same events (# of hurricanes, tornadoes, etc.) respective to the event type
result = combined_df.groupby('EVENT_TYPE').first()
result['COUNT'] = combined_df['EVENT_TYPE'].value_counts()
result.reset_index(inplace=True)
result = result[['EVENT_TYPE','COUNT']]
# print the counts of each event type
print(result)
combined_df = pd.merge(combined_df, result, on='EVENT_TYPE', how='left')
combined_df.head(10)
But in order to get a better understanding, why don't we use the power of visualization?!
For that purpose, we are going to mainly use the Mathplotlib or Seaborn library throughout the rest of this tutorial. Please feel free to learn more about them through the provided links as they are some of the greatest GIFTS to humankind.
plt.figure(figsize = (30, 10))
plt.title('Total Number of Event Type Occurences')
sns.set(font_scale=2.4)
count_bar = sns.countplot(x = 'EVENT_TYPE', data = combined_df,\
# order the events based on the number of occurences
order = combined_df['EVENT_TYPE'].value_counts().index)
count_bar.set_xticklabels(count_bar.get_xticklabels(), rotation=40, ha="right");
count_bar.set(xlabel='Event Type', ylabel='Number of Occurences');
def label_event_type (row):
event = row['EVENT_TYPE']
if event in ['Thunderstorm Wind', 'Hail', 'Tornado', 'Drought', 'Flood', 'Heat', \
'Flash Flood', 'Winter Weather', 'Rip Current', 'Winter Storm']:
return event
return 'Other'
combined_df['EVENT_TYPE_MODIFIED'] = combined_df.apply (lambda row: label_event_type(row), axis=1)
df_numbers_of_types = combined_df.groupby('EVENT_TYPE_MODIFIED')['EVENT_ID'].nunique()
df_numbers_of_types.sort_values(ascending=False)
label = list(map(str, df_numbers_of_types.keys()))
plt.figure(figsize = (14, 14))
plt.pie(df_numbers_of_types, labels = label, autopct = '%1.1f%%', textprops={'fontsize': 16, 'fontweight': "600"})
plt.title('Event Types Based on Occurence Percentage')
plt.show()
Since ThunderStorm Wind had the highest portion in our pie plot, we think that it should be the first cause of fatalities among other event types. In order to find out, let's make another graph based on each event type and the correlated number of deaths associated with it.
# plot for showing total deaths for each kind of event
total_death_in_event=combined_df.groupby('EVENT_TYPE')['DEATHS'].nunique().sort_values(ascending = False)
plt.figure(figsize=(30,8))
sns.set(font_scale=1.2)
ax = total_death_in_event.plot(kind='bar', \
ylabel = 'Number of Deaths', xlabel = 'Event Type' ,\
title = 'Total Deaths Based on Event Type')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right");
Now let's see the correlation between the type of each event with the number of injured people.
# plot for showing total deaths for each kind of event
total_injury_in_event=combined_df.groupby('EVENT_TYPE')['INJURIES'].nunique().sort_values(ascending = False)
plt.figure(figsize=(30,8))
sns.set(font_scale=1.2)
ax = total_injury_in_event.plot(kind='bar', \
ylabel = 'Number of Injuries', xlabel = 'Event Type' ,\
title = 'Total Injuries Based on Event Type')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right");
ax.set_ylim(0, 180);
Now, let's find out if there is a specific trend in the number of people who passed away over the years.
# number of deaths per year
total_death_in_year=combined_df.groupby('YEAR')['DEATHS'].nunique()
plt.figure(figsize=(30,8))
sns.set(font_scale=1.5)
plt.title('Total Deaths Per Year')
plt.xlabel('Year', fontsize = 24)
plt.ylabel('Number of Fatalities', fontsize = 24)
total_death_in_year.plot(kind='bar', fontsize = 16);
As we can see, there is no specific pattern over the years for the number of people who passed away, but in general, the average of people who died has increased after the 1990s.
Now, let's find out if there is a specific trend in the number of people who got injured away over the years.
# number of injuries per year
total_injuries_in_year=combined_df.groupby('YEAR')['INJURIES'].nunique()
plt.figure(figsize=(30,8))
sns.set(font_scale=1.5)
plt.title('Total Injuries Per Year')
plt.xlabel('Year', fontsize = 24)
plt.ylabel('Number of Injuries', fontsize = 24)
total_injuries_in_year.plot(kind='bar', fontsize = 16);
After visualizing the total injuries over the years, it doesn't seem that there is a specific trend for it.
In order to do so, we are going to analyze each state's Damage Property, Injuries, and Deaths.
However, to feed the data for visualization, we need to sum up the numbers based on each state. That is why we are going to use an amazing tool named groupby which helps us to converge the numbers based on each state.
We are doing to make a new data frame for this purpose. We are going to use this data frame for the next few steps.
dots = combined_df[['STATE', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS']].groupby(by=['STATE'], as_index=False).sum()
dots.head(10)
sns.set_theme(style="whitegrid")
# Make the PairGrid
g = sns.PairGrid(dots.sort_values("DAMAGE_PROPERTY", ascending=False),
x_vars=dots.columns[1:4], y_vars=["STATE"],
height=20, aspect=.25)
# Draw a dot plot using the stripplot function
g.map(sns.stripplot, size=10, orient="h",
palette="flare_r", linewidth=1, edgecolor="w")
g.set(ylabel="State")
# Use semantically meaningful titles for the columns
titles = ["Damage Property", "Injuries", "Deaths"]
for ax, title in zip(g.axes.flat, titles):
# Set a different title for each axes
ax.set(xlabel=title)
# Make the grid horizontal instead of vertical
ax.xaxis.grid(False)
ax.yaxis.grid(True)
sns.despine(left=True, bottom=True)
Now we have visualized the Death and Injuries related data and we have gained a better understanding. However, if we can put it together, it may help us to understand the data better. That is why we are going to use an amazing plot from Seaborn library named Pairplot.
sns.set(font_scale=2.5)
sns.pairplot(dots, height=10);
Now, we are curious to find out if there is a relation between the Age/Gender of people who either passed away or got injured. In order to do so, we are going to make a new data frame. Moreover, for future use, we will add the Location to this new data frame as well.
fdots = df_fatality[['STATE', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS', 'FATALITY_AGE', 'FATALITY_SEX', 'FATALITY_LOCATION']]
fdots.head(10)
After making the data frame, let's find out if there is any trend between the number of people who got injured or passed away with their age. In order to do so, let's draw a line plot that can handle all of the aforementioned data so we can gain a better understanding.
sns.set(font_scale=2.5)
sns.set(rc={'figure.figsize':(30,12)})
plt.title('Injuries / Deaths VS. Age')
sns.lineplot(data=fdots, x="FATALITY_AGE", y="INJURIES")
sns.lineplot(data=fdots, x="FATALITY_AGE", y="DEATHS")
plt.xlabel("Age")
plt.ylabel("Number of Injuries / Deaths")
plt.legend(labels=['Injuries', 'Deaths']);
Now let's see if we can find any trend on the gender and the number of injuries. In order to make it more interesting, let's add the location to it as well, and see if we can find a trend.
loc = sns.barplot(data=fdots, x="FATALITY_LOCATION", y="INJURIES", hue='FATALITY_SEX', ci=None)
sns.set(font_scale=2.5)
sns.set(rc={'figure.figsize':(30,8)})
plt.xlabel("Location")
plt.ylabel("Number of Injuries")
plt.title("The Relation Between Location, Gender, and Number of Injured")
loc.set_xticklabels(loc.get_xticklabels(), rotation=40, ha="right");
And now, let's visualize the same data, this time regarding the number of deaths.
loc = sns.barplot(data=fdots, x="FATALITY_LOCATION", y="DEATHS", hue='FATALITY_SEX', ci=None)
sns.set(font_scale=2.5)
sns.set(rc={'figure.figsize':(30,8)})
plt.xlabel("Location")
plt.ylabel("Number of Deaths")
plt.title("The Relation Between Location, Gender, and Number of Deaths")
loc.set_xticklabels(loc.get_xticklabels(), rotation=40, ha="right");
It can be seen that in general, the number of casualties is higher in women. Just like the number of injuries the Top 3 places that cause the most deaths after others are Churches, Permanent Structures, and Long Span Roofs.
Talking about the Trend between the genders, let's see if we can find a trend between the number of injuries over the years per gender.
loc = sns.barplot(data=df_fatality, x="YEAR", y="INJURIES", hue='FATALITY_SEX', ci=None)
sns.set(font_scale=2.5)
sns.set(rc={'figure.figsize':(30,8)})
plt.xlabel("Year")
plt.ylabel("Number of Injuries")
plt.title("Number of Injured Over the Years per Gender")
loc.set_xticklabels(loc.get_xticklabels(), rotation=40, ha="right");
We can see that women got injured almost twice as men in 2011's storms.
Now let's see if the same thing happens regarding the number of people who passed away.
loc = sns.barplot(data=df_fatality, x="YEAR", y="DEATHS", hue='FATALITY_SEX', ci=None)
sns.set(rc={'figure.figsize':(30,8)})
plt.xlabel("Year")
plt.ylabel("Number of Deaths")
plt.title("Number of Deaths Over the Years per Gender")
loc.set_xticklabels(loc.get_xticklabels(), rotation=40, ha="right");
Interestingly we can see that the same trend happened in 2011 regarding women being the leader in the number of deaths. However, in 2005 more people passed away rather than getting injured.
Now, let's see which age range has been affected more than others under the general term of fatalities, which will represent both injuries and deaths.
This time let's use another graph from the Seaborn library called Displot. This graph gives a good representation of the distribution of the population that we are studying.
sns.displot(combined_df['FATALITY_AGE'], height = 12)
plt.title("Distribution of Age");
Now, you may need even a better visualization to understand how the storms are hitting the U.S. Luckily, there are libraries that will help us implement an interactive map that we can show our data on it. Two of the most used libraries are Folium and Ipyleaflet.
Here, we are going to use the Folium package.
Let's start by making another data frame as we are going to apply some changes.
# copying the data frame into a new data frame
storm_df= combined_df.copy()
# dropping the rows that has a NAN value to be able to map them
storm_df.dropna(subset=['END_LAT','END_LON'], inplace = True)
storm_df.head()
# implementing the interactive map using the Folium Package
# setting the starting latitude, longitude and zoom.
storm_map = folium.Map(location=[37.0902, -95.7129], zoom_start=4.45, tiles='Stamen Terrain')
# this funciton returns a color based on the value of damage property
def color_producer(damage):
if damage > 2200:
return 'red'
elif 1000 < damage <= 2200 :
return 'orange'
elif 500 < damage <= 1000 :
return 'yellow'
elif 100< damage <= 500:
return 'lightgreen'
else:
return 'darkgreen'
After that, we need to loop through our data set in order to get the required data such as:
During this process, we will add each marker to the map by using the CircleMarker and add_to() from the Folium package.
# looping throught the data frame to get the data requiered for markers
for i in range(0,len(storm_df)):
folium.CircleMarker(
location=[storm_df.iloc[i]['END_LAT'], storm_df.iloc[i]['END_LON']],
radius=2,
fill = True,
fill_opacity = 0.3, # Setting the inner circle opacity
color = color_producer(storm_df.iloc[i]['DAMAGE_PROPERTY']),
opacity = 0.4,
bubling_mouse_events = True,
popup= [storm_df.iloc[i]['EVENT_TYPE']] # Setting up an info pop up with Event type as it's info
).add_to(storm_map)
storm_map
Damage | Color |
---|---|
D > 2200 | Red |
1000 < D < 2200 | Orange |
500 < D < 1000 | Yellow |
100 < D < 500 | LightGreen |
D < 100 | DarkGreen |
In order to do so, let's see if we can find a pattern in the U.S average temperature over the years.
plt.figure(figsize = (40, 20))
sns.set(font_scale=2)
p = sns.lineplot(x="YEAR", y="AVG_TEMP",
data = combined_df, linewidth = 6)
Do you recall pair plots that we did previously?
Let's have another one of those to gain a better understanding regarding the correlation between the Temperature and Deaths, Injuries, and Damage Property.
In order to do so, let's make another data frame since we are about to apply a summation on the numeric values.
temp = df_fatality[['YEAR', 'STATE', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS', 'AVG_TEMP']]
tempcol = temp[['YEAR', 'AVG_TEMP']]
del temp['AVG_TEMP']
temp = temp.groupby(by=['YEAR'], as_index=False).sum()
temp = pd.merge(temp, tempcol, on='YEAR', how='left')
temp = temp.drop_duplicates(subset=['YEAR'], keep='first').reset_index()
del temp['index']
temp.head(50)
And now let's plot the data by using the pairplot.
sns.set(font_scale=2.5)
sns.pairplot(temp.drop(columns=['YEAR']), height=10);
Now let's see how temperature is affecting our lives over the years, by relating to the damage that the storms bring to us.
In order to do so, we will be using a scatter plot. As a mean of showing all the data in one graph, we will be affecting the Circle-size and the colors of each circle.
sns.set(style="white")
sns.relplot(x="YEAR", y="DAMAGE_PROPERTY", hue="AVG_TEMP",size = "DAMAGE_PROPERTY",
sizes=(10, 200), alpha=0.2,
height=11, data=combined_df,
legend = "brief");
# combine all property damage, injuries, and deaths grouped by year for better visualization of the annual data
dots2 = combined_df[['YEAR', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS']].groupby(by=['YEAR'], as_index=False).sum()
dots2 = pd.merge(dots2, tempcol, on='YEAR', how='left').dropna(subset=['AVG_TEMP']).drop_duplicates().reset_index(drop=True)
dots2.head(10)
# get all fatalities data for further analysis
fdots2 = df_fatality[['YEAR', 'STATE', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS', \
'FATALITY_AGE', 'FATALITY_SEX', 'FATALITY_LOCATION']]
fdots2.head(10)
def runML(X, Y, xlabel, ylabel, title):
# split the data into training/testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
# create linear regression object
model = linear_model.LinearRegression()
# train the model using the training sets
model.fit(X_train, Y_train)
# make predictions using the testing set
Y_pred = model.predict(X_test)
print('Ordinary Least Squares (OLS)')
print('Coefficients: ', model.coef_[0])
print('Intercept: ', model.intercept_)
print('Mean squared error: %.2f'
% mean_squared_error(Y_test, Y_pred))
print('Coefficient of determination: %.2f'
% r2_score(Y_test, Y_pred))
plt.scatter(X, Y, color='black')
plt.plot(X_test, Y_pred, color='blue', linewidth=3)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()
def runMLRANSAC(X, Y, xlabel, ylabel, title, num):
# Robustly fit linear model with RANSAC algorithm
ransac = linear_model.RANSACRegressor()
ransac.fit(X, Y)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
Xin = X[inlier_mask]
Yin = Y[inlier_mask]
# split the data into training/testing sets
X_train, X_test, Y_train, Y_test = train_test_split(Xin, Yin, test_size=0.2)
# create linear regression object
model = linear_model.LinearRegression()
# train the model using the training sets
model.fit(X_train, Y_train)
# make predictions using the testing set
Y_pred = model.predict(X_test)
score = r2_score(Y_test, Y_pred)
if ((score < 0.95 or score > 1) and num < 250):
runMLRANSAC(X, Y, xlabel, ylabel, title, num+1)
else:
print('RANSAC')
print('Coefficients: ', model.coef_[0])
print('Intercept: ', model.intercept_)
print('Mean squared error: %.2f'
% mean_squared_error(Y_test, Y_pred))
print('Coefficient of determination: %.2f'
% r2_score(Y_test, Y_pred))
plt.scatter(X[inlier_mask], Y[inlier_mask], color='green', marker='.', s=200,
label='Inliers')
plt.scatter(X[outlier_mask], Y[outlier_mask], color='red', marker='.', s=200,
label='Outliers')
plt.plot(X_test, Y_pred, color='blue', linewidth=3, label='Line of Best Fit')
plt.legend(loc='lower right')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()
def runOtherML(X, Y):
# split the data into training/testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
# logistic regression
log_reg = linear_model.LogisticRegression()
log_reg.fit(X_train, Y_train)
lscore = log_reg.score(X_test, Y_test)
print("Logistic Regression Accuracy: ", lscore)
# SVM
svm = SVC(gamma='auto')
svm.fit(X_train, Y_train)
sscore = svm.score(X_test, Y_test)
print("SVM Accuracy: ", sscore)
# KNN
kfin = 0
kscore = 0
for k in range(1,21):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, Y_train)
score = knn.score(X_test, Y_test)
if score > kscore:
kfin = k
kscore = score
print("K Nearest Neighbors Accuracy for k = ", k, ": ", score)
c1 = "red"
c2 = "green"
a1 = 0.55
a2 = 1
h = 0.5
# show figures
plt.barh(['Logistic Regression'], 1, color = c1, alpha = a1, height = h)
plt.barh(['SVM'], 1, color = c1, alpha = a1, height = h)
plt.barh(['KNN'], 1, color = c1, alpha = a1, height = h)
plt.barh(['Logistic Regression'], lscore, color = c2, alpha = a2, height = h)
plt.barh(['SVM'], sscore, color = c2, alpha = a2, height = h)
plt.barh(['KNN'], kscore, color = c2, alpha = a2, height = h)
plt.title("Model Accuracy")
X = dots2[['YEAR']]
Y = dots2[['AVG_TEMP']]
runML(X, Y, 'Year', 'Average Temperature', 'Temperature Changes Over Time')
runMLRANSAC(X, Y, 'Year', 'Average Temperature', 'Temperature Changes Over Time', 0)
X = dots2[['YEAR']]
Y = dots2[['DAMAGE_PROPERTY']]
runML(X, Y, 'Year', 'Property Damage', 'Property Damage Over Time')
runMLRANSAC(X, Y, 'Year', 'Property Damage', 'Property Damage Over Time', 0)
dots3 = dots2.sort_values(by=['DAMAGE_PROPERTY'], ascending=False)[2:]
X = dots3[['DAMAGE_PROPERTY']]
Y = dots3[['DEATHS']]
runML(X, Y, 'Damage Property', 'Deaths', 'Relationship Between Property Damage and Number of Injuries')
runMLRANSAC(X, Y, 'Damage Property', 'Deaths', 'Relationship Between Property Damage and Number of Injuries', 0)
dots4 = dots2.sort_values(by=['DAMAGE_PROPERTY'], ascending=False)[2:]
X = dots4[['DAMAGE_PROPERTY']]
Y = dots4[['INJURIES']]
runML(X, Y, 'Damage Property', 'Injuries', 'Relationship Between Property Damage and Number of Injuries')
runMLRANSAC(X, Y, 'Damage Property', 'Injuries', 'Relationship Between Property Damage and Number of Injuries', 0)
X = combined_df[['YEAR']]
Y = combined_df[['AVG_TEMP']]
runML(X, Y, 'Year', 'Average Temperature', 'Temperature Changes Over Time')
runMLRANSAC(X, Y, 'Year', 'Average Temperature', 'Temperature Changes Over Time', 0)
X = combined_df[['YEAR']]
Y = combined_df[['DAMAGE_PROPERTY']]
runML(X, Y, 'Year', 'Property Damage', 'Property Damage Over Time')
runMLRANSAC(X, Y, 'Year', 'Property Damage', 'Property Damage Over Time', 0)
df = combined_df[['YEAR', 'DAMAGE_PROPERTY', 'DEATHS']].drop_duplicates(subset=['YEAR', 'DAMAGE_PROPERTY'], keep='first')
df = df.loc[df['DEATHS'] <= 200]
X = df[['DAMAGE_PROPERTY']]
Y = df[['DEATHS']]
runML(X, Y, 'Damage Property', 'Deaths', 'Relationship Between Property Damage and Number of Deaths')
runOtherML(X, Y)
df = combined_df[['YEAR', 'DAMAGE_PROPERTY', 'INJURIES']].drop_duplicates(subset=['YEAR', 'DAMAGE_PROPERTY'], keep='first')
df = df.loc[df['INJURIES'] <= 1000]
X = df[['DAMAGE_PROPERTY']]
Y = df[['INJURIES']]
runML(X, Y, 'Damage Property', 'Injuries', 'Relationship Between Property Damage and Number of Injuries')
runOtherML(X, Y)