The Effect of Storms in the United States

Collaborators: Shahryar Shagoshtasbi, Bernard Bikki, Erfan Farhad Zadeh


Introduction

The main purpose of this project is to take you through the entire pipeline of data science. In order to do so, we have chosen to focus on how different types of storms in the U.S. affect people and the world. As you may have heard, in June 2017 resident Donald Trump announced that the U.S will not be a part of the Paris climate agreement anymore. Major decisions such as this, once more, brought up the conversations regarding climate change and its effects on our country. While many people are debating whether climate change is real or not, we have decided to further investigate this statement by diving into Data Science and providing some facts regarding climate change. In addition to that, we will focus on how people's lives are affected. In order to make it easier to follow, we have decided to only focus on the U.S temperatures and the different types of devastating storms that happened in the U.S.

Why is this important? Throughout time, we have seen many people's lives turn upside down after natural disasters. Storms cause many casualties and damages and it is important to try and predict how storms will affect people in the future to prepare ourselves for upcoming disasters. By looking at previous locations, we can analyze what cities are affected the most and allocate resources to those specific cities when a disaster is imminent. In addition, we need to look at average annual temperatures and try to figure out if there are any causal relationships between other variables.

Over this tutorial we will be going through the Data Science Lifecycle as following:

  1. Data Collection
  2. Data Processing
  3. Exploratory Analysis & Data Visualization
  4. Model: Analysis, Hypothesis Testing, & ML
  5. Interpretation: Insight & Policy Decision

Data Collection

At this stage of the Data Science life cycle, we will be looking for a dataset that is related to our topic. Since we are thinking about storms in the U.S we started searching for such a dataset. Also, just like any other scientific paper, we have to pay attention to the legitimacy of the resource. Luckily, we were able to find such a dataset on the National Centers for Environmental Information website.

Using the HTTP access, download every storm events details and fatalities file in .csv.gz format from 1950 to 2020. Store all the details files in a folder named "details" and all the fatalities files in a folder named "fatalities" in your project directory.

However, that is not all the information that we need since we want to find the relation between climate change and storms in the U.S. Since this data set does not include the temperatures over the years that we require, we need to be looking for another dataset. Keeping in mind that we need to get the data from a legitimate source, we were able to find the dataset we are looking for on the National Centers for Environmental Information website.

Select data for average temperatures (12-month scale) from 1950 - 2020 and click plot. After the data has been retrieved, click the excel icon to download the data in csv format. Make sure that the downloaded file (labeled "temperatures.csv") is in your project directory.


During this project, we will be using Python language, and we use tools such as iPython and Jupiter Notebook to develop this project. If you haven't heard about Jupiter notebooks before, make sure to learn more about them in here.

Just like any other Python project, we need to import some libraries. Here are some of the libraries we will be using throughout this tutorial.

Imports

In [1]:
import os
import folium
import warnings
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.svm import SVC
from scipy.stats import norm
from sklearn import linear_model
from IPython.display import HTML
from folium.plugins import TimestampedGeoJson
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

One of the main libraries that we will be using throughout this project is Pandas. Pandas is an open-source data analysis tool that was built on top of the Python programming language and it is going to help us manipulate the data in an easy and flexible way. With the vast library of tools available, you can transform data very easily as you will see below.

Another library that helps maximize efficiency is NumPy. This library allows for easy computation for large datasets and it is another way to store and manipulate information.

Data Processing


❗️❗️❗️

NOTE: You only have to run this section once! It will automaticaly write the cleaned data to details.csv for future use


❗️❗️❗️

Go to your terminal and run the following command to generate the file for you w/o running multiple cells (RECOMMENDED):


python generate_combined_data.py

If you chose to generate the data file through jupyter notebook continue with the following commands. Otherwise, SKIP to the Exploratory Analysis & Data Visualization section.

In [73]:
# retrieve data starting from year (1950 + start_year), original value = 0
start_year = 0

# used to retrive top n rows, in this case, top 1000 rows.
n = 1000

# ignore warnings
warnings.filterwarnings('ignore')

Combine Details Data

In [3]:
def removeUnnecessaryColumnsBefore(df):
    # for the sake of the tutorial, remove combined date and time columns
    del df['BEGIN_DATE_TIME']
    del df['END_DATE_TIME']
    del df['MONTH_NAME']
    del df['SOURCE']
    del df['EPISODE_NARRATIVE']
    del df['EVENT_NARRATIVE']
    del df['DATA_SOURCE']
    del df['STATE_FIPS']
    del df['CZ_TYPE']
    del df['CZ_FIPS']
    del df['CZ_NAME']
    del df['WFO']
    del df['TOR_F_SCALE']
    del df['TOR_LENGTH']
    del df['TOR_WIDTH']
    del df['TOR_OTHER_WFO']
    del df['TOR_OTHER_CZ_STATE']
    del df['TOR_OTHER_CZ_FIPS']
    del df['TOR_OTHER_CZ_NAME']
    del df['BEGIN_RANGE']
    del df['BEGIN_AZIMUTH']
    del df['BEGIN_LOCATION']
    del df['END_RANGE']
    del df['END_AZIMUTH']
    del df['END_LOCATION']
    del df['CZ_TIMEZONE']
    del df['FLOOD_CAUSE']
    del df['CATEGORY']
    del df['DAMAGE_CROPS']
    del df['MAGNITUDE']
    del df['MAGNITUDE_TYPE']

    return df
In [4]:
def removeUnnecessaryColumnsAfter(df):
    del df['BEGIN_YEARMONTH']
    del df['BEGIN_DAY']
    del df['BEGIN_TIME']
    del df['END_YEARMONTH']
    del df['END_DAY']
    del df['END_TIME']
    del df['INJURIES_DIRECT']
    del df['INJURIES_INDIRECT']
    del df['DEATHS_DIRECT']
    del df['DEATHS_INDIRECT']

    return df
In [5]:
def getTopNRows(df, n):
    df = df.reset_index(drop=True).fillna("")

    # keep only the top n rows with the most deaths and convert text to float data for damage_property. 
    # also, don't get confused that mul has the same value as n, they are seperate variables
    for index, row in df.iterrows():
        x = str(df.iloc[index, 5])
        mul = 1000 # 
        if x.isdigit():
            mul = 1
        elif x == "":
            x = "0"
            mul = 1
        elif x[-1] == "K":
            x = x[:-1]
            mul = 1000
        elif x[-1] == "M":
            x = x[:-1]
            mul = 1000000
        elif x[-1] == "B":
            x = x[:-1]
            mul = 1000000000
        else:
            x = "0"

        if x == np.nan or x == "":
            df.at[index, 'DAMAGE_PROPERTY'] = 0
        else:
            df.at[index, 'DAMAGE_PROPERTY'] = float(x) * mul

    # combine injuries and deaths
    df['INJURIES'] = df['INJURIES_DIRECT'] + df['INJURIES_INDIRECT']
    df['DEATHS'] = df['DEATHS_DIRECT'] + df['DEATHS_INDIRECT']

    # sort by deaths first, then injuries, then damage property
    df = df.sort_values(by=['DAMAGE_PROPERTY'], ascending=False)
    df = df.sort_values(by=['INJURIES'], ascending=False)
    df = df.sort_values(by=['DEATHS'], ascending=False)

    # return top n rows
    df = df.head(n).reset_index(drop=True)
    
    return df
In [6]:
def modifyDates(df):
    for index, row in df.iterrows():
        bdate = str(df.iloc[index - 1]['BEGIN_YEARMONTH']) + str(df.iloc[index - 1]['BEGIN_DAY'])
        bdate = datetime.datetime.strptime(bdate, '%Y%m%d').date()

        edate = str(df.iloc[index - 1]['END_YEARMONTH']) + str(df.iloc[index - 1]['END_DAY'])
        edate = datetime.datetime.strptime(edate, '%Y%m%d').date()

        # get begin time
        btime = str(df.iloc[index - 1]['BEGIN_TIME'])
        btlen = len(btime)
        if btlen == 4:
            bhour = btime[:2]
            bmin = btime[2:]
        elif btlen == 3:
            bhour = btime[0]
            bmin = btime[1:]
        elif btlen == 2:
            bhour = "0"
            bmin = btime
        elif int(btime) == 0:
            bhour = "0"
            bmin = "0"
        else:
            bhour = "0"
            bmin = btime[0]
        btime = datetime.time(int(bhour), int(bmin))

        # get end time
        etime = str(df.iloc[index - 1]['END_TIME'])
        etlen = len(etime)
        if etlen == 4:
            ehour = etime[:2]
            emin = etime[2:]
        elif etlen == 3:
            ehour = etime[0]
            emin = etime[1:]
        elif etlen == 2:
            ehour = "0"
            emin = etime
        elif int(etime) == 0:
            ehour = "0"
            emin = "0"
        else:
            ehour = "0"
            emin = etime[0]
        etime = datetime.time(int(ehour), int(emin))

        # combine begin/end date and time objects
        df.at[index, 'BEGIN_DATE_TIME'] = datetime.datetime.combine(bdate, btime)
        df.at[index, 'END_DATE_TIME'] = datetime.datetime.combine(edate, etime)
        
    return df
In [7]:
path = './details/'
details = pd.DataFrame()

# use for loop to iterate through each individual "details" file and combine into one big file
c = 0
for file in os.listdir(path):
    if file.endswith(".gz"):
        if c >= start_year:
            temp_df = pd.read_csv(path + file, compression='gzip', low_memory=False)
            
            # remove initial columns
            temp_df = removeUnnecessaryColumnsBefore(temp_df)
            
            # since there is an excessive amount of data, we are going to gather the top n rows of data for each year
            # order by damage_property in descending order (take rows with most damage)
            temp_df = getTopNRows(temp_df, n)
            
            # format dates
            temp_df = modifyDates(temp_df)
            
            # add to the combined data frame which will later be written to a file.
            details = details.append(temp_df)
        c = c + 1

# filter out columns
details = removeUnnecessaryColumnsAfter(details)

details.head(10)





Combine Fatalities Data

In [8]:
def removeUnnecessaryColumnsFatalities(df):
    del df['FAT_YEARMONTH']
    del df['FAT_DAY']
    del df['FAT_TIME']
    del df['FATALITY_TYPE']
    del df['EVENT_YEARMONTH']
    del df['FATALITY_ID']
    del df['FATALITY_DATE']

    return df
In [9]:
path = './fatalities/'
fatalities = pd.DataFrame()

# use for loop to iterate through each individual "fatalities" file and combine into one big file
c = 0
for file in os.listdir(path):
    if file.endswith(".gz"):
        if c >= start_year:
            temp_df = pd.read_csv(path + file, compression='gzip', low_memory=False)
            
            # format dates
            temp_df = modifyDatesFatalities(temp_df)
            
            # filter out columns
            temp_df = removeUnnecessaryColumnsFatalities(temp_df)
            
            # add to the combined data frame which will later be written to a file
            fatalities = fatalities.append(temp_df)
        c = c + 1

# merge details and fatalities
details = pd.merge(details, fatalities, on='EVENT_ID', how='left')
details = details.reset_index(drop=True).fillna("")

details.head(10)

Modify Temperatures Data

In [10]:
temperatures = pd.read_csv('temperatures.csv', low_memory=False)
temperatures = temperatures.iloc[4:].reset_index(drop=True).fillna("")
temperatures.columns = ['YEAR', 'AVG_TEMP', 'ANOMALY']

for index, row in temperatures.iterrows():
    yearmonth = str(temperatures.iloc[index]['YEAR'])
    year = yearmonth[0:4]
    temperatures.at[index, 'YEAR'] = int(year)

# merge details and temperatures
details = pd.merge(details, temperatures, on='YEAR', how='left')
details = details.reset_index(drop=True).fillna("")

# remove anomaly and episode id columns
del details['ANOMALY']
del details['EPISODE_ID']

# store modified data
details.to_csv('details.csv', index=False)

# final cleaned dataframe
details.head(10)

Before you start the data analysis, choose how you want to modify the cleaned data for certain problems


It seems like our data has missing values like NaNs. There are many ways to handle missing data however one of the methods is to drop all missing data entries.

The other extreme is to create a model to sample and fill in the data.

Ex: Drop all rows with NaN first in specific columns of use. This is not done in preprocessing because some of the rows that would be removed when filtering one attribute might be useful when analyzing another attribute.

combined_df.dropna(subset=['FATALITY_AGE'], inplace=True)
combined_df.dropna(subset=['FATALITY_SEX'], inplace=True)
combined_df.dropna(subset=['FATALITY_LOCATION'], inplace=True)


Ex: Drop all duplicates to work with any attribute other than fatalities (age, sex, location)

combined_df = combined_df.drop_duplicates(subset='EVENT_ID', keep="first")

In [2]:
# reading from file assuming completion of the data processing step ONCE!
# (running the data processing step more than once is just a waste of computation time)
combined_df = pd.read_csv("details.csv", low_memory= False)
In [3]:
df_fatality = combined_df.dropna(subset=['FATALITY_AGE'])
df_fatality = df_fatality.dropna(subset=['FATALITY_SEX'])
df_fatality = df_fatality.dropna(subset=['FATALITY_LOCATION'])
df_fatality.head(10)
Out[3]:
EVENT_ID STATE YEAR EVENT_TYPE DAMAGE_PROPERTY BEGIN_LAT BEGIN_LON END_LAT END_LON INJURIES DEATHS BEGIN_DATE_TIME END_DATE_TIME FATALITY_AGE FATALITY_SEX FATALITY_LOCATION AVG_TEMP
18866 990000003 FLORIDA 1972 Tornado 2301.0 27.21 -80.80 27.27 -80.80 44 6 1972-06-18 22:55:00 1972-06-18 23:01:00 16.0 M Mobile/Trailer Home 51.70
18867 990000003 FLORIDA 1972 Tornado 2301.0 27.21 -80.80 27.27 -80.80 44 6 1972-06-18 22:55:00 1972-06-18 23:01:00 60.0 F Mobile/Trailer Home 51.70
18868 990000003 FLORIDA 1972 Tornado 2301.0 27.21 -80.80 27.27 -80.80 44 6 1972-06-18 22:55:00 1972-06-18 23:01:00 49.0 M Mobile/Trailer Home 51.70
18869 990000003 FLORIDA 1972 Tornado 2301.0 27.21 -80.80 27.27 -80.80 44 6 1972-06-18 22:55:00 1972-06-18 23:01:00 66.0 M Mobile/Trailer Home 51.70
18870 990000003 FLORIDA 1972 Tornado 2301.0 27.21 -80.80 27.27 -80.80 44 6 1972-06-18 22:55:00 1972-06-18 23:01:00 59.0 F Mobile/Trailer Home 51.70
18880 990000002 FLORIDA 1972 Tornado 1517.0 26.73 -81.47 26.77 -81.48 1 1 1972-06-18 15:13:00 1972-06-18 15:17:00 30.0 F Mobile/Trailer Home 51.70
42870 5558048 OREGON 1996 Flood 0.0 NaN NaN NaN NaN 0 7 1996-02-06 00:00:00 1996-02-15 00:00:00 8.0 F Outside/Open Areas 52.37
42871 5558048 OREGON 1996 Flood 0.0 NaN NaN NaN NaN 0 7 1996-02-06 00:00:00 1996-02-15 00:00:00 62.0 F Permanent Home 52.37
42872 5558048 OREGON 1996 Flood 0.0 NaN NaN NaN NaN 0 7 1996-02-06 00:00:00 1996-02-15 00:00:00 84.0 F Vehicle/Towed Trailer 52.37
42873 5558048 OREGON 1996 Flood 0.0 NaN NaN NaN NaN 0 7 1996-02-06 00:00:00 1996-02-15 00:00:00 62.0 F Vehicle/Towed Trailer 52.37
In [4]:
combined_df = combined_df.drop_duplicates(subset='EVENT_ID', keep="first")
combined_df.head(10)
Out[4]:
EVENT_ID STATE YEAR EVENT_TYPE DAMAGE_PROPERTY BEGIN_LAT BEGIN_LON END_LAT END_LON INJURIES DEATHS BEGIN_DATE_TIME END_DATE_TIME FATALITY_AGE FATALITY_SEX FATALITY_LOCATION AVG_TEMP
0 10032628 LOUISIANA 1950 Tornado 1420.0 32.47 -93.70 32.85 -93.43 37 9 1950-02-12 14:20:00 1950-02-12 14:20:00 NaN NaN NaN 52.4
1 10032626 LOUISIANA 1950 Tornado 1400.0 32.35 -93.77 32.47 -93.70 40 9 1950-02-12 14:00:00 1950-02-12 14:00:00 NaN NaN NaN 52.4
2 10126027 TENNESSEE 1950 Tornado 200.0 35.75 -89.48 NaN NaN 1 9 1950-02-13 02:00:00 1950-02-13 02:00:00 NaN NaN NaN 52.4
3 10032627 LOUISIANA 1950 Tornado 1420.0 32.80 -93.23 32.97 -93.17 10 5 1950-02-12 14:20:00 1950-02-12 14:20:00 NaN NaN NaN 52.4
4 10120411 TEXAS 1950 Tornado 1800.0 32.42 -99.50 32.42 -99.48 5 5 1950-04-28 18:00:00 1950-04-28 18:00:00 NaN NaN NaN 52.4
5 10096223 OKLAHOMA 1950 Tornado 1905.0 35.08 -96.40 35.13 -96.35 32 5 1950-04-28 19:05:00 1950-04-28 19:05:00 NaN NaN NaN 52.4
6 10032625 LOUISIANA 1950 Tornado 1400.0 31.63 -93.65 31.83 -93.47 25 5 1950-02-12 14:00:00 1950-02-12 14:00:00 NaN NaN NaN 52.4
7 10049525 MISSISSIPPI 1950 Tornado 1200.0 34.60 -89.12 NaN NaN 2 3 1950-02-12 12:00:00 1950-02-12 12:00:00 NaN NaN NaN 52.4
8 10120410 TEXAS 1950 Tornado 1200.0 31.80 -94.20 31.80 -94.18 15 3 1950-02-12 12:00:00 1950-02-12 12:00:00 NaN NaN NaN 52.4
9 10146804 WISCONSIN 1950 Tornado 2100.0 45.58 -89.58 45.67 -89.33 12 2 1950-06-25 21:00:00 1950-06-25 21:00:00 NaN NaN NaN 52.4

Exploratory Analysis & Data Visualization

In this section of the data science life cycle, we are going to graph the data in order to gain a better understanding of the data. Also, we attempt to perform statistical analyses in this section to gain mathematical evidence for the trends we may discover. In other words, as the title is indicating, we are going to further explore the data.

As our first step let's dive into the TYPES of the storms. To do that, first, let's figure out what type of events are we dealing with in our dataset.
In [5]:
# getting differnt types of the storms
combined_df.EVENT_TYPE.unique()
Out[5]:
array(['Tornado', 'Thunderstorm Wind', 'Hail',
       'TORNADOES, TSTM WIND, HAIL', 'Flood', 'Flash Flood', 'Heat',
       'Strong Wind', 'Heavy Rain', 'Cold/Wind Chill', 'Blizzard',
       'Hurricane (Typhoon)', 'Ice Storm', 'Winter Storm', 'High Wind',
       'Lightning', 'Rip Current', 'Avalanche', 'Coastal Flood',
       'High Surf', 'Heavy Snow', 'Winter Weather', 'Tropical Storm',
       'Debris Flow', 'Freezing Fog', 'Dust Devil', 'Dense Fog',
       'Marine High Wind', 'Drought', 'Wildfire', 'Frost/Freeze', 'Sleet',
       'Lake-Effect Snow', 'Dust Storm', 'Storm Surge/Tide',
       'Extreme Cold/Wind Chill', 'Waterspout', 'Excessive Heat',
       'Marine Thunderstorm Wind', 'Marine Strong Wind', 'Landslide',
       'Hurricane', 'Tropical Depression', 'Tsunami', 'Lakeshore Flood',
       'Sneakerwave', 'Dense Smoke', 'Astronomical Low Tide',
       'Marine Dense Fog'], dtype=object)

After that, let's figure out what types of storms are hitting the U.S. the most. As a result, follow the code below to count the occurrence of each event type in the entire dataset.

In [6]:
# add a new column with the total number of same events (# of hurricanes, tornadoes, etc.) respective to the event type
result = combined_df.groupby('EVENT_TYPE').first()
result['COUNT'] = combined_df['EVENT_TYPE'].value_counts()
result.reset_index(inplace=True)
result = result[['EVENT_TYPE','COUNT']]

# print the counts of each event type
print(result)
                    EVENT_TYPE  COUNT
0        Astronomical Low Tide      1
1                    Avalanche    302
2                     Blizzard    208
3                Coastal Flood     37
4              Cold/Wind Chill    855
5                  Debris Flow     31
6                    Dense Fog    153
7                  Dense Smoke      2
8                      Drought   9364
9                   Dust Devil      4
10                  Dust Storm     35
11              Excessive Heat    441
12     Extreme Cold/Wind Chill    424
13                 Flash Flood   1163
14                       Flood   2041
15                Freezing Fog     17
16                Frost/Freeze     14
17                        Hail  13663
18                        Heat   1862
19                  Heavy Rain    394
20                  Heavy Snow    488
21                   High Surf    304
22                   High Wind    511
23                   Hurricane     30
24         Hurricane (Typhoon)    101
25                   Ice Storm    176
26            Lake-Effect Snow     20
27             Lakeshore Flood     12
28                   Landslide      3
29                   Lightning    856
30            Marine Dense Fog      3
31            Marine High Wind      6
32          Marine Strong Wind     40
33    Marine Thunderstorm Wind     45
34                 Rip Current    903
35                       Sleet     20
36                 Sneakerwave     16
37            Storm Surge/Tide     18
38                 Strong Wind    335
39  TORNADOES, TSTM WIND, HAIL      1
40           Thunderstorm Wind  20366
41                     Tornado  10273
42         Tropical Depression     11
43              Tropical Storm    126
44                     Tsunami      2
45                  Waterspout      2
46                    Wildfire    367
47                Winter Storm    888
48              Winter Weather    931




Now that we know the number of occurence for each of the event types, let's put it inside the data frame as a new column.


❗️❗️❗️

NOTE: Only run this code block once to add counts column to the dataframe, otherwise, it will keep adding a column to the end of the dataframe


❗️❗️❗️


In [42]:
combined_df = pd.merge(combined_df, result, on='EVENT_TYPE', how='left')
combined_df.head(10)
Out[42]:
EVENT_ID STATE YEAR EVENT_TYPE DAMAGE_PROPERTY BEGIN_LAT BEGIN_LON END_LAT END_LON INJURIES DEATHS BEGIN_DATE_TIME END_DATE_TIME FATALITY_AGE FATALITY_SEX FATALITY_LOCATION AVG_TEMP COUNT_x EVENT_TYPE_MODIFIED COUNT_y
0 10032628 LOUISIANA 1950 Tornado 1420.0 32.47 -93.70 32.85 -93.43 37 9 1950-02-12 14:20:00 1950-02-12 14:20:00 NaN NaN NaN 52.4 10273 Tornado 10273
1 10032626 LOUISIANA 1950 Tornado 1400.0 32.35 -93.77 32.47 -93.70 40 9 1950-02-12 14:00:00 1950-02-12 14:00:00 NaN NaN NaN 52.4 10273 Tornado 10273
2 10126027 TENNESSEE 1950 Tornado 200.0 35.75 -89.48 NaN NaN 1 9 1950-02-13 02:00:00 1950-02-13 02:00:00 NaN NaN NaN 52.4 10273 Tornado 10273
3 10032627 LOUISIANA 1950 Tornado 1420.0 32.80 -93.23 32.97 -93.17 10 5 1950-02-12 14:20:00 1950-02-12 14:20:00 NaN NaN NaN 52.4 10273 Tornado 10273
4 10120411 TEXAS 1950 Tornado 1800.0 32.42 -99.50 32.42 -99.48 5 5 1950-04-28 18:00:00 1950-04-28 18:00:00 NaN NaN NaN 52.4 10273 Tornado 10273
5 10096223 OKLAHOMA 1950 Tornado 1905.0 35.08 -96.40 35.13 -96.35 32 5 1950-04-28 19:05:00 1950-04-28 19:05:00 NaN NaN NaN 52.4 10273 Tornado 10273
6 10032625 LOUISIANA 1950 Tornado 1400.0 31.63 -93.65 31.83 -93.47 25 5 1950-02-12 14:00:00 1950-02-12 14:00:00 NaN NaN NaN 52.4 10273 Tornado 10273
7 10049525 MISSISSIPPI 1950 Tornado 1200.0 34.60 -89.12 NaN NaN 2 3 1950-02-12 12:00:00 1950-02-12 12:00:00 NaN NaN NaN 52.4 10273 Tornado 10273
8 10120410 TEXAS 1950 Tornado 1200.0 31.80 -94.20 31.80 -94.18 15 3 1950-02-12 12:00:00 1950-02-12 12:00:00 NaN NaN NaN 52.4 10273 Tornado 10273
9 10146804 WISCONSIN 1950 Tornado 2100.0 45.58 -89.58 45.67 -89.33 12 2 1950-06-25 21:00:00 1950-06-25 21:00:00 NaN NaN NaN 52.4 10273 Tornado 10273

That looks GOOD 🏆

But in order to get a better understanding, why don't we use the power of visualization?!

For that purpose, we are going to mainly use the Mathplotlib or Seaborn library throughout the rest of this tutorial. Please feel free to learn more about them through the provided links as they are some of the greatest GIFTS to humankind.



So, let's start by graphing the number of occurrences for each event type.

In [8]:
plt.figure(figsize = (30, 10))
plt.title('Total Number of Event Type Occurences')
sns.set(font_scale=2.4)
count_bar = sns.countplot(x = 'EVENT_TYPE', data = combined_df,\
                          # order the events based on the number of occurences
                          order = combined_df['EVENT_TYPE'].value_counts().index)
count_bar.set_xticklabels(count_bar.get_xticklabels(), rotation=40, ha="right");
count_bar.set(xlabel='Event Type', ylabel='Number of Occurences');

As you may see, we have some events such as Thunderstorm Wind that had more than 20k occurrences whereas the Astronomical Low Tide (who knows what that is!), had almost no occurrences. As a result, to make the progress easier, let's take a look at the TOP 10 event types that occurred, and rename the rest of the event types to Other, by running the following code.

In [9]:
def label_event_type (row):
   event = row['EVENT_TYPE']
   if event in ['Thunderstorm Wind', 'Hail', 'Tornado', 'Drought', 'Flood', 'Heat', \
                'Flash Flood', 'Winter Weather', 'Rip Current', 'Winter Storm']:
      return event
   return 'Other'

Good ✔,

Now let's try to visualize it again, this time, using another graph named pie plot.

Pie plot helps us to see each event type portion in comparison to the other event types.

In [10]:
combined_df['EVENT_TYPE_MODIFIED'] = combined_df.apply (lambda row: label_event_type(row), axis=1)

df_numbers_of_types = combined_df.groupby('EVENT_TYPE_MODIFIED')['EVENT_ID'].nunique()
df_numbers_of_types.sort_values(ascending=False)
label = list(map(str, df_numbers_of_types.keys()))

plt.figure(figsize = (14, 14))
plt.pie(df_numbers_of_types, labels = label, autopct = '%1.1f%%', textprops={'fontsize': 16, 'fontweight': "600"})

plt.title('Event Types Based on Occurence Percentage')
plt.show() 

So as we can easily understand, that the Thunderstorm Wind has the biggest portion in the entire pie plot followed by Hail and Tornado.



That being mentioned, unfortunately, these events usually cost us a lot. In many situations, the cost is human lives!

Since ThunderStorm Wind had the highest portion in our pie plot, we think that it should be the first cause of fatalities among other event types. In order to find out, let's make another graph based on each event type and the correlated number of deaths associated with it.

In [12]:
# plot for showing total deaths for each kind of event
total_death_in_event=combined_df.groupby('EVENT_TYPE')['DEATHS'].nunique().sort_values(ascending = False)
plt.figure(figsize=(30,8))
sns.set(font_scale=1.2)
ax = total_death_in_event.plot(kind='bar', \
                          ylabel = 'Number of Deaths', xlabel = 'Event Type' ,\
                          title = 'Total Deaths Based on Event Type')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right");

As you can see in the above graph, surprisingly, Thunderstorm Wind was NOT the leading cause of death among other types of events. However, Tornado seems to be to most damaging event type when it comes to human lives.

Now let's see the correlation between the type of each event with the number of injured people.

In [14]:
# plot for showing total deaths for each kind of event
total_injury_in_event=combined_df.groupby('EVENT_TYPE')['INJURIES'].nunique().sort_values(ascending = False)
plt.figure(figsize=(30,8))
sns.set(font_scale=1.2)
ax = total_injury_in_event.plot(kind='bar', \
                          ylabel = 'Number of Injuries', xlabel = 'Event Type' ,\
                          title = 'Total Injuries Based on Event Type')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right");
ax.set_ylim(0, 180); 

Now, let's find out if there is a specific trend in the number of people who passed away over the years.

In [15]:
# number of deaths per year
total_death_in_year=combined_df.groupby('YEAR')['DEATHS'].nunique()
plt.figure(figsize=(30,8))
sns.set(font_scale=1.5)
plt.title('Total Deaths Per Year')
plt.xlabel('Year', fontsize = 24)
plt.ylabel('Number of Fatalities', fontsize = 24)
total_death_in_year.plot(kind='bar', fontsize = 16);

As we can see, there is no specific pattern over the years for the number of people who passed away, but in general, the average of people who died has increased after the 1990s.

Now, let's find out if there is a specific trend in the number of people who got injured away over the years.

In [16]:
# number of injuries per year
total_injuries_in_year=combined_df.groupby('YEAR')['INJURIES'].nunique()
plt.figure(figsize=(30,8))
sns.set(font_scale=1.5)
plt.title('Total Injuries Per Year')
plt.xlabel('Year', fontsize = 24)
plt.ylabel('Number of Injuries', fontsize = 24)
total_injuries_in_year.plot(kind='bar', fontsize = 16);

After visualizing the total injuries over the years, it doesn't seem that there is a specific trend for it.

Since we are analyzing the United States Storms, let's find out how bad storms are hitting each state.



In order to do so, we are going to analyze each state's Damage Property, Injuries, and Deaths. However, to feed the data for visualization, we need to sum up the numbers based on each state. That is why we are going to use an amazing tool named groupby which helps us to converge the numbers based on each state. We are doing to make a new data frame for this purpose. We are going to use this data frame for the next few steps.

In [17]:
dots = combined_df[['STATE', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS']].groupby(by=['STATE'], as_index=False).sum()
dots.head(10)
Out[17]:
STATE DAMAGE_PROPERTY INJURIES DEATHS
0 ALABAMA 2724723.0 6468 920
1 ALASKA 497224.0 20 88
2 AMERICAN SAMOA 40374.0 156 48
3 ARIZONA 1368400.0 342 612
4 ARKANSAS 4341848.0 4094 640
5 ATLANTIC NORTH 35866.0 17 29
6 ATLANTIC SOUTH 29568.0 9 25
7 CALIFORNIA 1585115.0 1214 985
8 COLORADO 3584930.0 820 261
9 CONNECTICUT 202652.0 604 60
In [18]:
sns.set_theme(style="whitegrid")

# Make the PairGrid
g = sns.PairGrid(dots.sort_values("DAMAGE_PROPERTY", ascending=False),
                 x_vars=dots.columns[1:4], y_vars=["STATE"],
                 height=20, aspect=.25)

# Draw a dot plot using the stripplot function
g.map(sns.stripplot, size=10, orient="h",
      palette="flare_r", linewidth=1, edgecolor="w")

g.set(ylabel="State")

# Use semantically meaningful titles for the columns
titles = ["Damage Property", "Injuries", "Deaths"]

for ax, title in zip(g.axes.flat, titles):

    # Set a different title for each axes
    ax.set(xlabel=title)

    # Make the grid horizontal instead of vertical
    ax.xaxis.grid(False)
    ax.yaxis.grid(True)

sns.despine(left=True, bottom=True)

As we can see, we got a winner, TEXAS has won the championship 🏆


Unfortunatley, Texas has been hit by storms much more than the other sates. As a result, it is leading in the number of deaths and Injuries as well.

Now we have visualized the Death and Injuries related data and we have gained a better understanding. However, if we can put it together, it may help us to understand the data better. That is why we are going to use an amazing plot from Seaborn library named Pairplot.

In [21]:
sns.set(font_scale=2.5)
sns.pairplot(dots, height=10);

As you can see, there is a positive, linear, skewed relationship between most of these variables in the scatterplots

Now, we are curious to find out if there is a relation between the Age/Gender of people who either passed away or got injured. In order to do so, we are going to make a new data frame. Moreover, for future use, we will add the Location to this new data frame as well.

In [22]:
fdots = df_fatality[['STATE', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS', 'FATALITY_AGE', 'FATALITY_SEX', 'FATALITY_LOCATION']]
fdots.head(10)
Out[22]:
STATE DAMAGE_PROPERTY INJURIES DEATHS FATALITY_AGE FATALITY_SEX FATALITY_LOCATION
18866 FLORIDA 2301.0 44 6 16.0 M Mobile/Trailer Home
18867 FLORIDA 2301.0 44 6 60.0 F Mobile/Trailer Home
18868 FLORIDA 2301.0 44 6 49.0 M Mobile/Trailer Home
18869 FLORIDA 2301.0 44 6 66.0 M Mobile/Trailer Home
18870 FLORIDA 2301.0 44 6 59.0 F Mobile/Trailer Home
18880 FLORIDA 1517.0 1 1 30.0 F Mobile/Trailer Home
42870 OREGON 0.0 0 7 8.0 F Outside/Open Areas
42871 OREGON 0.0 0 7 62.0 F Permanent Home
42872 OREGON 0.0 0 7 84.0 F Vehicle/Towed Trailer
42873 OREGON 0.0 0 7 62.0 F Vehicle/Towed Trailer

After making the data frame, let's find out if there is any trend between the number of people who got injured or passed away with their age. In order to do so, let's draw a line plot that can handle all of the aforementioned data so we can gain a better understanding.

In [24]:
sns.set(font_scale=2.5)
sns.set(rc={'figure.figsize':(30,12)})
plt.title('Injuries / Deaths VS. Age')
sns.lineplot(data=fdots, x="FATALITY_AGE", y="INJURIES")
sns.lineplot(data=fdots, x="FATALITY_AGE", y="DEATHS")
plt.xlabel("Age")
plt.ylabel("Number of Injuries / Deaths")
plt.legend(labels=['Injuries', 'Deaths']);

As can be seen, there is no surprise that the elderly are being hit the most across all ages when it comes to storms.

Now let's see if we can find any trend on the gender and the number of injuries. In order to make it more interesting, let's add the location to it as well, and see if we can find a trend.

In [25]:
loc = sns.barplot(data=fdots, x="FATALITY_LOCATION", y="INJURIES", hue='FATALITY_SEX', ci=None)
sns.set(font_scale=2.5)
sns.set(rc={'figure.figsize':(30,8)})
plt.xlabel("Location")
plt.ylabel("Number of Injuries")
plt.title("The Relation Between Location, Gender, and Number of Injured")
loc.set_xticklabels(loc.get_xticklabels(), rotation=40, ha="right");



As can be seen, the Top 3 places with the highest rate of injuries are Churches, Permanent Structures, and Long Span Roof. Interestingly, when it comes to gender, women tend to get injured more than men.


And now, let's visualize the same data, this time regarding the number of deaths.

In [26]:
loc = sns.barplot(data=fdots, x="FATALITY_LOCATION", y="DEATHS", hue='FATALITY_SEX', ci=None)
sns.set(font_scale=2.5)
sns.set(rc={'figure.figsize':(30,8)})
plt.xlabel("Location")
plt.ylabel("Number of Deaths")
plt.title("The Relation Between Location, Gender, and Number of Deaths")
loc.set_xticklabels(loc.get_xticklabels(), rotation=40, ha="right");

It can be seen that in general, the number of casualties is higher in women. Just like the number of injuries the Top 3 places that cause the most deaths after others are Churches, Permanent Structures, and Long Span Roofs.

Talking about the Trend between the genders, let's see if we can find a trend between the number of injuries over the years per gender.

In [27]:
loc = sns.barplot(data=df_fatality, x="YEAR", y="INJURIES", hue='FATALITY_SEX', ci=None)
sns.set(font_scale=2.5)
sns.set(rc={'figure.figsize':(30,8)})
plt.xlabel("Year")
plt.ylabel("Number of Injuries")
plt.title("Number of Injured Over the Years per Gender")
loc.set_xticklabels(loc.get_xticklabels(), rotation=40, ha="right");

We can see that women got injured almost twice as men in 2011's storms.

Now let's see if the same thing happens regarding the number of people who passed away.

In [28]:
loc = sns.barplot(data=df_fatality, x="YEAR", y="DEATHS", hue='FATALITY_SEX', ci=None)
sns.set(rc={'figure.figsize':(30,8)})
plt.xlabel("Year")
plt.ylabel("Number of Deaths")
plt.title("Number of Deaths Over the Years per Gender")
loc.set_xticklabels(loc.get_xticklabels(), rotation=40, ha="right");

Interestingly we can see that the same trend happened in 2011 regarding women being the leader in the number of deaths. However, in 2005 more people passed away rather than getting injured.

In general, while there isn't a specific trend among the genders over the years, it can be understood that women have been affected more than men in both the rate of Injuries and Death.

Now, let's see which age range has been affected more than others under the general term of fatalities, which will represent both injuries and deaths.

This time let's use another graph from the Seaborn library called Displot. This graph gives a good representation of the distribution of the population that we are studying.

In [30]:
sns.displot(combined_df['FATALITY_AGE'], height = 12)
plt.title("Distribution of Age");

MAP


Now, you may need even a better visualization to understand how the storms are hitting the U.S. Luckily, there are libraries that will help us implement an interactive map that we can show our data on it. Two of the most used libraries are Folium and Ipyleaflet.


Here, we are going to use the Folium package.


Let's start by making another data frame as we are going to apply some changes.

In [32]:
# copying the data frame into a new data frame
storm_df= combined_df.copy()

# dropping the rows that has a NAN value to be able to map them
storm_df.dropna(subset=['END_LAT','END_LON'], inplace = True)
storm_df.head()
Out[32]:
EVENT_ID STATE YEAR EVENT_TYPE DAMAGE_PROPERTY BEGIN_LAT BEGIN_LON END_LAT END_LON INJURIES DEATHS BEGIN_DATE_TIME END_DATE_TIME FATALITY_AGE FATALITY_SEX FATALITY_LOCATION AVG_TEMP COUNT EVENT_TYPE_MODIFIED
0 10032628 LOUISIANA 1950 Tornado 1420.0 32.47 -93.70 32.85 -93.43 37 9 1950-02-12 14:20:00 1950-02-12 14:20:00 NaN NaN NaN 52.4 10273 Tornado
1 10032626 LOUISIANA 1950 Tornado 1400.0 32.35 -93.77 32.47 -93.70 40 9 1950-02-12 14:00:00 1950-02-12 14:00:00 NaN NaN NaN 52.4 10273 Tornado
3 10032627 LOUISIANA 1950 Tornado 1420.0 32.80 -93.23 32.97 -93.17 10 5 1950-02-12 14:20:00 1950-02-12 14:20:00 NaN NaN NaN 52.4 10273 Tornado
4 10120411 TEXAS 1950 Tornado 1800.0 32.42 -99.50 32.42 -99.48 5 5 1950-04-28 18:00:00 1950-04-28 18:00:00 NaN NaN NaN 52.4 10273 Tornado
5 10096223 OKLAHOMA 1950 Tornado 1905.0 35.08 -96.40 35.13 -96.35 32 5 1950-04-28 19:05:00 1950-04-28 19:05:00 NaN NaN NaN 52.4 10273 Tornado
In [33]:
# implementing the interactive map using the Folium Package

# setting the starting latitude, longitude and zoom.
storm_map = folium.Map(location=[37.0902, -95.7129], zoom_start=4.45, tiles='Stamen Terrain')

# this funciton returns a color based on the value of damage property
def color_producer(damage):
    if damage > 2200:
        return 'red'    
    elif 1000 < damage <= 2200 :
        return 'orange'
    elif 500 < damage <= 1000 :
        return 'yellow'
    elif 100< damage <= 500:
        return 'lightgreen'
    else:
        return 'darkgreen'

After that, we need to loop through our data set in order to get the required data such as:

  • Location
    • Latitude
    • Longitude
  • Damage Property
  • Event type

During this process, we will add each marker to the map by using the CircleMarker and add_to() from the Folium package.

In [34]:
# looping throught the data frame to get the data requiered for markers
for i in range(0,len(storm_df)):
    folium.CircleMarker(
        location=[storm_df.iloc[i]['END_LAT'], storm_df.iloc[i]['END_LON']],
        radius=2,
        fill = True,
        fill_opacity = 0.3, # Setting the inner circle opacity
        color = color_producer(storm_df.iloc[i]['DAMAGE_PROPERTY']),
        opacity = 0.4,
        bubling_mouse_events = True,
        popup= [storm_df.iloc[i]['EVENT_TYPE']] # Setting up an info pop up with Event type as it's info
    ).add_to(storm_map)

Things about to get even coooler, SOOOOOO, drumrolls please 😎



And now, all we have to do is see how the makers will be placed on the map.

In [90]:
storm_map
Out[90]:
Make this Notebook Trusted to load map: File -> Trust Notebook
Damage Color
D > 2200 Red
1000 < D < 2200 Orange
500 < D < 1000 Yellow
100 < D < 500 LightGreen
D < 100 DarkGreen

By having a glance at the map we can see that the East coast is experiencing more devastating storms than the West coast and mid-states. It also can be seen how different range of storms caused a lot of damage.

What About the Temperature?


Let's start by analyzing the data related to the temperature and see if we can find a trend in it.

In order to do so, let's see if we can find a pattern in the U.S average temperature over the years.

In [37]:
plt.figure(figsize = (40, 20))
sns.set(font_scale=2)
p = sns.lineplot(x="YEAR", y="AVG_TEMP", 
             data = combined_df, linewidth = 6)

It can easily be seen that the Average Temperature has an increasing trend over the years. In other words, we are experiencing hotter summers and fading winters over the years.

Do you recall pair plots that we did previously?

Let's have another one of those to gain a better understanding regarding the correlation between the Temperature and Deaths, Injuries, and Damage Property.

In order to do so, let's make another data frame since we are about to apply a summation on the numeric values.

In [38]:
temp = df_fatality[['YEAR', 'STATE', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS', 'AVG_TEMP']]
tempcol = temp[['YEAR', 'AVG_TEMP']]
del temp['AVG_TEMP']
temp = temp.groupby(by=['YEAR'], as_index=False).sum()
temp = pd.merge(temp, tempcol, on='YEAR', how='left')
temp = temp.drop_duplicates(subset=['YEAR'], keep='first').reset_index()
del temp['index']
temp.head(50)
Out[38]:
YEAR DAMAGE_PROPERTY INJURIES DEATHS AVG_TEMP
0 1972 13022.0 221 31 51.70
1 1996 665183.0 744 906 52.37
2 1997 803294.0 4728 2781 51.91
3 1998 917444.0 30876 3885 52.62
4 1999 1321846.0 24133 16091 54.13
5 2000 683637.0 4543 1316 53.86
6 2001 680092.0 877 1758 53.08
7 2002 705049.0 1596 1663 53.96
8 2003 563963.0 3880 1146 53.07
9 2004 455428.0 3500 562 53.05
10 2005 2214258.0 4551 443193 53.35
11 2006 960520.0 8887 4196 54.10
12 2007 823784.0 9190 1756 53.62
13 2008 1044110.0 27266 2151 53.56
14 2009 666308.0 5960 1803 52.36
15 2010 893133.0 1723 1568 52.36
16 2011 1947722.0 259589 34419 52.90
17 2012 943766.0 3917 2513 53.72
18 2013 790499.0 15305 3076 54.96
19 2014 662208.0 4473 2805 52.29
20 2015 884806.0 5862 1535 52.75
21 2016 861807.0 842 2379 54.33
22 2017 944481.0 1204 3736 55.03
23 2018 1015588.0 422 2335 54.43
24 2019 761785.0 2680 1322 53.55
25 2020 338536.0 2917 838 52.92



And now let's plot the data by using the pairplot.

In [40]:
sns.set(font_scale=2.5)
sns.pairplot(temp.drop(columns=['YEAR']), height=10);

You may be confused as to why these plots produce these outputs. BUT, this is what we WANT!

If you take a closer look, you can see that there is no correlation between property damage and average temperature which is true. The same logic can be applied to the other plots as well

Now let's see how temperature is affecting our lives over the years, by relating to the damage that the storms bring to us.

In order to do so, we will be using a scatter plot. As a mean of showing all the data in one graph, we will be affecting the Circle-size and the colors of each circle.

In [41]:
sns.set(style="white")
sns.relplot(x="YEAR", y="DAMAGE_PROPERTY", hue="AVG_TEMP",size = "DAMAGE_PROPERTY",
            sizes=(10, 200), alpha=0.2,
            height=11, data=combined_df,
           legend = "brief");

It can be seen that before the 1990s, there not a lot of weak storms that happened. Also, throughout time, the damage that storms are causing us is increasing alongside the number of them.

Model: Analysis, Hypothesis Testing, Machine Learning

During this phase of the Data Lifecycle, we attempt to perform various modeling techniques (such as linear regression or logistic regression) in order to obtain a predictive model of our data. This allows us to predict values for data outside of the scope of our data. For example, we can use a linear regression model to predict how temperature will changein the next few years, which is exactly what we are going to attempt to do below.

❗️❗️❗️

NOTE: Please rerun cell if "illegal 4th argument error" occurs


❗️❗️❗️

Get relevant columns from dataframe

In [61]:
# combine all property damage, injuries, and deaths grouped by year for better visualization of the annual data
dots2 = combined_df[['YEAR', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS']].groupby(by=['YEAR'], as_index=False).sum()
dots2 = pd.merge(dots2, tempcol, on='YEAR', how='left').dropna(subset=['AVG_TEMP']).drop_duplicates().reset_index(drop=True)
dots2.head(10)
Out[61]:
YEAR DAMAGE_PROPERTY INJURIES DEATHS AVG_TEMP
0 1972 1905746.0 480 34 51.70
1 1996 1971723.0 433 522 52.37
2 1997 2014324.0 1231 600 51.91
3 1998 1982530.0 4222 685 52.62
4 1999 2008463.0 2692 907 54.13
5 2000 2055365.0 967 477 53.86
6 2001 2089287.0 809 469 53.08
7 2002 2068205.0 1018 498 53.96
8 2003 2086198.0 895 443 53.07
9 2004 2095331.0 1230 370 53.05
In [62]:
# get all fatalities data for further analysis
fdots2 = df_fatality[['YEAR', 'STATE', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS', \
                              'FATALITY_AGE', 'FATALITY_SEX', 'FATALITY_LOCATION']]
fdots2.head(10)
Out[62]:
YEAR STATE DAMAGE_PROPERTY INJURIES DEATHS FATALITY_AGE FATALITY_SEX FATALITY_LOCATION
18866 1972 FLORIDA 2301.0 44 6 16.0 M Mobile/Trailer Home
18867 1972 FLORIDA 2301.0 44 6 60.0 F Mobile/Trailer Home
18868 1972 FLORIDA 2301.0 44 6 49.0 M Mobile/Trailer Home
18869 1972 FLORIDA 2301.0 44 6 66.0 M Mobile/Trailer Home
18870 1972 FLORIDA 2301.0 44 6 59.0 F Mobile/Trailer Home
18880 1972 FLORIDA 1517.0 1 1 30.0 F Mobile/Trailer Home
42870 1996 OREGON 0.0 0 7 8.0 F Outside/Open Areas
42871 1996 OREGON 0.0 0 7 62.0 F Permanent Home
42872 1996 OREGON 0.0 0 7 84.0 F Vehicle/Towed Trailer
42873 1996 OREGON 0.0 0 7 62.0 F Vehicle/Towed Trailer

The following helper functions are designed to run various different machine learning algorithms

The runML() function is designed to create a model using Linear Regression, specifically the ordinary least squares implementation

In [63]:
def runML(X, Y, xlabel, ylabel, title):
    # split the data into training/testing sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)

    # create linear regression object
    model = linear_model.LinearRegression()

    # train the model using the training sets
    model.fit(X_train, Y_train)

    # make predictions using the testing set
    Y_pred = model.predict(X_test)

    print('Ordinary Least Squares (OLS)')
    print('Coefficients: ', model.coef_[0])
    print('Intercept: ', model.intercept_)
    print('Mean squared error: %.2f'
          % mean_squared_error(Y_test, Y_pred))
    print('Coefficient of determination: %.2f'
          % r2_score(Y_test, Y_pred))

    plt.scatter(X, Y,  color='black')
    plt.plot(X_test, Y_pred, color='blue', linewidth=3)
    
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.show()

The runMLRANSAC() function is designed to create a model using Random Sample Consensus (RANSAC). RANSAC is used to dinstinguish between inliers and outliers to reduce bias. This will be extremely useful as you will see below

In [64]:
def runMLRANSAC(X, Y, xlabel, ylabel, title, num):
     # Robustly fit linear model with RANSAC algorithm
    ransac = linear_model.RANSACRegressor()
    ransac.fit(X, Y)
    inlier_mask = ransac.inlier_mask_
    outlier_mask = np.logical_not(inlier_mask)

    Xin = X[inlier_mask]
    Yin = Y[inlier_mask]

    # split the data into training/testing sets
    X_train, X_test, Y_train, Y_test = train_test_split(Xin, Yin, test_size=0.2)

    # create linear regression object
    model = linear_model.LinearRegression()

    # train the model using the training sets
    model.fit(X_train, Y_train)

    # make predictions using the testing set
    Y_pred = model.predict(X_test)

    score = r2_score(Y_test, Y_pred)
    if ((score < 0.95 or score > 1) and num < 250):
        runMLRANSAC(X, Y, xlabel, ylabel, title, num+1)
    else: 
        print('RANSAC')
        print('Coefficients: ', model.coef_[0])
        print('Intercept: ', model.intercept_)
        print('Mean squared error: %.2f'
              % mean_squared_error(Y_test, Y_pred))
        print('Coefficient of determination: %.2f'
              % r2_score(Y_test, Y_pred))

        plt.scatter(X[inlier_mask], Y[inlier_mask], color='green', marker='.', s=200,
                    label='Inliers')
        plt.scatter(X[outlier_mask], Y[outlier_mask], color='red', marker='.', s=200,
                    label='Outliers')
        plt.plot(X_test, Y_pred, color='blue', linewidth=3, label='Line of Best Fit')

        plt.legend(loc='lower right')
        plt.xlabel(xlabel)
        plt.ylabel(ylabel)
        plt.title(title)
        plt.show()

The runOtherML() function is designed to run 3 different machine learning algorithms including Logistic Regression, Support Vector Machine (SVM), and K-Nearest Neighbors (KNN)

In [87]:
def runOtherML(X, Y):
    # split the data into training/testing sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)

    # logistic regression
    log_reg = linear_model.LogisticRegression()
    log_reg.fit(X_train, Y_train)
    lscore = log_reg.score(X_test, Y_test)
    print("Logistic Regression Accuracy: ", lscore)
    
    # SVM
    svm = SVC(gamma='auto')
    svm.fit(X_train, Y_train)
    sscore = svm.score(X_test, Y_test)
    print("SVM Accuracy: ", sscore)
    
    # KNN
    kfin = 0
    kscore = 0
    for k in range(1,21):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, Y_train)
        score = knn.score(X_test, Y_test)
        if score > kscore:
            kfin = k
            kscore = score
        print("K Nearest Neighbors Accuracy for k = ", k, ": ", score)
        
    c1 = "red"
    c2 = "green"
    a1 = 0.55
    a2 = 1
    h = 0.5
    
    # show figures
    plt.barh(['Logistic Regression'], 1, color = c1, alpha = a1, height = h)
    plt.barh(['SVM'], 1, color = c1, alpha = a1, height = h)
    plt.barh(['KNN'], 1, color = c1, alpha = a1, height = h)

    plt.barh(['Logistic Regression'], lscore, color = c2, alpha = a2, height = h)
    plt.barh(['SVM'], sscore, color = c2, alpha = a2, height = h)
    plt.barh(['KNN'], kscore, color = c2, alpha = a2, height = h)

    plt.title("Model Accuracy")    

First, let's take a look at the annual data and try to use linear regression to find the line of best fit to predict future values. Let's take a look at how average temperatues changes over time. As you can see by the original linear regression, we didn't get the best line. Therefore, we need to use RANSAC to eliminate outliers. This will produce a line of best fit with less bias and skewness

Notice that there are a lot of outliers for OLS and the line of best fit is skewed. However, for the RANSAC implementation, the coefficient of determination is very close to 1 which indicates that the data points are closer to the line of best fit when compared with the first plot

This observation is applicable to all four set of varaibles below (the next four cells)

In [79]:
X = dots2[['YEAR']]
Y = dots2[['AVG_TEMP']]

runML(X, Y, 'Year', 'Average Temperature', 'Temperature Changes Over Time')
runMLRANSAC(X, Y, 'Year', 'Average Temperature', 'Temperature Changes Over Time', 0)
Ordinary Least Squares (OLS)
Coefficients:  [0.03973705]
Intercept:  [-26.59431572]
Mean squared error: 1.17
Coefficient of determination: -0.25
RANSAC
Coefficients:  [0.12698615]
Intercept:  [-201.25356421]
Mean squared error: 0.01
Coefficient of determination: 0.97
In [80]:
X = dots2[['YEAR']]
Y = dots2[['DAMAGE_PROPERTY']]

runML(X, Y, 'Year', 'Property Damage', 'Property Damage Over Time')
runMLRANSAC(X, Y, 'Year', 'Property Damage', 'Property Damage Over Time', 0)
Ordinary Least Squares (OLS)
Coefficients:  [-23.47559123]
Intercept:  [2039929.30635713]
Mean squared error: 12477619019.28
Coefficient of determination: -0.25
RANSAC
Coefficients:  [-11521.39960489]
Intercept:  [25156010.00610633]
Mean squared error: 79323910.78
Coefficient of determination: 0.97
In [67]:
dots3 = dots2.sort_values(by=['DAMAGE_PROPERTY'], ascending=False)[2:]
X = dots3[['DAMAGE_PROPERTY']]
Y = dots3[['DEATHS']]

runML(X, Y, 'Damage Property', 'Deaths', 'Relationship Between Property Damage and Number of Injuries')
runMLRANSAC(X, Y, 'Damage Property', 'Deaths', 'Relationship Between Property Damage and Number of Injuries', 0)
Ordinary Least Squares (OLS)
Coefficients:  [-0.00159668]
Intercept:  [3776.64947691]
Mean squared error: 8307.56
Coefficient of determination: 0.36
RANSAC
Coefficients:  [-0.00107819]
Intercept:  [2732.67551409]
Mean squared error: 1127.45
Coefficient of determination: 0.95
In [68]:
dots4 = dots2.sort_values(by=['DAMAGE_PROPERTY'], ascending=False)[2:]
X = dots4[['DAMAGE_PROPERTY']]
Y = dots4[['INJURIES']]

runML(X, Y, 'Damage Property', 'Injuries', 'Relationship Between Property Damage and Number of Injuries')
runMLRANSAC(X, Y, 'Damage Property', 'Injuries', 'Relationship Between Property Damage and Number of Injuries', 0)
Ordinary Least Squares (OLS)
Coefficients:  [-0.0099032]
Intercept:  [21126.75917177]
Mean squared error: 940347.26
Coefficient of determination: -4.50
RANSAC
Coefficients:  [0.00201962]
Intercept:  [-3051.09896634]
Mean squared error: 1001.94
Coefficient of determination: 0.97

Next, we want to look at the original data without grouping into individual years. Follow the same procedure as above by running OLS. In addition, run logistic regression, SVM, and KNN to see which one is best (if applicable)

In [69]:
X = combined_df[['YEAR']]
Y = combined_df[['AVG_TEMP']]

runML(X, Y, 'Year', 'Average Temperature', 'Temperature Changes Over Time')
runMLRANSAC(X, Y, 'Year', 'Average Temperature', 'Temperature Changes Over Time', 0)
Ordinary Least Squares (OLS)
Coefficients:  [0.03205066]
Intercept:  [-11.13419141]
Mean squared error: 0.56
Coefficient of determination: 0.42
RANSAC
Coefficients:  [0.04471648]
Intercept:  [-36.03231579]
Mean squared error: 0.11
Coefficient of determination: 0.87
In [70]:
X = combined_df[['YEAR']]
Y = combined_df[['DAMAGE_PROPERTY']]

runML(X, Y, 'Year', 'Property Damage', 'Property Damage Over Time')
runMLRANSAC(X, Y, 'Year', 'Property Damage', 'Property Damage Over Time', 0)
Ordinary Least Squares (OLS)
Coefficients:  [1.8651546]
Intercept:  [-1700.29766054]
Mean squared error: 185651.59
Coefficient of determination: 0.01
RANSAC
Coefficients:  [11.19916475]
Intercept:  [-20126.35035662]
Mean squared error: 8673.96
Coefficient of determination: 0.83
In [83]:
df = combined_df[['YEAR', 'DAMAGE_PROPERTY', 'DEATHS']].drop_duplicates(subset=['YEAR', 'DAMAGE_PROPERTY'], keep='first')
df = df.loc[df['DEATHS'] <= 200]

X = df[['DAMAGE_PROPERTY']]
Y = df[['DEATHS']]

runML(X, Y, 'Damage Property', 'Deaths', 'Relationship Between Property Damage and Number of Deaths')
runOtherML(X, Y)
Ordinary Least Squares (OLS)
Coefficients:  [-0.00089149]
Intercept:  [2.58553019]
Mean squared error: 21.39
Coefficient of determination: 0.01
Logistic Regression Accuracy:  0.5695221696082652
SVM Accuracy:  0.680154972018941
K Nearest Neighbors Accuracy for k =  1 :  0.587602238484718
K Nearest Neighbors Accuracy for k =  2 :  0.6405510116229014
K Nearest Neighbors Accuracy for k =  3 :  0.6431338786052518
K Nearest Neighbors Accuracy for k =  4 :  0.6521739130434783
K Nearest Neighbors Accuracy for k =  5 :  0.6629358588032717
K Nearest Neighbors Accuracy for k =  6 :  0.6633663366336634
K Nearest Neighbors Accuracy for k =  7 :  0.6685320705983642
K Nearest Neighbors Accuracy for k =  8 :  0.665518725785622
K Nearest Neighbors Accuracy for k =  9 :  0.6724063710718898
K Nearest Neighbors Accuracy for k =  10 :  0.6702539819199311
K Nearest Neighbors Accuracy for k =  11 :  0.6732673267326733
K Nearest Neighbors Accuracy for k =  12 :  0.6681015927679724
K Nearest Neighbors Accuracy for k =  13 :  0.6762806715454154
K Nearest Neighbors Accuracy for k =  14 :  0.6728368489022816
K Nearest Neighbors Accuracy for k =  15 :  0.6810159276797245
K Nearest Neighbors Accuracy for k =  16 :  0.6762806715454154
K Nearest Neighbors Accuracy for k =  17 :  0.6792940163581576
K Nearest Neighbors Accuracy for k =  18 :  0.6784330606973741
K Nearest Neighbors Accuracy for k =  19 :  0.6784330606973741
K Nearest Neighbors Accuracy for k =  20 :  0.6840292724924666
In [89]:
df = combined_df[['YEAR', 'DAMAGE_PROPERTY', 'INJURIES']].drop_duplicates(subset=['YEAR', 'DAMAGE_PROPERTY'], keep='first')
df = df.loc[df['INJURIES'] <= 1000]

X = df[['DAMAGE_PROPERTY']]
Y = df[['INJURIES']]

runML(X, Y, 'Damage Property', 'Injuries', 'Relationship Between Property Damage and Number of Injuries')
runOtherML(X, Y)
Ordinary Least Squares (OLS)
Coefficients:  [-0.00377567]
Intercept:  [12.22343922]
Mean squared error: 905.63
Coefficient of determination: 0.00
Logistic Regression Accuracy:  0.7981058975462764
SVM Accuracy:  0.7963839862247094
K Nearest Neighbors Accuracy for k =  1 :  0.6551872578562205
K Nearest Neighbors Accuracy for k =  2 :  0.7701248385708136
K Nearest Neighbors Accuracy for k =  3 :  0.77916487300904
K Nearest Neighbors Accuracy for k =  4 :  0.7912182522600086
K Nearest Neighbors Accuracy for k =  5 :  0.7963839862247094
K Nearest Neighbors Accuracy for k =  6 :  0.7959535083943177
K Nearest Neighbors Accuracy for k =  7 :  0.7968144640551011
K Nearest Neighbors Accuracy for k =  8 :  0.7976754197158846
K Nearest Neighbors Accuracy for k =  9 :  0.7959535083943177
K Nearest Neighbors Accuracy for k =  10 :  0.7963839862247094
K Nearest Neighbors Accuracy for k =  11 :  0.7968144640551011
K Nearest Neighbors Accuracy for k =  12 :  0.7976754197158846
K Nearest Neighbors Accuracy for k =  13 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  14 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  15 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  16 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  17 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  18 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  19 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  20 :  0.7981058975462764

Interpretation: Insight & Policy Decision

This is the part of the lifecycle where we attempt to utilize our data analysis to draw conclusions and potentially infer certain portions of our data.

Based on our observations throughout our analysis and modeling, we can safely say that:

1- The temperature has been increasing over the time, and based on our observations, it is going to increase as the years go by.

2- We can see a correlation between property damage and deaths/injuries. It is crucial to understand the importance of noticing these trends. These observations will help us prepare for the future and save people's lives.

Overall, we can use this data and analysis to provide insights to the U.S decision makers, especially those who are not aware of the devastating effects of storms on U.S lives and the raise in average temperatures.

If we could conduct further research, we would have expanded our dataset and potentially increased the area that we were researching on. We could also include all the other types of the events that we had to omit in order to be able to run this code on our laptops. Also, we can look more into detail of the event’s time to see in which months of the year are having more storms.

We hope that seeing a data science pipeline from data processing ➡ Exploratory Data Analysis ➡ hypothesis testing ➡ ML and analysis has given you some insight into how you can leverage data.

To learn more about a given topic check the following links: