The Effect of Storms in the United States

Collaborators: Shahryar Shagoshtasbi, Bernard Bikki, Erfan Farhad Zadeh

Introduction

The main purpose of this project is to take you through the entire pipeline of data science. In order to do so, we have chosen to focus on how different types of storms in the U.S. affect people and the world. As you may have heard, in June 2017 resident Donald Trump announced that the U.S will not be a part of the Paris climate agreement anymore. Major decisions such as this, once more, brought up the conversations regarding climate change and its effects on our country. While many people are debating whether climate change is real or not, we have decided to further investigate this statement by diving into Data Science and providing some facts regarding climate change. In addition to that, we will focus on how people's lives are affected. In order to make it easier to follow, we have decided to only focus on the U.S temperatures and the different types of devastating storms that happened in the U.S.

Why is this important? Throughout time, we have seen many people's lives turn upside down after natural disasters. Storms cause many casualties and damages and it is important to try and predict how storms will affect people in the future to prepare ourselves for upcoming disasters. By looking at previous locations, we can analyze what cities are affected the most and allocate resources to those specific cities when a disaster is imminent. In addition, we need to look at average annual temperatures and try to figure out if there are any causal relationships between other variables.

Over this tutorial we will be going through the Data Science Lifecycle as following:

Data Collection
Data Processing
Exploratory Analysis & Data Visualization
Model: Analysis, Hypothesis Testing, & ML
Interpretation: Insight & Policy Decision

Data Collection

At this stage of the Data Science life cycle, we will be looking for a dataset that is related to our topic. Since we are thinking about storms in the U.S we started searching for such a dataset. Also, just like any other scientific paper, we have to pay attention to the legitimacy of the resource. Luckily, we were able to find such a dataset on the National Centers for Environmental Information website.

Using the HTTP access, download every storm events details and fatalities file in .csv.gz format from 1950 to 2020. Store all the details files in a folder named "details" and all the fatalities files in a folder named "fatalities" in your project directory.

However, that is not all the information that we need since we want to find the relation between climate change and storms in the U.S. Since this data set does not include the temperatures over the years that we require, we need to be looking for another dataset. Keeping in mind that we need to get the data from a legitimate source, we were able to find the dataset we are looking for on the National Centers for Environmental Information website.

Select data for average temperatures (12-month scale) from 1950 - 2020 and click plot. After the data has been retrieved, click the excel icon to download the data in csv format. Make sure that the downloaded file (labeled "temperatures.csv") is in your project directory.

During this project, we will be using Python language, and we use tools such as iPython and Jupiter Notebook to develop this project. If you haven't heard about Jupiter notebooks before, make sure to learn more about them in here.

Just like any other Python project, we need to import some libraries. Here are some of the libraries we will be using throughout this tutorial.

Imports¶

import os
import folium
import warnings
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.svm import SVC
from scipy.stats import norm
from sklearn import linear_model
from IPython.display import HTML
from folium.plugins import TimestampedGeoJson
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

One of the main libraries that we will be using throughout this project is Pandas. Pandas is an open-source data analysis tool that was built on top of the Python programming language and it is going to help us manipulate the data in an easy and flexible way. With the vast library of tools available, you can transform data very easily as you will see below.

Another library that helps maximize efficiency is NumPy. This library allows for easy computation for large datasets and it is another way to store and manipulate information.

Data Processing

❗️❗️❗️

NOTE: You only have to run this section once! It will automaticaly write the cleaned data to details.csv for future use

❗️❗️❗️

Go to your terminal and run the following command to generate the file for you w/o running multiple cells (RECOMMENDED):¶

python generate_combined_data.py

If you chose to generate the data file through jupyter notebook continue with the following commands. Otherwise, SKIP to the Exploratory Analysis & Data Visualization section.¶

# retrieve data starting from year (1950 + start_year), original value = 0
start_year = 0

# used to retrive top n rows, in this case, top 1000 rows.
n = 1000

# ignore warnings
warnings.filterwarnings('ignore')

Combine Details Data¶

def removeUnnecessaryColumnsBefore(df):
    # for the sake of the tutorial, remove combined date and time columns
    del df['BEGIN_DATE_TIME']
    del df['END_DATE_TIME']
    del df['MONTH_NAME']
    del df['SOURCE']
    del df['EPISODE_NARRATIVE']
    del df['EVENT_NARRATIVE']
    del df['DATA_SOURCE']
    del df['STATE_FIPS']
    del df['CZ_TYPE']
    del df['CZ_FIPS']
    del df['CZ_NAME']
    del df['WFO']
    del df['TOR_F_SCALE']
    del df['TOR_LENGTH']
    del df['TOR_WIDTH']
    del df['TOR_OTHER_WFO']
    del df['TOR_OTHER_CZ_STATE']
    del df['TOR_OTHER_CZ_FIPS']
    del df['TOR_OTHER_CZ_NAME']
    del df['BEGIN_RANGE']
    del df['BEGIN_AZIMUTH']
    del df['BEGIN_LOCATION']
    del df['END_RANGE']
    del df['END_AZIMUTH']
    del df['END_LOCATION']
    del df['CZ_TIMEZONE']
    del df['FLOOD_CAUSE']
    del df['CATEGORY']
    del df['DAMAGE_CROPS']
    del df['MAGNITUDE']
    del df['MAGNITUDE_TYPE']

    return df

def removeUnnecessaryColumnsAfter(df):
    del df['BEGIN_YEARMONTH']
    del df['BEGIN_DAY']
    del df['BEGIN_TIME']
    del df['END_YEARMONTH']
    del df['END_DAY']
    del df['END_TIME']
    del df['INJURIES_DIRECT']
    del df['INJURIES_INDIRECT']
    del df['DEATHS_DIRECT']
    del df['DEATHS_INDIRECT']

    return df

def getTopNRows(df, n):
    df = df.reset_index(drop=True).fillna("")

    # keep only the top n rows with the most deaths and convert text to float data for damage_property. 
    # also, don't get confused that mul has the same value as n, they are seperate variables
    for index, row in df.iterrows():
        x = str(df.iloc[index, 5])
        mul = 1000 # 
        if x.isdigit():
            mul = 1
        elif x == "":
            x = "0"
            mul = 1
        elif x[-1] == "K":
            x = x[:-1]
            mul = 1000
        elif x[-1] == "M":
            x = x[:-1]
            mul = 1000000
        elif x[-1] == "B":
            x = x[:-1]
            mul = 1000000000
        else:
            x = "0"

        if x == np.nan or x == "":
            df.at[index, 'DAMAGE_PROPERTY'] = 0
        else:
            df.at[index, 'DAMAGE_PROPERTY'] = float(x) * mul

    # combine injuries and deaths
    df['INJURIES'] = df['INJURIES_DIRECT'] + df['INJURIES_INDIRECT']
    df['DEATHS'] = df['DEATHS_DIRECT'] + df['DEATHS_INDIRECT']

    # sort by deaths first, then injuries, then damage property
    df = df.sort_values(by=['DAMAGE_PROPERTY'], ascending=False)
    df = df.sort_values(by=['INJURIES'], ascending=False)
    df = df.sort_values(by=['DEATHS'], ascending=False)

    # return top n rows
    df = df.head(n).reset_index(drop=True)
    
    return df

def modifyDates(df):
    for index, row in df.iterrows():
        bdate = str(df.iloc[index - 1]['BEGIN_YEARMONTH']) + str(df.iloc[index - 1]['BEGIN_DAY'])
        bdate = datetime.datetime.strptime(bdate, '%Y%m%d').date()

        edate = str(df.iloc[index - 1]['END_YEARMONTH']) + str(df.iloc[index - 1]['END_DAY'])
        edate = datetime.datetime.strptime(edate, '%Y%m%d').date()

        # get begin time
        btime = str(df.iloc[index - 1]['BEGIN_TIME'])
        btlen = len(btime)
        if btlen == 4:
            bhour = btime[:2]
            bmin = btime[2:]
        elif btlen == 3:
            bhour = btime[0]
            bmin = btime[1:]
        elif btlen == 2:
            bhour = "0"
            bmin = btime
        elif int(btime) == 0:
            bhour = "0"
            bmin = "0"
        else:
            bhour = "0"
            bmin = btime[0]
        btime = datetime.time(int(bhour), int(bmin))

        # get end time
        etime = str(df.iloc[index - 1]['END_TIME'])
        etlen = len(etime)
        if etlen == 4:
            ehour = etime[:2]
            emin = etime[2:]
        elif etlen == 3:
            ehour = etime[0]
            emin = etime[1:]
        elif etlen == 2:
            ehour = "0"
            emin = etime
        elif int(etime) == 0:
            ehour = "0"
            emin = "0"
        else:
            ehour = "0"
            emin = etime[0]
        etime = datetime.time(int(ehour), int(emin))

        # combine begin/end date and time objects
        df.at[index, 'BEGIN_DATE_TIME'] = datetime.datetime.combine(bdate, btime)
        df.at[index, 'END_DATE_TIME'] = datetime.datetime.combine(edate, etime)
        
    return df

path = './details/'
details = pd.DataFrame()

# use for loop to iterate through each individual "details" file and combine into one big file
c = 0
for file in os.listdir(path):
    if file.endswith(".gz"):
        if c >= start_year:
            temp_df = pd.read_csv(path + file, compression='gzip', low_memory=False)
            
            # remove initial columns
            temp_df = removeUnnecessaryColumnsBefore(temp_df)
            
            # since there is an excessive amount of data, we are going to gather the top n rows of data for each year
            # order by damage_property in descending order (take rows with most damage)
            temp_df = getTopNRows(temp_df, n)
            
            # format dates
            temp_df = modifyDates(temp_df)
            
            # add to the combined data frame which will later be written to a file.
            details = details.append(temp_df)
        c = c + 1

# filter out columns
details = removeUnnecessaryColumnsAfter(details)

details.head(10)

Combine Fatalities Data¶

def removeUnnecessaryColumnsFatalities(df):
    del df['FAT_YEARMONTH']
    del df['FAT_DAY']
    del df['FAT_TIME']
    del df['FATALITY_TYPE']
    del df['EVENT_YEARMONTH']
    del df['FATALITY_ID']
    del df['FATALITY_DATE']

    return df

path = './fatalities/'
fatalities = pd.DataFrame()

# use for loop to iterate through each individual "fatalities" file and combine into one big file
c = 0
for file in os.listdir(path):
    if file.endswith(".gz"):
        if c >= start_year:
            temp_df = pd.read_csv(path + file, compression='gzip', low_memory=False)
            
            # format dates
            temp_df = modifyDatesFatalities(temp_df)
            
            # filter out columns
            temp_df = removeUnnecessaryColumnsFatalities(temp_df)
            
            # add to the combined data frame which will later be written to a file
            fatalities = fatalities.append(temp_df)
        c = c + 1

# merge details and fatalities
details = pd.merge(details, fatalities, on='EVENT_ID', how='left')
details = details.reset_index(drop=True).fillna("")

details.head(10)

Modify Temperatures Data¶

temperatures = pd.read_csv('temperatures.csv', low_memory=False)
temperatures = temperatures.iloc[4:].reset_index(drop=True).fillna("")
temperatures.columns = ['YEAR', 'AVG_TEMP', 'ANOMALY']

for index, row in temperatures.iterrows():
    yearmonth = str(temperatures.iloc[index]['YEAR'])
    year = yearmonth[0:4]
    temperatures.at[index, 'YEAR'] = int(year)

# merge details and temperatures
details = pd.merge(details, temperatures, on='YEAR', how='left')
details = details.reset_index(drop=True).fillna("")

# remove anomaly and episode id columns
del details['ANOMALY']
del details['EPISODE_ID']

# store modified data
details.to_csv('details.csv', index=False)

# final cleaned dataframe
details.head(10)

Before you start the data analysis, choose how you want to modify the cleaned data for certain problems¶

It seems like our data has missing values like NaNs. There are many ways to handle missing data however one of the methods is to drop all missing data entries.

The other extreme is to create a model to sample and fill in the data.

Ex: Drop all rows with NaN first in specific columns of use. This is not done in preprocessing because some of the rows that would be removed when filtering one attribute might be useful when analyzing another attribute.

combined_df.dropna(subset=['FATALITY_AGE'], inplace=True)
combined_df.dropna(subset=['FATALITY_SEX'], inplace=True)
combined_df.dropna(subset=['FATALITY_LOCATION'], inplace=True)

Ex: Drop all duplicates to work with any attribute other than fatalities (age, sex, location)

combined_df = combined_df.drop_duplicates(subset='EVENT_ID', keep="first")

# reading from file assuming completion of the data processing step ONCE!
# (running the data processing step more than once is just a waste of computation time)
combined_df = pd.read_csv("details.csv", low_memory= False)

df_fatality = combined_df.dropna(subset=['FATALITY_AGE'])
df_fatality = df_fatality.dropna(subset=['FATALITY_SEX'])
df_fatality = df_fatality.dropna(subset=['FATALITY_LOCATION'])
df_fatality.head(10)

combined_df = combined_df.drop_duplicates(subset='EVENT_ID', keep="first")
combined_df.head(10)

Exploratory Analysis & Data Visualization

In this section of the data science life cycle, we are going to graph the data in order to gain a better understanding of the data. Also, we attempt to perform statistical analyses in this section to gain mathematical evidence for the trends we may discover. In other words, as the title is indicating, we are going to further explore the data.

As our first step let's dive into the TYPES of the storms. To do that, first, let's figure out what type of events are we dealing with in our dataset.¶

# getting differnt types of the storms
combined_df.EVENT_TYPE.unique()

array(['Tornado', 'Thunderstorm Wind', 'Hail',
       'TORNADOES, TSTM WIND, HAIL', 'Flood', 'Flash Flood', 'Heat',
       'Strong Wind', 'Heavy Rain', 'Cold/Wind Chill', 'Blizzard',
       'Hurricane (Typhoon)', 'Ice Storm', 'Winter Storm', 'High Wind',
       'Lightning', 'Rip Current', 'Avalanche', 'Coastal Flood',
       'High Surf', 'Heavy Snow', 'Winter Weather', 'Tropical Storm',
       'Debris Flow', 'Freezing Fog', 'Dust Devil', 'Dense Fog',
       'Marine High Wind', 'Drought', 'Wildfire', 'Frost/Freeze', 'Sleet',
       'Lake-Effect Snow', 'Dust Storm', 'Storm Surge/Tide',
       'Extreme Cold/Wind Chill', 'Waterspout', 'Excessive Heat',
       'Marine Thunderstorm Wind', 'Marine Strong Wind', 'Landslide',
       'Hurricane', 'Tropical Depression', 'Tsunami', 'Lakeshore Flood',
       'Sneakerwave', 'Dense Smoke', 'Astronomical Low Tide',
       'Marine Dense Fog'], dtype=object)

After that, let's figure out what types of storms are hitting the U.S. the most. As a result, follow the code below to count the occurrence of each event type in the entire dataset.¶

# add a new column with the total number of same events (# of hurricanes, tornadoes, etc.) respective to the event type
result = combined_df.groupby('EVENT_TYPE').first()
result['COUNT'] = combined_df['EVENT_TYPE'].value_counts()
result.reset_index(inplace=True)
result = result[['EVENT_TYPE','COUNT']]

# print the counts of each event type
print(result)

                    EVENT_TYPE  COUNT
0        Astronomical Low Tide      1
1                    Avalanche    302
2                     Blizzard    208
3                Coastal Flood     37
4              Cold/Wind Chill    855
5                  Debris Flow     31
6                    Dense Fog    153
7                  Dense Smoke      2
8                      Drought   9364
9                   Dust Devil      4
10                  Dust Storm     35
11              Excessive Heat    441
12     Extreme Cold/Wind Chill    424
13                 Flash Flood   1163
14                       Flood   2041
15                Freezing Fog     17
16                Frost/Freeze     14
17                        Hail  13663
18                        Heat   1862
19                  Heavy Rain    394
20                  Heavy Snow    488
21                   High Surf    304
22                   High Wind    511
23                   Hurricane     30
24         Hurricane (Typhoon)    101
25                   Ice Storm    176
26            Lake-Effect Snow     20
27             Lakeshore Flood     12
28                   Landslide      3
29                   Lightning    856
30            Marine Dense Fog      3
31            Marine High Wind      6
32          Marine Strong Wind     40
33    Marine Thunderstorm Wind     45
34                 Rip Current    903
35                       Sleet     20
36                 Sneakerwave     16
37            Storm Surge/Tide     18
38                 Strong Wind    335
39  TORNADOES, TSTM WIND, HAIL      1
40           Thunderstorm Wind  20366
41                     Tornado  10273
42         Tropical Depression     11
43              Tropical Storm    126
44                     Tsunami      2
45                  Waterspout      2
46                    Wildfire    367
47                Winter Storm    888
48              Winter Weather    931

Now that we know the number of occurence for each of the event types, let's put it inside the data frame as a new column.¶

❗️❗️❗️

NOTE: Only run this code block once to add counts column to the dataframe, otherwise, it will keep adding a column to the end of the dataframe

❗️❗️❗️

combined_df = pd.merge(combined_df, result, on='EVENT_TYPE', how='left')
combined_df.head(10)

That looks GOOD 🏆¶

But in order to get a better understanding, why don't we use the power of visualization?!

For that purpose, we are going to mainly use the Mathplotlib or Seaborn library throughout the rest of this tutorial. Please feel free to learn more about them through the provided links as they are some of the greatest GIFTS to humankind.

So, let's start by graphing the number of occurrences for each event type.¶

plt.figure(figsize = (30, 10))
plt.title('Total Number of Event Type Occurences')
sns.set(font_scale=2.4)
count_bar = sns.countplot(x = 'EVENT_TYPE', data = combined_df,\
                          # order the events based on the number of occurences
                          order = combined_df['EVENT_TYPE'].value_counts().index)
count_bar.set_xticklabels(count_bar.get_xticklabels(), rotation=40, ha="right");
count_bar.set(xlabel='Event Type', ylabel='Number of Occurences');

As you may see, we have some events such as Thunderstorm Wind that had more than 20k occurrences whereas the Astronomical Low Tide (who knows what that is!), had almost no occurrences. As a result, to make the progress easier, let's take a look at the TOP 10 event types that occurred, and rename the rest of the event types to Other, by running the following code.¶

def label_event_type (row):
   event = row['EVENT_TYPE']
   if event in ['Thunderstorm Wind', 'Hail', 'Tornado', 'Drought', 'Flood', 'Heat', \
                'Flash Flood', 'Winter Weather', 'Rip Current', 'Winter Storm']:
      return event
   return 'Other'

Good ✔,¶

Now let's try to visualize it again, this time, using another graph named pie plot.¶

Pie plot helps us to see each event type portion in comparison to the other event types.¶

combined_df['EVENT_TYPE_MODIFIED'] = combined_df.apply (lambda row: label_event_type(row), axis=1)

df_numbers_of_types = combined_df.groupby('EVENT_TYPE_MODIFIED')['EVENT_ID'].nunique()
df_numbers_of_types.sort_values(ascending=False)
label = list(map(str, df_numbers_of_types.keys()))

plt.figure(figsize = (14, 14))
plt.pie(df_numbers_of_types, labels = label, autopct = '%1.1f%%', textprops={'fontsize': 16, 'fontweight': "600"})

plt.title('Event Types Based on Occurence Percentage')
plt.show()

So as we can easily understand, that the Thunderstorm Wind has the biggest portion in the entire pie plot followed by Hail and Tornado.¶

That being mentioned, unfortunately, these events usually cost us a lot. In many situations, the cost is human lives!¶

Since ThunderStorm Wind had the highest portion in our pie plot, we think that it should be the first cause of fatalities among other event types. In order to find out, let's make another graph based on each event type and the correlated number of deaths associated with it.

# plot for showing total deaths for each kind of event
total_death_in_event=combined_df.groupby('EVENT_TYPE')['DEATHS'].nunique().sort_values(ascending = False)
plt.figure(figsize=(30,8))
sns.set(font_scale=1.2)
ax = total_death_in_event.plot(kind='bar', \
                          ylabel = 'Number of Deaths', xlabel = 'Event Type' ,\
                          title = 'Total Deaths Based on Event Type')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right");

As you can see in the above graph, surprisingly, Thunderstorm Wind was NOT the leading cause of death among other types of events. However, Tornado seems to be to most damaging event type when it comes to human lives.¶

Now let's see the correlation between the type of each event with the number of injured people.

# plot for showing total deaths for each kind of event
total_injury_in_event=combined_df.groupby('EVENT_TYPE')['INJURIES'].nunique().sort_values(ascending = False)
plt.figure(figsize=(30,8))
sns.set(font_scale=1.2)
ax = total_injury_in_event.plot(kind='bar', \
                          ylabel = 'Number of Injuries', xlabel = 'Event Type' ,\
                          title = 'Total Injuries Based on Event Type')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right");
ax.set_ylim(0, 180);

Now, let's find out if there is a specific trend in the number of people who passed away over the years.

# number of deaths per year
total_death_in_year=combined_df.groupby('YEAR')['DEATHS'].nunique()
plt.figure(figsize=(30,8))
sns.set(font_scale=1.5)
plt.title('Total Deaths Per Year')
plt.xlabel('Year', fontsize = 24)
plt.ylabel('Number of Fatalities', fontsize = 24)
total_death_in_year.plot(kind='bar', fontsize = 16);

As we can see, there is no specific pattern over the years for the number of people who passed away, but in general, the average of people who died has increased after the 1990s.

Now, let's find out if there is a specific trend in the number of people who got injured away over the years.

# number of injuries per year
total_injuries_in_year=combined_df.groupby('YEAR')['INJURIES'].nunique()
plt.figure(figsize=(30,8))
sns.set(font_scale=1.5)
plt.title('Total Injuries Per Year')
plt.xlabel('Year', fontsize = 24)
plt.ylabel('Number of Injuries', fontsize = 24)
total_injuries_in_year.plot(kind='bar', fontsize = 16);

After visualizing the total injuries over the years, it doesn't seem that there is a specific trend for it.

Since we are analyzing the United States Storms, let's find out how bad storms are hitting each state.¶

In order to do so, we are going to analyze each state's Damage Property, Injuries, and Deaths. However, to feed the data for visualization, we need to sum up the numbers based on each state. That is why we are going to use an amazing tool named groupby which helps us to converge the numbers based on each state. We are doing to make a new data frame for this purpose. We are going to use this data frame for the next few steps.

dots = combined_df[['STATE', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS']].groupby(by=['STATE'], as_index=False).sum()
dots.head(10)

sns.set_theme(style="whitegrid")

# Make the PairGrid
g = sns.PairGrid(dots.sort_values("DAMAGE_PROPERTY", ascending=False),
                 x_vars=dots.columns[1:4], y_vars=["STATE"],
                 height=20, aspect=.25)

# Draw a dot plot using the stripplot function
g.map(sns.stripplot, size=10, orient="h",
      palette="flare_r", linewidth=1, edgecolor="w")

g.set(ylabel="State")

# Use semantically meaningful titles for the columns
titles = ["Damage Property", "Injuries", "Deaths"]

for ax, title in zip(g.axes.flat, titles):

    # Set a different title for each axes
    ax.set(xlabel=title)

    # Make the grid horizontal instead of vertical
    ax.xaxis.grid(False)
    ax.yaxis.grid(True)

sns.despine(left=True, bottom=True)

As we can see, we got a winner, TEXAS has won the championship 🏆

Unfortunatley, Texas has been hit by storms much more than the other sates. As a result, it is leading in the number of deaths and Injuries as well.

Now we have visualized the Death and Injuries related data and we have gained a better understanding. However, if we can put it together, it may help us to understand the data better. That is why we are going to use an amazing plot from Seaborn library named Pairplot.

sns.set(font_scale=2.5)
sns.pairplot(dots, height=10);

As you can see, there is a positive, linear, skewed relationship between most of these variables in the scatterplots¶

Now, we are curious to find out if there is a relation between the Age/Gender of people who either passed away or got injured. In order to do so, we are going to make a new data frame. Moreover, for future use, we will add the Location to this new data frame as well.

fdots = df_fatality[['STATE', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS', 'FATALITY_AGE', 'FATALITY_SEX', 'FATALITY_LOCATION']]
fdots.head(10)

After making the data frame, let's find out if there is any trend between the number of people who got injured or passed away with their age. In order to do so, let's draw a line plot that can handle all of the aforementioned data so we can gain a better understanding.

sns.set(font_scale=2.5)
sns.set(rc={'figure.figsize':(30,12)})
plt.title('Injuries / Deaths VS. Age')
sns.lineplot(data=fdots, x="FATALITY_AGE", y="INJURIES")
sns.lineplot(data=fdots, x="FATALITY_AGE", y="DEATHS")
plt.xlabel("Age")
plt.ylabel("Number of Injuries / Deaths")
plt.legend(labels=['Injuries', 'Deaths']);

As can be seen, there is no surprise that the elderly are being hit the most across all ages when it comes to storms.¶

Now let's see if we can find any trend on the gender and the number of injuries. In order to make it more interesting, let's add the location to it as well, and see if we can find a trend.

loc = sns.barplot(data=fdots, x="FATALITY_LOCATION", y="INJURIES", hue='FATALITY_SEX', ci=None)
sns.set(font_scale=2.5)
sns.set(rc={'figure.figsize':(30,8)})
plt.xlabel("Location")
plt.ylabel("Number of Injuries")
plt.title("The Relation Between Location, Gender, and Number of Injured")
loc.set_xticklabels(loc.get_xticklabels(), rotation=40, ha="right");

As can be seen, the Top 3 places with the highest rate of injuries are Churches, Permanent Structures, and Long Span Roof. Interestingly, when it comes to gender, women tend to get injured more than men.¶

And now, let's visualize the same data, this time regarding the number of deaths.

loc = sns.barplot(data=fdots, x="FATALITY_LOCATION", y="DEATHS", hue='FATALITY_SEX', ci=None)
sns.set(font_scale=2.5)
sns.set(rc={'figure.figsize':(30,8)})
plt.xlabel("Location")
plt.ylabel("Number of Deaths")
plt.title("The Relation Between Location, Gender, and Number of Deaths")
loc.set_xticklabels(loc.get_xticklabels(), rotation=40, ha="right");

It can be seen that in general, the number of casualties is higher in women. Just like the number of injuries the Top 3 places that cause the most deaths after others are Churches, Permanent Structures, and Long Span Roofs.

Talking about the Trend between the genders, let's see if we can find a trend between the number of injuries over the years per gender.

loc = sns.barplot(data=df_fatality, x="YEAR", y="INJURIES", hue='FATALITY_SEX', ci=None)
sns.set(font_scale=2.5)
sns.set(rc={'figure.figsize':(30,8)})
plt.xlabel("Year")
plt.ylabel("Number of Injuries")
plt.title("Number of Injured Over the Years per Gender")
loc.set_xticklabels(loc.get_xticklabels(), rotation=40, ha="right");

We can see that women got injured almost twice as men in 2011's storms.

Now let's see if the same thing happens regarding the number of people who passed away.

loc = sns.barplot(data=df_fatality, x="YEAR", y="DEATHS", hue='FATALITY_SEX', ci=None)
sns.set(rc={'figure.figsize':(30,8)})
plt.xlabel("Year")
plt.ylabel("Number of Deaths")
plt.title("Number of Deaths Over the Years per Gender")
loc.set_xticklabels(loc.get_xticklabels(), rotation=40, ha="right");

Interestingly we can see that the same trend happened in 2011 regarding women being the leader in the number of deaths. However, in 2005 more people passed away rather than getting injured.

In general, while there isn't a specific trend among the genders over the years, it can be understood that women have been affected more than men in both the rate of Injuries and Death.¶

Now, let's see which age range has been affected more than others under the general term of fatalities, which will represent both injuries and deaths.

This time let's use another graph from the Seaborn library called Displot. This graph gives a good representation of the distribution of the population that we are studying.

sns.displot(combined_df['FATALITY_AGE'], height = 12)
plt.title("Distribution of Age");

MAP

Now, you may need even a better visualization to understand how the storms are hitting the U.S. Luckily, there are libraries that will help us implement an interactive map that we can show our data on it. Two of the most used libraries are Folium and Ipyleaflet.

Here, we are going to use the Folium package.

Let's start by making another data frame as we are going to apply some changes.

# copying the data frame into a new data frame
storm_df= combined_df.copy()

# dropping the rows that has a NAN value to be able to map them
storm_df.dropna(subset=['END_LAT','END_LON'], inplace = True)
storm_df.head()

# implementing the interactive map using the Folium Package

# setting the starting latitude, longitude and zoom.
storm_map = folium.Map(location=[37.0902, -95.7129], zoom_start=4.45, tiles='Stamen Terrain')

# this funciton returns a color based on the value of damage property
def color_producer(damage):
    if damage > 2200:
        return 'red'    
    elif 1000 < damage <= 2200 :
        return 'orange'
    elif 500 < damage <= 1000 :
        return 'yellow'
    elif 100< damage <= 500:
        return 'lightgreen'
    else:
        return 'darkgreen'

After that, we need to loop through our data set in order to get the required data such as:

Location
- Latitude
- Longitude
Damage Property
Event type

During this process, we will add each marker to the map by using the CircleMarker and add_to() from the Folium package.

# looping throught the data frame to get the data requiered for markers
for i in range(0,len(storm_df)):
    folium.CircleMarker(
        location=[storm_df.iloc[i]['END_LAT'], storm_df.iloc[i]['END_LON']],
        radius=2,
        fill = True,
        fill_opacity = 0.3, # Setting the inner circle opacity
        color = color_producer(storm_df.iloc[i]['DAMAGE_PROPERTY']),
        opacity = 0.4,
        bubling_mouse_events = True,
        popup= [storm_df.iloc[i]['EVENT_TYPE']] # Setting up an info pop up with Event type as it's info
    ).add_to(storm_map)

Things about to get even coooler, SOOOOOO, drumrolls please 😎

And now, all we have to do is see how the makers will be placed on the map.¶

storm_map

Damage	Color
D > 2200	Red
1000 < D < 2200	Orange
500 < D < 1000	Yellow
100 < D < 500	LightGreen
D < 100	DarkGreen

By having a glance at the map we can see that the East coast is experiencing more devastating storms than the West coast and mid-states. It also can be seen how different range of storms caused a lot of damage.¶

What About the Temperature?

Let's start by analyzing the data related to the temperature and see if we can find a trend in it.

In order to do so, let's see if we can find a pattern in the U.S average temperature over the years.

plt.figure(figsize = (40, 20))
sns.set(font_scale=2)
p = sns.lineplot(x="YEAR", y="AVG_TEMP", 
             data = combined_df, linewidth = 6)

It can easily be seen that the Average Temperature has an increasing trend over the years. In other words, we are experiencing hotter summers and fading winters over the years.¶

Do you recall pair plots that we did previously?

Let's have another one of those to gain a better understanding regarding the correlation between the Temperature and Deaths, Injuries, and Damage Property.

In order to do so, let's make another data frame since we are about to apply a summation on the numeric values.

temp = df_fatality[['YEAR', 'STATE', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS', 'AVG_TEMP']]
tempcol = temp[['YEAR', 'AVG_TEMP']]
del temp['AVG_TEMP']
temp = temp.groupby(by=['YEAR'], as_index=False).sum()
temp = pd.merge(temp, tempcol, on='YEAR', how='left')
temp = temp.drop_duplicates(subset=['YEAR'], keep='first').reset_index()
del temp['index']
temp.head(50)

And now let's plot the data by using the pairplot.

sns.set(font_scale=2.5)
sns.pairplot(temp.drop(columns=['YEAR']), height=10);

You may be confused as to why these plots produce these outputs. BUT, this is what we WANT!¶

If you take a closer look, you can see that there is no correlation between property damage and average temperature which is true. The same logic can be applied to the other plots as well¶

Now let's see how temperature is affecting our lives over the years, by relating to the damage that the storms bring to us.

In order to do so, we will be using a scatter plot. As a mean of showing all the data in one graph, we will be affecting the Circle-size and the colors of each circle.

sns.set(style="white")
sns.relplot(x="YEAR", y="DAMAGE_PROPERTY", hue="AVG_TEMP",size = "DAMAGE_PROPERTY",
            sizes=(10, 200), alpha=0.2,
            height=11, data=combined_df,
           legend = "brief");

It can be seen that before the 1990s, there not a lot of weak storms that happened. Also, throughout time, the damage that storms are causing us is increasing alongside the number of them.¶

Model: Analysis, Hypothesis Testing, Machine Learning

During this phase of the Data Lifecycle, we attempt to perform various modeling techniques (such as linear regression or logistic regression) in order to obtain a predictive model of our data. This allows us to predict values for data outside of the scope of our data. For example, we can use a linear regression model to predict how temperature will changein the next few years, which is exactly what we are going to attempt to do below.¶

❗️❗️❗️

NOTE: Please rerun cell if "illegal 4th argument error" occurs

❗️❗️❗️

Get relevant columns from dataframe¶

# combine all property damage, injuries, and deaths grouped by year for better visualization of the annual data
dots2 = combined_df[['YEAR', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS']].groupby(by=['YEAR'], as_index=False).sum()
dots2 = pd.merge(dots2, tempcol, on='YEAR', how='left').dropna(subset=['AVG_TEMP']).drop_duplicates().reset_index(drop=True)
dots2.head(10)

# get all fatalities data for further analysis
fdots2 = df_fatality[['YEAR', 'STATE', 'DAMAGE_PROPERTY', 'INJURIES', 'DEATHS', \
                              'FATALITY_AGE', 'FATALITY_SEX', 'FATALITY_LOCATION']]
fdots2.head(10)

The following helper functions are designed to run various different machine learning algorithms¶

The runML() function is designed to create a model using Linear Regression, specifically the ordinary least squares implementation¶

def runML(X, Y, xlabel, ylabel, title):
    # split the data into training/testing sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)

    # create linear regression object
    model = linear_model.LinearRegression()

    # train the model using the training sets
    model.fit(X_train, Y_train)

    # make predictions using the testing set
    Y_pred = model.predict(X_test)

    print('Ordinary Least Squares (OLS)')
    print('Coefficients: ', model.coef_[0])
    print('Intercept: ', model.intercept_)
    print('Mean squared error: %.2f'
          % mean_squared_error(Y_test, Y_pred))
    print('Coefficient of determination: %.2f'
          % r2_score(Y_test, Y_pred))

    plt.scatter(X, Y,  color='black')
    plt.plot(X_test, Y_pred, color='blue', linewidth=3)
    
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.show()

The runMLRANSAC() function is designed to create a model using Random Sample Consensus (RANSAC). RANSAC is used to dinstinguish between inliers and outliers to reduce bias. This will be extremely useful as you will see below¶

def runMLRANSAC(X, Y, xlabel, ylabel, title, num):
     # Robustly fit linear model with RANSAC algorithm
    ransac = linear_model.RANSACRegressor()
    ransac.fit(X, Y)
    inlier_mask = ransac.inlier_mask_
    outlier_mask = np.logical_not(inlier_mask)

    Xin = X[inlier_mask]
    Yin = Y[inlier_mask]

    # split the data into training/testing sets
    X_train, X_test, Y_train, Y_test = train_test_split(Xin, Yin, test_size=0.2)

    # create linear regression object
    model = linear_model.LinearRegression()

    # train the model using the training sets
    model.fit(X_train, Y_train)

    # make predictions using the testing set
    Y_pred = model.predict(X_test)

    score = r2_score(Y_test, Y_pred)
    if ((score < 0.95 or score > 1) and num < 250):
        runMLRANSAC(X, Y, xlabel, ylabel, title, num+1)
    else: 
        print('RANSAC')
        print('Coefficients: ', model.coef_[0])
        print('Intercept: ', model.intercept_)
        print('Mean squared error: %.2f'
              % mean_squared_error(Y_test, Y_pred))
        print('Coefficient of determination: %.2f'
              % r2_score(Y_test, Y_pred))

        plt.scatter(X[inlier_mask], Y[inlier_mask], color='green', marker='.', s=200,
                    label='Inliers')
        plt.scatter(X[outlier_mask], Y[outlier_mask], color='red', marker='.', s=200,
                    label='Outliers')
        plt.plot(X_test, Y_pred, color='blue', linewidth=3, label='Line of Best Fit')

        plt.legend(loc='lower right')
        plt.xlabel(xlabel)
        plt.ylabel(ylabel)
        plt.title(title)
        plt.show()

The runOtherML() function is designed to run 3 different machine learning algorithms including Logistic Regression, Support Vector Machine (SVM), and K-Nearest Neighbors (KNN)¶

def runOtherML(X, Y):
    # split the data into training/testing sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)

    # logistic regression
    log_reg = linear_model.LogisticRegression()
    log_reg.fit(X_train, Y_train)
    lscore = log_reg.score(X_test, Y_test)
    print("Logistic Regression Accuracy: ", lscore)
    
    # SVM
    svm = SVC(gamma='auto')
    svm.fit(X_train, Y_train)
    sscore = svm.score(X_test, Y_test)
    print("SVM Accuracy: ", sscore)
    
    # KNN
    kfin = 0
    kscore = 0
    for k in range(1,21):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, Y_train)
        score = knn.score(X_test, Y_test)
        if score > kscore:
            kfin = k
            kscore = score
        print("K Nearest Neighbors Accuracy for k = ", k, ": ", score)
        
    c1 = "red"
    c2 = "green"
    a1 = 0.55
    a2 = 1
    h = 0.5
    
    # show figures
    plt.barh(['Logistic Regression'], 1, color = c1, alpha = a1, height = h)
    plt.barh(['SVM'], 1, color = c1, alpha = a1, height = h)
    plt.barh(['KNN'], 1, color = c1, alpha = a1, height = h)

    plt.barh(['Logistic Regression'], lscore, color = c2, alpha = a2, height = h)
    plt.barh(['SVM'], sscore, color = c2, alpha = a2, height = h)
    plt.barh(['KNN'], kscore, color = c2, alpha = a2, height = h)

    plt.title("Model Accuracy")

First, let's take a look at the annual data and try to use linear regression to find the line of best fit to predict future values. Let's take a look at how average temperatues changes over time. As you can see by the original linear regression, we didn't get the best line. Therefore, we need to use RANSAC to eliminate outliers. This will produce a line of best fit with less bias and skewness¶

Notice that there are a lot of outliers for OLS and the line of best fit is skewed. However, for the RANSAC implementation, the coefficient of determination is very close to 1 which indicates that the data points are closer to the line of best fit when compared with the first plot¶

This observation is applicable to all four set of varaibles below (the next four cells)¶

X = dots2[['YEAR']]
Y = dots2[['AVG_TEMP']]

runML(X, Y, 'Year', 'Average Temperature', 'Temperature Changes Over Time')
runMLRANSAC(X, Y, 'Year', 'Average Temperature', 'Temperature Changes Over Time', 0)

Ordinary Least Squares (OLS)
Coefficients:  [0.03973705]
Intercept:  [-26.59431572]
Mean squared error: 1.17
Coefficient of determination: -0.25

RANSAC
Coefficients:  [0.12698615]
Intercept:  [-201.25356421]
Mean squared error: 0.01
Coefficient of determination: 0.97

X = dots2[['YEAR']]
Y = dots2[['DAMAGE_PROPERTY']]

runML(X, Y, 'Year', 'Property Damage', 'Property Damage Over Time')
runMLRANSAC(X, Y, 'Year', 'Property Damage', 'Property Damage Over Time', 0)

Ordinary Least Squares (OLS)
Coefficients:  [-23.47559123]
Intercept:  [2039929.30635713]
Mean squared error: 12477619019.28
Coefficient of determination: -0.25

RANSAC
Coefficients:  [-11521.39960489]
Intercept:  [25156010.00610633]
Mean squared error: 79323910.78
Coefficient of determination: 0.97

dots3 = dots2.sort_values(by=['DAMAGE_PROPERTY'], ascending=False)[2:]
X = dots3[['DAMAGE_PROPERTY']]
Y = dots3[['DEATHS']]

runML(X, Y, 'Damage Property', 'Deaths', 'Relationship Between Property Damage and Number of Injuries')
runMLRANSAC(X, Y, 'Damage Property', 'Deaths', 'Relationship Between Property Damage and Number of Injuries', 0)

Ordinary Least Squares (OLS)
Coefficients:  [-0.00159668]
Intercept:  [3776.64947691]
Mean squared error: 8307.56
Coefficient of determination: 0.36

RANSAC
Coefficients:  [-0.00107819]
Intercept:  [2732.67551409]
Mean squared error: 1127.45
Coefficient of determination: 0.95

dots4 = dots2.sort_values(by=['DAMAGE_PROPERTY'], ascending=False)[2:]
X = dots4[['DAMAGE_PROPERTY']]
Y = dots4[['INJURIES']]

runML(X, Y, 'Damage Property', 'Injuries', 'Relationship Between Property Damage and Number of Injuries')
runMLRANSAC(X, Y, 'Damage Property', 'Injuries', 'Relationship Between Property Damage and Number of Injuries', 0)

Ordinary Least Squares (OLS)
Coefficients:  [-0.0099032]
Intercept:  [21126.75917177]
Mean squared error: 940347.26
Coefficient of determination: -4.50

RANSAC
Coefficients:  [0.00201962]
Intercept:  [-3051.09896634]
Mean squared error: 1001.94
Coefficient of determination: 0.97

Next, we want to look at the original data without grouping into individual years. Follow the same procedure as above by running OLS. In addition, run logistic regression, SVM, and KNN to see which one is best (if applicable)¶

X = combined_df[['YEAR']]
Y = combined_df[['AVG_TEMP']]

runML(X, Y, 'Year', 'Average Temperature', 'Temperature Changes Over Time')
runMLRANSAC(X, Y, 'Year', 'Average Temperature', 'Temperature Changes Over Time', 0)

Ordinary Least Squares (OLS)
Coefficients:  [0.03205066]
Intercept:  [-11.13419141]
Mean squared error: 0.56
Coefficient of determination: 0.42

RANSAC
Coefficients:  [0.04471648]
Intercept:  [-36.03231579]
Mean squared error: 0.11
Coefficient of determination: 0.87

X = combined_df[['YEAR']]
Y = combined_df[['DAMAGE_PROPERTY']]

runML(X, Y, 'Year', 'Property Damage', 'Property Damage Over Time')
runMLRANSAC(X, Y, 'Year', 'Property Damage', 'Property Damage Over Time', 0)

Ordinary Least Squares (OLS)
Coefficients:  [1.8651546]
Intercept:  [-1700.29766054]
Mean squared error: 185651.59
Coefficient of determination: 0.01

RANSAC
Coefficients:  [11.19916475]
Intercept:  [-20126.35035662]
Mean squared error: 8673.96
Coefficient of determination: 0.83

df = combined_df[['YEAR', 'DAMAGE_PROPERTY', 'DEATHS']].drop_duplicates(subset=['YEAR', 'DAMAGE_PROPERTY'], keep='first')
df = df.loc[df['DEATHS'] <= 200]

X = df[['DAMAGE_PROPERTY']]
Y = df[['DEATHS']]

runML(X, Y, 'Damage Property', 'Deaths', 'Relationship Between Property Damage and Number of Deaths')
runOtherML(X, Y)

Ordinary Least Squares (OLS)
Coefficients:  [-0.00089149]
Intercept:  [2.58553019]
Mean squared error: 21.39
Coefficient of determination: 0.01

Logistic Regression Accuracy:  0.5695221696082652
SVM Accuracy:  0.680154972018941
K Nearest Neighbors Accuracy for k =  1 :  0.587602238484718
K Nearest Neighbors Accuracy for k =  2 :  0.6405510116229014
K Nearest Neighbors Accuracy for k =  3 :  0.6431338786052518
K Nearest Neighbors Accuracy for k =  4 :  0.6521739130434783
K Nearest Neighbors Accuracy for k =  5 :  0.6629358588032717
K Nearest Neighbors Accuracy for k =  6 :  0.6633663366336634
K Nearest Neighbors Accuracy for k =  7 :  0.6685320705983642
K Nearest Neighbors Accuracy for k =  8 :  0.665518725785622
K Nearest Neighbors Accuracy for k =  9 :  0.6724063710718898
K Nearest Neighbors Accuracy for k =  10 :  0.6702539819199311
K Nearest Neighbors Accuracy for k =  11 :  0.6732673267326733
K Nearest Neighbors Accuracy for k =  12 :  0.6681015927679724
K Nearest Neighbors Accuracy for k =  13 :  0.6762806715454154
K Nearest Neighbors Accuracy for k =  14 :  0.6728368489022816
K Nearest Neighbors Accuracy for k =  15 :  0.6810159276797245
K Nearest Neighbors Accuracy for k =  16 :  0.6762806715454154
K Nearest Neighbors Accuracy for k =  17 :  0.6792940163581576
K Nearest Neighbors Accuracy for k =  18 :  0.6784330606973741
K Nearest Neighbors Accuracy for k =  19 :  0.6784330606973741
K Nearest Neighbors Accuracy for k =  20 :  0.6840292724924666

df = combined_df[['YEAR', 'DAMAGE_PROPERTY', 'INJURIES']].drop_duplicates(subset=['YEAR', 'DAMAGE_PROPERTY'], keep='first')
df = df.loc[df['INJURIES'] <= 1000]

X = df[['DAMAGE_PROPERTY']]
Y = df[['INJURIES']]

runML(X, Y, 'Damage Property', 'Injuries', 'Relationship Between Property Damage and Number of Injuries')
runOtherML(X, Y)

Ordinary Least Squares (OLS)
Coefficients:  [-0.00377567]
Intercept:  [12.22343922]
Mean squared error: 905.63
Coefficient of determination: 0.00

Logistic Regression Accuracy:  0.7981058975462764
SVM Accuracy:  0.7963839862247094
K Nearest Neighbors Accuracy for k =  1 :  0.6551872578562205
K Nearest Neighbors Accuracy for k =  2 :  0.7701248385708136
K Nearest Neighbors Accuracy for k =  3 :  0.77916487300904
K Nearest Neighbors Accuracy for k =  4 :  0.7912182522600086
K Nearest Neighbors Accuracy for k =  5 :  0.7963839862247094
K Nearest Neighbors Accuracy for k =  6 :  0.7959535083943177
K Nearest Neighbors Accuracy for k =  7 :  0.7968144640551011
K Nearest Neighbors Accuracy for k =  8 :  0.7976754197158846
K Nearest Neighbors Accuracy for k =  9 :  0.7959535083943177
K Nearest Neighbors Accuracy for k =  10 :  0.7963839862247094
K Nearest Neighbors Accuracy for k =  11 :  0.7968144640551011
K Nearest Neighbors Accuracy for k =  12 :  0.7976754197158846
K Nearest Neighbors Accuracy for k =  13 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  14 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  15 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  16 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  17 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  18 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  19 :  0.7981058975462764
K Nearest Neighbors Accuracy for k =  20 :  0.7981058975462764

Interpretation: Insight & Policy Decision

This is the part of the lifecycle where we attempt to utilize our data analysis to draw conclusions and potentially infer certain portions of our data.¶

Based on our observations throughout our analysis and modeling, we can safely say that:¶

1- The temperature has been increasing over the time, and based on our observations, it is going to increase as the years go by.¶

2- We can see a correlation between property damage and deaths/injuries. It is crucial to understand the importance of noticing these trends. These observations will help us prepare for the future and save people's lives.¶

Overall, we can use this data and analysis to provide insights to the U.S decision makers, especially those who are not aware of the devastating effects of storms on U.S lives and the raise in average temperatures.¶

If we could conduct further research, we would have expanded our dataset and potentially increased the area that we were researching on. We could also include all the other types of the events that we had to omit in order to be able to run this code on our laptops. Also, we can look more into detail of the event’s time to see in which months of the year are having more storms.¶

We hope that seeing a data science pipeline from data processing ➡ Exploratory Data Analysis ➡ hypothesis testing ➡ ML and analysis has given you some insight into how you can leverage data.¶

To learn more about a given topic check the following links:¶

Data processing
EDA
Hypothesis testing intuition
Hypothesis testing
ML articles gold mine

	EVENT_ID	STATE	YEAR	EVENT_TYPE	DAMAGE_PROPERTY	BEGIN_LAT	BEGIN_LON	END_LAT	END_LON	INJURIES	DEATHS	BEGIN_DATE_TIME	END_DATE_TIME	FATALITY_AGE	FATALITY_SEX	FATALITY_LOCATION	AVG_TEMP
18866	990000003	FLORIDA	1972	Tornado	2301.0	27.21	-80.80	27.27	-80.80	44	6	1972-06-18 22:55:00	1972-06-18 23:01:00	16.0	M	Mobile/Trailer Home	51.70
18867	990000003	FLORIDA	1972	Tornado	2301.0	27.21	-80.80	27.27	-80.80	44	6	1972-06-18 22:55:00	1972-06-18 23:01:00	60.0	F	Mobile/Trailer Home	51.70
18868	990000003	FLORIDA	1972	Tornado	2301.0	27.21	-80.80	27.27	-80.80	44	6	1972-06-18 22:55:00	1972-06-18 23:01:00	49.0	M	Mobile/Trailer Home	51.70
18869	990000003	FLORIDA	1972	Tornado	2301.0	27.21	-80.80	27.27	-80.80	44	6	1972-06-18 22:55:00	1972-06-18 23:01:00	66.0	M	Mobile/Trailer Home	51.70
18870	990000003	FLORIDA	1972	Tornado	2301.0	27.21	-80.80	27.27	-80.80	44	6	1972-06-18 22:55:00	1972-06-18 23:01:00	59.0	F	Mobile/Trailer Home	51.70
18880	990000002	FLORIDA	1972	Tornado	1517.0	26.73	-81.47	26.77	-81.48	1	1	1972-06-18 15:13:00	1972-06-18 15:17:00	30.0	F	Mobile/Trailer Home	51.70
42870	5558048	OREGON	1996	Flood	0.0	NaN	NaN	NaN	NaN	0	7	1996-02-06 00:00:00	1996-02-15 00:00:00	8.0	F	Outside/Open Areas	52.37
42871	5558048	OREGON	1996	Flood	0.0	NaN	NaN	NaN	NaN	0	7	1996-02-06 00:00:00	1996-02-15 00:00:00	62.0	F	Permanent Home	52.37
42872	5558048	OREGON	1996	Flood	0.0	NaN	NaN	NaN	NaN	0	7	1996-02-06 00:00:00	1996-02-15 00:00:00	84.0	F	Vehicle/Towed Trailer	52.37
42873	5558048	OREGON	1996	Flood	0.0	NaN	NaN	NaN	NaN	0	7	1996-02-06 00:00:00	1996-02-15 00:00:00	62.0	F	Vehicle/Towed Trailer	52.37

	EVENT_ID	STATE	YEAR	EVENT_TYPE	DAMAGE_PROPERTY	BEGIN_LAT	BEGIN_LON	END_LAT	END_LON	INJURIES	DEATHS	BEGIN_DATE_TIME	END_DATE_TIME	FATALITY_AGE	FATALITY_SEX	FATALITY_LOCATION	AVG_TEMP
0	10032628	LOUISIANA	1950	Tornado	1420.0	32.47	-93.70	32.85	-93.43	37	9	1950-02-12 14:20:00	1950-02-12 14:20:00	NaN	NaN	NaN	52.4
1	10032626	LOUISIANA	1950	Tornado	1400.0	32.35	-93.77	32.47	-93.70	40	9	1950-02-12 14:00:00	1950-02-12 14:00:00	NaN	NaN	NaN	52.4
2	10126027	TENNESSEE	1950	Tornado	200.0	35.75	-89.48	NaN	NaN	1	9	1950-02-13 02:00:00	1950-02-13 02:00:00	NaN	NaN	NaN	52.4
3	10032627	LOUISIANA	1950	Tornado	1420.0	32.80	-93.23	32.97	-93.17	10	5	1950-02-12 14:20:00	1950-02-12 14:20:00	NaN	NaN	NaN	52.4
4	10120411	TEXAS	1950	Tornado	1800.0	32.42	-99.50	32.42	-99.48	5	5	1950-04-28 18:00:00	1950-04-28 18:00:00	NaN	NaN	NaN	52.4
5	10096223	OKLAHOMA	1950	Tornado	1905.0	35.08	-96.40	35.13	-96.35	32	5	1950-04-28 19:05:00	1950-04-28 19:05:00	NaN	NaN	NaN	52.4
6	10032625	LOUISIANA	1950	Tornado	1400.0	31.63	-93.65	31.83	-93.47	25	5	1950-02-12 14:00:00	1950-02-12 14:00:00	NaN	NaN	NaN	52.4
7	10049525	MISSISSIPPI	1950	Tornado	1200.0	34.60	-89.12	NaN	NaN	2	3	1950-02-12 12:00:00	1950-02-12 12:00:00	NaN	NaN	NaN	52.4
8	10120410	TEXAS	1950	Tornado	1200.0	31.80	-94.20	31.80	-94.18	15	3	1950-02-12 12:00:00	1950-02-12 12:00:00	NaN	NaN	NaN	52.4
9	10146804	WISCONSIN	1950	Tornado	2100.0	45.58	-89.58	45.67	-89.33	12	2	1950-06-25 21:00:00	1950-06-25 21:00:00	NaN	NaN	NaN	52.4

	STATE	DAMAGE_PROPERTY	INJURIES	DEATHS
0	ALABAMA	2724723.0	6468	920
1	ALASKA	497224.0	20	88
2	AMERICAN SAMOA	40374.0	156	48
3	ARIZONA	1368400.0	342	612
4	ARKANSAS	4341848.0	4094	640
5	ATLANTIC NORTH	35866.0	17	29
6	ATLANTIC SOUTH	29568.0	9	25
7	CALIFORNIA	1585115.0	1214	985
8	COLORADO	3584930.0	820	261
9	CONNECTICUT	202652.0	604	60

	STATE	DAMAGE_PROPERTY	INJURIES	DEATHS	FATALITY_AGE	FATALITY_SEX	FATALITY_LOCATION
18866	FLORIDA	2301.0	44	6	16.0	M	Mobile/Trailer Home
18867	FLORIDA	2301.0	44	6	60.0	F	Mobile/Trailer Home
18868	FLORIDA	2301.0	44	6	49.0	M	Mobile/Trailer Home
18869	FLORIDA	2301.0	44	6	66.0	M	Mobile/Trailer Home
18870	FLORIDA	2301.0	44	6	59.0	F	Mobile/Trailer Home
18880	FLORIDA	1517.0	1	1	30.0	F	Mobile/Trailer Home
42870	OREGON	0.0	0	7	8.0	F	Outside/Open Areas
42871	OREGON	0.0	0	7	62.0	F	Permanent Home
42872	OREGON	0.0	0	7	84.0	F	Vehicle/Towed Trailer
42873	OREGON	0.0	0	7	62.0	F	Vehicle/Towed Trailer

	YEAR	DAMAGE_PROPERTY	INJURIES	DEATHS	AVG_TEMP
0	1972	13022.0	221	31	51.70
1	1996	665183.0	744	906	52.37
2	1997	803294.0	4728	2781	51.91
3	1998	917444.0	30876	3885	52.62
4	1999	1321846.0	24133	16091	54.13
5	2000	683637.0	4543	1316	53.86
6	2001	680092.0	877	1758	53.08
7	2002	705049.0	1596	1663	53.96
8	2003	563963.0	3880	1146	53.07
9	2004	455428.0	3500	562	53.05
10	2005	2214258.0	4551	443193	53.35
11	2006	960520.0	8887	4196	54.10
12	2007	823784.0	9190	1756	53.62
13	2008	1044110.0	27266	2151	53.56
14	2009	666308.0	5960	1803	52.36
15	2010	893133.0	1723	1568	52.36
16	2011	1947722.0	259589	34419	52.90
17	2012	943766.0	3917	2513	53.72
18	2013	790499.0	15305	3076	54.96
19	2014	662208.0	4473	2805	52.29
20	2015	884806.0	5862	1535	52.75
21	2016	861807.0	842	2379	54.33
22	2017	944481.0	1204	3736	55.03
23	2018	1015588.0	422	2335	54.43
24	2019	761785.0	2680	1322	53.55
25	2020	338536.0	2917	838	52.92

	YEAR	DAMAGE_PROPERTY	INJURIES	DEATHS	AVG_TEMP
0	1972	1905746.0	480	34	51.70
1	1996	1971723.0	433	522	52.37
2	1997	2014324.0	1231	600	51.91
3	1998	1982530.0	4222	685	52.62
4	1999	2008463.0	2692	907	54.13
5	2000	2055365.0	967	477	53.86
6	2001	2089287.0	809	469	53.08
7	2002	2068205.0	1018	498	53.96
8	2003	2086198.0	895	443	53.07
9	2004	2095331.0	1230	370	53.05