Working with Text¶

Author

Paul Schrimpf UBC

Prerequisites

Outcomes

Use text as features for classification
Understand latent topic analysis
Use folium to create an interactive map
Request and combine json data from a web server

Outline¶

Working with Text

# Uncomment following line to install on colab
#! pip install qeds fiona geopandas xgboost gensim folium pyLDAvis descartes

Introduction¶

Many data sources contain both numerical data and text.

We can use text to create features for any of the prediction methods that we have discussed.

Doing so requires encoding text into some numerical representation.

A good encoding preserves the meaning of the original text, while keeping dimensionality manageable.

In this lecture, we will learn how to work with text through an application — predicting fatalities from avalanche forecasts.

Avalanches¶

Snow avalanches are a hazard in the mountains. Avalanches can be partially predicted based on snow conditions, weather, and terrain. Avalanche Canada produces daily avalanche forecasts for various Canadian mountainous regions. These forecasts consist of 1-5 ratings for each of three elevation bands, as well as textual descriptions of recent avalanche observations, snowpack, and weather. Avalanche Canada also maintains a list of fatal avalanche incidents . In this lecture, we will attempt to predict fatal incidents from the text of avalanche forecasts. Since fatal incidents are rare, this prediction task will be quite difficult.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


%matplotlib inline
# activate plot theme
import qeds
qeds.themes.mpl_style();

Data¶

Avalanche Canada has an unstable json api. The api seems to be largely tailored to displaying the information on various Avalanche Canada websites, which does not make it easy to obtain large amounts of data. Nonetheless, getting information from the API is easier than scraping the website. Generally, whenever you’re considering scraping a website, you should first check whether the site has an API available.

Incident Data¶

# Get data on avalanche forecasts and incidents from Avalanche Canada
# Avalanche Canada has an unstable public api
# https://github.com/avalanche-canada/ac-web
# Since API might change, this code might break
import json
import os
import urllib.request
import pandas as pd
import time
import requests
import io
import zipfile
import shutil

# Incidents
url = "http://incidents.avalanche.ca/public/incidents/?format=json"
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as response:
    result = json.loads(response.read().decode('utf-8'))
incident_list = result["results"]
while (result["next"] != None):
    req = urllib.request.Request(result["next"])
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    incident_list = incident_list + result["results"]
incidents_brief = pd.DataFrame.from_dict(incident_list,orient="columns")
pd.options.display.max_rows = 20
pd.options.display.max_columns = 8
incidents_brief

# We can get more information about these incidents e.g. "https://www.avalanche.ca/incidents/37d909e4-c6de-43f1-8416-57a34cd48255"
# this information is also available through the API
def get_incident_details(id):
    url = "http://incidents.avalanche.ca/public/incidents/{}?format=json".format(id)
    req = urllib.request.Request(url)
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    return(result)


incidentsfile = "avalanche_incidents.csv"

# To avoid loading the avalanche Canada servers, we save the incident details locally.
if (not os.path.isfile(incidentsfile)):
    incident_detail_list = incidents_brief.id.apply(get_incident_details).to_list()
    incidents = pd.DataFrame.from_dict(incident_detail_list, orient="columns")
    incidents.to_csv(incidentsfile)
else:
    incidents = pd.read_csv(incidentsfile)

incidents

Many incidents include coordinates, but others do not. Most however, do include a place name. We can use Natural Resources Canada’s Geolocation Service to retrieve coordinates from place names.

# geocode locations without coordinates
def geolocate(location, province):
    url = "http://geogratis.gc.ca/services/geolocation/en/locate?q={},%20{}"
    req = urllib.request.Request(url.format(urllib.parse.quote(location),province))
    with urllib.request.urlopen(req) as response:
        result = json.loads(response.read().decode('utf-8'))
    if (len(result)==0):
        return([None,None])
    else:
        return(result[0]['geometry']['coordinates'])
if not "alt_coord" in incidents.columns:
    incidents["alt_coord"] = [
        geolocate(incidents.location[i], incidents.location_province[i])
        for i in incidents.index
    ]
    incidents.to_csv(incidentsfile)

Now that we have incident data, let’s create some figures.

# clean up activity names
incidents.group_activity.unique()

array(['Backcountry Skiing', 'Ice Climbing', 'Ski touring', 'Snowshoeing',
       'Snowmobiling', 'Skiing', 'Heliskiing', 'Snowshoeing & Hiking',
       'Mechanized Skiing', 'Work', 'Mountaineering',
       'Other Recreational', 'Out-of-bounds Skiing',
       'At Outdoor Worksite', 'Lift Skiing Closed', 'Lift Skiing Open',
       'Hunting/Fishing', 'Out-of-Bounds Skiing', 'Control Work',
       'Inside Building', 'Car/Truck on Road', 'Inside Car/Truck on Road',
       'Unknown', 'Outside Building'], dtype=object)

incidents.group_activity=incidents.group_activity.replace("Ski touring","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Out-of-Bounds Skiing","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Lift Skiing Closed","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Skiing","Backcountry Skiing")
incidents.group_activity=incidents.group_activity.replace("Snowshoeing","Snowshoeing & Hiking")
incidents.group_activity=incidents.group_activity.replace("Snowshoeing and Hiking","Snowshoeing & Hiking")
incidents.group_activity=incidents.group_activity.replace("Mechanized Skiing","Heli or Cat Skiing")
incidents.group_activity=incidents.group_activity.replace("Heliskiing","Heli or Cat Skiing")
incidents.group_activity=incidents.group_activity.replace("At Outdoor Worksite","Work")
incidents.group_activity=incidents.group_activity.replace("Control Work","Work")
incidents.group_activity=incidents.group_activity.replace("Hunting/Fishing","Other Recreational")
incidents.group_activity=incidents.group_activity.replace("Inside Car/Truck on Road","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Car/Truck on Road","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Inside Building","Car/Truck/Building")
incidents.group_activity=incidents.group_activity.replace("Outside Building","Car/Truck/Building")


incidents.group_activity.unique()

fig, ax = plt.subplots(1,2, sharey=True, figsize=(12,4))
colors=plt.rcParams["axes.prop_cycle"].by_key()["color"]
incidents.groupby(['group_activity']).id.count().plot(kind='bar', title="Incidents by Activity", ax=ax[0])
incidents.groupby(['group_activity']).num_fatal.sum().plot(kind='bar', title="Deaths by Activity", ax=ax[1], color=colors[1])
ax[0].set_xlabel(None)
ax[1].set_xlabel(None);

incidents["date"] = pd.to_datetime(incidents.ob_date)
incidents["year"] = incidents.date.apply(lambda x: x.year)
incidents.date = incidents.date.dt.date
colors=plt.rcParams["axes.prop_cycle"].by_key()["color"]
f = incidents.groupby(["year"]).num_fatal.sum()
n = incidents.groupby(["year"]).id.count()
yearstart=1950
f=f[f.index>yearstart]
n=n[n.index>yearstart]
fig,ax = plt.subplots(1,1,figsize=(12,4))
n.plot(ax=ax)
f.plot(ax=ax)
ax.set_ylabel("Count")
ax.annotate("Incidents", (2010, 4), color=colors[0])
ax.annotate("Deaths", (2011, 15), color=colors[1]);

Mapping Incidents¶

Since the incident data includes coordinates, we might as well make a map too. Unfortunately, some latitude and longitudes contain obvious errors. Here, we try to fix them.

import re

# fix errors in latitude, longitude
latlon = incidents.location_coords
def makenumeric(cstr):
    if cstr is None:
        return([None,None])
    elif (type(cstr)==str):
        return([float(s) for s in re.findall(r'-?\d+\.?\d*',cstr)])
    else:
        return(cstr)

latlon = latlon.apply(makenumeric)

def good_lat(lat):
    return(lat >= 41.6 and lat <= 83.12) # min & max for Canada

def good_lon(lon):
    return(lon >= -141 and lon<= -52.6)

def fixlatlon(c):
    if (len(c)<2 or type(c[0])!=float or type(c[1])!=float):
        c = [None, None]
        return(c)
    lat = c[0]
    lon = c[1]
    if not good_lat(lat) and good_lat(lon):
        tmp = lat
        lat = lon
        lon = tmp
    if not good_lon(lon) and good_lon(-lon):
        lon = -lon
    if not good_lon(lon) and good_lon(lat):
        tmp = lat
        lat = lon
        lon = tmp
    if not good_lon(lon) and good_lon(-lat):
        tmp = -lat
        lat = lon
        lon = tmp
    if not good_lat(lat) or not good_lon(lon):
        c[0] = None
        c[1] = None
    else:
        c[0] = lat
        c[1] = lon
    return(c)

incidents["latlon"] = latlon.apply(fixlatlon)

def foo(c, a):
    if (type(a)==str):
        a = [float(s) for s in re.findall(r'-?\d+\.?\d*',a)]
    if len(a) <2:
        a = [None,None]
    return([a[1],a[0]] if type(c[0])!=float else c)
incidents["latlon_filled"]=[foo(c,a) for c,a in zip(incidents["latlon"],incidents["alt_coord"])]
nmiss = sum([a[0]==None for a in incidents.latlon_filled])
n = len(incidents.latlon_filled)
print("{} of {} incidents have latitude & longitude".format(n-nmiss, n))

310 of 471 incidents have latitude & longitude

# download forecast region definitions
req = urllib.request.Request("https://www.avalanche.ca/api/forecasts")
with urllib.request.urlopen(req) as response:
    forecastregions = json.loads(response.read().decode('utf-8'))

You may have to uncomment the second line below if folium is not installed.

# Map forecast regions and incidents
#!pip install --user folium
import folium
import matplotlib

cmap = matplotlib.cm.get_cmap('Set1')
fmap = folium.Map(location=[60, -98],
                            zoom_start=3,
                            tiles='Stamen Terrain')
with urllib.request.urlopen(req) as response:
    regions_tmp = json.loads(response.read().decode('utf-8'))
folium.GeoJson(regions_tmp,
               tooltip=folium.GeoJsonTooltip(fields=["name"], aliases=[""]),
               highlight_function=lambda x: { 'weight': 10},
              style_function=lambda x: {'weight':1}).add_to(fmap)
activities = incidents.group_activity.unique()
for i in incidents.index:
    if incidents.latlon_filled[i][0] is not None and  incidents.latlon_filled[i][1] is not None:
        cindex=[j for j,x in enumerate(activities) if x==incidents.group_activity[i]][0]
        txt = "{}, {}<br>{} deaths"
        txt = txt.format(incidents.group_activity[i],
                        incidents.ob_date[i],
                        incidents.num_fatal[i]
                        )
        pop = folium.Popup(incidents.comment[i], parse_html=True, max_width=400)
        folium.CircleMarker(incidents.latlon_filled[i],
                      tooltip=txt,
                      popup=pop,
                      color=matplotlib.colors.to_hex(cmap(cindex)), fill=True, radius=5).add_to(fmap)
fmap

Take a moment to click around the map and read about some of the incidents.

Between presenting this information on a map and the list on https://www.avalanche.ca/incidents/ , which do you prefer and why?

Matching Incidents to Regions¶

Later, we will want to match incidents to forecasts, so let’s find the closest region to each incident.

Note that distance here will be in units of latitude, longitude (or whatever coordinate system we use). At the equator, a distance of 1 is approximately 60 nautical miles.

Since longitude lines get closer together farther from the equator, these distances will be understated the further North you go.

This is not much of a problem if we’re just finding the nearest region, but if we care about accurate distances, we should re-project the latitude and longitude into a different coordinate system.

# Match incidents to nearest forecast regions.
from shapely.geometry import Point, Polygon, shape
point = Point(incidents.latlon_filled[0][1],incidents.latlon_filled[0][0])
def distances(latlon):
    point=Point(latlon[1],latlon[0])
    df = pd.DataFrame.from_dict([{'id':feature['id'],
                                  'distance':shape(feature['geometry']).distance(point)} for
                                 feature in forecastregions['features']])
    return(df)
def foo(x):
    if (x[0]==None):
        return(None)
    d = distances(x)
    return(d.id[d.distance.idxmin()])
incidents['nearest_region'] = incidents.latlon_filled.apply(foo)
incidents['nearest_distance'] = incidents.latlon_filled.apply(lambda x: None if x[0]==None else distances(x).distance.min())

incidents

Forecast Data¶

We’ll now download all forecasts for all regions since November 2011 (roughly the earliest data available).

We can only request one forecast at a time, so this takes many hours to download.

To make this process run more quickly for readers, we ran the code ourselves and then stored the data in the cloud.

The function below will fetch all the forecasts from the cloud storage location and save them to a folder named avalanche_forecasts.

def download_cached_forecasts():
    # remove the directory if necessary
    dirname = "avalanche_forecasts"
    if os.path.isdir(dirname):
        shutil.rmtree(dirname)

    # download the zipped file and unzip it here
    url = "https://storage.googleapis.com/qeds/data/avalanche_forecasts.zip"
    with requests.get(url) as res:
        if not res.ok:
            raise ValueError("failed to download the cached forecasts")
        with zipfile.ZipFile(io.BytesIO(res.content)) as z:
            z.extractall(".")

download_cached_forecasts()

The code below is what we initially ran to obtain all the forecasts.

You will notice that this code checks to see whether the files can be found in the avalanche_forecasts directory (they can if you ran the download_cached_forecasts above!) and will only download them if they aren’t found.

You can experiment with this caching by deleting one or more files from the avalanche_forecasts folder and re-running the cells below.

# Functions for downloading forecasts from Avalanche Canada

def get_forecast(date, region):
    url = "https://www.avalanche.ca/api/bulletin-archive/{}/{}.json".format(date.isoformat(),region)
    try:
        req = urllib.request.Request(url)
        with urllib.request.urlopen(req) as response:
            result = json.loads(response.read().decode('utf-8'))
        return(result)
    except:
        return(None)

def get_forecasts(start, end, region):
    day = start
    forecasts = []
    while(day<=end and day<end.today()):
        #print("working on {}, {}".format(region,day))
        forecasts = forecasts + [get_forecast(day, region)]
        #print("sleeping")
        time.sleep(0.1) # to avoid too much load on Avalanche Canada servers
        day = day + pd.Timedelta(1,"D")
    return(forecasts)

def get_season(year, region):
    start_month = 11
    start_day = 20
    last_month = 5
    last_day = 1
    if (not os.path.isdir("avalanche_forecasts")):
        os.mkdir("avalanche_forecasts")
    seasonfile = "avalanche_forecasts/{}_{}-{}.json".format(region, year, year+1)
    if (not os.path.isfile(seasonfile)):
        startdate = pd.to_datetime("{}-{}-{} 12:00".format(year, start_month, start_day))
        lastdate = pd.to_datetime("{}-{}-{} 12:00".format(year+1, last_month, last_day))
        season = get_forecasts(startdate,lastdate,region)
        with open(seasonfile, 'w') as outfile:
            json.dump(season, outfile, ensure_ascii=False)
    else:
        with open(seasonfile, "rb") as json_data:
            season = json.load(json_data)
    return(season)

forecastlist=[]

for year in range(2011,2019):
    print("working on {}".format(year))
    for region in [region["id"] for region in forecastregions["features"]]:
        forecastlist = forecastlist + get_season(year, region)

working on 2011
working on 2012
working on 2013
working on 2014
working on 2015
working on 2016
working on 2017
working on 2018

# convert to DataFrame and extract some variables
forecasts = pd.DataFrame.from_dict([f for f in forecastlist if not f==None],orient="columns")

forecasts["danger_date"] = forecasts.dangerRatings.apply(lambda r: r[0]["date"])
forecasts["danger_date"] = pd.to_datetime(forecasts.danger_date, utc=True).dt.date
forecasts["danger_alpine"]=forecasts.dangerRatings.apply(lambda r: r[0]["dangerRating"]["alp"])
forecasts["danger_treeline"]=forecasts.dangerRatings.apply(lambda r: r[0]["dangerRating"]["tln"])
forecasts["danger_belowtree"]=forecasts.dangerRatings.apply(lambda r: r[0]["dangerRating"]["btl"])

forecasts.head()

# merge incidents to forecasts
adf = pd.merge(forecasts, incidents, how="left",
               left_on=["region","danger_date"],
               right_on=["nearest_region","date"],
              indicator=True)
adf["incident"] = adf._merge=="both"
print("There were {} incidents matched with forecasts data. These occured on {}% of day-regions with forecasts".format(adf.incident.sum(),adf.incident.mean()*100))

There were 39 incidents matched with forecasts data. These occured on 0.26666666666666666% of day-regions with forecasts

import seaborn as sns
ratings=sorted(adf.danger_alpine.unique())
ava_colors = ["#52BA4A", "#FFF300", "#F79218", "#EF1C29", "#1A1A1A", "#BFBFBF"]
for x in ["danger_alpine", "danger_treeline", "danger_belowtree"]:
    fig=sns.catplot(x=x, kind="count",col="incident", order=ratings, data=adf, sharey=False,
                    palette=ava_colors, height=3, aspect=2)
    plt.subplots_adjust(top=0.9)
    fig.fig.suptitle(x.replace("danger_",""))
    display(fig)

<seaborn.axisgrid.FacetGrid at 0x7fc5d1e830b8>

<seaborn.axisgrid.FacetGrid at 0x7fc5e2145da0>

<seaborn.axisgrid.FacetGrid at 0x7fc5cff81be0>

Predicting Incidents from Text¶

Preprocessing¶

The first step when using text as data is to pre-process the text.

In preprocessing, we will:

Clean: Remove unwanted punctuation and non-text characters.
Tokenize: Break sentences down into words.
Remove “stopwords”: Eliminate common words that actually provide no information, like “a” and “the”.
Lemmatize words: Reduce words to their dictionary “lemma” e.g. “snowing” and “snowed” both become snow (verb).

from bs4 import BeautifulSoup
import nltk
import string
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Remove stopwords (the, a, is, etc)
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords=stopwords.union(set(string.punctuation))
# Lemmatize words e.g. snowed and snowing are both snow (verb)
wnl = nltk.WordNetLemmatizer()
def text_prep(txt):
    soup = BeautifulSoup(txt, "lxml")
    [s.extract() for s in soup('style')] # remove css
    txt=soup.text # remove html tags
    tokens = [token for token in nltk.tokenize.word_tokenize(txt)]
    tokens = [token for token in tokens if not token in stopwords]
    #tokens = [token for token in tokens if not token ]
    tokens = [wnl.lemmatize(token) for token in tokens]
    if (len(tokens)==0):
        tokens = ["EMPTYSTRING"]
    return(tokens)

text_prep(forecasts.highlights[1000])

[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ubuntu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

['Avalanche',
 'danger',
 'could',
 'spike',
 'HIGH',
 'slope',
 'getting',
 'baked',
 'sun',
 'Avoid',
 'traveling',
 'underneath',
 'area']

Now, let’s apply this to all avalanche summaries.

text_data = [text_prep(txt) for txt in adf.avalancheSummary]

Let’s make a bar plot of the most common words.

wf = nltk.FreqDist([word for doc in text_data for word in doc]).most_common(20)
words = [x[0] for x in wf]
cnt = [x[1] for x in wf]

fig, ax = plt.subplots(figsize=(12,4))
ax.bar(range(len(words)), cnt);
ax.set_xticks(range(len(words)));
ax.set_xticklabels(words, rotation='vertical');
ax.set_title('Most common words in avalanche summaries');
ax.set_xlabel('Word');
ax.set_ylabel('Occurences');
plt.show()

Feature Engineering¶

The “bag of words” approach is the simplest way to convert a collection of processed text documents to a feature matrix. We view each document as a bag of words, and our feature matrix counts how many times each word appears. This method is called a “bag of words” because we ignore the document’s word order.

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vectorizer = CountVectorizer(max_features=500, min_df=5, max_df=0.7)
text_data = [text_prep(txt) for txt in adf.avalancheSummary]
y = adf.incident
X = vectorizer.fit_transform([' '.join(doc) for doc in text_data])

We can also perform more complicated feature engineering. One extension of the “bag of words” method is to consider counts of pairs or triples of consecutive words. These are called n-grams and can be created by setting the n_gram argument to CountVectorizer. Another alternative might be to accommodate the fact that common words will inherently have higher counts by using term-frequency inverse-document-frequency (see below).

After creating our feature matrix, we can now apply any classification method to predict incidents.

Naive Bayes Classifier¶

A common text data classifier is the Naive Bayes classifier. This classifier predicts incidents using Bayes’ rules.

$$ P(incident | words) = \frac{P(words|incident) P(incidents)}{P(words)} $$

The classifier is naive, though; it assumes words are independent of one another in any given incident.

$$ P(words|incident) = \prod_{w \in words} P(w|incident) $$

Although this assumption is implausible for text, the Naive Bayes classifier can be computed extremely quickly, and sometimes quite well.

from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=124)

from sklearn import naive_bayes
classifier = naive_bayes.MultinomialNB()
classifier.fit(Xtrain,ytrain)
np.mean(classifier.predict(Xtest)==ytest)

0.9936189608021878

from sklearn import metrics
print(metrics.confusion_matrix(ytest, classifier.predict(Xtest)))
print(metrics.classification_report(ytest, classifier.predict(Xtest)))

[[4358   18]
 [  10    2]]
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      4376
        True       0.10      0.17      0.12        12

    accuracy                           0.99      4388
   macro avg       0.55      0.58      0.56      4388
weighted avg       1.00      0.99      0.99      4388

# print text with highest predicted probabilities
phat=classifier.predict_proba(X)[:,1]
def remove_html(txt):
    soup = BeautifulSoup(txt, "lxml")
    [s.extract() for s in soup('style')] # remove css
    return(soup.text)
docs = [remove_html(txt) for txt in adf.avalancheSummary]
txt_high = [(_,x) for _, x in sorted(zip(phat,docs), key=lambda pair: pair[0],reverse=True)]
txt_high[:10]

[(0.9997811283128446,
  'On Friday avalanches ran naturally to size 2, while explosive control work produced avalanches up to size 2.5. Check out the Mountain Information Network for more details about a human triggered size 2 on a NW facing slope along the Bonnington Traverse. On Thursday reports included numerous natural avalanches up to Size 3 in response to heavy loading from snow, wind and rain. Most of the natural avalanches were storm and wind slabs, but a few persistent slabs also ran naturally on the late February persistent weak layer. Explosives and other artificial triggers (i.e. snow cats) produced additional Size 2-3 persistent slab activity, with remote and sympathetic triggers as well as 50-100 cm thick slab releasing from the impact of the dropped charge, before it exploded.'),
 (0.9981312612295593,
  'On Friday the Monashees were a bit more reactive due to heavier snow fall amounts, natural, skier controlled, and accidentally triggered storm slab avalanches were reported up to size 2.5. In the Selkirks natural cornice falls and skier controlled storm slabs up to size 2.0 were reported. On Thursday we had reports of natural storm slab avalanches up to size 1.5 and natural cornice falls up to size 2.0. On Wednesday natural slab avalanches were reported up to size 2.0 in the alpine on northerly aspects. Natural avalanches up to size 3.0 were reported on Tuesday from alpine elevations on east thru south aspects. Skier accidental and skier remote avalanches were also reported from Tuesday up to size 2.0 that failed at the crust interface; on surface hoar in the case of the remotes. On Monday we had reports of natural storm slab avalanches up to size 3.0 in the northern Selkirks. Some of these avalanches released additional slab avalanches in the track of the slide that added to the size and depth of the debris. '),
 (0.9974024772245915,
  "In the dogtooth backcountry, observers reported a size 2.5 avalanche with a crown 1m+ in depth, that was likely human triggered.  Remote triggering of avalanches continues to be reported from the region.  Explosive control work once again produced large avalanches to size 3.5 on all aspects with crowns ranging from 40 - 120 in depth.I've left the previous narratives in from earlier this because they help to illustrate the nature of the current hazard. From Thursday:  Lots of avalanche activity to report from the region.  In the north Moderate to Strong NW winds at ridge top continued to load lee slopes resulting in a natural avalanche cycle to size 2.5 with crown depths 20 - 60cm in depth.  Just south of Golden explosive control work produced spectacular results with avalanches to size 3.5 & crown depths 100 - 160cm in depth.  All aspects at & above treeline were involved with the majority of the failures occurring on the early Feb. surface hoar.  In the central portion of the region a natural avalanche cycle to size 3.5 was observed on all aspects and elevations.  At this point it's not surprising that the early Feb SH was to blame.  As skies cleared Wednesdays, observers in the south of the region were able to get their eyes on a widespread avalanche cycle that occurred Monday & Tuesday producing avalanches to size 3.  A party of skiers accidently triggered a size 2 avalanche in the Panorama backcountry.  (Not within the resort boundaries)  The avalanche was on a 35 degree slope, which is the average pitch of an intermediate/advanced run at a ski area, in other words, not very steep.  The avalanche's crown measured 150 cm.  This continues the trend from a very active natural avalanche cycle that reached it's height on Monday and is ongoing.From Monday/Tuesday, Size 3 avalanches were common, with some size 4. Timber fell.  One remotely triggered avalanche triggered from 300m away caught my eye - it speaks to the propagation potential of the surface hoar now deeply buried. If this layer is triggered the avalanche will be big; this layer can be triggered from below so be mindful of overhead hazard. This layer will not go away in the next few days. Also, a sledder triggered a size 3 avalanche in the Quartz Creek area and while the machine is reportedly totaled, the rider is okay.There are two great observations from professionals working in the region recently.  The first one can be found on the MCR, it talks about the widespread & sensitive nature of the SH in the Northern Purcells.  You can start here and work your way into the March observations: http://www.acmg.ca/mcr/archives.asp The other is in the form of a youtube video which also deals with the surface hoar, highlighting the potential for remote triggering and large propagations when this layer fails.  Check out the shooting cracks in the video.  Another 50cm + of snow has fallen since this video was shot. http://www.youtube.com/watch?v=UmSJkS3SSbA&feature=youtu.be"),
 (0.994342118179235,
  "Observations from Thursday are limited due to poor visibility and travel during the storm. There was likely natural avalanche activity that would have been most prevalent at upper elevations. Explosive control work produced storm slab avalanches to size 1.5 Thursday morning, but that was before the bulk of the snow fell. We received a great MIN report early Thursday afternoon that talked about shooting cracks in the storm slab that were traveling for several meters. More details here.There was a rather anomalous size 3.5 that was seen on the Cheakamus Glacier from Whistler March 15th. We don't have details on the failure plane of this avalanche, but it may have run on the mid-February crust."),
 (0.9928510477810087,
  'Explosive control work on Friday produced only a few size 1 and isolated size 2 avalanches failing in the recent storm snow or running on the recently buried January 15th crust. On Thursday there were reports of natural avalanches up to size 2.5, while ski cuts and explosive control work produced numerous size 1-1.5, and up to size 2, storm slab avalanches 20-50 cm deep on predominantly northerly aspect, running far and wide within the recent storm snow.Wednesday ski cutting and explosive control work produced numerous easily triggered size 1 storm slab results up to 40 cm deep running on the January 15th crust. And on Tuesday, explosive control work and ski cutting produced numerous size 1-1.5 storm slab releases on leeward aspects at treeline and into the alpine, running on the January 15th crust. '),
 (0.9925098716945482,
  "On Friday we received reports of a widespread natural avalanche cycle to size 3.0 that had occurred during the previous 48 hour period. It's suspected that these avalanches initially failed on the January 5th interface before stepping down to the mid-December and late November weak layers. Control work Friday produced both storm slab and persistent slab avalanches on north and east facing alpine features. On Thursday explosive control work produced mainly storm slabs from size 1.5-2 on all aspects in the alpine. One remote triggered Size 1 persistent slab was also reported. Wednesdays's reports included observations of several persistent slab releases from Size 1-2.5 that were primarily triggered remotely (from a distance) and failed at the December 15 interface. Several other storm slabs and wind slabs were also reported.Some of the themes that are emerging from recent activity both in the Purcells as well as in neighbouring regions include accidental and remote triggering, 'step down' release types, releases on surprisingly low angle, supported terrain, as well as wide fracture propagations. "),
 (0.990848917922573,
  "Storm slab avalanches continue to be reported running naturally. Most of the storm snow avalanches are size 1.5-2.0, however one was reported to be size 3.0 where the storm slab is now 80 cm thick. Explosives control released some large avalanches down to the early February persistent weak layer. There were also a couple of accidentally triggered storm slab avalanches, one that buried a sledder on a northwest aspect in the alpine. Neighboring forecast regions have reported large avalanches initiating in the new storm snow, then stepping down to deeper layers, some\xa0 running full path to the ground.\xa0 The South Columbia's reported a size 4.5 on a south aspect than ran more that 1500 vertical metres."),
 (0.9889039143659862,
  'On Thursday control work produced numerous storm slab avalanches on all aspects to size 1.5. A MIN report from Thursday http://bit.ly/2lnYiMC indicates significant slab development at and above treeline. Reports from Wednesday include several rider and explosive triggered storm slab avalanches up to size 1.5 and natural soft wind slab avalanches up to size 2. Of note was a remotely triggered size 1.5 windslab. This slab was triggered from 7m away with a 30cm crown, running on facets overlying a crust that formed mid-February.'),
 (0.9888605651954021,
  'On Thursday there was a widespread round of natural storm slab activity (mostly in the size 2 range with some results to size 3) in response to new snow and strong winds. In a few cases, avalanches stepped down to the late February layer on big south-facing alpine terrain. In these cases avalanches ran to size 3.5. More wind slab activity is expected with wind and snow forecast for Saturday. Although naturally triggered persistent slab avalanches are becoming less likely, human triggering remains a very real concern.'),
 (0.9854931162891907,
  'Explosives control on Saturday produced numerous size 1.5-2 avalanches that ran on the March 26th interface. Some of these avalanches ran sympathetically with other slides, or remotely at distances of up to 300m. On the same day, a size 2 skier-triggered windslab avalanche occurred on a southwest aspect at treeline in the Fitzsimmons Range. On Sunday, ski cutting produced numerous size 1-1.5 windslab avalanches that ran within snow that fell over the previous 24hrs. Expect widespread wind and storm slab avalanche activity with potential to step down to the deeper March 26 interface with weather forecast for Monday.')]

# print text with lowest predicted probabilities
txt_low = [(_,x) for _, x in sorted(zip(phat,docs), key=lambda pair: pair[0])]
txt_low[:10]

[(1.068100716120279e-18,
  'An early report from Wednesday includes a natural size 3 cornice triggered avalanche and a natural size 3 wet slab avalanche at 2200-2400 m elevation. On Tuesday, natural avalanche activity was reported in many parts of the region. Most of this activity was wet slabs up to size 3 and loose wet avalanches up to size 2 in response to the high elevation rain. A natural size 2.5 persistent slab avalanche was also observed on an east and southeast aspect slope at 2200 m in the area south of Nelson which failed on a layer down 50 cm. Ski cutting on Tuesday morning triggered several storm slab up to size 1.5 and explosives triggered numerous wet slab avalanches up to size 2.5 in the afternoon. On Thursday, if the sun comes out in full force, it will further destabilize an already warm snowpack. Natural loose wet and wet slab avalanches remain possible on steep sun exposed slopes. Large persistent slab avalanches also remain a serious concern until the snowpack has time to recover from the recent rain and warming. In the high alpine, recent storm snow will remain touchy and may fail naturally with solar triggers. '),
 (1.2493954146637027e-17,
  'Reports of recent avalanche activity have been minimal. Loose wet avalanches are continued on South aspects up to size 2. On Monday several size 2 natural loose avalanches were reported. These initiated on very steep North aspects. Spring conditions exist in the region. Exposure to the sun, warm temperatures, and periods of rain are the most likely factors to influence the avalanche danger. If the temperatures go below freezing overnight, strong crusts should develop that are likely to hold the snowpack together. If the sun shines for a few hours, the crusts may break down quickly and moist surface snow avalanches may start running naturally. Continued warming from more sun, rain, or no overnight freeze, may cause surface avalanches to step down and trigger deeper wet slab avalanches. Prolonged warming may cause very deep releases on weak layers that were deposited early in the season, or on depth hoar that developed during the winter. It is important to monitor the temperature and the freezing levels as they may change rapidly from day to day, and as you travel through different elevation bands.'),
 (5.714550014301606e-17,
  'No new avalanches were reported on Friday. On Thursday, storm slabs up to size 3.5 were observed in the north of the region failing up to 100 cm deep. A size 2.5 wind slab was observed on a northeast aspect at 2300 m which failed down 100 cm and ran 1 km. Numerous solar triggered loose wet and loose dry avalanches were also reported. There have been no recent avalanches reported from the Coquihalla area or south of the region.On Sunday, the recent storm snow is expected to remain reactive to human triggering at higher elevations, especially in wind loaded terrain. Natural solar triggered sluffing is expected from steep sun exposed slopes. Natural wind slab avalanches and cornice releases are also possible when the sun is at its strongest.'),
 (3.0676828873042433e-16,
  'Recent avalanche observations include a couple of skier triggered ( controlled) size 1.5 loose wet avalanches from East aspects. Spring conditions exist in the region. Exposure to the sun, warm temperatures, and periods of rain are the most likely factors to influence the avalanche danger. If the temperatures go below freezing overnight, strong crusts should develop that are likely to hold the snowpack together. If the sun shines for a few hours, the crusts may break down quickly and moist surface snow avalanches may start running naturally. Continued warming from more sun, rain, or no overnight freeze, may cause surface avalanches to step down and trigger deeper wet slab avalanches. Prolonged warming may cause very deep releases on weak layers that were deposited early in the season, or on depth hoar that developed during the winter. It is important to monitor the temperature and the freezing levels as they may change rapidly from day to day, and as you travel through different elevations.'),
 (3.0676828873042433e-16,
  'Recent avalanche observations include a couple of skier triggered (controlled) size 1.5 loose wet avalanches from East aspects. Spring conditions exist in the region. Exposure to the sun, warm temperatures, and periods of rain are the most likely factors to influence the avalanche danger. If the temperatures go below freezing overnight, strong crusts should develop that are likely to hold the snowpack together. If the sun shines for a few hours, the crusts may break down quickly and moist surface snow avalanches may start running naturally. Continued warming from more sun, rain, or no overnight freeze, may cause surface avalanches to step down and trigger deeper wet slab avalanches. Prolonged warming may cause very deep releases on weak layers that were deposited early in the season, or on depth hoar that developed during the winter. It is important to monitor the temperature and the freezing levels as they may change rapidly from day to day, and as you travel through different elevations.'),
 (3.094229779131243e-16,
  'On Monday several size 2 natural loose avalanches were reported. These initiated on very steep North aspects. Spring conditions exist in the region. Exposure to the sun, warm temperatures, and periods of rain are the most likely factors to influence the avalanche danger. If the temperatures go below freezing overnight, strong crusts should develop that are likely to hold the snowpack together. If the sun shines for a few hours, the crusts may break down quickly and moist surface snow avalanches may start running naturally. Continued warming from more sun, rain, or no overnight freeze, may cause surface avalanches to step down and trigger deeper wet slab avalanches. Prolonged warming may cause very deep releases on weak layers that were deposited early in the season, or on depth hoar that developed during the winter. It is important to monitor the temperature and the freezing levels as they may change rapidly from day to day, and as you travel through different elevation bands.'),
 (3.759205254140181e-16,
  'No new avalanches have been reported from this region. Although warm daytime temperatures and overnight cooling are likely helping bond recent wind slabs to the surface. The potential for releases deeper in the snowpack will remain elevated as these warm temperatures persist. Thin and variable snowpack depth areas will be the most likely trigger points for a deep release. The band of warm air associated with the inversion has been hovering at elevations higher than most peaks in the area. If this band of warm air drops, we may see a round of loose wet avalanche activity in steep, sun-exposed terrain.Please submit your observations to the Mountain Information Network.'),
 (4.433243387205596e-16,
  "Reports from Thursday were limited to a couple of size 2 storm and wind slabs, 30 and 50 cm deep, respectively. The storm slab released naturally while the wind slab was ski cut from a steeper north aspect in the alpine.Tuesday's reports included observations of numerous natural wind slab release from size 1-2 on north through east aspects in the alpine. A skier triggered size 2 storm slab resulted in a near miss. Another recent very large (size 3.5) storm slab was also observed to have released on a large, wind-loaded alpine feature. Reports from Monday were limited by poor visibility but included one small (size 1, 20 cm deep) ski cut wind slab as well as skier-triggered loose snow releases on steeper slopes. The rough dividing line between loose wet and loose dry activity was about 2000 metres.In addition to heightened storm slab activity following regular snowfalls, observations from late last week also showed a pattern of heightened cornice failure activity. Looking forward, sunshine and warming temperatures are expected to initiate a natural avalanche cycle limited to the new snow. The same warming may also cause natural cornice collapses."),
 (4.7312308907701225e-16,
  'On Tuesday, five natural avalanches size 2-3.5 were reported.\xa0 These released down 50-80cm so they likely failed on persistent weak layers.\xa0 Wind slabs were reported to be forming in the alpine and loose wet activity was reported at lower elevations. On Monday, lots of loose wet avalanches were reported at lower elevations and storm slabs were reported above around 2300m elevation.\xa0 Ongoing persistent slab, storm slab, and loose wet avalanches have been occurring since last Friday. Activity is now expected to taper off with the cooler temperatures. Natural avalanches are generally not expected on Thursday except possibly on steep sun-exposed slopes in the afternoon. Human-triggered storm slabs and persistent slabs are still a concern, especially on steep, unsupported slopes in the alpine. Stability is generally expected to improve with the cooling trend but this may take a few days at higher elevations and tricky conditions may still exist on Thursday.'),
 (6.706639937807967e-16,
  'On Monday, a few slab avalanches were reported above around 2200m elevation as well as lots of wet sluffing at lower elevations. Several slabs up to size 2.5 were triggered by cornice failures. Some of these were up to 1m deep and these are expected to be failing on persistent weak layers. On Saturday and Sunday, isolated natural size 2-3 persistent slabs were reported on a variety of aspects and mainly above 2000m. On Thursday and Friday, a widespread avalanche cycle occurred in response to wet and windy conditions. These were a mix of storm slab, loose wet avalanches and persistent slabs up to size 3. Natural avalanches are generally not expected on Wednesday if there is good overnight recovery but may still be possible in isolated areas. Human-triggered storm slabs and persistent slabs are still a concern, especially on steep, unsupported slopes in the alpine. Stability is generally expected to improve with the cooling trend but this may take a few days and touchy conditions may still exist on Wednesday.')]

See exercise 1 in the exercise list

Predicting deaths from forecast text is very difficult because deaths are so rare. A prediction exercise more likely to succeed would be to predict the avalanche rating from the forecast text. However, doing so is a very artificial task, with little practical use.

Another alternative would be to gather more data on non-fatal avalanches. Avalanche Canada also has user-submitted “Mountain Information Network” reports. These reports include observations of natural avalanches and information on non-fatal avalanche incidents. Since the data is user-submitted, it is messy and more difficult to work with. Nonetheless, working with it would be good practice and could lead to some insights.

Unsupervised Learning¶

The regression and classification methods that we have seen so far are examples of supervised learning — we are trying to predict an observed outcome. In unsupervised learning, we do not have an observed outcome to predict. Instead, we try to find informative patterns in the data. Unsupervised learning can be particularly useful with text data. We will look at two related techniques for topic modeling. These techniques attempt to extract distinct topics from a collection of text documents.

Latent Semantic Analysis¶

Latent semantic analysis is used by some search engines to rank the similarities among documents. Latent semantic analysis begins with a term document matrix, $ X $. The term document matrix is a number of documents by number of terms matrix where the i,jth entry is the measure of how often term j appears in document i. This could be the same bag of words feature matrix we constructed above, or it could be some other measure. For this example, we will use the term-frequency, inverse-document-frequency representation.

$$ x^{tfidf}_{ij} = \frac{\text{occurences of term j in document i}}{\text{length of document i}} \log \left(\frac{\text{number of documents}}{\text{number of documents containing term j}}\right) $$

Given a term document matrix, $ X $, latent semantic analysis computes a lower rank approximation to $ X $ through the singular value decomposition. This lower rank approximation can potentially be interpreted or used instead of $ X $ for other learning algorithms. In other contexts, similar decompositions are referred to as principal components analysis or factor models.

# LSA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
tfidf_vectorizer = TfidfVectorizer(use_idf=True, smooth_idf=True)
X = tfidf_vectorizer.fit_transform([' '.join(doc) for doc in text_data])

svd_model = TruncatedSVD(n_components=10, algorithm='randomized', n_iter=100, random_state=122)
svd_model.fit(X)

TruncatedSVD(algorithm='randomized', n_components=10, n_iter=100,
             random_state=122, tol=0.0)

Here, we have computed a rank 10 approximation to the tf-idf matrix. We can see how much variance of the original matrix that our 10 components reproduce. We can also look at how all terms in the document contribute to each of the 10 components.

print(svd_model.explained_variance_ratio_)
print(svd_model.explained_variance_ratio_.cumsum())
terms = tfidf_vectorizer.get_feature_names()
comp_label=[]
for i, comp in enumerate(svd_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
    print("Topic "+str(i)+": ")
    message = ""
    for t in sorted_terms:
        message = message + "{:.2f} * {} + ".format(t[1],t[0])
    print(message)
    comp_label.append(message)

[0.01358051 0.04512525 0.03355762 0.01528191 0.01324332 0.01253436
 0.01175581 0.01019238 0.00935028 0.008005  ]
[0.01358051 0.05870576 0.09226337 0.10754528 0.1207886  0.13332297
 0.14507878 0.15527115 0.16462143 0.17262644]
Topic 0: 
0.37 * avalanche + 0.26 * size + 0.26 * slab + 0.24 * reported + 0.22 * no + 0.20 * new + 0.18 * triggered + 
Topic 1: 
0.64 * no + 0.46 * new + 0.25 * reported + 0.14 * avalanche + 0.09 * observation + 0.08 * recent + 0.07 * observed + 
Topic 2: 
1.00 * emptystring + 0.00 * no + 0.00 * new + 0.00 * reported + 0.00 * recent + 0.00 * avalanche + 0.00 * region + 
Topic 3: 
0.45 * observation + 0.30 * information + 0.30 * please + 0.30 * mountain + 0.27 * network + 0.22 * recent + 0.20 * submit + 
Topic 4: 
0.44 * loose + 0.36 * wet + 0.29 * steep + 0.22 * terrain + 0.20 * solar + 0.16 * dry + 0.15 * observed + 
Topic 5: 
0.34 * reported + 0.26 * recent + 0.18 * size + 0.15 * aspect + 0.14 * no + 0.10 * loose + 0.10 * triggered + 
Topic 6: 
0.28 * new + 0.26 * observed + 0.23 * nothing + 0.18 * today + 0.18 * observation + 0.17 * information + 0.16 * mountain + 
Topic 7: 
0.42 * wind + 0.24 * recent + 0.23 * terrain + 0.23 * slab + 0.17 * steep + 0.16 * snow + 0.16 * small + 
Topic 8: 
0.37 * report + 0.35 * recent + 0.30 * observed + 0.26 * activity + 0.22 * today + 0.19 * there + 0.14 * nothing + 
Topic 9: 
0.31 * natural + 0.30 * storm + 0.22 * activity + 0.20 * size + 0.15 * cycle + 0.13 * widespread + 0.11 * explosive +

Finally, we can attempt to visualize the components.

The LSA reduced the dimensionality of the representation of documents from thousands (of term-frequency inverse-document-frequency) to ten components. While ten is more manageable than thousands, it is still too many dimensions to effectively visualize. t-SNE is a technique to further reduce dimensionality. t-SNE is a nonlinear data transformation from many dimensions to 2, while attempting to preserve the original clustering of their original domain. In other words, if a set of documents are closely clustered in the 10 dimensional LSA space, then they will also be close together in the 2 dimensional t-SNE representation.

lsa_topic_matrix = svd_model.transform(X)

from sklearn.manifold import TSNE
nplot = 2000 # reduce the size of the data to speed computation and make the plot less cluttered
lsa_topic_sample = lsa_topic_matrix[np.random.choice(lsa_topic_matrix.shape[0], nplot, replace=False)]
tsne_lsa_model = TSNE(n_components=2, perplexity=50, learning_rate=500,
                      n_iter=1000, verbose=10, random_state=0, angle=0.75)
tsne_lsa_vectors = tsne_lsa_model.fit_transform(lsa_topic_sample)

[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 2000 samples in 0.001s...
[t-SNE] Computed neighbors for 2000 samples in 0.219s...
[t-SNE] Computed conditional probabilities for sample 1000 / 2000
[t-SNE] Computed conditional probabilities for sample 2000 / 2000
[t-SNE] Mean sigma: 0.000000
[t-SNE] Computed conditional probabilities in 0.309s
[t-SNE] Iteration 50: error = 71.3692780, gradient norm = 0.2182202 (50 iterations in 1.260s)
[t-SNE] Iteration 100: error = 70.8558960, gradient norm = 0.2077104 (50 iterations in 1.242s)
[t-SNE] Iteration 150: error = 70.5134125, gradient norm = 0.2179738 (50 iterations in 1.142s)
[t-SNE] Iteration 200: error = 70.3679810, gradient norm = 0.2121821 (50 iterations in 1.244s)
[t-SNE] Iteration 250: error = 70.8498993, gradient norm = 0.2114082 (50 iterations in 1.187s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 70.849899
[t-SNE] Iteration 300: error = 1.4850413, gradient norm = 0.0010846 (50 iterations in 0.882s)
[t-SNE] Iteration 350: error = 1.4059987, gradient norm = 0.0003184 (50 iterations in 0.816s)
[t-SNE] Iteration 400: error = 1.3848737, gradient norm = 0.0003053 (50 iterations in 0.834s)
[t-SNE] Iteration 450: error = 1.3751968, gradient norm = 0.0002230 (50 iterations in 0.844s)
[t-SNE] Iteration 500: error = 1.3679105, gradient norm = 0.0002306 (50 iterations in 0.826s)
[t-SNE] Iteration 550: error = 1.3580319, gradient norm = 0.0002946 (50 iterations in 0.845s)
[t-SNE] Iteration 600: error = 1.3522397, gradient norm = 0.0002061 (50 iterations in 0.810s)
[t-SNE] Iteration 650: error = 1.3490918, gradient norm = 0.0001969 (50 iterations in 0.826s)
[t-SNE] Iteration 700: error = 1.3467090, gradient norm = 0.0001809 (50 iterations in 0.803s)
[t-SNE] Iteration 750: error = 1.3440834, gradient norm = 0.0001787 (50 iterations in 0.818s)
[t-SNE] Iteration 800: error = 1.3428628, gradient norm = 0.0001627 (50 iterations in 0.826s)
[t-SNE] Iteration 850: error = 1.3412895, gradient norm = 0.0001603 (50 iterations in 0.797s)
[t-SNE] Iteration 900: error = 1.3399029, gradient norm = 0.0001703 (50 iterations in 0.798s)
[t-SNE] Iteration 950: error = 1.3372972, gradient norm = 0.0002491 (50 iterations in 0.781s)
[t-SNE] Iteration 1000: error = 1.3352970, gradient norm = 0.0001804 (50 iterations in 0.802s)
[t-SNE] KL divergence after 1000 iterations: 1.335297

The t-SNE model creates a non-linear projection from our 10 dimensional LSA topics onto two dimensional space. It can be useful for visualizing high-dimensional data. One word of caution: the output of the t-SNE model can depend on the parameters of the algorithm. Failure to see clear clusters in the t-SNE visualization could mean either the original data was not clustered in higher dimensional space or that the t-SNE algorithm parameters were chosen poorly.

cmap = matplotlib.cm.get_cmap('Paired')
fig, ax = plt.subplots(1,2,figsize=(16,6))
n_topics=len(svd_model.components_)
lsa_keys = np.argmax(lsa_topic_sample, axis=1)
ax[0].scatter(x=tsne_lsa_vectors[:,0],y=tsne_lsa_vectors[:,1], color=[cmap(i) for i in lsa_keys], alpha=0.8)
bbox_props = dict(boxstyle="round4,pad=0.1", lw=0.2, fc="white")
for i in range(n_topics):
    m = tsne_lsa_vectors[lsa_keys==i, :].mean(axis=0)
    ax[0].text(m[0], m[1], str(i), ha="center", va="center",
               size=15, color=cmap(i),
               bbox=bbox_props)
    ax[1].text(0,1-(i+1)*1/(n_topics+1),"Topic " + str(i) + " : "+ comp_label[i],ha="left", va="center", color=cmap(i))
    ax[1].axis('off')
fig.tight_layout()

/home/ubuntu/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: RuntimeWarning: Mean of empty slice.
  
/home/ubuntu/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py:78: RuntimeWarning: invalid value encountered in true_divide
  ret, rcount, out=ret, casting='unsafe', subok=False)
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values

From this plot, we can immediately see two things. First, most documents are closest to topic 0. Second, most topics are not well-separated.

See exercise 2 in the exercise list

Latent Dirichlet Analysis¶

Latent dirichlet analysis (LDA) produces similar outputs as latent semantic analysis, but LDA often produces nicer results. The statistical theory underlying LSA is built on continuous $ X $ features. LDA uses similar ideas, but takes into account that text is discrete.

# LDA
import gensim
# gensim works with a list of lists of tokens
text_data = [text_prep(txt) for txt in forecasts.avalancheSummary]

# convert to bag of words
dictionary = gensim.corpora.Dictionary(text_data)
bow_data = [dictionary.doc2bow(text) for text in text_data]

ldamodel = gensim.models.ldamodel.LdaModel(bow_data, num_topics = 5, id2word=dictionary, passes=15)
topics = ldamodel.print_topics(num_words=10)
for topic in topics:
    print(topic)

(0, '0.055*"avalanche" + 0.051*"size" + 0.035*"slab" + 0.034*"triggered" + 0.026*"aspect" + 0.019*"reported" + 0.016*"natural" + 0.016*"2" + 0.016*"On" + 0.014*"skier"')
(1, '0.050*"slab" + 0.042*"avalanche" + 0.029*"wind" + 0.025*"storm" + 0.017*"size" + 0.015*"persistent" + 0.015*"snow" + 0.014*"layer" + 0.014*"natural" + 0.012*"activity"')
(2, '0.077*"avalanche" + 0.055*"new" + 0.053*"No" + 0.038*"reported" + 0.036*"observation" + 0.023*"region" + 0.022*"activity" + 0.017*"Information" + 0.017*"Network" + 0.016*"snow"')
(3, '0.077*"MIN" + 0.028*"See" + 0.020*"snowpack" + 0.018*"Check" + 0.016*"report" + 0.015*"post" + 0.015*"great" + 0.015*"detail" + 0.014*"area" + 0.013*"avalanche"')
(4, '0.059*"loose" + 0.050*"avalanche" + 0.041*"wet" + 0.028*"steep" + 0.026*"size" + 0.024*"slab" + 0.023*"terrain" + 0.020*"aspect" + 0.020*"dry" + 0.017*"cornice"')

import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
lda_display = pyLDAvis.gensim.prepare(ldamodel, bow_data, dictionary)
lda_display

/home/ubuntu/anaconda3/lib/python3.7/site-packages/pyLDAvis/_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))

See exercise 3 in the exercise list

See exercise 4 in the exercise list

Exercises¶

Exercise 1

Use another classification method to predict incidents. Check whether your method outperforms the Naive Bayes classifier.

(back to text)

Exercise 2

Apply LSA to the weather or snowpack descriptions. Can you notice any patterns?

(back to text)

Exercise 3

Apply LDA to the weather or snowpack descriptions. Can you notice any patterns?

(back to text)

Exercise 4

Use the reduced rank representation of text from LSA or LDA as a feature matrix to predict avalanche incidents. Compare the performance with the bag of words feature matrix.

(back to text)

	date	group_activity	id	location	location_province	num_fatal	num_injured	num_involved
0	2019-04-20	Backcountry Skiing	49c180c3-aa8e-4f9e-805d-603e510a4485	Yoho National Park	BC	1	0.0	3.0
1	2019-04-16	Ice Climbing	bc4a88cc-7b79-4382-912d-f2ba5cd1f7ec	Howse Peak	AB	3	0.0	3.0
2	2019-03-16	Ski touring	37d909e4-c6de-43f1-8416-57a34cd48255	Pharaoh Lake	AB	1	0.0	2.0
3	2019-03-11	Ice Climbing	53263a93-e674-4eda-999d-01e05fcf64ef	Massey's Waterfall	BC	1	0.0	5.0
4	2019-02-18	Snowshoeing	b3e31dcb-77d2-42b3-8115-e88a16b2f946	Runner Peak	BC	1	1.0	2.0
5	2019-02-09	Snowmobiling	2a3afb33-ff26-48ab-b7ca-cfd840d203fa	Oventop Creek	BC	1	0.0	1.0
6	2019-01-26	Snowmobiling	87affdea-021e-4b13-b7fb-f52fa3b1437a	Nain	NL	1	0.0	3.0
7	2019-01-12	Snowmobiling	9413994e-a20d-4f52-8e84-dfc4ee6217b5	Mount Brewer	BC	2	0.0	2.0
8	2019-01-03	Skiing	3891fd51-8736-4373-9829-9159b950992b	Pebble Creek	BC	1	0.0	1.0
9	2018-03-28	Heliskiing	5930994f-0066-4e58-822e-15caf064c12e	South Creek	BC	1	0.0	1.0
...	...	...	...	...	...	...	...	...
461	1863-01-16	Unknown	a3405ee9-7384-4064-9df5-58b789718064	Lévis	QC	2	NaN	NaN
462	1856-02-05	Inside Building	bc4899c4-70ce-47df-b586-445641e6068d	Cape Breton - Big Pond	NS	5	NaN	NaN
463	1852-02-01	Inside Building	1888475d-8f26-4b0d-bd23-b9fb82e165a7	Promontoir de Québec	QC	7	NaN	NaN
464	1850-01-01	Hunting/Fishing	6418e7b2-39d6-4ac4-bc2e-7f5518ecf798	Northumberland Inlet - Kingaite		1	NaN	NaN
465	1843-12-18	Unknown	56ef01bd-6e1d-450c-90bb-fc53fe2922e9	Cap Blanc (Ville de Québec) 18	QC	1	NaN	NaN
466	1840-02-01	Unknown	101c517b-29a4-4c49-8934-f6c56ddd882d	Château-Richer	QC	1	NaN	NaN
467	1836-02-09	Unknown	b2e1c50a-1533-4145-a1a2-0befca0154d5	Quebec	QC	1	NaN	NaN
468	1833-05-24	Unknown	18e8f963-da33-4682-9312-57ca2cc9ad8d	Carbonear	NL	1	0.0	NaN
469	1825-02-04	Inside Building	083d22df-ed50-4687-b9ab-1649960a0fbe	Saint-Joseph de Lévis	QC	5	NaN	NaN
470	1782-01-01	Unknown	f498c48a-981d-43cf-ac16-151b8794435c	Nain	NL	22	NaN	NaN

	avalanche_obs	comment	documents	group_activity	...	snowpack_comment	snowpack_obs	weather_comment	weather_obs
0	[{'size': '3.0', 'type': 'S', 'trigger': 'Sa',...	A party of three backcountry skiers were invol...	[{'date': '2019-04-21', 'title': 'Parks Canada...	Backcountry Skiing	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...		{'temp_present': None, 'temp_max': None, 'temp...
1	[{'size': None, 'type': None, 'trigger': None,...	A party of three ice climbers were attempting ...	[{'date': '2019-04-21', 'title': 'Howse Peak 2...	Ice Climbing	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...		{'temp_present': None, 'temp_max': None, 'temp...
2	[{'size': '2.0', 'type': 'S', 'trigger': 'Sa',...	A ski-tourer and split boarder triggered a per...	[]	Ski touring	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...	Nearest weather station temperature was about ...	{'temp_present': None, 'temp_max': None, 'temp...
3	[{'size': '2.5', 'type': 'S', 'trigger': 'Na',...	A naturally triggered wind slab released from ...	[]	Ice Climbing	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...	After an extended period of drought, small amo...	{'temp_present': None, 'temp_max': None, 'temp...
4	[{'size': '2.5', 'type': 'S', 'trigger': 'Sa',...	A group of two snowshoers were winter camping ...	[{'date': '2019-02-20', 'title': 'Start zone s...	Snowshoeing	...	Average HS at 1200-1300 m is 300-350 cm. This ...	{'hs': 325, 'hn24': None, 'hst': 40, 'hst_rese...	Observations are based on Mt Seymour resort's ...	{'temp_present': None, 'temp_max': None, 'temp...
5	[{'size': '2.0', 'type': 'S', 'trigger': 'Ma',...	A group of four snowmobilers were on or near t...	[]	Snowmobiling	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...		{'temp_present': None, 'temp_max': None, 'temp...
6	[{'size': '2.0', 'type': 'CS', 'trigger': 'Nc'...	A group of three was on a route used to travel...	[]	Snowmobiling	...	Snowpack in the area of the incident is descri...	{'hs': None, 'hn24': None, 'hst': None, 'hst_r...	Approximately 25cm of new snow fell the night ...	{'temp_present': None, 'temp_max': None, 'temp...
7	[{'size': '3.0', 'type': 'S', 'trigger': 'Ma',...	A group of snowmobilers were on or near slopes...	[]	Snowmobiling	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...		{'temp_present': None, 'temp_max': None, 'temp...
8	[{'size': '2.0', 'type': 'S', 'trigger': 'Sa',...	A skier who was part of a group was slightly a...	[]	Skiing	...		{'hs': None, 'hn24': 40, 'hst': 65, 'hst_reset...	L-M SE winds reported in previous 24 hours.	{'temp_present': -5.0, 'temp_max': None, 'temp...
9	[{'size': '3.0', 'type': 'S', 'trigger': 'Sa',...	While regrouping a remotely triggered size 1 a...	[]	Heliskiing	...	Little wind effect on slope where incident occ...	{'hs': None, 'hn24': 0, 'hst': 10, 'hst_reset'...	Previous weather M-S SW/SE winds with 10cm of ...	{'temp_present': -5.0, 'temp_max': None, 'temp...
...	...	...	...	...	...	...	...	...	...
461	[{'observation_date': '1863-01-16', 'size': ''...	tué par une aval.	[]	Unknown	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...		{'temp_present': None, 'temp_max': None, 'temp...
462	[{'observation_date': '1856-02-05', 'size': ''...	In house, estimated that the avalanche struck ...	[]	Inside Building	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...	From Cape Breton Newspaper article - unsure of...	{'temp_present': None, 'temp_max': None, 'temp...
463	[{'observation_date': '1852-02-01', 'size': ''...		[]	Inside Building	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...		{'temp_present': None, 'temp_max': None, 'temp...
464	[{'observation_date': '1800-01-01', 'size': ''...	DATE ESTIMATED, NOT ACCURATE!!! Based on arti...	[]	Hunting/Fishing	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...		{'temp_present': None, 'temp_max': None, 'temp...
465	[{'observation_date': '1843-12-18', 'size': ''...	No information	[]	Unknown	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...		{'temp_present': None, 'temp_max': None, 'temp...
466	[{'observation_date': '1840-02-01', 'size': ''...	mort étouffé lors d'une aval.	[]	Unknown	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...		{'temp_present': None, 'temp_max': None, 'temp...
467	[{'observation_date': '1836-02-09', 'size': ''...	No information is provided. No location.	[]	Unknown	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...		{'temp_present': None, 'temp_max': None, 'temp...
468	[]	Carbonear Star, May 29, 1833 \nThe definition ...	[{'title': 'Carbonear, May 24, 1833', 'source'...	Unknown	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...		{'temp_present': None, 'temp_max': None, 'temp...
469	[{'observation_date': '1825-02-04', 'size': ''...	autre aval. Cap Diamant	[]	Inside Building	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...		{'temp_present': None, 'temp_max': None, 'temp...
470	[]	NAC MG17 D1 B-X-1 pp.38,750 et ff., microfilm ...	[{'title': 'Nain, 1781-2', 'source': 'NFLD Geo...	Unknown	...		{'hs': None, 'hn24': None, 'hst': None, 'hst_r...		{'temp_present': None, 'temp_max': None, 'temp...

	avalanche_obs	comment	documents	group_activity	...	latlon	latlon_filled	nearest_region	nearest_distance
0	[{'size': '3.0', 'type': 'S', 'trigger': 'Sa',...	A party of three backcountry skiers were invol...	[{'date': '2019-04-21', 'title': 'Parks Canada...	Backcountry Skiing	...	[51.59123, -116.60428]	[51.59123, -116.60428]	little-yoho	0.000000
1	[{'size': None, 'type': None, 'trigger': None,...	A party of three ice climbers were attempting ...	[{'date': '2019-04-21', 'title': 'Howse Peak 2...	Ice Climbing	...	[51.813002, -116.669218]	[51.813002, -116.669218]	banff-yoho-kootenay	0.087773
2	[{'size': '2.0', 'type': 'S', 'trigger': 'Sa',...	A ski-tourer and split boarder triggered a per...	[]	Backcountry Skiing	...	[51.114536, -115.914762]	[51.114536, -115.914762]	banff-yoho-kootenay	0.000000
3	[{'size': '2.5', 'type': 'S', 'trigger': 'Na',...	A naturally triggered wind slab released from ...	[]	Ice Climbing	...	[51.404092, -116.466891]	[51.404092, -116.466891]	little-yoho	0.000000
4	[{'size': '2.5', 'type': 'S', 'trigger': 'Sa',...	A group of two snowshoers were winter camping ...	[{'date': '2019-02-20', 'title': 'Start zone s...	Snowshoeing & Hiking	...	[49.3937, -122.9434]	[49.3937, -122.9434]	south-coast	0.000000
5	[{'size': '2.0', 'type': 'S', 'trigger': 'Ma',...	A group of four snowmobilers were on or near t...	[]	Snowmobiling	...	[52.334951, -118.894301]	[52.334951, -118.894301]	north-columbia	0.000000
6	[{'size': '2.0', 'type': 'CS', 'trigger': 'Nc'...	A group of three was on a route used to travel...	[]	Snowmobiling	...	[56.525664, -61.695035]	[56.525664, -61.695035]	chic-chocs	8.621167
7	[{'size': '3.0', 'type': 'S', 'trigger': 'Ma',...	A group of snowmobilers were on or near slopes...	[]	Snowmobiling	...	[50.3733, -116.22389]	[50.3733, -116.22389]	purcells	0.000000
8	[{'size': '2.0', 'type': 'S', 'trigger': 'Sa',...	A skier who was part of a group was slightly a...	[]	Backcountry Skiing	...	[50.7211, -123.2444]	[50.7211, -123.2444]	south-coast-inland	0.000000
9	[{'size': '3.0', 'type': 'S', 'trigger': 'Sa',...	While regrouping a remotely triggered size 1 a...	[]	Heli or Cat Skiing	...	[50.483889, -123.281667]	[50.483889, -123.281667]	sea-to-sky	0.000000
...	...	...	...	...	...	...	...	...	...
461	[{'observation_date': '1863-01-16', 'size': ''...	tué par une aval.	[]	Unknown	...	[None, None]	[46.8, -71.1833333]	chic-chocs	5.363003
462	[{'observation_date': '1856-02-05', 'size': ''...	In house, estimated that the avalanche struck ...	[]	Car/Truck/Building	...	[None, None]	[None, None]	None	NaN
463	[{'observation_date': '1852-02-01', 'size': ''...		[]	Car/Truck/Building	...	[None, None]	[None, None]	None	NaN
464	[{'observation_date': '1800-01-01', 'size': ''...	DATE ESTIMATED, NOT ACCURATE!!! Based on arti...	[]	Other Recreational	...	[None, None]	[None, None]	None	NaN
465	[{'observation_date': '1843-12-18', 'size': ''...	No information	[]	Unknown	...	[None, None]	[None, None]	None	NaN
466	[{'observation_date': '1840-02-01', 'size': ''...	mort étouffé lors d'une aval.	[]	Unknown	...	[None, None]	[46.9666667, -71.0166667]	chic-chocs	5.145266
467	[{'observation_date': '1836-02-09', 'size': ''...	No information is provided. No location.	[]	Unknown	...	[None, None]	[52, -72]	chic-chocs	6.527245
468	[]	Carbonear Star, May 29, 1833 \nThe definition ...	[{'title': 'Carbonear, May 24, 1833', 'source'...	Unknown	...	[47.7333335876465, -53.216667175293]	[47.7333335876465, -53.216667175293]	chic-chocs	12.827272
469	[{'observation_date': '1825-02-04', 'size': ''...	autre aval. Cap Diamant	[]	Car/Truck/Building	...	[None, None]	[46.8105556, -71.1819444]	chic-chocs	5.357613
470	[]	NAC MG17 D1 B-X-1 pp.38,750 et ff., microfilm ...	[{'title': 'Nain, 1781-2', 'source': 'NFLD Geo...	Unknown	...	[56.533332824707, -61.6833343505859]	[56.533332824707, -61.6833343505859]	chic-chocs	8.633647

	avalancheSummary	bulletinTitle	confidence	dangerMode	...	danger_date	danger_alpine	danger_treeline	danger_belowtree
0	We have a few reports of glide crack releases ...	Avalanche Forecast - Northwest Coastal	Poor - Due to limited field observations	NaN	...	2011-11-21	2:Moderate	2:Moderate	1:Low
1	No new avalanches observed. We suspect that t...	Avalanche Forecast - Northwest Coastal	Poor - Timing, track, or intensity of incomin...	NaN	...	2011-11-22	3:Considerable	3:Considerable	2:Moderate
2	No recent avalanche have been reported, but I...	Avalanche Forecast - Northwest Coastal	Fair - Due to limited field observationsfor t...	NaN	...	2011-11-23	4:High	4:High	3:Considerable
3	No recent avalanche have been reported, but I...	Avalanche Forecast - Northwest Coastal	Poor - Due to limited field observationsfor t...	NaN	...	2011-11-24	4:High	4:High	3:Considerable
4	No recent avalanche have been reported, but I...	Avalanche Forecast - Northwest Coastal	Poor - Due to limited field observationsfor t...	NaN	...	2011-11-25	4:High	4:High	3:Considerable