Prerequisites
Outcomes
datetime
agg
apply
applymap
Data
# Uncomment following line to install on colab
#! pip install qeds
In this lecture, we will use unemployment data by state at a monthly frequency.
import pandas as pd
%matplotlib inline
# activate plot theme
import qeds
qeds.themes.mpl_style();
pd.__version__
First, we will download the data directly from a url and read it into a pandas DataFrame.
## Load up the data -- this will take a couple seconds
url = "https://storage.googleapis.com/qeds/data/state_unemployment.csv"
unemp_raw = pd.read_csv(url, parse_dates=["Date"])
The pandas read_csv
will determine most datatypes of the underlying columns. The
exception here is that we need to give pandas a hint so it can load up the Date
column as a Python datetime type: the parse_dates=["Date"]
.
We can see the basic structure of the downloaded data by getting the first 5 rows, which directly matches the underlying CSV file.
unemp_raw.head()
Note that a row has a date, state, labor force size, and unemployment rate.
For our analysis, we want to look at the unemployment rate across different states over time, which requires a transformation of the data similar to an Excel pivot-table.
# Don't worry about the details here quite yet
unemp_all = (
unemp_raw
.reset_index()
.pivot_table(index="Date", columns="state", values="UnemploymentRate")
)
unemp_all.head()
Finally, we can filter it to look at a subset of the columns (i.e. “state” in this case).
states = [
"Arizona", "California", "Florida", "Illinois",
"Michigan", "New York", "Texas"
]
unemp = unemp_all[states]
unemp.head()
When plotting, a DataFrame knows the column and index names.
unemp.plot(figsize=(8, 6))
See exercise 1 in the exercise list
You might have noticed that our index now has a nice format for the
dates (YYYY-MM-DD
) rather than just a year.
This is because the dtype
of the index is a variant of datetime
.
unemp.index
We can index into a DataFrame with a DatetimeIndex
using string
representations of dates.
For example
# Data corresponding to a single date
unemp.loc["01/01/2000", :]
# Data for all days between New Years Day and June first in the year 2000
unemp.loc["01/01/2000":"06/01/2000", :]
We will learn more about what pandas can do with dates and times in an upcoming lecture on time series data.
Let’s talk about aggregations.
Loosely speaking, an aggregation is an operation that combines multiple values into a single value.
For example, computing the mean of three numbers (for example
[0, 1, 2]
) returns a single number (1).
We will use aggregations extensively as we analyze and manipulate our data.
Thankfully, pandas makes this easy!
pandas already has some of the most frequently used aggregations.
For example:
mean
) var
) std
) min
) median
) max
) Note
When looking for common operations, using “tab completion” goes a long way.
unemp.mean()
As seen above, the aggregation’s default is to aggregate each column.
However, by using the axis
keyword argument, you can do aggregations by
row as well.
unemp.var(axis=1).head()
The built-in aggregations will get us pretty far in our analysis, but sometimes we need more flexibility.
We can have pandas perform custom aggregations by following these two steps:
Series
as an input and
outputs a single value. agg
method with our new function as an argument. For example, below, we will classify states as “low unemployment” or “high unemployment” based on whether their mean unemployment level is above or below 6.5.
#
# Step 1: We write the (aggregation) function that we'd like to use
#
def high_or_low(s):
"""
This function takes a pandas Series object and returns high
if the mean is above 6.5 and low if the mean is below 6.5
"""
if s.mean() < 6.5:
out = "Low"
else:
out = "High"
return out
#
# Step 2: Apply it via the agg method.
#
unemp.agg(high_or_low)
# How does this differ from unemp.agg(high_or_low)?
unemp.agg(high_or_low, axis=1).head()
Notice that agg
can also accept multiple functions at once.
unemp.agg([min, max, high_or_low])
See exercise 2 in the exercise list
Many analytical operations do not necessarily involve an aggregation.
The output of a function applied to a Series might need to be a new Series.
Some examples:
pandas comes with many transform functions including:
cum(sum|min|max|prod)
) diff
) +
, -
, *
, /
) pct_change
) value_counts
) abs
) Again, tab completion is helpful when trying to find these functions.
unemp.head()
unemp.pct_change().head()
unemp.diff().head()
Transforms can be split into to several main categories:
abs
method, or adding a constant to each value of a Series. pandas also simplifies applying custom Series transforms to a Series or the columns of a DataFrame. The steps are:
apply
method (alternatively, the transform
method). As an example, we will standardize our unemployment data to have mean 0 and standard deviation 1.
After doing this, we can use an aggregation to determine at which date the unemployment rate is most different from “normal times” for each state.
#
# Step 1: We write the Series transform function that we'd like to use
#
def standardize_data(x):
"""
Changes the data in a Series to become mean 0 with standard deviation 1
"""
mu = x.mean()
std = x.std()
return (x - mu)/std
#
# Step 2: Apply our function via the apply method.
#
std_unemp = unemp.apply(standardize_data)
std_unemp.head()
# Takes the absolute value of all elements of a function
abs_std_unemp = std_unemp.abs()
abs_std_unemp.head()
# find the date when unemployment was "most different from normal" for each State
def idxmax(x):
# idxmax of Series will return index of maximal value
return x.idxmax()
abs_std_unemp.agg(idxmax)
As you may have predicted, we can also apply custom scalar transforms to our pandas data.
To do this, we use the following pattern:
applymap
Series or DataFrame method. Complete the exercise below to practice writing and using your own scalar transforms.
See exercise 3 in the exercise list
We have seen how we can select subsets of data by referring to the index or column names.
However, we often want to select based on conditions met by the data itself.
Some examples are:
We will be able to do this by using a Series or list of boolean values to index into a Series or DataFrame.
Let’s look at some examples.
unemp_small = unemp.head() # Create smaller data so we can see what's happening
unemp_small
# list of booleans selects rows
unemp_small.loc[[True, True, True, False, False]]
# second argument selects columns, the ``:`` means "all".
# here we use it to select all columns
unemp_small.loc[[True, False, True, False, True], :]
# can use booleans to select both rows and columns
unemp_small.loc[[True, True, True, False, False], [True, False, False, False, False, True, True]]
We can use conditional statements to construct Series of booleans from our data.
unemp_small["Texas"] < 4.5
Once we have our Series of bools, we can use it to extract subsets of rows from our DataFrame.
unemp_small.loc[unemp_small["Texas"] < 4.5]
unemp_small["New York"] > unemp_small["Texas"]
big_NY = unemp_small["New York"] > unemp_small["Texas"]
unemp_small.loc[big_NY]
In the boolean section of the basics lecture, we saw
that we can use the words and
and or
to combine multiple booleans into
a single bool.
Recall:
True and False -> False
True and True -> True
False and False -> False
True or False -> True
True or True -> True
False or False -> False
We can do something similar in pandas, but instead of
bool1 and bool2
we write:
(bool_series1) & (bool_series2)
Likewise, instead of bool1 or bool2
we write:
(bool_series1) | (bool_series2)
small_NYTX = (unemp_small["Texas"] < 4.7) & (unemp_small["New York"] < 4.7)
small_NYTX
unemp_small[small_NYTX]
isin
¶Sometimes, we will want to check whether a data point takes on one of a several fixed values.
We could do this by writing (df["x"] == val_1) | (df["x"] == val_2)
(like we did above), but there is a better way: the .isin
method
unemp_small["Michigan"].isin([3.3, 3.2])
# now select full rows where this Series is True
unemp_small.loc[unemp_small["Michigan"].isin([3.3, 3.2])]
.any
and .all
¶Recall from the boolean section of the basics lecture
that the Python functions any
and all
are aggregation functions that
take a collection of booleans and return a single boolean.
any
returns True whenever at least one of the inputs are True while
all
is True only when all the inputs are True
.
Series and DataFrames with dtype
bool have .any
and .all
methods that apply this logic to pandas objects.
Let’s use these methods to count how many months all the states in our sample had high unemployment.
As we work through this example, consider the “want operator”, a helpful concept from Nobel Laureate Tom Sargent for clearly stating the goal of our analysis and determining the steps necessary to reach the goal.
We always begin by writing Want:
followed by what we want to
accomplish.
In this case, we would write:
Want: Count the number of months in which all states in our sample had unemployment above 6.5%
After identifying the want, we work backwards to identify the steps necessary to accomplish our goal.
So, starting from the result, we have:
True
values in a Series indicating dates for
which all states had high unemployment. .all
method
on a DataFrame containing booleans indicating whether each state had
high unemployment at each date. >
comparison. Now that we have a clear plan, let’s follow through and apply the want operator:
# Step 3: construct the DataFrame of bools
high = unemp > 6.5
high.head()
# Step 2: use the .all method on axis=1 to get the dates where all states have a True
all_high = high.all(axis=1)
all_high.head()
# Step 1: Call .sum to add up the number of True values in `all_high`
# (note that True == 1 and False == 0 in Python, so .sum will count Trues)
msg = "Out of {} months, {} had high unemployment across all states"
print(msg.format(len(all_high), all_high.sum()))
See exercise 4 in the exercise list
Looking at the displayed DataFrame above, can you identify the index? The columns?
You can use the cell below to verify your visual intuition.
# your code here
Exercise 2
Do the following exercises in separate code cells below:
dtype
) is returned by the
aggregation? idxmax
# min unemployment rate by state
# median unemployment rate by state
# max unemployment rate across all states and Year
# low or high volatility
Exercise 3
Imagine that we want to determine whether unemployment was high (> 6.5), medium (4.5 < x <= 6.5), or low (<= 4.5) for each state and each month.
applymap
(quiz: why applymap
and not
agg
or apply
?) and save the result in a new DataFrame called
unemp_bins
. unemp_bins
to count how many
times each state had each of the three classifications. # Part 1: Write a Python function to classify unemployment levels.
# Part 2: Pass your function from part 1 to applymap
unemp_bins = unemp.applymap#replace this comment with your code!!
# Part 3: Count the number of times each state had each classification.
## then make a horizontal bar chart here
# Part 4: Apply the same transform from part 4, but to each date instead of to each state.
Exercise 4
unemp_bins
DataFrame from the exercise above).