The Index

Prerequisites

Outcomes

  • Understand how the index is used to align data
  • Know how to set and reset the index
  • Understand how to select subsets of data by slicing on index and columns
  • Understand that for DataFrames, the column names also align data
In [1]:
# Uncomment following line to install on colab
#! pip install qeds
In [2]:
import pandas as pd
import numpy as np

So What is this Index?

Every Series or DataFrame has an index.

We told you that the index was the “row labels” for the data.

This is true, but an index in pandas does much more than label the rows.

The purpose of this lecture is to understand the importance of the index.

The pandas documentation says

Data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.

In practice, the index and column names are used to make sure the data is properly aligned when operating on multiple DataFrames.

This is a somewhat abstract concept that is best understood by example…

Let’s begin by loading some data on GDP components that we collected from the World Bank’s World Development Indicators Dataset.

In [3]:
url = "https://storage.googleapis.com/qeds/data/wdi_data.csv"
df = pd.read_csv(url)
df.info()

df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72 entries, 0 to 71
Data columns (total 7 columns):
country        72 non-null object
year           72 non-null int64
GovExpend      72 non-null float64
Consumption    72 non-null float64
Exports        72 non-null float64
Imports        72 non-null float64
GDP            72 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 4.0+ KB
Out[3]:
country year GovExpend Consumption Exports Imports GDP
0 Canada 2017 0.372665 1.095475 0.582831 0.600031 1.868164
1 Canada 2016 0.364899 1.058426 0.576394 0.575775 1.814016
2 Canada 2015 0.358303 1.035208 0.568859 0.575793 1.794270
3 Canada 2014 0.353485 1.011988 0.550323 0.572344 1.782252
4 Canada 2013 0.351541 0.986400 0.518040 0.558636 1.732714

We’ll also extract a couple smaller DataFrames we can use in examples.

In [4]:
df_small = df.head(5)
df_small
Out[4]:
country year GovExpend Consumption Exports Imports GDP
0 Canada 2017 0.372665 1.095475 0.582831 0.600031 1.868164
1 Canada 2016 0.364899 1.058426 0.576394 0.575775 1.814016
2 Canada 2015 0.358303 1.035208 0.568859 0.575793 1.794270
3 Canada 2014 0.353485 1.011988 0.550323 0.572344 1.782252
4 Canada 2013 0.351541 0.986400 0.518040 0.558636 1.732714
In [5]:
df_tiny = df.iloc[[0, 3, 2, 4], :]
df_tiny
Out[5]:
country year GovExpend Consumption Exports Imports GDP
0 Canada 2017 0.372665 1.095475 0.582831 0.600031 1.868164
3 Canada 2014 0.353485 1.011988 0.550323 0.572344 1.782252
2 Canada 2015 0.358303 1.035208 0.568859 0.575793 1.794270
4 Canada 2013 0.351541 0.986400 0.518040 0.558636 1.732714
In [6]:
im_ex = df_small[["Imports", "Exports"]]
im_ex_copy = im_ex.copy()
im_ex_copy
Out[6]:
Imports Exports
0 0.600031 0.582831
1 0.575775 0.576394
2 0.575793 0.568859
3 0.572344 0.550323
4 0.558636 0.518040

Observe what happens when we evaluate im_ex + im_ex_copy.

In [7]:
im_ex + im_ex_copy
Out[7]:
Imports Exports
0 1.200063 1.165661
1 1.151550 1.152787
2 1.151585 1.137718
3 1.144688 1.100646
4 1.117272 1.036081

Notice that this operated elementwise, meaning that the + operation was applied to each element of im_ex and the corresponding element of im_ex_copy.

Let’s take a closer look at df_tiny:

In [8]:
df_tiny
Out[8]:
country year GovExpend Consumption Exports Imports GDP
0 Canada 2017 0.372665 1.095475 0.582831 0.600031 1.868164
3 Canada 2014 0.353485 1.011988 0.550323 0.572344 1.782252
2 Canada 2015 0.358303 1.035208 0.568859 0.575793 1.794270
4 Canada 2013 0.351541 0.986400 0.518040 0.558636 1.732714

Relative to im_ex notice a few things:

  • The row labeled 1 appears in im_ex but not df_tiny.
  • For row labels that appear in both, they are not in the same position within each DataFrame.
  • Certain columns appear only in df_tiny.
  • The Imports and Exports columns are the 6th and 5th columns of df_tiny and the 1st and 2nd of im_ex, respectively.

Now, let’s see what happens when we try df_tiny + im_ex.

In [9]:
im_ex_tiny = df_tiny + im_ex
im_ex_tiny
Out[9]:
Consumption Exports GDP GovExpend Imports country year
0 NaN 1.165661 NaN NaN 1.200063 NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN 1.137718 NaN NaN 1.151585 NaN NaN
3 NaN 1.100646 NaN NaN 1.144688 NaN NaN
4 NaN 1.036081 NaN NaN 1.117272 NaN NaN

Whoa, a lot happened! Let’s break it down.

Automatic Alignment

For all (row, column) combinations that appear in both DataFrames (e.g. rows [1, 3] and columns [Imports, Exports]), the value of im_ex_tiny is equal to df_tiny.loc[row, col] + im_ex.loc[row, col].

This happened even though the rows and columns were not in the same order.

We refer to this as pandas aligning the data for us.

To see how awesome this is, think about how to do something similar in Excel:

  • df_tiny and im_ex would be in different sheets.
  • The index and column names would be the first column and row in each sheet.
  • We would have a third sheet to hold the sum.
  • For each label in the first row and column of either the df_tiny sheet or the im_ex sheet we would have to do a IFELSE to check if the label exists in the other sheet and then a VLOOKUP to extract the value.

In pandas, this happens automatically, behind the scenes, and very quickly.

Handling Missing Data

For all elements in row 1 or columns ["country", "year", "GovExpend", "Consumption", "GDP"], the value in im_ex_tiny is NaN.

This is how pandas represents missing data.

So, when pandas was trying to look up the values in df_tiny and im_ex, it could only find a value in one DataFrame: the other value was missing.

When pandas tries to add a number to something that is missing, it says that the result is missing (spelled NaN).

See exercise 1 in the exercise list

Setting the Index

For a DataFrame df, the df.set_index method allows us to use one (or more) of the DataFrame’s columns as the index.

Here’s an example.

In [10]:
# first, create the DataFrame
df_year = df.set_index(["year"])
df_year.head()
Out[10]:
country GovExpend Consumption Exports Imports GDP
year
2017 Canada 0.372665 1.095475 0.582831 0.600031 1.868164
2016 Canada 0.364899 1.058426 0.576394 0.575775 1.814016
2015 Canada 0.358303 1.035208 0.568859 0.575793 1.794270
2014 Canada 0.353485 1.011988 0.550323 0.572344 1.782252
2013 Canada 0.351541 0.986400 0.518040 0.558636 1.732714

Now that the year is on the index, we can use .loc to extract all the data for a specific year.

In [11]:
df_year.loc[2010]
Out[11]:
country GovExpend Consumption Exports Imports GDP
year
2010 Canada 0.347332 0.921952 0.469949 0.500341 1.613543
2010 Germany 0.653386 1.915481 1.443735 1.266126 3.417095
2010 United Kingdom 0.521146 1.598563 0.690824 0.745065 2.452900
2010 United States 2.510143 10.185836 1.846280 2.360183 14.992053

This would be helpful, for example, if we wanted to compute the difference in the average of all our variables from one year to the next.

In [12]:
df_year.loc[2009].mean() - df_year.loc[2008].mean()
Out[12]:
GovExpend      0.033317
Consumption   -0.042998
Exports       -0.121425
Imports       -0.140042
GDP           -0.182610
dtype: float64

Notice that pandas did a few things for us.

  • After computing .mean(), the row labels (index) were the former column names.
  • These column names were used to align data when we wanted asked pandas to compute the difference.

Suppose that someone asked you, “What was the GDP in the US in 2010?”

To compute that using df_year you might do something like this:

In [13]:
df_year.loc[df_year["country"] == "United States", "GDP"].loc[2010]
Out[13]:
14.992052727

That was a lot of work!

Now, suppose that after seeing you extract that data, your friend asks you “What about GDP in Germany and the UK in 2010?”

To answer that question, you might write.

In [14]:
df_year.loc[df_year["country"].isin(["United Kingdom", "Germany"]), "GDP"].loc[2010]
Out[14]:
year
2010    3.417095
2010    2.452900
Name: GDP, dtype: float64

Notice that this code is similar to the code above, but now provides a result that is ambiguous.

The two elements in the series both have with label 2010.

How do we know which is which?

We might think that the first value corresponds to the United Kingdom because that is what we listed first in the call to isin, but we would be wrong!

Let’s check.

In [15]:
df_year.loc[2010]
Out[15]:
country GovExpend Consumption Exports Imports GDP
year
2010 Canada 0.347332 0.921952 0.469949 0.500341 1.613543
2010 Germany 0.653386 1.915481 1.443735 1.266126 3.417095
2010 United Kingdom 0.521146 1.598563 0.690824 0.745065 2.452900
2010 United States 2.510143 10.185836 1.846280 2.360183 14.992053

Setting just the year as index has one more potential issue: we will get data alignment only on the year, which may not be sufficient.

To demonstrate this point, suppose now you are asked to use our WDI dataset to compute an approximation for net exports and investment in in 2009.

As a seasoned economist, you would remember the expenditure formula for GDP is written

$$ GDP = Consumption + Investment + GovExpend + Net Exports $$

which we can rearrange to compute investment as a function of the variables in our DataFrame…

$$ Investment = GDP - Consumption - GovExpend - Net Exports $$

Note that we can compute NetExports as Exports - Imports.

In [16]:
nx = df_year["Exports"] - df_year["Imports"]
nx.head(19)
Out[16]:
year
2017   -0.017201
2016    0.000619
2015   -0.006934
2014   -0.022021
2013   -0.040596
2012   -0.041787
2011   -0.035878
2010   -0.030393
2009    0.000896
2008    0.004068
2007    0.032451
2006    0.053530
2005    0.072729
2004    0.091902
2003    0.097794
2002    0.121850
2001    0.122673
2000    0.118702
2017    0.264214
dtype: float64

Now, suppose that we accidentally had a bug in our code that swapped the data for Canada and Germany’s net exports in 2017.

Note

This example is contrived, but if you were getting unclean data from some resource or doing more complicated operations, this type of mistake becomes increasingly likely.

In [17]:
ca17 = nx.iloc[[0]]
g17 = nx.iloc[[18]]
nx.iloc[[0]] = g17
nx.iloc[[18]] = ca17

nx.head(19)
Out[17]:
year
2017    0.264214
2016    0.000619
2015   -0.006934
2014   -0.022021
2013   -0.040596
2012   -0.041787
2011   -0.035878
2010   -0.030393
2009    0.000896
2008    0.004068
2007    0.032451
2006    0.053530
2005    0.072729
2004    0.091902
2003    0.097794
2002    0.121850
2001    0.122673
2000    0.118702
2017   -0.017201
dtype: float64

Notice that if we now add nx to the DataFrame and compute investment pandas doesn’t complain.

In [18]:
df_year["NetExports"] = nx
df_year["Investment"] = df_year.eval("GDP - Consumption - GovExpend - NetExports")
df_year.head(19)
Out[18]:
country GovExpend Consumption Exports Imports GDP NetExports Investment
year
2017 Canada 0.372665 1.095475 0.582831 0.600031 1.868164 0.264214 0.135811
2016 Canada 0.364899 1.058426 0.576394 0.575775 1.814016 0.000619 0.390072
2015 Canada 0.358303 1.035208 0.568859 0.575793 1.794270 -0.006934 0.407692
2014 Canada 0.353485 1.011988 0.550323 0.572344 1.782252 -0.022021 0.438800
2013 Canada 0.351541 0.986400 0.518040 0.558636 1.732714 -0.040596 0.435369
2012 Canada 0.354342 0.961226 0.505969 0.547756 1.693428 -0.041787 0.419647
2011 Canada 0.351887 0.943145 0.492349 0.528227 1.664240 -0.035878 0.405086
2010 Canada 0.347332 0.921952 0.469949 0.500341 1.613543 -0.030393 0.374652
2009 Canada 0.339686 0.890078 0.440692 0.439796 1.565291 0.000896 0.334631
2008 Canada 0.330766 0.889602 0.506350 0.502281 1.612862 0.004068 0.388425
2007 Canada 0.318777 0.864012 0.530453 0.498002 1.596876 0.032451 0.381636
2006 Canada 0.311382 0.827643 0.524461 0.470931 1.564608 0.053530 0.372053
2005 Canada 0.303043 0.794390 0.519950 0.447222 1.524608 0.072729 0.354447
2004 Canada 0.299854 0.764357 0.508657 0.416754 1.477317 0.091902 0.321203
2003 Canada 0.294335 0.741796 0.481993 0.384199 1.433089 0.097794 0.299164
2002 Canada 0.286094 0.721974 0.490465 0.368615 1.407725 0.121850 0.277806
2001 Canada 0.279767 0.694230 0.484696 0.362023 1.366590 0.122673 0.269921
2000 Canada 0.270553 0.677713 0.499526 0.380823 1.342805 0.118702 0.275837
2017 Germany 0.745579 2.112009 1.930563 1.666348 3.883870 -0.017201 1.043482

Because we didn’t also have data alignment on the country, we would have overstated Canada’s investment by 281 billion USD and understated Germany’s by the same amount.

To make these types operation easier, we need to include both the year and country in the index…

Setting a Hierarchical Index

Include multiple columns in the index is advantageous in some situations.

These situations might include:

  • When we need more than one piece of information (column) to identify an observation (as in the Germany and UK GDP example above)
  • When we need data-alignment by more than one column

To achieve multiple columns in the index, we pass a list of multiple column names to set_index.

In [19]:
wdi = df.set_index(["country", "year"])
wdi.head(20)
Out[19]:
GovExpend Consumption Exports Imports GDP
country year
Canada 2017 0.372665 1.095475 0.582831 0.600031 1.868164
2016 0.364899 1.058426 0.576394 0.575775 1.814016
2015 0.358303 1.035208 0.568859 0.575793 1.794270
2014 0.353485 1.011988 0.550323 0.572344 1.782252
2013 0.351541 0.986400 0.518040 0.558636 1.732714
2012 0.354342 0.961226 0.505969 0.547756 1.693428
2011 0.351887 0.943145 0.492349 0.528227 1.664240
2010 0.347332 0.921952 0.469949 0.500341 1.613543
2009 0.339686 0.890078 0.440692 0.439796 1.565291
2008 0.330766 0.889602 0.506350 0.502281 1.612862
2007 0.318777 0.864012 0.530453 0.498002 1.596876
2006 0.311382 0.827643 0.524461 0.470931 1.564608
2005 0.303043 0.794390 0.519950 0.447222 1.524608
2004 0.299854 0.764357 0.508657 0.416754 1.477317
2003 0.294335 0.741796 0.481993 0.384199 1.433089
2002 0.286094 0.721974 0.490465 0.368615 1.407725
2001 0.279767 0.694230 0.484696 0.362023 1.366590
2000 0.270553 0.677713 0.499526 0.380823 1.342805
Germany 2017 0.745579 2.112009 1.930563 1.666348 3.883870
2016 0.734014 2.075615 1.844949 1.589495 3.801859

Notice that in the display above, the row labels seem to have two levels now.

The outer (or left-most) level is named country and the inner (or right-most) level is named year.

When a DataFrame’s index has multiple levels, we (and the pandas documentation) refer to the DataFrame as having a hierarchical index.

Slicing a Hierarchical Index

Now, we can answer our friend’s questions in a much more straightforward way.

In [20]:
wdi.loc[("United States", 2010), "GDP"]
Out[20]:
14.992052727
In [21]:
wdi.loc[(["United Kingdom", "Germany"], 2010), "GDP"]
Out[21]:
country         year
Germany         2010    3.417095
United Kingdom  2010    2.452900
Name: GDP, dtype: float64

As shown above, we can use wdi.loc to extract different slices of our national accounts data.

The rules for using .loc with a hierarchically-indexed DataFrame are similar to the ones we’ve learned for standard DataFrames, but they are a bit more elaborate as we now have more structure to our data.

We will summarize the main rules, and then work through an exercise that demonstrates each of them.

Slicing rules

pandas slicing reacts differently to lists and tuples.

It does this to provide more flexibility to select the data you want.

list in row slicing will be an “or” operation, where it chooses rows based on whether the index value corresponds to any element of the list.

tuple in row slicing will be used to denote a single hierarchical index and must include a value for each level.

Row slicing examples

  1. wdi.loc["United States"]: all rows where the outer most index value is equal to United States
  2. wdi.loc[("United States", 2010)]: all rows where the outer-most index value is equal to "United States and the second level is equal to 2010
  3. wdi.loc[["United States", "Canada"]]: all rows where the outer-most index is either "United States" or "Canada"
  4. wdi.loc[(["United States", "Canada"], [2010, 2011]), :]: all rows where the outer-most index is either "United States or "Canada" AND where the second level index is either 2010 or 2011
  5. wdi.loc[[("United States", 2010), ("Canada", 2011)], :]: all rows where the the two hierarchical indices are either ("United States", 2010) or ("Canada", 2011)

We can also restrict .loc to extract certain columns by doing:

  1. wdi.loc[rows, GDP]: return the rows specified by rows (see rules above) and only column named GDP (returned object will be a Series)
  2. df.loc[rows, ["GDP", "Consumption"]]: return the rows specified by rows (see rules above) and only columns GDP and Consumption

See exercise 2 in the exercise list

Alignment with MultiIndex

The data alignment features we talked about above also apply to a MultiIndex DataFrame.

The exercise below gives you a chance to experiment with this.

See exercise 3 in the exercise list

pd.IndexSlice

When we want to extract rows for a few values of the outer index and all values for an inner index level, we can use the convenient df.loc[[id11, id22]] shorthand.

We can use this notation to extract all the data for the United States and Canada.

In [22]:
wdi.loc[["United States", "Canada"]]
Out[22]:
GovExpend Consumption Exports Imports GDP
country year
Canada 2017 0.372665 1.095475 0.582831 0.600031 1.868164
2016 0.364899 1.058426 0.576394 0.575775 1.814016
2015 0.358303 1.035208 0.568859 0.575793 1.794270
2014 0.353485 1.011988 0.550323 0.572344 1.782252
2013 0.351541 0.986400 0.518040 0.558636 1.732714
2012 0.354342 0.961226 0.505969 0.547756 1.693428
2011 0.351887 0.943145 0.492349 0.528227 1.664240
2010 0.347332 0.921952 0.469949 0.500341 1.613543
2009 0.339686 0.890078 0.440692 0.439796 1.565291
2008 0.330766 0.889602 0.506350 0.502281 1.612862
2007 0.318777 0.864012 0.530453 0.498002 1.596876
2006 0.311382 0.827643 0.524461 0.470931 1.564608
2005 0.303043 0.794390 0.519950 0.447222 1.524608
2004 0.299854 0.764357 0.508657 0.416754 1.477317
2003 0.294335 0.741796 0.481993 0.384199 1.433089
2002 0.286094 0.721974 0.490465 0.368615 1.407725
2001 0.279767 0.694230 0.484696 0.362023 1.366590
2000 0.270553 0.677713 0.499526 0.380823 1.342805
United States 2017 2.405743 12.019266 2.287071 3.069954 17.348627
2016 2.407981 11.722133 2.219937 2.936004 16.972348
2015 2.373130 11.409800 2.222228 2.881337 16.710459
2014 2.334071 11.000619 2.209555 2.732228 16.242526
2013 2.353381 10.687214 2.118639 2.600198 15.853796
2012 2.398873 10.534042 2.045509 2.560677 15.567038
2011 2.434378 10.378060 1.978083 2.493194 15.224555
2010 2.510143 10.185836 1.846280 2.360183 14.992053
2009 2.507390 10.010687 1.646432 2.086299 14.617299
2008 2.407771 10.137847 1.797347 2.400349 14.997756
2007 2.351987 10.159387 1.701096 2.455016 15.018268
2006 2.314957 9.938503 1.564920 2.395189 14.741688
2005 2.287022 9.643098 1.431205 2.246246 14.332500
2004 2.267999 9.311431 1.335978 2.108585 13.846058
2003 2.233519 8.974708 1.218199 1.892825 13.339312
2002 2.193188 8.698306 1.192180 1.804105 12.968263
2001 2.112038 8.480461 1.213253 1.740797 12.746262
2000 2.040500 8.272097 1.287739 1.790995 12.620268

However, suppose we wanted to extract the data for all countries, but only the years 2005, 2007, and 2009.

We cannot do this using wdi.loc because the year is on the second level, not outer-most level of our index.

To get around this limitation, we can use the pd.IndexSlice helper.

Here’s an example.

In [23]:
wdi.loc[pd.IndexSlice[:, [2005, 2007, 2009]], :]
Out[23]:
GovExpend Consumption Exports Imports GDP
country year
Canada 2009 0.339686 0.890078 0.440692 0.439796 1.565291
2007 0.318777 0.864012 0.530453 0.498002 1.596876
2005 0.303043 0.794390 0.519950 0.447222 1.524608
Germany 2009 0.645023 1.908393 1.260525 1.121914 3.283144
2007 0.605624 1.894219 1.442436 1.213835 3.441356
2005 0.591184 1.866253 1.175200 1.028094 3.213777
United Kingdom 2009 0.519716 1.587152 0.653830 0.689011 2.411632
2007 0.504549 1.644789 0.710200 0.767699 2.527327
2005 0.490806 1.578914 0.640088 0.715951 2.403352
United States 2009 2.507390 10.010687 1.646432 2.086299 14.617299
2007 2.351987 10.159387 1.701096 2.455016 15.018268
2005 2.287022 9.643098 1.431205 2.246246 14.332500

Notice that the : in the first part of [:, ["A", "D"]] instructed pandas to give us rows for all values of the outer most index level and that the : just before ] said grab all the columns.

See exercise 4 in the exercise list

Multi-index Columns

The functionality of MultiIndex also applies to the column names.

Let’s see how it works.

In [24]:
wdiT = wdi.T  # .T means "transpose" or "swap rows and columns"
wdiT
Out[24]:
country Canada ... United States
year 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 ... 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000
GovExpend 0.372665 0.364899 0.358303 0.353485 0.351541 0.354342 0.351887 0.347332 0.339686 0.330766 ... 2.507390 2.407771 2.351987 2.314957 2.287022 2.267999 2.233519 2.193188 2.112038 2.040500
Consumption 1.095475 1.058426 1.035208 1.011988 0.986400 0.961226 0.943145 0.921952 0.890078 0.889602 ... 10.010687 10.137847 10.159387 9.938503 9.643098 9.311431 8.974708 8.698306 8.480461 8.272097
Exports 0.582831 0.576394 0.568859 0.550323 0.518040 0.505969 0.492349 0.469949 0.440692 0.506350 ... 1.646432 1.797347 1.701096 1.564920 1.431205 1.335978 1.218199 1.192180 1.213253 1.287739
Imports 0.600031 0.575775 0.575793 0.572344 0.558636 0.547756 0.528227 0.500341 0.439796 0.502281 ... 2.086299 2.400349 2.455016 2.395189 2.246246 2.108585 1.892825 1.804105 1.740797 1.790995
GDP 1.868164 1.814016 1.794270 1.782252 1.732714 1.693428 1.664240 1.613543 1.565291 1.612862 ... 14.617299 14.997756 15.018268 14.741688 14.332500 13.846058 13.339312 12.968263 12.746262 12.620268

5 rows × 72 columns

Notice that wdiT seems to have two levels of names for the columns.

The same logic laid out in the above row slicing rules applies when we have a hierarchical index for column names.

In [25]:
wdiT.loc[:, "United States"]
Out[25]:
year 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000
GovExpend 2.405743 2.407981 2.373130 2.334071 2.353381 2.398873 2.434378 2.510143 2.507390 2.407771 2.351987 2.314957 2.287022 2.267999 2.233519 2.193188 2.112038 2.040500
Consumption 12.019266 11.722133 11.409800 11.000619 10.687214 10.534042 10.378060 10.185836 10.010687 10.137847 10.159387 9.938503 9.643098 9.311431 8.974708 8.698306 8.480461 8.272097
Exports 2.287071 2.219937 2.222228 2.209555 2.118639 2.045509 1.978083 1.846280 1.646432 1.797347 1.701096 1.564920 1.431205 1.335978 1.218199 1.192180 1.213253 1.287739
Imports 3.069954 2.936004 2.881337 2.732228 2.600198 2.560677 2.493194 2.360183 2.086299 2.400349 2.455016 2.395189 2.246246 2.108585 1.892825 1.804105 1.740797 1.790995
GDP 17.348627 16.972348 16.710459 16.242526 15.853796 15.567038 15.224555 14.992053 14.617299 14.997756 15.018268 14.741688 14.332500 13.846058 13.339312 12.968263 12.746262 12.620268
In [26]:
wdiT.loc[:, ["United States", "Canada"]]
Out[26]:
country Canada ... United States
year 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 ... 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000
GovExpend 0.372665 0.364899 0.358303 0.353485 0.351541 0.354342 0.351887 0.347332 0.339686 0.330766 ... 2.507390 2.407771 2.351987 2.314957 2.287022 2.267999 2.233519 2.193188 2.112038 2.040500
Consumption 1.095475 1.058426 1.035208 1.011988 0.986400 0.961226 0.943145 0.921952 0.890078 0.889602 ... 10.010687 10.137847 10.159387 9.938503 9.643098 9.311431 8.974708 8.698306 8.480461 8.272097
Exports 0.582831 0.576394 0.568859 0.550323 0.518040 0.505969 0.492349 0.469949 0.440692 0.506350 ... 1.646432 1.797347 1.701096 1.564920 1.431205 1.335978 1.218199 1.192180 1.213253 1.287739
Imports 0.600031 0.575775 0.575793 0.572344 0.558636 0.547756 0.528227 0.500341 0.439796 0.502281 ... 2.086299 2.400349 2.455016 2.395189 2.246246 2.108585 1.892825 1.804105 1.740797 1.790995
GDP 1.868164 1.814016 1.794270 1.782252 1.732714 1.693428 1.664240 1.613543 1.565291 1.612862 ... 14.617299 14.997756 15.018268 14.741688 14.332500 13.846058 13.339312 12.968263 12.746262 12.620268

5 rows × 36 columns

In [27]:
wdiT.loc[:, (["United States", "Canada"], 2010)]
Out[27]:
country Canada United States
year 2010 2010
GovExpend 0.347332 2.510143
Consumption 0.921952 10.185836
Exports 0.469949 1.846280
Imports 0.500341 2.360183
GDP 1.613543 14.992053

See exercise 5 in the exercise list

Re-setting the Index

The df.reset_index method will move one or more level of the index back into the DataFrame as a normal column.

With no additional arguments, it moves all levels out of the index and sets the index of the returned DataFrame to the default of range(df.shape[0]).

In [28]:
wdi.reset_index()
Out[28]:
country year GovExpend Consumption Exports Imports GDP
0 Canada 2017 0.372665 1.095475 0.582831 0.600031 1.868164
1 Canada 2016 0.364899 1.058426 0.576394 0.575775 1.814016
2 Canada 2015 0.358303 1.035208 0.568859 0.575793 1.794270
3 Canada 2014 0.353485 1.011988 0.550323 0.572344 1.782252
4 Canada 2013 0.351541 0.986400 0.518040 0.558636 1.732714
5 Canada 2012 0.354342 0.961226 0.505969 0.547756 1.693428
6 Canada 2011 0.351887 0.943145 0.492349 0.528227 1.664240
7 Canada 2010 0.347332 0.921952 0.469949 0.500341 1.613543
8 Canada 2009 0.339686 0.890078 0.440692 0.439796 1.565291
9 Canada 2008 0.330766 0.889602 0.506350 0.502281 1.612862
10 Canada 2007 0.318777 0.864012 0.530453 0.498002 1.596876
11 Canada 2006 0.311382 0.827643 0.524461 0.470931 1.564608
12 Canada 2005 0.303043 0.794390 0.519950 0.447222 1.524608
13 Canada 2004 0.299854 0.764357 0.508657 0.416754 1.477317
14 Canada 2003 0.294335 0.741796 0.481993 0.384199 1.433089
15 Canada 2002 0.286094 0.721974 0.490465 0.368615 1.407725
16 Canada 2001 0.279767 0.694230 0.484696 0.362023 1.366590
17 Canada 2000 0.270553 0.677713 0.499526 0.380823 1.342805
18 Germany 2017 0.745579 2.112009 1.930563 1.666348 3.883870
19 Germany 2016 0.734014 2.075615 1.844949 1.589495 3.801859
20 Germany 2015 0.706115 2.033666 1.803081 1.527074 3.718482
21 Germany 2014 0.685990 1.999953 1.712270 1.445409 3.654924
22 Germany 2013 0.675471 1.979458 1.635030 1.394385 3.577015
23 Germany 2012 0.666454 1.967390 1.607455 1.354122 3.559587
24 Germany 2011 0.659528 1.941340 1.563277 1.355008 3.542160
25 Germany 2010 0.653386 1.915481 1.443735 1.266126 3.417095
26 Germany 2009 0.645023 1.908393 1.260525 1.121914 3.283144
27 Germany 2008 0.626140 1.905520 1.470300 1.241057 3.478602
28 Germany 2007 0.605624 1.894219 1.442436 1.213835 3.441356
29 Germany 2006 0.596868 1.894219 1.319574 1.142552 3.332692
... ... ... ... ... ... ... ...
42 United Kingdom 2011 0.521716 1.588172 0.735331 0.750540 2.493244
43 United Kingdom 2010 0.521146 1.598563 0.690824 0.745065 2.452900
44 United Kingdom 2009 0.519716 1.587152 0.653830 0.689011 2.411632
45 United Kingdom 2008 0.513870 1.635333 0.713184 0.753502 2.518585
46 United Kingdom 2007 0.504549 1.644789 0.710200 0.767699 2.527327
47 United Kingdom 2006 0.499312 1.604404 0.717506 0.786134 2.464591
48 United Kingdom 2005 0.490806 1.578914 0.640088 0.715951 2.403352
49 United Kingdom 2004 0.471828 1.531808 0.593046 0.667995 2.329987
50 United Kingdom 2003 0.452743 1.484092 0.562653 0.625696 2.276538
51 United Kingdom 2002 0.434954 1.433861 0.546092 0.606795 2.202971
52 United Kingdom 2001 0.418387 1.380779 0.533999 0.574277 2.149246
53 United Kingdom 2000 0.401274 1.332093 0.523797 0.548044 2.089877
54 United States 2017 2.405743 12.019266 2.287071 3.069954 17.348627
55 United States 2016 2.407981 11.722133 2.219937 2.936004 16.972348
56 United States 2015 2.373130 11.409800 2.222228 2.881337 16.710459
57 United States 2014 2.334071 11.000619 2.209555 2.732228 16.242526
58 United States 2013 2.353381 10.687214 2.118639 2.600198 15.853796
59 United States 2012 2.398873 10.534042 2.045509 2.560677 15.567038
60 United States 2011 2.434378 10.378060 1.978083 2.493194 15.224555
61 United States 2010 2.510143 10.185836 1.846280 2.360183 14.992053
62 United States 2009 2.507390 10.010687 1.646432 2.086299 14.617299
63 United States 2008 2.407771 10.137847 1.797347 2.400349 14.997756
64 United States 2007 2.351987 10.159387 1.701096 2.455016 15.018268
65 United States 2006 2.314957 9.938503 1.564920 2.395189 14.741688
66 United States 2005 2.287022 9.643098 1.431205 2.246246 14.332500
67 United States 2004 2.267999 9.311431 1.335978 2.108585 13.846058
68 United States 2003 2.233519 8.974708 1.218199 1.892825 13.339312
69 United States 2002 2.193188 8.698306 1.192180 1.804105 12.968263
70 United States 2001 2.112038 8.480461 1.213253 1.740797 12.746262
71 United States 2000 2.040500 8.272097 1.287739 1.790995 12.620268

72 rows × 7 columns

See exercise 6 in the exercise list

Choose the Index Carefully

So, now that we know that we use index and column names for aligning data, “how should we pick the index?” is a natural question to ask.

To guide us to the right answer, we will list the first two components to Hadley Wickham’s description of tidy data:

  1. Each column should each have one variable.
  2. Each row should each have one observation.

If we strive to have our data in a tidy form (we should), then when choosing the index, we should set:

  • the row labels (index) to be a unique identifier for an observation of data
  • the column names to identify one variable

For example, suppose we are looking data on interest rates.

Each column might represent one bond or asset and each row might represent the date.

Using hierarchical row and column indices allows us to store higher dimensional data in our (inherently) two dimensional DataFrame.

Know Your Goal

The correct column(s) to choose for the index often depends on the context of your analysis.

For example, if I were studying how GDP and consumption evolved over time for various countries, I would want time (year) and country name on the index

On the other hand, if I were trying to look at the differences across countries and variables within a particular year, I may opt to put the country and variable on the index and have years be columns.

Following the tidy data rules above and thinking about how you intend to use the data – and a little practice – will enable you to consistently select the correct index.

Exercises

Exercise 1

What happens when you apply the mean method to im_ex_tiny?

In particular, what happens to columns that have missing data? (HINT: also looking at the output of the sum method might help)

(back to text)

Exercise 2

For each of the examples below do the following:

  • Determine which of the rules above applies.
  • Identify the type of the returned value.
  • Explain why the slicing operation returned the data it did.

Write your answers.

In [29]:
wdi.loc[["United States", "Canada"]]
Out[29]:
GovExpend Consumption Exports Imports GDP
country year
Canada 2017 0.372665 1.095475 0.582831 0.600031 1.868164
2016 0.364899 1.058426 0.576394 0.575775 1.814016
2015 0.358303 1.035208 0.568859 0.575793 1.794270
2014 0.353485 1.011988 0.550323 0.572344 1.782252
2013 0.351541 0.986400 0.518040 0.558636 1.732714
2012 0.354342 0.961226 0.505969 0.547756 1.693428
2011 0.351887 0.943145 0.492349 0.528227 1.664240
2010 0.347332 0.921952 0.469949 0.500341 1.613543
2009 0.339686 0.890078 0.440692 0.439796 1.565291
2008 0.330766 0.889602 0.506350 0.502281 1.612862
2007 0.318777 0.864012 0.530453 0.498002 1.596876
2006 0.311382 0.827643 0.524461 0.470931 1.564608
2005 0.303043 0.794390 0.519950 0.447222 1.524608
2004 0.299854 0.764357 0.508657 0.416754 1.477317
2003 0.294335 0.741796 0.481993 0.384199 1.433089
2002 0.286094 0.721974 0.490465 0.368615 1.407725
2001 0.279767 0.694230 0.484696 0.362023 1.366590
2000 0.270553 0.677713 0.499526 0.380823 1.342805
United States 2017 2.405743 12.019266 2.287071 3.069954 17.348627
2016 2.407981 11.722133 2.219937 2.936004 16.972348
2015 2.373130 11.409800 2.222228 2.881337 16.710459
2014 2.334071 11.000619 2.209555 2.732228 16.242526
2013 2.353381 10.687214 2.118639 2.600198 15.853796
2012 2.398873 10.534042 2.045509 2.560677 15.567038
2011 2.434378 10.378060 1.978083 2.493194 15.224555
2010 2.510143 10.185836 1.846280 2.360183 14.992053
2009 2.507390 10.010687 1.646432 2.086299 14.617299
2008 2.407771 10.137847 1.797347 2.400349 14.997756
2007 2.351987 10.159387 1.701096 2.455016 15.018268
2006 2.314957 9.938503 1.564920 2.395189 14.741688
2005 2.287022 9.643098 1.431205 2.246246 14.332500
2004 2.267999 9.311431 1.335978 2.108585 13.846058
2003 2.233519 8.974708 1.218199 1.892825 13.339312
2002 2.193188 8.698306 1.192180 1.804105 12.968263
2001 2.112038 8.480461 1.213253 1.740797 12.746262
2000 2.040500 8.272097 1.287739 1.790995 12.620268
In [30]:
wdi.loc[(["United States", "Canada"], [2010, 2011, 2012]), :]
Out[30]:
GovExpend Consumption Exports Imports GDP
country year
Canada 2012 0.354342 0.961226 0.505969 0.547756 1.693428
2011 0.351887 0.943145 0.492349 0.528227 1.664240
2010 0.347332 0.921952 0.469949 0.500341 1.613543
United States 2012 2.398873 10.534042 2.045509 2.560677 15.567038
2011 2.434378 10.378060 1.978083 2.493194 15.224555
2010 2.510143 10.185836 1.846280 2.360183 14.992053
In [31]:
wdi.loc["United States"]
Out[31]:
GovExpend Consumption Exports Imports GDP
year
2017 2.405743 12.019266 2.287071 3.069954 17.348627
2016 2.407981 11.722133 2.219937 2.936004 16.972348
2015 2.373130 11.409800 2.222228 2.881337 16.710459
2014 2.334071 11.000619 2.209555 2.732228 16.242526
2013 2.353381 10.687214 2.118639 2.600198 15.853796
2012 2.398873 10.534042 2.045509 2.560677 15.567038
2011 2.434378 10.378060 1.978083 2.493194 15.224555
2010 2.510143 10.185836 1.846280 2.360183 14.992053
2009 2.507390 10.010687 1.646432 2.086299 14.617299
2008 2.407771 10.137847 1.797347 2.400349 14.997756
2007 2.351987 10.159387 1.701096 2.455016 15.018268
2006 2.314957 9.938503 1.564920 2.395189 14.741688
2005 2.287022 9.643098 1.431205 2.246246 14.332500
2004 2.267999 9.311431 1.335978 2.108585 13.846058
2003 2.233519 8.974708 1.218199 1.892825 13.339312
2002 2.193188 8.698306 1.192180 1.804105 12.968263
2001 2.112038 8.480461 1.213253 1.740797 12.746262
2000 2.040500 8.272097 1.287739 1.790995 12.620268
In [32]:
wdi.loc[("United States", 2010), ["GDP", "Exports"]]
Out[32]:
GDP        14.992053
Exports     1.846280
Name: (United States, 2010), dtype: float64
In [33]:
wdi.loc[("United States", 2010)]
Out[33]:
GovExpend       2.510143
Consumption    10.185836
Exports         1.846280
Imports         2.360183
GDP            14.992053
Name: (United States, 2010), dtype: float64
In [34]:
wdi.loc[[("United States", 2010), ("Canada", 2015)]]
Out[34]:
GovExpend Consumption Exports Imports GDP
country year
United States 2010 2.510143 10.185836 1.846280 2.360183 14.992053
Canada 2015 0.358303 1.035208 0.568859 0.575793 1.794270
In [35]:
wdi.loc[["United States", "Canada"], "GDP"]
Out[35]:
country        year
Canada         2017     1.868164
               2016     1.814016
               2015     1.794270
               2014     1.782252
               2013     1.732714
               2012     1.693428
               2011     1.664240
               2010     1.613543
               2009     1.565291
               2008     1.612862
               2007     1.596876
               2006     1.564608
               2005     1.524608
               2004     1.477317
               2003     1.433089
               2002     1.407725
               2001     1.366590
               2000     1.342805
United States  2017    17.348627
               2016    16.972348
               2015    16.710459
               2014    16.242526
               2013    15.853796
               2012    15.567038
               2011    15.224555
               2010    14.992053
               2009    14.617299
               2008    14.997756
               2007    15.018268
               2006    14.741688
               2005    14.332500
               2004    13.846058
               2003    13.339312
               2002    12.968263
               2001    12.746262
               2000    12.620268
Name: GDP, dtype: float64
In [36]:
wdi.loc["United States", "GDP"]
Out[36]:
year
2017    17.348627
2016    16.972348
2015    16.710459
2014    16.242526
2013    15.853796
2012    15.567038
2011    15.224555
2010    14.992053
2009    14.617299
2008    14.997756
2007    15.018268
2006    14.741688
2005    14.332500
2004    13.846058
2003    13.339312
2002    12.968263
2001    12.746262
2000    12.620268
Name: GDP, dtype: float64

(back to text)

Exercise 3

Try setting my_df to some subset of the rows in wdi (use one of the .loc variations above).

Then see what happens when you do wdi / my_df or my_df ** wdi.

Try changing the subset of rows in my_df and repeat until you understand what is happening.

(back to text)

Exercise 4

Below, we create wdi2, which is the same as df4 except that the levels of the index are swapped.

In the cells after df6 is defined, we have commented out a few of the slicing examples from the previous exercise.

For each of these examples, use pd.IndexSlice to extract the same data from df6.

(HINT: You will need to swap the order of the row slicing arguments within the pd.IndexSlice.)

In [37]:
wdi2 = df.set_index(["year", "country"])
In [38]:
# wdi.loc["United States"]
In [39]:
# wdi.loc[(["United States", "Canada"], [2010, 2011, 2012]), :]
In [40]:
# wdi.loc[["United States", "Canada"], "GDP"]

(back to text)

Exercise 5

Use pd.IndexSlice to extract all data from wdiT where the year level of the column names (the second level) is one of 2010, 2012, and 2014

(back to text)

Exercise 6

Look up the documentation for the reset_index method and study it to learn how to do the following:

  • Move just the year level of the index back as a column.
  • Completely throw away all levels of the index.
  • Remove the country of the index and do not keep it as a column.
In [41]:
# remove just year level and add as column
In [42]:
# throw away all levels of index
In [43]:
# Remove country from the index -- don't keep it as a column