Prerequisites
Outcomes
Data
# Uncomment following line to install on colab
#! pip install qeds
import pandas as pd
import numpy as np
Data can be saved in a variety of formats.
pandas understands how to write and read DataFrames to and from many of these formats.
We defer to the official documentation for a full description of how to interact with all the file formats, but will briefly discuss a few of them here.
What is it? CSVs store data as plain text (strings) where each row is a
line and columns are separated by ,
.
Pros
Cons
When to use:
What is it? xlsx is a binary file format used as Excel’s default.
Pros:
Cons:
When to use:
What is it? Parquet is a custom binary format designed for efficient reading and writing of data stored in columns.
Pros:
dtypes
used by pandas, including
multi-index DataFrames Cons:
When to use:
What is it? Feather is a custom binary format designed for efficient reading and writing of data stored in columns.
Pros:
dtypes
used by pandas Cons:
reset_index
before saving and then set_index
after loading When to use:
What is it? SQL is a language used to interact with relational databases… more info
Pros:
Cons:
When to use:
NOTE: We can cover interacting with SQL databases in a dedicated lecture – contact us for more information.
Let’s now talk about saving a DataFrame to a file.
As a general rule of thumb, if we have a DataFrame df
and we would
like to save to save it as a file of type FOO
, then we would call
the method named df.to_FOO(...)
.
We will show you how this can be done and try to highlight some of the items mentioned above.
But, we will not cover all possible options and features — we feel it is best to learn these as you need them by consulting the appropriate documentation.
First, we need some DataFrames to save. Let’s make them now.
Note that by default df2
will be approximately 10 MB.
If you need to change this number, adjust the value of
the wanted_mb
variable below.
np.random.seed(42) # makes sure we get the same random numbers each time
df1 = pd.DataFrame(
np.random.randint(0, 100, size=(10, 4)),
columns=["a", "b", "c", "d"]
)
wanted_mb = 10 # CHANGE THIS LINE
nrow = 100000
ncol = int(((wanted_mb * 1024**2) / 8) / nrow)
df2 = pd.DataFrame(
np.random.rand(nrow, ncol),
columns=["x{}".format(i) for i in range(ncol)]
)
print("df2.shape = ", df2.shape)
print("df2 is approximately {} MB".format(df2.memory_usage().sum() / (1024**2)))
# notice the plain text format -- one row per line, columns separated by `'`
print(df1.to_csv())
If we do pass an argument, the first argument will be used as the file name.
df1.to_csv("df1.csv")
Run the cell below to verify that the file was created.
import os
os.path.isfile("df1.csv")
Let’s see how long it takes to save df2
to a file. (Because of the %%time
at
the top, Jupyter will report the total time to run all code in
the cell)
%%time
df2.to_csv("df2.csv")
As we will see below, this isn’t as fastest file format we could choose.
When saving a DataFrame to an Excel workbook, we can choose both the name of the workbook (file) and the name of the sheet within the file where the DataFrame should be written.
We do this by passing the workbook name as the first argument and the sheet name as the second argument as follows.
df1.to_excel("df1.xlsx", "df1")
pandas also gives us the option to write more than one DataFrame to a workbook.
To do this, we need to first construct an instance of pd.ExcelWriter
and then pass that as the first argument to df.to_excel
.
Let’s see how this works.
with pd.ExcelWriter("df1.xlsx") as writer:
df1.to_excel(writer, "df1")
(df1 + 10).to_excel(writer, "df1 plus 10")
The
with ... as ... :
syntax used above is an example of a context manager.
We don’t need to understand all the details behind what this means (google it if you are curious).
For now, just recognize that particular syntax as the way to write multiple sheets to an Excel workbook.
WARNING:
Saving df2
to an excel file takes a very long time.
For that reason, we will just show the code and hard-code the output we saw when we ran the code.
%%time
df2.to_excel("df2.xlsx")
Wall time: 25.7 s
As noted above, the feather file format was developed for very efficient reading and writing between Python and your computer.
Support for this format is provided by a separate Python package called pyarrow
.
This package is not installed by default. To install it, copy/paste the code below into a code cell and execute.
markdown
!pip install pyarrow
The parameters for pyarrow.feather.write_feather
are the DataFrame and file name.
Let’s try it out.
import pyarrow.feather
pyarrow.feather.write_feather(df1, "df1.feather")
%%time
pyarrow.feather.write_feather(df2, "df2.feather")
An example timing result:
format | time |
---|---|
csv | 2.66 seconds |
xlsx | 25.7 seconds |
feather | 43 milliseconds |
As you can see, saving this DataFrame in the feather format was far faster than either CSV or Excel.
As with the df.to_FOO
family of methods, there are similar
pd.read_FOO
functions. (Note: they are in defined pandas, not as
methods on a DataFrame.)
These methods have many more options because data storage can be messy or wrong.
We will explore these in more detail in a separate lecture.
For now, we just want to highlight the differences in how to read data from each of the file formats.
Let’s start by reading the files we just created to verify that they match the data we began with.
# notice that index was specified in the first (0th -- why?) column of the file
df1_csv = pd.read_csv("df1.csv", index_col=0)
df1_csv.head()
df1_xlsx = pd.read_excel("df1.xlsx", "df1", index_col=0)
df1_xlsx.head()
# notice feather already knows what the index is
df1_feather = pyarrow.feather.read_feather("df1.feather")
df1_feather.head()
With the pd.read_FOO
family of functions, we can also read files
from places on the internet.
We saved our df1
DataFrame to a file
and posted it online.
Below, we show an example of using pd.read_csv
to read this file.
df1_url = "https://storage.googleapis.com/workshop_materials/df1.csv"
df1_web = pd.read_csv(df1_url, index_col=0)
df1_web.head()
Now it’s your turn…
In the cell below, the variable url
contains a web address to a csv
file containing the result of all NFL games from September 1920 to
February 2017.
Your task is to do the following:
pd.read_csv
to read this file into a DataFrame named nfl
nfl
nfl.xlsx
If you finish quickly, do some basic analysis of the data. Try to do something interesting. If you get stuck, here are some suggestions for what to try:
url = "https://raw.githubusercontent.com/fivethirtyeight/nfl-elo-game/"
url = url + "3488b7d0b46c5f6583679bc40fb3a42d729abd39/data/nfl_games.csv"
# your code here --- create more cells if necessary
If you want to remove the files we just created, run the following cell.
def try_remove(file):
if os.path.isfile(file):
os.remove(file)
for df in ["df1", "df2"]:
for extension in ["csv", "feather", "xlsx"]:
filename = df + "." + extension
try_remove(filename)