Chicago Public Libraries Data Analysis¶

Author: Jesus Cantu Jr.¶

Last Updated: July 26, 2023¶

The City of Chicago Data Portal (https://data.cityofchicago.org/) holds a rich repository of data regarding the city’s infrastructure. Data sets can be downloaded for free and are available, at different organizational levels, in a variety of categories ranging from administration and finance information to sanitation and public safety requests.

Here, we will be focusing on a data sets that contain information about Chicago Public Libraries (CPL) such as location, hours of operation, number of monthly visitors, etc. Public libraries are crucial institutions that provide free access to information and resources for all. They promote lifelong learning, support education, and bridge the digital divide by offering computers and internet access. Libraries serve as community hubs, fostering social connections and cultural engagement. They contribute to democracy by providing unbiased information and resources for civic engagement.

Thus, it is important that they are mantained and that adequate resources are provided, specially to libraries that see a lot of visitors per month. For this project, we are particularly interested in the seasonality behind the use of public libraries. Is there a time of the year for which they experience more visitors (e.g., summer vs. winter)? Has there been a decline or increase in the use of public libraries? And if so, did the COVID-19 pandemic contribute to any more changes?

We hope to answer all of these questions within the following exploration.

In [457]:
# Import Libraries
import os
import re
import folium
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from scipy import stats
from scipy.stats import ttest_ind
from geopy import distance
from shapely.geometry import Point
from pandas_profiling import ProfileReport
from IPython.display import display, HTML
from IPython.display import clear_output
from sklearn.linear_model import LinearRegression

Spatial Analysis- Part 1¶

Let's start our investigation by first looking at how public libraries are distributed within the city of Chicago, for that we will be using the following data:

Libraries - Locations, Contact Information, and Usual Hours of Operation¶

https://data.cityofchicago.org/Education/Libraries-Locations-Contact-Information-and-Usual-/x8fc-8rcq

The data set contains a total of 81 rows. Each row in the dataset represents a library location in Chicago and includes the following fields:

  • Name: The name of the library.
  • Hours of Operation: The usual hours during which the library is open.
  • Address: The physical address of the library.
  • City: The city where the library is located, which in this case would be Chicago.
  • State: The state where the library is located.
  • Zip: The zip code of the library's location.
  • Phone: The contact phone number of the library.
  • Website: The website URL of the library.
  • Location: The geographical location of the library.

Please note that branches may experience short-term closures due to utility interruptions, HVAC problems, and other unforeseen issues. For more up-to-date hours, please see https://chipublib.bibliocommons.com/locations or the Hours & Locations link at the top of any page on https://www.chipublib.org. For current closures, please see https://www.chipublib.org/news/cpl-location-closures.

In [350]:
# Load local copy of the data set 
# libraries_df = pd.read_csv("~/Raw_Data/Library_Locations/Libraries_Locations_Contact_Information_and_Usual_Hours_of_Operation.csv")

# Or download a more recent copy from URL
url = "https://data.cityofchicago.org/api/views/x8fc-8rcq/rows.csv?accessType=DOWNLOAD"
libraries_df = pd.read_csv(url)

libraries_df.head()
Out[350]:
NAME HOURS OF OPERATION ADDRESS CITY STATE ZIP PHONE WEBSITE LOCATION
0 Vodak-East Side Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 3710 E. 106th St. Chicago IL 60617 (312) 747-5500 https://www.chipublib.org/locations/71/ (41.70283443594318, -87.61428978448026)
1 Albany Park Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. Chicago IL 60625 (773) 539-5450 https://www.chipublib.org/locations/3/ (41.97557881655979, -87.71361314512697)
2 Avalon Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 8148 S. Stony Island Ave. Chicago IL 60617 (312) 747-5234 https://www.chipublib.org/locations/8/ (41.746393038286826, -87.5860053710736)
3 Brainerd Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 1350 W. 89th St. Chicago IL 60620 (312) 747-6291 https://www.chipublib.org/locations/13/ (41.73244482025524, -87.65772892721816)
4 Popular Library at Water Works Mon. - Thurs., 10-6; Fri. & Sat., 9-5; Sun., 1-5 163 E. Pearson St. Chicago IL 60611 (312) 742-8811 https://www.chipublib.org/locations/73/ (41.897484072390675, -87.62337776811282)
In [351]:
libraries_df.tail()
Out[351]:
NAME HOURS OF OPERATION ADDRESS CITY STATE ZIP PHONE WEBSITE LOCATION
76 Woodson Regional Library Mon. - Thurs., 9-8; Fri. & Sat., 9-5; Sun., 1-5 9525 S. Halsted St. Chicago IL 60628 (312) 747-6900 https://www.chipublib.org/locations/81/ (41.720694885749005, -87.64304817213312)
77 Mayfair Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 4400 W. Lawrence Ave. Chicago IL 60630 (312) 744-1254 https://www.chipublib.org/locations/49/ (41.968242773953044, -87.737968778247)
78 Chicago Bee Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 3647 S. State St. Chicago IL 60609 (312) 747-6872 https://www.chipublib.org/locations/18/ (41.82824306445502, -87.6263495444489)
79 Uptown Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 929 W. Buena Ave. Chicago IL 60613 (312) 744-8400 https://www.chipublib.org/locations/70/ (41.95832305807637, -87.65424744448335)
80 Sherman Park Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 5440 S. Racine Ave. Chicago IL 60609 (312) 747-0477 https://www.chipublib.org/locations/64/ (41.79476901885989, -87.65502837616037)
In [352]:
# Inspect data set structure
libraries_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   NAME                81 non-null     object
 1   HOURS OF OPERATION  81 non-null     object
 2   ADDRESS             81 non-null     object
 3   CITY                81 non-null     object
 4   STATE               81 non-null     object
 5   ZIP                 81 non-null     int64 
 6   PHONE               81 non-null     object
 7   WEBSITE             81 non-null     object
 8   LOCATION            81 non-null     object
dtypes: int64(1), object(8)
memory usage: 5.8+ KB

Note, the 'LOCATION' column values are currently strings in the format of '(latitude, longitude)'. Before they can be used in a mapping function, they need to be converted into a tuple of floats.

For mapping, we will use the Python library folium because it makes it easy to create interactive maps directly from Python, and it integrates well with pandas data structures. The alternative would be to use a JavaScript library like Leaflet directly, but this would require writing JavaScript code and wouldn't integrate as smoothly with our pandas dataframe.

In [438]:
# Define a function to parse location strings into tuples of floats
def parse_location(location_str):
    # Remove parentheses from the location string and split it into two parts by the comma
    lat_str, lon_str = location_str[1:-1].split(", ")
    # Convert the parts to floats and return them as a tuple
    return float(lat_str), float(lon_str)

# Remove leading and trailing white spaces from all column names in the dataframe
# This is to prevent issues where a column name can't be found due to unexpected white spaces
libraries_df.columns = libraries_df.columns.str.strip()

# Create a new folium Map object
chicago_map = folium.Map(location = [41.8781, -87.6298], zoom_start = 11)

# Iterate over each row in the library dataframe
for idx, row in libraries_df.iterrows():
    # Use the parse_location function to convert the location string into a tuple of floats
    location = parse_location(row['LOCATION'])
    # Add a marker to the map at the given location
    # The tooltip includes the label 'Public Library:' followed by the name of the library (with HTML bold formatting)
    tooltip_text = f'<b>Public Library:</b> {row["NAME"]}'
    folium.Marker(location, tooltip = tooltip_text).add_to(chicago_map)

# Display the map
# Note, this will only work in an interactive environment like a Jupyter notebook
# If you're running this script from a file, you may need to save the map to an HTML file and open it in a web browser
chicago_map
Out[438]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Determining whether libraries are "equally spaced" can be a bit complex due to the irregular shape of most city boundaries and the fact that population density can vary significantly across a city. However, there are a few different ways we could analyze the distribution of libraries across the city.

One way is to create a heatmap of library locations. This would give us a visual representation of areas with a high density of libraries versus areas with a low density.

Another way is to calculate the nearest neighbor distance for each library, which is the distance to the closest other library. If the libraries were perfectly equally spaced, all these distances would be the same. By calculating these distances and looking at their distribution (e.g., by calculating the mean and standard deviation or creating a histogram), we can get an idea of how equally spaced the libraries are.

In [439]:
# Define a function to calculate distances from each library to all other libraries
def calculate_distances(data):
    # Initialize an empty list to hold the minimum distances for each library
    distances = []
    
    # Iterate over each row (library) in the data
    for idx1, row1 in data.iterrows():
        # Initialize an empty list to hold the distances from the current library to all other libraries
        library_distances = []
        # Parse the location of the current library
        location1 = parse_location(row1['LOCATION'])
        
        # Again iterate over each row (library) in the data
        for idx2, row2 in data.iterrows():
            # If the second library is not the same as the current library
            if idx1 != idx2:
                # Parse the location of the second library
                location2 = parse_location(row2['LOCATION'])
                # Calculate the distance between the current library and the second library
                dist = distance.distance(location1, location2).miles
                # Add the calculated distance to the list of distances for the current library
                library_distances.append(dist)
        
        # Find the minimum distance for the current library and add it to the list of minimum distances
        distances.append(min(library_distances))
    
    # Return the list of minimum distances
    return distances

# Call the function to calculate the minimum distance for each library
min_distances = calculate_distances(libraries_df)

# Calculate the mean (average) of the minimum distances
mean_min_dist = round(sum(min_distances) / len(min_distances), 2)
print(f"Mean minimum distance: {mean_min_dist} miles")
Mean minimum distance: 1.22 miles
In [440]:
# Create a histogram of the minimum distances
plt.hist(min_distances, bins = 5, edgecolor = 'black')
plt.xlabel('Minimum Distance (miles)')
plt.ylabel('Frequency')
plt.title('Histogram of Minimum Distances between Libraries')
plt.show()

The result we got, 1.22 miles, is the average distance to the nearest neighboring library for each library in our data set.

This number by itself doesn't tell us whether the libraries are "too close" or "too far apart", because that depends on many factors including the size and population density of Chicago, and what we consider to be an acceptable distance to a library.

However, it can provide some useful context. For example, if we know that the average block in Chicago is about 0.1 miles long, this result suggests that the average library has another library within a distance of about 1.2 miles (i.e., about 12 blocks).

To better interpret this number, we might want to compare it to other cities, or to some standard of what we believe is a reasonable distance to a library. We might also want to consider other factors, like how this distance varies across different neighborhoods, and whether there are any areas that are particularly underserved by libraries.

For example, we could calculate the standard deviation of these minimum distances to see how much they vary. If the standard deviation is small, it suggests that most libraries are about the same distance from their nearest neighbor. If it's large, it means there's a lot of variation, and some libraries are much further from their nearest neighbor than others.

In [356]:
# Calculate the standard deviation of the minimum distances
std_min_dist = round(np.std(min_distances), 2)
print(f"Standard deviation of minimum distances: {std_min_dist} miles")
Standard deviation of minimum distances: 0.34 miles

The standard deviation of 0.34 miles tells us about the variation or dispersion of the minimum distances between libraries. In other words, it gives us a measure of how much these distances typically deviate from the average minimum distance, which was calculated to be 1.22 miles.

A small standard deviation (relative to the mean) indicates that the data points tend to be very close to the mean. In this context, a small standard deviation would mean that most libraries have their nearest neighboring library at a distance close to the average of 1.22 miles. While a high standard deviation indicates that the data points are spread out over a large range of values. In this context, a large standard deviation would mean that the distances to the nearest neighboring library vary widely from the average of 1.22 miles.

With a standard deviation of 0.34 miles, we can say that while the average distance to the nearest library is 1.22 miles, the distances vary somewhat, and it's not uncommon for a library to have its nearest neighbor anywhere between roughly 0.88 miles (1.22 - 0.34) and 1.56 miles (1.22 + 0.34) away.

However, remember that the standard deviation is just a rough guideline - about 68% of libraries will have their nearest neighbor within this range if the data follows a normal distribution. There may still be some libraries with much larger or smaller distances to their nearest neighbor.

This suggests that while libraries in Chicago are somewhat evenly distributed, there is still some variation, and some areas of the city might be better served than others.

In [ ]:
 

Spatial Analysis- Part 2¶

Let's continue our investigation by looking at how this distance varies across different neighborhoods, and whether there are any areas that are particularly underserved by libraries by incorporating the following data:

Boundaries - Community Areas¶

https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6

Current community area boundaries in Chicago.

In [441]:
# Create a new folium Map object
chicago_map2 = folium.Map(location = [41.8781, -87.6298], zoom_start = 11)

# Load GeoJSON file with community areas from a local copy 
# community_areas_gdf = gpd.read_file("~/Raw_Data/Boundaries_Community_Areas/Boundaries_Community_Areas_Current.geojson")

# Load GeoJSON file with community areas from a URL
community_areas_url = "https://data.cityofchicago.org/api/geospatial/cauq-8yn6?method=export&format=GeoJSON"
community_areas_gdf = gpd.read_file(community_areas_url)

# Make sure 'community' is in the columns of community_areas_gdf
assert 'community' in community_areas_gdf.columns, "'community' column not found in community_areas_gdf"

# Add the community areas layer to the map
community_layer = folium.GeoJson(community_areas_gdf, name = "Community Areas")

# Add tooltips to the community layer using the 'community' column
# The tooltip includes the label 'Community Area:' followed by the name of the community area
tooltips = folium.features.GeoJsonTooltip(fields = ['community'], labels = True, sticky = True,
                                          aliases = ['Community Area: '])
community_layer.add_child(tooltips)

# Add the community layer to the map
community_layer.add_to(chicago_map2)

# Iterate over each row in the library dataframe
for idx, row in libraries_df.iterrows():
    # Use the parse_location function to convert the location string into a tuple of floats
    location = parse_location(row['LOCATION'])
    # Add a marker to the map at the given location
    # The tooltip includes the label 'Public Library:' followed by the name of the library (with HTML bold formatting)
    tooltip_text = f'<b>Public Library:</b> {row["NAME"]}'
    folium.Marker(location, tooltip = tooltip_text).add_to(chicago_map2)

# Display the map
chicago_map2
Out[441]:
Make this Notebook Trusted to load map: File -> Trust Notebook

By adding the adding the community area layer to the map, we can visually discern if there are any community areas that do not have public libraries within their boundaries. Out of 77 community areas, it seems that only 16 of them don't have a public library positioned exactly inside their boundaries; the list includes: OHare, Edison Park, Lincoln Park, Belmont Cragin, Hermosa, East Garfield Park, Near South Side, Oakland, Hyde Park, Washington Park, Chatam, Burnside, Calumet Heights, East Side, Auburn, and Gresham. However, note that in most of these cases the public library lies at the boundary between two community areas.

This is not necessarily a bad thing, as must public libraries seem to be evenly dispersed across Chicago. To make the investigation more rigorous, we could select points within each community area (e.g., centroid) and calculate distances to the nearest public library. Additionaly, connecting to an API, like Google Maps, could give us additional information in regards to distances and how different modes of transportation might affect them giving us further insights into what neighboorhoods might be underserved by libraries.

In [ ]:
 

Hours of Operation¶

The "HOURS OF OPERATION" field in our data appears to include operating hours for different days of the week in a semi-structured format.

To make better sense of it, we could separate the hours of operation for each day of the week into separate columns. This process involves parsing the text and can be a bit complex because the format isn't perfectly consistent (for example, sometimes there's a space after the comma, and sometimes there isn't). Let's create a function that extracts the hours for each day of the week and applies it to the "HOURS OF OPERATION" column. We'll create a new dataframe that includes these parsed hours, and then merge it with the original dataframe.

Regular expressions allow us to match and extract information from strings based on patterns.The structure of the "HOURS OF OPERATION" strings seems to follow this pattern:

  • A sequence of day abbreviations (Mon., Tues., etc.), followed by a comma.
  • A time range (e.g., "10-6", "Noon-8"), followed by a semicolon.
  • This pattern repeats for each day or sequence of days. Let's create a regex pattern that matches this structure, and then use that pattern to extract the hours for each day.
In [442]:
# Regular expression patterns for each day
patterns = {
    "MON": r"Mon\..*?,\s*(.*?);",
    "TUES": r"Tues\..*?,\s*(.*?);",
    "WED": r"Wed\..*?,\s*(.*?);",
    "THURS": r"Thurs\..*?,\s*(.*?);",
    "FRI": r"Fri\..*?,\s*(.*?);",
    "SAT": r"Sat\..*?,\s*(.*?);",
    "SUN": r"Sun\..*?,\s*(.*?);",
}

# Function to parse hours of operation using regex
def parse_hours_regex(hours_str):
    hours_dict = {}
    
    for day, pattern in patterns.items():
        match = re.search(pattern, hours_str)
        hours_dict[day] = match.group(1).strip() if match else ""
    
    return hours_dict

# Apply the function to the "HOURS OF OPERATION" column
hours_df = libraries_df['HOURS OF OPERATION'].apply(parse_hours_regex).apply(pd.Series)

# Merge the hours dataframe with the original dataframe
parsed_data = pd.concat([libraries_df, hours_df], axis = 1)

parsed_data.head()
Out[442]:
NAME HOURS OF OPERATION ADDRESS CITY STATE ZIP PHONE WEBSITE LOCATION MON TUES WED THURS FRI SAT SUN
0 Vodak-East Side Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 3710 E. 106th St. Chicago IL 60617 (312) 747-5500 https://www.chipublib.org/locations/71/ (41.70283443594318, -87.61428978448026) Noon-8 10-6 Noon-8 10-6 9-5 9-5
1 Albany Park Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. Chicago IL 60625 (773) 539-5450 https://www.chipublib.org/locations/3/ (41.97557881655979, -87.71361314512697) 10-6 Noon-8 10-6 Noon-8 9-5 9-5
2 Avalon Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 8148 S. Stony Island Ave. Chicago IL 60617 (312) 747-5234 https://www.chipublib.org/locations/8/ (41.746393038286826, -87.5860053710736) Noon-8 10-6 Noon-8 10-6 9-5 9-5
3 Brainerd Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 1350 W. 89th St. Chicago IL 60620 (312) 747-6291 https://www.chipublib.org/locations/13/ (41.73244482025524, -87.65772892721816) 10-6 Noon-8 10-6 Noon-8 9-5 9-5
4 Popular Library at Water Works Mon. - Thurs., 10-6; Fri. & Sat., 9-5; Sun., 1-5 163 E. Pearson St. Chicago IL 60611 (312) 742-8811 https://www.chipublib.org/locations/73/ (41.897484072390675, -87.62337776811282) 10-6 10-6 9-5 9-5

Now that we have separated the hours of operation by day of the week, we can also compute the total hours of operation per week, by first converting the hours of operation into a numerical format. Currently, the hours of operation are stored as strings in a 12-hour clock format, such as "Noon-8" or "10-6".

We'll start by defining a function that converts a string representing a range of hours into a number representing the total hours. This function will handle different ways of representing times, such as "Noon" or "10".

Then, we'll apply this function to the hours of operation for each day, sum up the total hours for each week, and store the result in a new column.

In [443]:
# Function to calculate total hours from a time range string
def calculate_hours(time_range_str):
    if not time_range_str:
        return 0
    
    # Convert "Noon" to "12"
    time_range_str = time_range_str.replace("Noon", "12")
    
    # Split the string into start time and end time
    start_str, end_str = time_range_str.split("-")
    
    # Convert start and end times to integers
    start = int(start_str.strip())
    end = int(end_str.strip())
    
    # If the end time is less than the start time, add 12 to it (to handle times in the evening)
    if end < start:
        end += 12
    
    # Calculate the total hours
    total_hours = end - start
    
    return total_hours

# Apply the function to the hours for each day and sum up the total hours for each week
parsed_data["TOTAL HOURS"] = parsed_data[["MON", "TUES", "WED", "THURS", "FRI", "SAT", "SUN"]].applymap(calculate_hours).sum(axis=1)

parsed_data.head()
Out[443]:
NAME HOURS OF OPERATION ADDRESS CITY STATE ZIP PHONE WEBSITE LOCATION MON TUES WED THURS FRI SAT SUN TOTAL HOURS
0 Vodak-East Side Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 3710 E. 106th St. Chicago IL 60617 (312) 747-5500 https://www.chipublib.org/locations/71/ (41.70283443594318, -87.61428978448026) Noon-8 10-6 Noon-8 10-6 9-5 9-5 48
1 Albany Park Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. Chicago IL 60625 (773) 539-5450 https://www.chipublib.org/locations/3/ (41.97557881655979, -87.71361314512697) 10-6 Noon-8 10-6 Noon-8 9-5 9-5 48
2 Avalon Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 8148 S. Stony Island Ave. Chicago IL 60617 (312) 747-5234 https://www.chipublib.org/locations/8/ (41.746393038286826, -87.5860053710736) Noon-8 10-6 Noon-8 10-6 9-5 9-5 48
3 Brainerd Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 1350 W. 89th St. Chicago IL 60620 (312) 747-6291 https://www.chipublib.org/locations/13/ (41.73244482025524, -87.65772892721816) 10-6 Noon-8 10-6 Noon-8 9-5 9-5 48
4 Popular Library at Water Works Mon. - Thurs., 10-6; Fri. & Sat., 9-5; Sun., 1-5 163 E. Pearson St. Chicago IL 60611 (312) 742-8811 https://www.chipublib.org/locations/73/ (41.897484072390675, -87.62337776811282) 10-6 10-6 9-5 9-5 32

Now, lets visualize all this information.

In [444]:
# Convert hours to numeric for plot
hours_numeric_df = hours_df.applymap(calculate_hours)

# Remove Sunday from the hours dataframe as it contains no values (Public libraries are closed Sundays!)
hours_numeric_df = hours_numeric_df.drop(columns="SUN")

# Create histograms for hours of operation for each day
plt.figure(figsize = (14, 10))
for i, day in enumerate(["MON", "TUES", "WED", "THURS", "FRI", "SAT"], start = 1):
    plt.subplot(2, 3, i)
    plt.hist(hours_numeric_df[day], bins = range(0, 13), alpha = 0.7, color = 'skyblue', edgecolor = 'black')
    plt.title(f"{day} Hours of Operation")
    plt.xlabel("Hours")
    plt.ylabel("Frequency")

plt.tight_layout()
plt.show()

# Create histogram for total hours per week
plt.figure(figsize = (10, 6))
plt.hist(parsed_data["TOTAL HOURS"], bins = range(0, 60, 2), alpha = 0.7, color = 'skyblue', edgecolor = 'black')
plt.title("Total Hours of Operation Per Week")
plt.xlabel("Hours")
plt.ylabel("Frequency")
plt.show()

The histograms at the top show the distribution of hours of operation for each day from Monday to Saturday across all libraries. Each bar represents the number of libraries operating for a given number of hours. For example, in the "MON Hours of Operation" histogram, the bar over "8" hours means that there are over 70 libraries that operate for 8 hours on Mondays.

The histogram at the bottom shows the distribution of total hours of operation per week across all libraries. Each bar represents the number of libraries operating for a given total number of hours per week. For example, the tallest bar over "48" hours means that the most common total hours of operation per week is around 48 hours.

Let's dig a little deeper and obtain the names of the public libraries which are open less than the average number of 48 hours per week.

In [445]:
# Filter the dataframe for libraries open less than 48 hours per week
less_than_48 = parsed_data[parsed_data["TOTAL HOURS"] < 48][["NAME", "TOTAL HOURS"]]

# Sort by total hours
less_than_48 = less_than_48.sort_values("TOTAL HOURS")

less_than_48
Out[445]:
NAME TOTAL HOURS
42 Galewood-Mont Clare 0
4 Popular Library at Water Works 32
5 Little Italy 32
13 Portage-Cragin 32
33 Archer Heights 32
47 West Pullman 32
51 Austin 32
26 Sulzer Regional Library 38
49 Legler Regional 38
60 Harold Washtington Library Center 38
76 Woodson Regional Library 38

Compared to the rest, it appears that most of these libraries are open for 32 or 38 hours per week, with one library (Galewood-Mont Clare) having a total of 0 hours of operation listed, which might be a data error or indicate that it's currently closed. Based on the list provided at https://www.chipublib.org/news/cpl-location-closures; we can see that Galewood-Mont Clare is currently closed.

As it stands this data set does not provide any additional information that could inform us if differences in hours of operation, for example, affects they way people utilize Chicago Public Libraries. Next, we will incorporate visitor count data to answer this and other relevant questions to our investigation.

But first, lets save the changes we made for later use.

In [380]:
# Melt the dataframe to long format (long format is better for analysis)
location_long_format = pd.melt(parsed_data, id_vars = ['NAME', 'HOURS OF OPERATION', 'ADDRESS', 'CITY', 'STATE', 'ZIP',
                                            'PHONE', 'WEBSITE', 'LOCATION', 'TOTAL HOURS'], 
                      value_vars = ["MON", "TUES", "WED", "THURS", "FRI", "SAT", "SUN"], 
                      var_name = 'DAY', value_name = 'HOURS')

# Convert hours to numerical format
location_long_format['HOURS'] = location_long_format['HOURS'].apply(calculate_hours)

# Rename specific columns
column_rename_dict = {'TOTAL HOURS': 'TOTAL HOURS WEEK', 'HOURS': 'HOURS OPEN'}
location_long_format = location_long_format.rename(columns=column_rename_dict)

# Specify the saving path for the CSV file
saving_path = '~/Processed_Data/Library_Locations_Processed.csv'

# Save the long format DataFrame to a CSV file
location_long_format.to_csv(saving_path, index = False)

location_long_format.head()
Out[380]:
NAME HOURS OF OPERATION ADDRESS CITY STATE ZIP PHONE WEBSITE LOCATION TOTAL HOURS WEEK DAY HOURS OPEN
0 Vodak-East Side Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 3710 E. 106th St. Chicago IL 60617 (312) 747-5500 https://www.chipublib.org/locations/71/ (41.70283443594318, -87.61428978448026) 48 MON 8
1 Albany Park Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. Chicago IL 60625 (773) 539-5450 https://www.chipublib.org/locations/3/ (41.97557881655979, -87.71361314512697) 48 MON 8
2 Avalon Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 8148 S. Stony Island Ave. Chicago IL 60617 (312) 747-5234 https://www.chipublib.org/locations/8/ (41.746393038286826, -87.5860053710736) 48 MON 8
3 Brainerd Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 1350 W. 89th St. Chicago IL 60620 (312) 747-6291 https://www.chipublib.org/locations/13/ (41.73244482025524, -87.65772892721816) 48 MON 8
4 Popular Library at Water Works Mon. - Thurs., 10-6; Fri. & Sat., 9-5; Sun., 1-5 163 E. Pearson St. Chicago IL 60611 (312) 742-8811 https://www.chipublib.org/locations/73/ (41.897484072390675, -87.62337776811282) 32 MON 8

Incorporating Visitor Data¶

We will continue our investigation into Chicago Public Libraries by adding information on the number of visitors that each location experiences per month/year. The Chicago Data portal has multiple data sets, similarly structured to the the one below for 2011.

  • Data for 2011: https://data.cityofchicago.org/Education/Libraries-2011-Visitors-by-Location/xxwy-zyzu
  • Data for 2012: https://data.cityofchicago.org/Education/Libraries-2012-Visitors-by-Location/zh3n-jtnt
  • Data for 2013: https://data.cityofchicago.org/Education/Libraries-2013-Visitors-by-Location/x74m-smqb
  • Data for 2014: https://data.cityofchicago.org/Education/Libraries-2014-Visitors-by-Location/si8n-dg3u
  • Data for 2015: https://data.cityofchicago.org/dataset/Libraries-2015-Visitors-by-Location/7imc-umy4
  • Data for 2016: https://data.cityofchicago.org/dataset/Libraries-2016-Visitors-by-Location/cpc6-pxmp
  • Data for 2017: https://data.cityofchicago.org/dataset/Libraries-2017-Visitors-by-Location/bk6j-nu5x
  • Data for 2018: https://data.cityofchicago.org/dataset/Libraries-2018-Visitors-by-Location/i7zz-iiza
  • Data for 2019: https://data.cityofchicago.org/dataset/Libraries-2019-Visitors-by-Location/sw6v-npyj
  • Data for 2020: https://data.cityofchicago.org/Education/Libraries-2020-Visitors-by-Location/pb9h-bnh4
  • Data for 2021: https://data.cityofchicago.org/Education/Libraries-2021-Visitors-by-Location/8i46-4b7w
  • Data for 2022: https://data.cityofchicago.org/Education/Libraries-2022-Visitors-by-Location/ykhx-yxn9
  • Data for 2023: https://data.cityofchicago.org/Education/Libraries-2023-Visitors-by-Location/74j2-zzz4

Libraries - 2011 Visitors by Location¶

The data set contains a total of 79 rows. Each row in the data set represents a library location in Chicago and includes the following fields:

  • Location: The name of the library.
  • January to December: The number of visitors each month.
  • YTD (Year To Date): The total number of visitors for the year.

This data set provides information on the number of visitors to each library in the Chicago Public Library system, which includes the Harold Washington Library Center, Sulzer and Woodson regional libraries, and over 70 neighborhood branches. Please note that some locations may experience sporadic closures due to facilities upgrades, emergency closures due to heating or air conditioning issues, or area power outages.

An asterisk (*) in the 'LOCATION' name signifies that the count does not reflect the total building visitor count due to the location of the traffic counter. Community room and program traffic are not included in the totals.

In [448]:
# Inspect data set structure
visitors_2011 = df = pd.read_csv("https://data.cityofchicago.org/api/views/xxwy-zyzu/rows.csv?accessType=DOWNLOAD") # Load data from URL
visitors_2011.head()
Out[448]:
LOCATION ADDRESS CITY ZIP CODE JANUARY FEBRUARY MARCH APRIL MAY JUNE JULY AUGUST SEPTEMBER OCTOBER NOVEMBER DECEMBER YTD
0 Albany Park 5150 N. Kimball Avenue CHICAGO 60625.0 9604.0 10500.0 9050.0 9300.0 8271.0 10984.0 9986.0 11078.0 9453.0 10213.0 9377.0 9609.0 117425.0
1 Altgeld 13281 S. Corliss Avenue CHICAGO 60827.0 5809.0 3899.0 5207.0 5201.0 4494.0 5760.0 3653.0 2414.0 4552.0 6891.0 5698.0 5079.0 58657.0
2 Archer Heights* 5055 S. Archer Avenue CHICAGO 60632.0 9829.0 9394.0 11342.0 11114.0 9365.0 11247.0 10329.0 11231.0 10373.0 11364.0 10011.0 9054.0 124653.0
3 Austin 5615 W. Race Avenue CHICAGO 60644.0 6713.0 6250.0 7054.0 9139.0 8857.0 9586.0 8352.0 10359.0 9151.0 10016.0 8461.0 8368.0 102306.0
4 Austin-Irving 6100 W. Irving Park Road CHICAGO 60634.0 11556.0 9904.0 13214.0 13064.0 10969.0 12587.0 12596.0 13638.0 12542.0 13286.0 11868.0 10628.0 145852.0
In [447]:
visitors_2011.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   LOCATION   80 non-null     object 
 1   ADDRESS    79 non-null     object 
 2   CITY       79 non-null     object 
 3   ZIP CODE   79 non-null     float64
 4   JANUARY    80 non-null     float64
 5   FEBRUARY   80 non-null     float64
 6   MARCH      79 non-null     float64
 7   APRIL      79 non-null     float64
 8   MAY        79 non-null     float64
 9   JUNE       78 non-null     float64
 10  JULY       78 non-null     float64
 11  AUGUST     77 non-null     float64
 12  SEPTEMBER  78 non-null     float64
 13  OCTOBER    78 non-null     float64
 14  NOVEMBER   78 non-null     float64
 15  DECEMBER   78 non-null     float64
 16  YTD        79 non-null     float64
dtypes: float64(14), object(3)
memory usage: 10.9+ KB

Data Cleaning¶

I inspected all the data sets, from 2011 to 2023, and noticed a few things:

  • Mismatched data types occur when loading the CSV files
  • All missing values are left blank
  • There is no way to tell which year the data belongs to
  • Data for 2011 includes rows with Totals
  • Data after 2011 changes column 'ZIP CODE' to 'ZIP'
  • Data for 2015-2018 does not include the column 'ADDRESS' or 'CITY'
  • Data for 2016 contains empty rows
  • Data after 2019 includes 'ADDRESS' and 'CITY' again but changed the column name 'LOCATION' TO 'BRANCH' to refer to the name of each public library
  • Data after 2019 uses the column 'LOCATION' to refer to geographic location (latitude & longitude)
  • Data for 2013 uses 'Location' instead of 'LOCATION' for its column name
  • Asterisks are used after the name of each library to highlight instances where the count does not reflect the total building visitor count due to location of traffic counter

These discrepancies make it difficult to analyze the data in aggregate and should be corrected. The code below cleans the data sets, in an automated fashion, and makes it possible to merge them all together. The format of each data set is also changed from wide to long to accomodate future exploration, particularly time series analysis. Additional changes, such as naming conventions, were also implemented to improve the readability and comprehensiveness. Note, columns that include location information (address, city, zip code, etc.) were removed as this information can be incoporated from the previous data set (i.e., 'Libraries - Locations, Contact Information, and Usual Hours of Operation') later on.

We will clean local copies of the data sets as it is easier to iterate through them. However, it should be noted that the data for 2023 is incomplete and should be redownloaded at a later date, if you wish to repeat the analysis.

In [392]:
# Get list of all the CSV files within a specific directory 
def get_csv_files_in_path(main_path, subdirectory):
    # Combine the main path and subdirectory to get the full directory path
    path = os.path.join(main_path, subdirectory)

    # Check if the given path exists and is a directory
    if not os.path.exists(path) or not os.path.isdir(path):
        print(f"Invalid path: {path}")
        return []

    # Get a list of all files in the directory
    file_list = os.listdir(path)

    # Filter the list to include only CSV files
    csv_files = [os.path.join(path, file) for file in file_list if file.lower().endswith('.csv')]

    return csv_files

# Example usage
main_path_to_files = "~/Raw_Data"
subdirectory_path = "Visitors_By_Location"
csv_files = get_csv_files_in_path(main_path_to_files, subdirectory_path)

print("CSV files in the specified path:")
for csv_file in csv_files:
    print(csv_file)
CSV files in the specified path:
/Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2023_Visitors_by_Location.csv
/Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2014_Visitors_by_Location.csv
/Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2018_Visitors_by_Location.csv
/Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2015_Visitors_by_Location.csv
/Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2011_Visitors_by_Location.csv
/Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2019_Visitors_by_Location.csv
/Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2013_Visitors_by_Location.csv
/Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2020_Visitors_by_Location.csv
/Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2017_Visitors_by_Location.csv
/Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2021_Visitors_by_Location.csv
/Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2016_Visitors_by_Location.csv
/Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Raw_Data/Visitors_By_Location/Libraries_2012_Visitors_by_Location.csv

Again, all these files pertain to visitor count data for each library for the years 2011 to 2023. Lets process them!

The following process_csv_files function reads and processes a list of CSV files, applies transformations to the data, adds a 'SEASON' column, saves the processed files, and returns the merged dataframe containing the processed data from all the files.

In [393]:
def process_csv_files(csv_files, new_directory):
    processed_dfs = []  # List to store processed dataframes

    # Define a function to map month number to season
    def month_to_season(month):
        if month in [3, 4, 5]:
            return 'Spring'
        elif month in [6, 7, 8]:
            return 'Summer'
        elif month in [9, 10, 11]:
            return 'Fall'
        else:
            return 'Winter'
            
    for file_path in csv_files:      
        # Read the CSV file
        df = pd.read_csv(file_path)

        # Replace missing values with 0
        df = df.fillna(0)

        # Define the data types for each column
        data_types = {
            'JANUARY': int,
            'FEBRUARY': int,
            'MARCH': int,
            'APRIL': int,
            'MAY': int,
            'JUNE': int,
            'JULY': int,
            'AUGUST': int,
            'SEPTEMBER': int,
            'OCTOBER': int,
            'NOVEMBER': int,
            'DECEMBER': int,
            'YTD': int,
        }

        # Extract the year from the file name using regular expression
        match = re.search(r"Libraries_(\d{4})", file_path)
        if match:
            year = int(match.group(1))
            
        # Remove the 'LOCATION' column for data after 2019
        if year > 2019:
            df = df.drop(columns = df.columns[df.columns.str.contains(r'(?i)^.*location.*\d*$')])

        # Check if the 'ZIP' or 'ZIP CODE' column exists and drop them
        if 'ZIP' in df.columns:
            df = df.drop(columns = ['ZIP'])
        if 'ZIP CODE' in df.columns:
            df = df.drop(columns = ['ZIP CODE'])

        # Check if the 'LOCATION' or 'BRANCH' column exists
        if 'LOCATION' in df.columns:
            # Rename the 'LOCATION' column to 'NAME'
            df = df.rename(columns = {'LOCATION': 'NAME'})
        elif 'BRANCH' in df.columns:
            # Rename the 'BRANCH' column to 'NAME'
            df = df.rename(columns = {'BRANCH': 'NAME'})

        # Check if the 'ADDRESS' column exists
        if 'ADDRESS' in df.columns:
            # Remove the 'ADDRESS' column
            df = df.drop(columns = ['ADDRESS'])

        # Check if the 'CITY' column exists
        if 'CITY' in df.columns:
            # Remove the 'CITY' column
            df = df.drop(columns = ['CITY'])

        # Apply the specified data types to the DataFrame
        df = df.astype(data_types)
      
        # Change column names
        df = df.rename(columns = {'JANUARY': 'JAN', 'FEBRUARY': 'FEB', 'MARCH': 'MAR', 'APRIL': 'APR', 'MAY': 'MAY',
                                  'JUNE': 'JUN', 'JULY': 'JUL', 'AUGUST': 'AUG', 'SEPTEMBER': 'SEP', 'OCTOBER': 'OCT',
                                  'NOVEMBER': 'NOV', 'DECEMBER': 'DEC', 'YTD': 'TOTAL_VISITORS_YEAR'})
        # Add the 'Year' column
        df['YEAR'] = year

        # Add a column for PROBLEM_WITH_COUNTER
        df['PROBLEM_WITH_COUNTER'] = df['NAME'].str.contains(r'\*', regex = True)

        # Remove trailing white spaces from the 'NAME' column
        df['NAME'] = df['NAME'].str.strip()

        # Remove asterisks from the 'NAME' column
        df['NAME'] = df['NAME'].str.replace('\*', '', regex = True)

        # Remove rows where 'Name' is empty or equals 'Total'
        df = df[df['NAME'].notna()]
        df = df[~df['NAME'].str.lower().str.contains('total')]

        # Convert from wide to long format
        df = df.melt(id_vars = ['NAME', 'YEAR', 'TOTAL_VISITORS_YEAR', 'PROBLEM_WITH_COUNTER'],
                     var_name = 'MONTH', value_name = 'VISITOR_COUNT')

        # Add a column for month number
        df['MONTH_NUMBER'] = pd.to_datetime(df['MONTH'], format = '%b', errors = 'coerce').dt.month

        # Add the 'SEASON' column
        df['SEASON'] = df['MONTH_NUMBER'].apply(month_to_season)

        # Rearrange the columns if desired
        df = df[['NAME', 'YEAR', 'TOTAL_VISITORS_YEAR', 'PROBLEM_WITH_COUNTER', 'MONTH', 'MONTH_NUMBER', 'SEASON', 'VISITOR_COUNT']]

        # Rearrange the rows by name, year, and month number
        df = df.sort_values(by = ['NAME', 'YEAR', 'MONTH_NUMBER'])

        # Reset the index if desired
        df = df.reset_index(drop = True)

        # Create the new file name
        file_name = os.path.basename(file_path)
        new_file_name = os.path.splitext(file_name)[0] + '_long_format.csv'

        # Construct the new file path
        new_file_path = os.path.join(new_directory, new_file_name)

        # Save the DataFrame to a new CSV file
        df.to_csv(new_file_path, index = False)

        processed_dfs.append(df)  # Store the processed DataFrame

    # Merge all the processed dataframes
    merged_df = pd.concat(processed_dfs, ignore_index = True)

    # Rearrange the rows in the merged dataframe by name, year, and month number
    merged_df = merged_df.sort_values(by = ['NAME', 'YEAR', 'MONTH_NUMBER'])

    # Reset the index of the merged dataframe
    merged_df = merged_df.reset_index(drop = True)

    # Save the merged dataframe to a separate CSV file
    merged_file_path = os.path.join(new_directory, 'Visitors_by_Location_merged_data_long_format.csv')
    merged_df.to_csv(merged_file_path, index = False)
    
    print("Done processing csv files!") 
    print(f"Clean files saved to: {new_directory}.")
    
    return merged_df
In [394]:
# Example usage
new_directory = "~/Processed_Data" # Path where you want the processed files saved
merged_df = process_csv_files(csv_files, new_directory)
Done processing csv files!
Clean files saved to: /Users/Jesse/Desktop/Workspace/Tableau/Chicago_Public_Libraries/Processed_Data.

Here's a more detailed summary of what the code above does:

  1. It initializes an empty list called processed_dfs to store processed dataframes.
  2. For each CSV file in the given csv_files list:
    • It reads the CSV file into a dataframe.
    • Replaces missing values in the dataframe with 0.
    • Defines the desired data types for each column.
    • Extracts the year from the file name using a regular expression.
    • If the year is greater than 2019, it removes any columns containing the word "location" (case-insensitive) and followed by a digit from the dataframe.
    • Checks if the 'ZIP' or 'ZIP CODE' columns exist and drops them if they do.
    • Renames the 'LOCATION' or 'BRANCH' column to 'NAME' if they exist.
    • Checks if the 'ADDRESS' or 'CITY' columns exist and drops them if they do.
    • Applies the specified data types to the dataframe columns.
    • Renames the columns to abbreviated month names and other desired column names.
    • Adds a 'YEAR' column containing the extracted year.
    • Adds a 'PROBLEM_WITH_COUNTER' column indicating if the 'NAME' contains an asterisk (*).
    • Removes trailing white spaces from the 'NAME' column.
    • Removes asterisks from the 'NAME' column.
    • Removes rows where the 'NAME' is empty or equals 'Total'.
    • Converts the dataframe from wide to long format using the 'NAME', 'YEAR', 'TOTAL_VISITORS_YEAR', and 'PROBLEM_WITH_COUNTER' columns as the identifier variables.
    • Adds a 'MONTH_NUMBER' column by converting the 'MONTH' column to month numbers using the specified format.
    • Adds a 'SEASON' column by mapping the 'MONTH_NUMBER' to its corresponding season.
    • Rearranges the columns in the desired order.
    • Sorts the rows by 'NAME', 'YEAR', and 'MONTH_NUMBER'.
    • Resets the dataframe index.
    • Saves the processed dataframe to a new CSV file with a modified file name.
    • Appends the processed dataframe to the processed_dfs list.
  3. After processing all CSV files, it merges all the processed dataframes into a single dataframe called merged_df.
  4. It sorts the rows in the merged_df dataframe by 'NAME', 'YEAR', and 'MONTH_NUMBER'.
  5. Resets the index of the merged_df dataframe.
  6. Saves the merged dataframe to a separate CSV file.
  7. Prints a message indicating the processing is complete and the directory where the clean files are saved.
  8. Finally, it returns the merged dataframe.

Finally, we can also incoporate the previous data set, which inlcudes information about library location, hours of operation, etc:

In [398]:
# Merge the two data sets on 'NAME' column
merged_full_df = pd.merge(merged_df, location_long_format, on = 'NAME')

# Save the full merged dataframe
merged_full_df.to_csv("~/Processed_Data/Public_Library_Location_Visitor_Count_2011_2023.csv", index = False)
merged_full_df.head()
Out[398]:
NAME YEAR TOTAL_VISITORS_YEAR PROBLEM_WITH_COUNTER MONTH MONTH_NUMBER SEASON VISITOR_COUNT HOURS OF OPERATION ADDRESS CITY STATE ZIP PHONE WEBSITE LOCATION TOTAL HOURS WEEK DAY HOURS OPEN
0 Albany Park 2011 117425 False JAN 1 Winter 9604 Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. Chicago IL 60625 (773) 539-5450 https://www.chipublib.org/locations/3/ (41.97557881655979, -87.71361314512697) 48 MON 8
1 Albany Park 2011 117425 False JAN 1 Winter 9604 Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. Chicago IL 60625 (773) 539-5450 https://www.chipublib.org/locations/3/ (41.97557881655979, -87.71361314512697) 48 TUES 8
2 Albany Park 2011 117425 False JAN 1 Winter 9604 Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. Chicago IL 60625 (773) 539-5450 https://www.chipublib.org/locations/3/ (41.97557881655979, -87.71361314512697) 48 WED 8
3 Albany Park 2011 117425 False JAN 1 Winter 9604 Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. Chicago IL 60625 (773) 539-5450 https://www.chipublib.org/locations/3/ (41.97557881655979, -87.71361314512697) 48 THURS 8
4 Albany Park 2011 117425 False JAN 1 Winter 9604 Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri... 3401 W. Foster Ave. Chicago IL 60625 (773) 539-5450 https://www.chipublib.org/locations/3/ (41.97557881655979, -87.71361314512697) 48 FRI 8
In [397]:
merged_full_df.tail()
Out[397]:
NAME YEAR TOTAL_VISITORS_YEAR PROBLEM_WITH_COUNTER MONTH MONTH_NUMBER SEASON VISITOR_COUNT HOURS OF OPERATION ADDRESS CITY STATE ZIP PHONE WEBSITE LOCATION TOTAL HOURS WEEK DAY HOURS OPEN
75175 Wrightwood-Ashburn 2023 26008 False DEC 12 Winter 0 Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 8530 S. Kedzie Ave. Chicago IL 60652 (312) 747-2696 https://www.chipublib.org/locations/82/ (41.73795704970435, -87.70221598421591) 48 WED 8
75176 Wrightwood-Ashburn 2023 26008 False DEC 12 Winter 0 Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 8530 S. Kedzie Ave. Chicago IL 60652 (312) 747-2696 https://www.chipublib.org/locations/82/ (41.73795704970435, -87.70221598421591) 48 THURS 8
75177 Wrightwood-Ashburn 2023 26008 False DEC 12 Winter 0 Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 8530 S. Kedzie Ave. Chicago IL 60652 (312) 747-2696 https://www.chipublib.org/locations/82/ (41.73795704970435, -87.70221598421591) 48 FRI 8
75178 Wrightwood-Ashburn 2023 26008 False DEC 12 Winter 0 Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 8530 S. Kedzie Ave. Chicago IL 60652 (312) 747-2696 https://www.chipublib.org/locations/82/ (41.73795704970435, -87.70221598421591) 48 SAT 8
75179 Wrightwood-Ashburn 2023 26008 False DEC 12 Winter 0 Mon. & Wed., Noon-8; Tues. & Thurs., 10-6; Fri... 8530 S. Kedzie Ave. Chicago IL 60652 (312) 747-2696 https://www.chipublib.org/locations/82/ (41.73795704970435, -87.70221598421591) 48 SUN 0

To summarize, our full merged data set contains the following attributes:

  • NAME: The name of the public library.
  • YEAR: The year of the record.
  • TOTAL_VISITORS_YEAR: The total number of visitors in the year.
  • PROBLEM_WITH_COUNTER: A boolean column indicating if there was a problem with the counter.
  • MONTH: The month of the record.
  • MONTH_NUMBER: The numerical representation of the month.
  • SEASON: The season of the record.
  • VISITOR_COUNT: The count of visitors.
  • HOURS OF OPERATION: The hours of operation for each day of the week.
  • ADDRESS: The address of the public library.
  • CITY: The city where the public library is located.
  • STATE: The state where the public library is located.
  • ZIP: The ZIP code of the public library.
  • PHONE: The phone number of the public library.
  • WEBSITE: The website of the public library.
  • LOCATION: The geographical coordinates of the public library.
  • TOTAL HOURS WEEK: Total hours of operation in a week.
  • DAY: The day of the week.
  • HOURS OPEN: The opening hours of the library for the specific day.

DATA PROFILING¶

Data profiling involves analyzing the contents and structure of a data set to gain insights into its characteristics. Here's an example of how we can perform data profiling on the full merged dataframe to further review it:

In [449]:
def perform_data_profiling(df):
    # Display general information about the dataframe
    print("Data Profile:")
    print("-" * 50)
    print(df.info())
    print("-" * 50)
    
    # Display summary statistics
    print("Summary Statistics:")
    print("-" * 50)
    print(df.describe())
    print("-" * 50)
    
    # Display the first few rows of the dataframe
    print("Sample Data:")
    print("-" * 50)
    print(df.head())
    print("-" * 50)
    
    # Display the number of unique values in each column
    print("Unique Value Counts:")
    print("-" * 50)
    for column in df.columns:
        unique_values = df[column].nunique()
        print(f"{column}: {unique_values} unique values")
    print("-" * 50)
    
    # Display missing value information
    print("Missing Value Counts:")
    print("-" * 50)
    print(df.isnull().sum())
    print("-" * 50)
    
    # Display data types of each column
    print("Data Types:")
    print("-" * 50)
    print(df.dtypes)
    print("-" * 50)
    
    # Display correlations between numeric columns
    numeric_columns = df.select_dtypes(include = 'number').columns
    if len(numeric_columns) > 1:
        print("Correlation Matrix:")
        print("-" * 50)
        print(df[numeric_columns].corr())
        print("-" * 50)

# Perform data profiling on the merged dataframe
perform_data_profiling(merged_full_df)
Data Profile:
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 75180 entries, 0 to 75179
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   NAME                  75180 non-null  object 
 1   YEAR                  75180 non-null  int64  
 2   TOTAL_VISITORS_YEAR   75180 non-null  int64  
 3   PROBLEM_WITH_COUNTER  75180 non-null  object 
 4   MONTH                 75180 non-null  object 
 5   MONTH_NUMBER          75180 non-null  int64  
 6   SEASON                75180 non-null  object 
 7   VISITOR_COUNT         75180 non-null  int64  
 8   HOURS OF OPERATION    75180 non-null  object 
 9   ADDRESS               75180 non-null  object 
 10  CITY                  75180 non-null  object 
 11  STATE                 75180 non-null  object 
 12  ZIP                   75180 non-null  int64  
 13  PHONE                 75180 non-null  object 
 14  WEBSITE               75180 non-null  object 
 15  LOCATION              75180 non-null  object 
 16  TOTAL HOURS WEEK      75180 non-null  int64  
 17  DAY                   75180 non-null  object 
 18  HOURS OPEN            75180 non-null  int64  
 19  Z_SCORE               75180 non-null  float64
dtypes: float64(1), int64(7), object(12)
memory usage: 14.1+ MB
None
--------------------------------------------------
Summary Statistics:
--------------------------------------------------
               YEAR  TOTAL_VISITORS_YEAR  MONTH_NUMBER  VISITOR_COUNT  \
count  75180.000000         7.518000e+04  75180.000000   75180.000000   
mean    2016.737430         8.269647e+04      6.500000    6891.372439   
std        3.621391         7.017199e+04      3.452075    6257.785494   
min     2011.000000         0.000000e+00      1.000000       0.000000   
25%     2014.000000         4.366100e+04      3.750000    3520.000000   
50%     2017.000000         7.505200e+04      6.500000    6403.000000   
75%     2020.000000         1.096040e+05      9.250000    9338.250000   
max     2023.000000         1.405964e+06     12.000000  141696.000000   

                ZIP  TOTAL HOURS WEEK    HOURS OPEN       Z_SCORE  
count  75180.000000      75180.000000  75180.000000  75180.000000  
mean   60633.706145         46.248045      6.606864      0.597374  
std       28.428708          6.669656      3.061526      0.801968  
min    60605.000000          0.000000      0.000000      0.000060  
25%    60617.000000         48.000000      8.000000      0.214833  
50%    60630.000000         48.000000      8.000000      0.467321  
75%    60643.000000         48.000000      8.000000      0.857447  
max    60827.000000         48.000000     11.000000     21.542049  
--------------------------------------------------
Sample Data:
--------------------------------------------------
          NAME  YEAR  TOTAL_VISITORS_YEAR PROBLEM_WITH_COUNTER MONTH  \
0  Albany Park  2011               117425                False   JAN   
1  Albany Park  2011               117425                False   JAN   
2  Albany Park  2011               117425                False   JAN   
3  Albany Park  2011               117425                False   JAN   
4  Albany Park  2011               117425                False   JAN   

   MONTH_NUMBER  SEASON  VISITOR_COUNT  \
0             1  Winter           9604   
1             1  Winter           9604   
2             1  Winter           9604   
3             1  Winter           9604   
4             1  Winter           9604   

                                  HOURS OF OPERATION              ADDRESS  \
0  Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri...  3401 W. Foster Ave.   
1  Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri...  3401 W. Foster Ave.   
2  Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri...  3401 W. Foster Ave.   
3  Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri...  3401 W. Foster Ave.   
4  Mon. & Wed., 10-6; Tues. & Thurs., Noon-8; Fri...  3401 W. Foster Ave.   

      CITY STATE    ZIP           PHONE  \
0  Chicago    IL  60625  (773) 539-5450   
1  Chicago    IL  60625  (773) 539-5450   
2  Chicago    IL  60625  (773) 539-5450   
3  Chicago    IL  60625  (773) 539-5450   
4  Chicago    IL  60625  (773) 539-5450   

                                  WEBSITE  \
0  https://www.chipublib.org/locations/3/   
1  https://www.chipublib.org/locations/3/   
2  https://www.chipublib.org/locations/3/   
3  https://www.chipublib.org/locations/3/   
4  https://www.chipublib.org/locations/3/   

                                  LOCATION  TOTAL HOURS WEEK    DAY  \
0  (41.97557881655979, -87.71361314512697)                48    MON   
1  (41.97557881655979, -87.71361314512697)                48   TUES   
2  (41.97557881655979, -87.71361314512697)                48    WED   
3  (41.97557881655979, -87.71361314512697)                48  THURS   
4  (41.97557881655979, -87.71361314512697)                48    FRI   

   HOURS OPEN   Z_SCORE  
0           8  0.433483  
1           8  0.433483  
2           8  0.433483  
3           8  0.433483  
4           8  0.433483  
--------------------------------------------------
Unique Value Counts:
--------------------------------------------------
NAME: 80 unique values
YEAR: 12 unique values
TOTAL_VISITORS_YEAR: 885 unique values
PROBLEM_WITH_COUNTER: 2 unique values
MONTH: 12 unique values
MONTH_NUMBER: 12 unique values
SEASON: 4 unique values
VISITOR_COUNT: 6961 unique values
HOURS OF OPERATION: 6 unique values
ADDRESS: 80 unique values
CITY: 1 unique values
STATE: 1 unique values
ZIP: 49 unique values
PHONE: 80 unique values
WEBSITE: 80 unique values
LOCATION: 80 unique values
TOTAL HOURS WEEK: 4 unique values
DAY: 7 unique values
HOURS OPEN: 3 unique values
Z_SCORE: 6961 unique values
--------------------------------------------------
Missing Value Counts:
--------------------------------------------------
NAME                    0
YEAR                    0
TOTAL_VISITORS_YEAR     0
PROBLEM_WITH_COUNTER    0
MONTH                   0
MONTH_NUMBER            0
SEASON                  0
VISITOR_COUNT           0
HOURS OF OPERATION      0
ADDRESS                 0
CITY                    0
STATE                   0
ZIP                     0
PHONE                   0
WEBSITE                 0
LOCATION                0
TOTAL HOURS WEEK        0
DAY                     0
HOURS OPEN              0
Z_SCORE                 0
dtype: int64
--------------------------------------------------
Data Types:
--------------------------------------------------
NAME                     object
YEAR                      int64
TOTAL_VISITORS_YEAR       int64
PROBLEM_WITH_COUNTER     object
MONTH                    object
MONTH_NUMBER              int64
SEASON                   object
VISITOR_COUNT             int64
HOURS OF OPERATION       object
ADDRESS                  object
CITY                     object
STATE                    object
ZIP                       int64
PHONE                    object
WEBSITE                  object
LOCATION                 object
TOTAL HOURS WEEK          int64
DAY                      object
HOURS OPEN                int64
Z_SCORE                 float64
dtype: object
--------------------------------------------------
Correlation Matrix:
--------------------------------------------------
                             YEAR  TOTAL_VISITORS_YEAR  MONTH_NUMBER  \
YEAR                 1.000000e+00        -3.583990e-01  4.462930e-17   
TOTAL_VISITORS_YEAR -3.583990e-01         1.000000e+00 -7.013253e-17   
MONTH_NUMBER         4.462930e-17        -7.013253e-17  1.000000e+00   
VISITOR_COUNT       -3.349105e-01         9.344625e-01 -1.777759e-02   
ZIP                 -2.902156e-02        -9.906362e-02 -7.209972e-15   
TOTAL HOURS WEEK    -3.199835e-02         4.678836e-02  6.400582e-17   
HOURS OPEN          -9.958526e-03         1.456147e-02  3.840921e-17   
Z_SCORE              1.401029e-01         6.466537e-01  3.503484e-02   

                     VISITOR_COUNT           ZIP  TOTAL HOURS WEEK  \
YEAR                     -0.334910 -2.902156e-02     -3.199835e-02   
TOTAL_VISITORS_YEAR       0.934463 -9.906362e-02      4.678836e-02   
MONTH_NUMBER             -0.017778 -7.209972e-15      6.400582e-17   
VISITOR_COUNT             1.000000 -9.257124e-02      4.372197e-02   
ZIP                      -0.092571  1.000000e+00     -2.547387e-01   
TOTAL HOURS WEEK          0.043722 -2.547387e-01      1.000000e+00   
HOURS OPEN                0.013607 -7.927978e-02      3.112200e-01   
Z_SCORE                   0.655241 -3.663064e-02     -8.461968e-02   

                       HOURS OPEN   Z_SCORE  
YEAR                -9.958526e-03  0.140103  
TOTAL_VISITORS_YEAR  1.456147e-02  0.646654  
MONTH_NUMBER         3.840921e-17  0.035035  
VISITOR_COUNT        1.360715e-02  0.655241  
ZIP                 -7.927978e-02 -0.036631  
TOTAL HOURS WEEK     3.112200e-01 -0.084620  
HOURS OPEN           1.000000e+00 -0.026335  
Z_SCORE             -2.633533e-02  1.000000  
--------------------------------------------------

We can also perform data profiling using pandas profiling. Pandas profiling is a powerful Python tool for exploratory data analysis, providing a quick and comprehensive overview of datasets. It automates the generation of statistical summaries, data visualizations, and insights, saving time and effort for data analysts and scientists. The HTML report generated by Pandas profiling facilitates data documentation and easy sharing of analysis results.

In [402]:
# Perform data profiling using pandas_profiling
profile = ProfileReport(merged_full_df, title = 'Pandas Profiling Report: Merged Dataset')
profile.to_file("visitors_by_location_data_profiling_report.html")
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

To open the HTML file generated by pandas profiling in a Jupyter notebook, you can use the IPython.display module or click on the file to open the document on your browser.

In [210]:
# Specify the file path of the HTML report
html_file = "visitors_by_location_data_profiling_report.html"

# Display the HTML report in the notebook
display(HTML(filename = html_file))

# Clear the displayed output
# clear_output()

The 'Overview' section of the report assures us that the processed data does not have any duplicates or missing values/cells. While the rest of the summary statistics/plots gives us important information we might want to consider moving forward, such as the high negative correlation between the variables 'YEAR' & 'VISITOR_COUNT'.

ANALYSIS¶

Seasonality in Chicago Public Library Usage¶

We can group the data by 'SEASON' and compute the average 'VISITOR_COUNT' for each season. This would tell us whether there's a significant difference in library usage across seasons. Similarly, we can examine monthly trends to capture finer temporal patterns.

In [409]:
# Set the style of the plots
sns.set_style("whitegrid")

# Define a color palette for the seasons
season_colors = {'Spring': '#9ACD32', 'Summer': '#FF0000', 'Fall': '#FFA500', 'Winter': '#87CEEB'}

# Question 1: Is there a time of the year for which they experience more visitors (e.g., summer vs winter)?
seasonal_data_avg = merged_full_df.groupby('SEASON')['VISITOR_COUNT'].mean().reset_index()

# Plot the average visitor count by season
plt.figure(figsize = (8, 6))
sns.barplot(x ='SEASON', y = 'VISITOR_COUNT', data = seasonal_data_avg, 
            order = ['Spring', 'Summer', 'Fall', 'Winter'], 
            palette = season_colors)
plt.title('Average Visitor Count by Season')
plt.ylabel('Average Number of Visitors')  
plt.show()

The bar chart represents the average visitor count for each season. As we can see, the average number of visitors is highest in the summer and lowest in the winter among Chicago Public Libraries.

In [410]:
# Define custom colors for each season (using hex codes)
colors = {
    'Spring': '#9ACD32',    # Pastel green
    'Summer': '#FF0000',    # Red
    'Fall':   '#FFA500',    # Orange 
    'Winter': '#87CEEB'     # Light blue
}

# Plot the counts of visitors by year and season with custom colors
fig, ax = plt.subplots(figsize = (12, 8))
count_by_year_season.plot(kind = 'bar', stacked = True, ax = ax, color = [colors[col] for col in count_by_year_season.columns])
ax.set_title('Number of Visitors to Libraries Each Year by Season')
ax.set_xlabel('Year')
ax.set_ylabel('Number of Visitors')

# Remove scientific notation and set the y-axis labels to regular format
ax.yaxis.set_major_formatter('{:.0f}'.format)

legend = ax.legend(title = 'Seasons')
plt.setp(legend.get_title(), fontsize = 'medium')
plt.xticks(rotation = 45)
plt.tight_layout()
plt.show()

Looking at visitor counts for each season over time highlights not only the patterns mentioned above but also a decrease in the number of visitors over time for each of the seasons.

Change in Chicago Public Library Usage Over Time¶

We can also plot 'YEAR' against 'TOTAL_VISITORS_YEAR' to visualize total changes in library usage over time.

In [406]:
# Set the style of the plots
sns.set_style("whitegrid")

# Question 2: Has there been a decline or increase in the use of public libraries?
yearly_data = merged_full_df.groupby('YEAR')['VISITOR_COUNT'].sum().reset_index()

# Include years 2011 and 2023 if missing and set their visitor count to 0
yearly_data = yearly_data.set_index('YEAR').reindex(range(2011, 2024)).fillna(0).reset_index()

# Plot the visitor count by year
plt.figure(figsize = (12, 6))
sns.lineplot(x = 'YEAR', y = 'VISITOR_COUNT', data = yearly_data, marker = 'o', color = 'b')
plt.title('Total Visitor Count by Year')
plt.ylabel('Number of Visitors')

# Format y-axis tick labels to show the count directly without scientific notation
plt.ticklabel_format(style = 'plain', axis = 'y')

# Add a vertical line for the start of the COVID-19 pandemic (assumed in January 2020)
plt.axvline(x = 2020, color = 'red', linestyle = '--', label = 'Start of COVID-19 Pandemic')
plt.legend()  # Show the legend with the label for the vertical line

plt.xticks(rotation = 45)  # Rotate x-axis labels for better readability
plt.tight_layout()         # Adjust layout to avoid cropping labels
plt.show()

The line plot above shows the total number of visitors across Chicago Public Libraries each year. There is a clear trend showing an overall decrease in the number of visitors over time, which may be due to various factors such as the rise of digital media and the internet.

Impact of COVID-19 Pandemic¶

We can compare visitor counts before and after 2020 (when the COVID-19 pandemic started to significantly impact public life). We can visualize this using a line plot or a bar chart and confirm the significance of the observed differences using a statistical test.

An independent sample t-test is appropriate in this situation because it aims to compare the means of two independent groups: the visitor counts before the COVID-19 pandemic (pre-COVID) and the visitor counts from 2020 onwards (during COVID). The t-test is used to determine whether there is a statistically significant difference between the average visitor counts in the two periods. The assumption of independence between the samples is met as each observation in one group is unrelated to the other group, and the assumption of approximately normal distributions for both samples allows for the valid application of the t-test. Additionally, using the equal_var = False argument indicates that the variance in the two groups is not assumed to be equal, which is appropriate for this analysis.

In [414]:
# Question 3: Did the COVID-19 pandemic contribute to any more changes?

# Let's consider years before 2020 (pre-COVID) and from 2020 onwards (during COVID)
pre_covid = merged_full_df[merged_full_df['YEAR'] < 2020]['VISITOR_COUNT']
during_covid = merged_full_df[merged_full_df['YEAR'] >= 2020]['VISITOR_COUNT']

# Calculate the average visitor count
average_pre_covid = pre_covid.mean()
average_during_covid = during_covid.mean()

# Print the average visitor count
print(f'Average visitor count before COVID-19: {average_pre_covid:.0f}')
print(f'Average visitor count during COVID-19: {average_during_covid:.0f}')

# Conduct the independent samples t-test
t_stat, p_value = stats.ttest_ind(pre_covid, during_covid, equal_var = False)

# Print the results
print(f'T-test statistic: {t_stat:.2f}')
print(f'P-value: {p_value:.5f}')
Average visitor count before COVID-19: 8283
Average visitor count during COVID-19: 3092
T-test statistic: 111.10
P-value: 0.00000

The average visitor count before the COVID-19 pandemic was 8283, while during the COVID-19 pandemic, it decreased significantly to 3092. The t-test statistic value of 111.10 indicates a large difference in means between the two periods. Additionally, the extremely low p-value of 0.00000 suggests that this difference is statistically significant.

Based on this test, it appears that the COVID-19 pandemic had a significant impact on visitor counts, leading to a considerable reduction in visitors during the pandemic compared to the pre-COVID period. However, we need to account for the overall decreasing trend in library usage when assessing the impact of COVID-19. One way to do this would be to compare the observed visitor count in 2020 and 2021 with the count predicted by a regression line fitted to the pre-2020 data. If the observed count is significantly lower than the predicted count, this would suggest an additional impact of COVID-19.

In [455]:
yearly_visitors = merged_full_df.groupby('YEAR')['TOTAL_VISITORS_YEAR'].sum()

# Fit a linear regression to the pre-2020 data
model = LinearRegression()
model.fit(pre_2020_data.index.values.reshape(-1, 1), pre_2020_data.values)

# Predict visitor count for 2020, 2021, 2022, and 2023
predicted_2020 = model.predict(np.array([[2020]]))[0]
predicted_2021 = model.predict(np.array([[2021]]))[0]
predicted_2022 = model.predict(np.array([[2022]]))[0]
predicted_2023 = model.predict(np.array([[2023]]))[0]

# Plot the observed and predicted visitor count
plt.figure(figsize = (10, 6))
sns.lineplot(x = yearly_visitors.index, y = yearly_visitors.values, label = 'Observed')
plt.plot([2020, 2021, 2022, 2023], [predicted_2020, predicted_2021, predicted_2022, predicted_2023],
         'r--', label = 'Predicted')
plt.title('Total Visitors per Year')
plt.xlabel('Year')
plt.ylabel('Total Visitors')
plt.legend()

# Disable scientific notation for the y-axis tick labels
plt.ticklabel_format(axis = 'y', style = 'plain', useMathText = False)

plt.show()

The line plot shows the total number of visitors to public libraries each year. The blue line represents the observed visitor count, while the dashed red line represents the count predicted by a regression line fitted to the data from before 2020.

We can see that the observed visitor counts for 2020 and 2021 are significantly lower than the predicted counts. This suggests that the COVID-19 pandemic likely had an additional impact on library usage beyond the overall decreasing trend.

The decrease in visitors during the pandemic could be due to various factors, such as library closures, restrictions on public gatherings, and people avoiding public places to reduce the risk of virus transmission.

Note, that this analysis is quite simplified. A more robust analysis might take into account other factors that could affect library usage, such as population changes and the availability of digital resources. Additionally, the linear trend line is a very simple model of the pre-2020 trend, and a more complex model might provide a more accurate prediction.

Outlier Detection and Treatment¶

Here we will use the Z-score method to detect outliers in the 'VISITOR_COUNT' column. We will then remove these outliers from the data set and try to improve our statistical testing.

In [418]:
# Calculate Z-scores
merged_full_df['Z_SCORE'] = np.abs(stats.zscore(merged_full_df['VISITOR_COUNT']))

# Filter out outliers (we define outliers as data points with a Z-score greater than 3)
df_filtered_no_outliers = merged_full_df[merged_full_df['Z_SCORE'] <= 3]

# Drop the 'Z_SCORE' column as we no longer need it
df_filtered_no_outliers = df_filtered_no_outliers.drop('Z_SCORE', axis = 1)

# Segmentation Analysis
# Here we will perform separate analyses for each library
# As an example, let's compare the total visitor count for each library in 2023
df_2023 = df_filtered_no_outliers[df_filtered_no_outliers['YEAR'] == 2023]
visitor_counts_2023 = df_2023.groupby('NAME')['VISITOR_COUNT'].sum().sort_values(ascending = False)

# Display the libraries with the highest and lowest visitor counts in 2023
visitor_counts_2023
Out[418]:
NAME
Sulzer Regional Library              709492
Chinatown                            518014
Lincoln Park                         417928
Merlo                                382445
Logan Square                         380919
                                      ...  
Sherman Park                          39837
Chicago Bee                           37443
Brainerd                              37079
Harold Washtington Library Center         0
Galewood-Mont Clare                       0
Name: VISITOR_COUNT, Length: 80, dtype: int64

The output above shows the total visitor count for each library in 2023, sorted in descending order. The Sulzer Regional Library had the highest visitor count, while the Galewood-Mont Clare and Harold Washington Library Center had zero visitors.

Please note that these results are based on the filtered dataset where outliers in visitor count (defined as having a Z-score greater than 3) have been removed.

Next, let's perform a hypothesis test again to assess whether the observed decrease in visitor count after 2020 is statistically significant. We'll use a t-test for this purpose, which compares the mean visitor count before and after 2020. The null hypothesis is that the means are equal, while the alternative hypothesis is that the mean visitor count after 2020 is lower.

Before we conduct the t-test, let's check the assumptions. The t-test assumes that the data are normally distributed and that the variances of the two groups are equal. Let's check these assumptions first.

In [419]:
# Check the distribution of visitor counts before and after 2020
before_2020 = df_filtered_no_outliers[df_filtered_no_outliers['YEAR'] < 2020]['VISITOR_COUNT']
after_2020 = df_filtered_no_outliers[df_filtered_no_outliers['YEAR'] >= 2020]['VISITOR_COUNT']

# Plot histograms
plt.figure(figsize = (12, 6))
plt.hist(before_2020, bins = 30, alpha = 0.5, label = 'Before 2020')
plt.hist(after_2020, bins = 30, alpha = 0.5, label = 'After 2020')
plt.title('Distribution of Visitor Count Before and After 2020')
plt.xlabel('Visitor Count')
plt.ylabel('Frequency')
plt.legend()
plt.show()

# Perform Levene's test for equal variances
from scipy.stats import levene
stat, p = levene(before_2020, after_2020)
print(f'Levene\'s test for equal variances: p = {p}')
Levene's test for equal variances: p = 3.3231360711195104e-300

The histograms show the distribution of visitor counts before and after 2020. Both distributions are positively skewed, suggesting that the assumption of normality may not hold. However, the t-test is quite robust to violations of this assumption, especially for large sample sizes.

Levene's test for equal variances gives a very small p-value, indicating that the variances of the two groups are significantly different. This violates one of the assumptions of the t-test. In this case, we can use a version of the t-test that does not assume equal variances, known as Welch's t-test.

Let's proceed with Welch's t-test to compare the mean visitor count before and after 2020.

In [422]:
# Perform Welch's t-test
t_stat, p = ttest_ind(before_2020, after_2020, equal_var = False)

# Print the results of the Welch's t-test
print(f'Welch\'s t-test statistic: {t_stat:.2f}')
print(f'p-value: {p:.5f}')
Welch's t-test statistic: 196.75
p-value: 0.00000

Welch's t-test gives a very small p-value, which is less than 0.05. This indicates that we can reject the null hypothesis of equal means, and conclude that the mean visitor count is significantly lower after 2020 compared to before 2020. This provides statistical evidence for a decrease in library usage after 2020, supporting the visual evidence from the regression line plot.

Please note that this result should be interpreted with caution, as the t-test does not prove causality. The observed decrease in library usage could be due to the COVID-19 pandemic, but it could also be influenced by other factors.

Finally, as mentioned earlier, a more complex analysis might involve multiple regression or time-series analysis to model the relationship between visitor count and multiple variables simultaneously, and to handle temporal patterns in the data. However, these types of analysis are quite complex and beyond the scope of this discussion. They would also require additional data that is not available in the current data set.

Additional Questions¶

The following queries could be performed to gain additional insights on the following questions.

  • Highest number of visitors of all time by library:
In [423]:
merged_full_df.groupby('NAME')['TOTAL_VISITORS_YEAR'].max()
Out[423]:
NAME
Albany Park                 159397
Altgeld                      63052
Archer Heights              124653
Austin                      103092
Austin-Irving               145852
                             ...  
West Pullman                121069
West Town                   146851
Whitney M. Young, Jr.       128395
Woodson Regional Library    214173
Wrightwood-Ashburn           89580
Name: TOTAL_VISITORS_YEAR, Length: 80, dtype: int64
  • Lowest number of visitors of all time by library:
In [431]:
merged_full_df.groupby('NAME')['TOTAL_VISITORS_YEAR'].min()
Out[431]:
NAME
Albany Park                   421
Altgeld                      5666
Archer Heights              19245
Austin                      19049
Austin-Irving               16357
                            ...  
West Pullman                15932
West Town                   35757
Whitney M. Young, Jr.           0
Woodson Regional Library    30324
Wrightwood-Ashburn          12274
Name: TOTAL_VISITORS_YEAR, Length: 80, dtype: int64
  • Total and average number of visitors per year by library:
In [434]:
merged_full_df.groupby(['NAME', 'YEAR'])['VISITOR_COUNT'].agg(['sum', 'mean'])
Out[434]:
sum mean
NAME YEAR
Albany Park 2011 821975 9785.416667
2012 549395 6540.416667
2013 2947 35.083333
2014 371812 4426.333333
2015 1076257 12812.583333
... ... ... ...
Wrightwood-Ashburn 2018 368270 4384.166667
2019 331471 3946.083333
2020 149366 1778.166667
2021 85918 1022.833333
2023 182056 2167.333333

895 rows × 2 columns

  • Total and average number of visitors per month by library:
In [435]:
merged_full_df.groupby(['NAME', 'MONTH'])['VISITOR_COUNT'].agg(['sum', 'mean'])
Out[435]:
sum mean
NAME MONTH
Albany Park APR 637819 7593.083333
AUG 676844 8057.666667
DEC 630105 7501.250000
FEB 642061 7643.583333
JAN 673918 8022.833333
... ... ... ...
Wrightwood-Ashburn MAR 406714 4841.833333
MAY 348362 4147.166667
NOV 397754 4735.166667
OCT 439789 5235.583333
SEP 409878 4879.500000

960 rows × 2 columns

  • Total and average number of visitors per season by library:
In [436]:
merged_full_df.groupby(['NAME', 'SEASON'])['VISITOR_COUNT'].agg(['sum', 'mean'])
Out[436]:
sum mean
NAME SEASON
Albany Park Fall 2150246 8532.722222
Spring 1944299 7715.472222
Summer 2027312 8044.888889
Winter 1946084 7722.555556
Altgeld Fall 875259 3473.250000
... ... ... ...
Woodson Regional Library Winter 770105 9167.916667
Wrightwood-Ashburn Fall 1247421 4950.083333
Spring 1153495 4577.361111
Summer 1257305 4989.305556
Winter 1150429 4565.194444

320 rows × 2 columns

Each of these lines of code returns a dataframe with the library names as the index and the max, min, sum, or mean visitors as the columns. The groupby method groups the data by the specified columns, and the agg method applies one or more aggregating functions to each group. These dataframes could provide a great starting point for future research.

  • Libraries with the highest attendance and lowest attendance:
In [460]:
# Find the library with the highest attendance
highest_attendance_library = merged_full_df.groupby('NAME')['TOTAL_VISITORS_YEAR'].sum().idxmax()
highest_attendance_count = merged_full_df.groupby('NAME')['TOTAL_VISITORS_YEAR'].sum().max()

# Find the library with the lowest attendance
lowest_attendance_library = merged_full_df.groupby('NAME')['TOTAL_VISITORS_YEAR'].sum().idxmin()
lowest_attendance_count = merged_full_df.groupby('NAME')['TOTAL_VISITORS_YEAR'].sum().min()

print("Library with highest attendance:", highest_attendance_library, "with attendance count of", highest_attendance_count)
print("Library with lowest attendance:", lowest_attendance_library, "with attendance count of", lowest_attendance_count)
Library with highest attendance: Harold Washtington Library Center with attendance count of 225055320
Library with lowest attendance: Legler Regional with attendance count of 8639904

Conclusion¶

The Jupyter notebook provided carries out an extensive analysis of the public libraries in the city of Chicago. The analysis begins with an exploration of libraries' geographical distribution across the city, achieved by using the Python library folium to generate interactive maps. The data used in the analysis includes the libraries' names, hours of operation, addresses, contact information, and geographical locations.

One of the key insights drawn from the spatial analysis is the distribution of libraries across the city. A heatmap of library locations is generated to provide a visual representation of areas with a high density of libraries versus areas with a low density. The analysis also explores the concept of nearest neighbor distance for each library, which provides an idea of how equally spaced the libraries are. The notebook then proceeds to analyze the number of monthly visitors to these libraries both through visualization and statistical testing, as well as any changes that might have occur in attendance over time (2011-2023) due to the effect of important events, such as the COVID-19 pandemic.

In conclusion, this project is a comprehensive exploration of the public libraries in Chicago, focusing on their spatial distribution and their usage patterns. The findings could help in strategic decision-making related to the placement of new libraries, resource allocation, and understanding the impact of factors like the COVID-19 pandemic on library usage.

Future Work¶

For a more robust analysis, we can consider incorporating additional data such as population and temperature.

  1. Population Data: The number of visitors to a library is likely to be influenced by the population of the area it serves. For instance, if the population of a city decreases, we might expect the number of library visitors to decrease as well. To account for this, we could incorporate population data into our analysis. We can obtain this data from the U.S. Census Bureau, which provides population estimates for cities and towns. Once we have this data, we could adjust the visitor counts by the population to get a per capita visitor count, which might provide a more accurate measure of library usage.

  2. Temperature Data: The weather could also influence library usage. For example, people might be more likely to visit a library if it's too hot or too cold outside. To account for this, we could incorporate historical temperature data into our analysis. This data can be obtained from various sources, such as the National Centers for Environmental Information (NCEI). We could then examine the correlation between temperature and library usage, and include temperature as a factor in our regression model.

Incorporating these additional data would involve several steps:

  • Data Collection: We would first need to collect the necessary data. For population data, we could use the U.S. Census Bureau's API to download population estimates for the relevant years and locations. For temperature data, we would need to find a source that provides historical temperature data for the locations we're interested in.

  • Data Cleaning and Preprocessing: The new data would likely require some cleaning and preprocessing. For example, we might need to handle missing values, convert data types, or merge multiple data sources.

  • Exploratory Data Analysis: Before incorporating the new data into our model, we would perform some exploratory data analysis to understand its characteristics and its relationship with library usage.

  • Modeling: We would then incorporate the new data into our regression model. This could be as simple as adding new variables to our model, or it might involve more complex procedures such as feature engineering or model selection.

  • Evaluation and Interpretation: After fitting our model, we would evaluate its performance and interpret the results. This could involve calculating error metrics, checking the model's assumptions, and interpreting the coefficients.

Remember that the availability of data can vary, and the specifics of the analysis would depend on the data that we are able to obtain.

In [ ]: