Week 4 โ Wrangling with pandas
Learning how to clean, transform, and prepare data for effective visualization.
๐ Background & Motivation
Data visualization is only as good as the data behind it. Most real-world datasets are messy, requiring cleaning and transformation before they can be meaningfully visualized. Pandas is the cornerstone Python library for data wrangling, providing flexible tools to reshape, aggregate, and join datasets. This week introduces practical techniques for preparing data that will allow you to build more accurate and insightful visualizations.
๐ Learning Objectives
- Select, filter, group, summarize, and reshape data in pandas.
- Work with datetime and categorical data types.
- Join and merge datasets for richer analysis.
- Prepare clean datasets ready for visualization.
๐ Readings & Resources
Sample Data Sources:
- NYC Taxi trips sample (NYC Open Data)
- COVID-19 data (Johns Hopkins University)
๐ ๏ธ Setup Checklist
Ensure you can run:
pip install pandas matplotlib seaborn
๐งญ Lecture Outline
Session 1 (75 min โ Theory Focus)
- Introduction to pandas DataFrames & Series (15 min)
- Selection and filtering operations (20 min)
- GroupBy and aggregation (20 min)
- Intro to joining and merging: concepts and syntaxย (15 min)
- Hands-on with NYC Taxi dataset (20 min). Download the files from here
Session 2 (75 min - Hands-On)
- Recap & troubleshooting (10 min)
- Reshaping data: pivot, melt, stack/unstack (25 min)
- Working with datetime and categorical variables (20 min)
- Merging and joining multiple datasets (20 min)
- Download the Jupyter Notebook and the additional population table for the session.
- Download the complete Jupyter Notebook. Remember to have other files ready for the notebook.
๐ป Starter Notebook Snippets
import pandas as pd
# Load dataset
trips = pd.read_csv("nyc_taxi_sample.csv")
# Filter rows
filtered = trips[trips["passenger_count"] > 2]
# Group and aggregate
agg = trips.groupby("passenger_count")["fare_amount"].mean()
# Reshape with melt
df_wide = pd.DataFrame({
'id': [1, 2],
'math': [90, 80],
'english': [85, 78]
})
df_tidy = df_wide.melt(id_vars='id', var_name='subject', value_name='score')
# Parse dates
trips['pickup_datetime'] = pd.to_datetime(trips['pickup_datetime'])
๐งช In-Class Activity
- Use the COVID-19 dataset to:
- Filter by one country and visualize new cases over time.
- Group data by month and compare monthly averages.
- Reshape the dataset to tidy form and create a plot.
๐ Homework (Due next Thursday, Sept 25)
- Select a dataset with at least 5 columns.
- Perform the following in a notebook:
- Clean and filter the dataset.
- Apply at least 2 groupby operations with aggregations.
- Reshape the dataset at least once (pivot, melt, etc.).
- Create 3 visualizations from the cleaned dataset.
- Submit
.ipynband.html.
Rubric (10 pts)
- Correct wrangling operations applied (4)
- Creativity in reshaping & grouping (2)
- Clear and readable plots (2)
- Reproducibility (2)
๐งฉ Optional Extensions
- Merge two datasets to add richer context.
- Explore pandas string operations to clean messy text.
- Create a time series plot with rolling averages.
โ Submission Checklist
Before submitting, make sure:
-
Your assignment has fulfilled all the basic requirements listed above.
-
The operations and visualizations in the Jupyter notebook are correct.
- Use Quarto to render the notebook and ensure the content is displayed well.