Week 4 โ€” Wrangling with pandas

Learning how to clean, transform, and prepare data for effective visualization.


๐Ÿ“– Background & Motivation

Data visualization is only as good as the data behind it. Most real-world datasets are messy, requiring cleaning and transformation before they can be meaningfully visualized. Pandas is the cornerstone Python library for data wrangling, providing flexible tools to reshape, aggregate, and join datasets. This week introduces practical techniques for preparing data that will allow you to build more accurate and insightful visualizations.


๐Ÿ”Ž Learning Objectives

  • Select, filter, group, summarize, and reshape data in pandas.
  • Work with datetime and categorical data types.
  • Join and merge datasets for richer analysis.
  • Prepare clean datasets ready for visualization.

๐Ÿ“š Readings & Resources

Sample Data Sources:

  • NYC Taxi trips sample (NYC Open Data)
  • COVID-19 data (Johns Hopkins University)

๐Ÿ› ๏ธ Setup Checklist

Ensure you can run:

pip install pandas matplotlib seaborn

๐Ÿงญ Lecture Outline

Session 1 (75 min โ€” Theory Focus)

  1. Introduction to pandas DataFrames & Series (15 min)
  2. Selection and filtering operations (20 min)
  3. GroupBy and aggregation (20 min)
  4. Intro to joining and merging: concepts and syntaxย (15 min)
  5. Hands-on with NYC Taxi dataset (20 min). Download the files from here

Session 2 (75 min - Hands-On)

  1. Recap & troubleshooting (10 min)
  2. Reshaping data: pivot, melt, stack/unstack (25 min)
  3. Working with datetime and categorical variables (20 min)
  4. Merging and joining multiple datasets (20 min)
  5. Download the Jupyter Notebook and the additional population table for the session.
  6. Download the complete Jupyter Notebook. Remember to have other files ready for the notebook.

๐Ÿ’ป Starter Notebook Snippets

import pandas as pd

# Load dataset
trips = pd.read_csv("nyc_taxi_sample.csv")

# Filter rows
filtered = trips[trips["passenger_count"] > 2]

# Group and aggregate
agg = trips.groupby("passenger_count")["fare_amount"].mean()

# Reshape with melt
df_wide = pd.DataFrame({
    'id': [1, 2],
    'math': [90, 80],
    'english': [85, 78]
})
df_tidy = df_wide.melt(id_vars='id', var_name='subject', value_name='score')

# Parse dates
trips['pickup_datetime'] = pd.to_datetime(trips['pickup_datetime'])

๐Ÿงช In-Class Activity

  • Use the COVID-19 dataset to:
  • Filter by one country and visualize new cases over time.
  • Group data by month and compare monthly averages.
  • Reshape the dataset to tidy form and create a plot.

๐Ÿ  Homework (Due next Thursday, Sept 25)

  1. Select a dataset with at least 5 columns.
  2. Perform the following in a notebook:
  3. Clean and filter the dataset.
  4. Apply at least 2 groupby operations with aggregations.
  5. Reshape the dataset at least once (pivot, melt, etc.).
  6. Create 3 visualizations from the cleaned dataset.
  7. Submit .ipynb and .html.

Rubric (10 pts)

  • Correct wrangling operations applied (4)
  • Creativity in reshaping & grouping (2)
  • Clear and readable plots (2)
  • Reproducibility (2)

๐Ÿงฉ Optional Extensions

  • Merge two datasets to add richer context.
  • Explore pandas string operations to clean messy text.
  • Create a time series plot with rolling averages.

โœ… Submission Checklist

Before submitting, make sure:

  • Your assignment has fulfilled all the basic requirements listed above.

  • The operations and visualizations in the Jupyter notebook are correct.

  • Use Quarto to render the notebook and ensure the content is displayed well.