Week 1 β Getting Started
A Quarto/Panel-style module page for DATS 2102, implemented in Python/Jupyter.
π Background & Motivation
Data visualization is the bridge between raw data and human understanding. In data science, visualizations are not just decorative β they are powerful analytical tools that help reveal patterns, outliers, and trends that might remain hidden in tables or statistical summaries.
Well-designed visualizations can:
- Tell compelling, evidence-based stories that influence decision-making.
- Make complex concepts easier to grasp for diverse audiences.
- Identify and expose errors or inconsistencies in data during the exploratory stage.
- Enable collaboration between technical and non-technical stakeholders.
Applications span across domains:
- Public health: Tracking disease spread with interactive dashboards.
- Climate science: Mapping temperature anomalies over decades.
- Business analytics: Visualizing customer behavior or sales performance.
- Machine learning: Understanding model performance through ROC curves, feature importance charts, or clustering visualizations.
As data science projects grow in size and complexity, the ability to craft clear, truthful, and impactful visuals becomes as important as building the models themselves.
π Learning Objectives
- Set up a reliable Python environment for data visualization.
- Navigate Jupyter Notebook/Lab and basic notebook hygiene (headings, code vs. markdown, restart & run all).
- Load and inspect tabular data with
pandas. - Produce the first charts with
matplotlibandseaborn.
π Readings & Resources
Sample Data Sources for Practice:
- Seaborn Built-in Datasets
- Kaggle Datasets
- FiveThirtyEight Data
- Our World in Data
- Open Data DC
- UCI Machine Learning Repository
- data.gov
- GeoPandas Sample Datasets
- Social ExplorerΒ
π οΈ Setup Checklist
- Install Anaconda or Miniconda.
- Create/activate environment:
bash conda create -n dataviz python=3.12 -y conda activate dataviz - Install libraries (CPU-friendly baseline):
bash pip install pandas numpy matplotlib seaborn plotly altair geopandas - Launch JupyterLab:
bash jupyter lab - (Optional) IDEs you can use: VS Code, PyCharm, Sublime Text; or run in Google Colab.
Troubleshooting
- If
geopandasfails on Windows, tryconda install -c conda-forge geopandas. - If Jupyter canβt see the env, run:
python -m ipykernel install --user --name dataviz --display-name "Python (dataviz)".
π§ Lecture Outline
Session 1 (75 minutes)
- Course overview & syllabus tour (15 min)
- Why visualization in data science? (truthfulness, clarity, audience) (15 min)
- Environment setup: conda + Jupyter walkthrough, troubleshooting (30 min)
- First dataset in
pandas: load CSV β DataFrame β quick EDA (15 min)
Session 2 (75 minutes)
- Recap + Q&A on environment setup (10 min)
- Notebook workflow: cells, markdown, restart & run all, saving (20 min)
- Basic plotting:
matplotlibbar/line;seabornscatter/histogram (30 min) - Guided practice with
penguinsdataset: scatterplot, pairplot activity (15 min) - Sample data 1 (customers_1000.csv); Sample data 2 (life_journey_data.csv), Sample data 3 (unemployment-x)
- Check out the detailed instructions in a Notebook and download the week1_session2.ipynb
π» Starter Notebook Snippets
Load a tiny dataset (download the tab-separated file (tsv) version)
import pandas as pd
cities = pd.DataFrame({
"city": ["DC", "NY", "LA", "Chicago", "Houston"],
"population": [712_816, 8_336_817, 3_898_747, 2_746_388, 2_304_580]
})
cities.head()
First charts (matplotlib β seaborn)
import matplotlib.pyplot as plt
import seaborn as sns
# Matplotlib bar chart
plt.bar(cities["city"], cities["population"])
plt.title("Population by City")
plt.xlabel("City"); plt.ylabel("Population")
plt.show()
# Seaborn bar chart
sns.barplot(data=cities, x="city", y="population")
plt.title("Population by City (Seaborn)")
plt.show()
Quick EDA helpers
cities.describe(include="all")
print("Missing values by column:\n", cities.isna().sum())
π§ͺ In-Class Activity
Using seaborn.load_dataset("penguins"):
- Make a scatterplot of flipper_length_mm vs body_mass_g colored by species.
- Add axis labels, a title, and a legend with a better title.
- Try a
seaborn.pairplotto see relationships across multiple variables.
Hints
penguins = sns.load_dataset("penguins").dropna()
ax = sns.scatterplot(data=penguins, x="flipper_length_mm", y="body_mass_g", hue="species")
ax.set(title="Penguins: Flipper vs Body Mass", xlabel="Flipper length (mm)", ylabel="Body mass (g)")
π Homework (Due before Week 2)
- Set up your environment and confirm you can open/run notebooks.
- Import a CSV of your choice and submit one notebook that includes:
- A short markdown description of the dataset (source, what, who, when).
- Top 5 rows,
.info(), and.describe(). - One bar or histogram plot, and one scatter plot.
- A brief paragraph reflecting on one insight + one limitation of the data.
- Export notebook to HTML (
File β Save and Export Notebook As) and upload both.ipynband.html.
Rubric (10 pts)
- Reproducible environment & clean notebook structure (2)
- Correct loading/inspection & basic EDA (3)
- Two charts with sensible labels/titles (3)
- Insight + limitation reflection (2)
π§© Optional Extensions
- Try the same chart in both
matplotlibandseaborn; note the pros/cons you observe. - Install
altair(a declarative statistical visualization library for Python, built on top of Vega-Lite, useful for creating interactive charts with minimal code) and create the same scatterplot with tooltips. - If youβre comfortable with maps, test your
geopandasinstall (geopandas.datasets.get_path('naturalearth_lowres')).
β Submission Checklist
This section, for example, lists everything you should verify before submitting your work for Week 1.