⬅️ Back to Projects

Air Quality & Pollen Forecasting (Boston)

This project is an end-to-end data science and machine learning system designed to forecast daily Air Quality Index (AQI) values and pollen levels for the Boston area. The goal is to support public-health–oriented decision-making for individuals with asthma, allergies, or other respiratory sensitivities by translating complex environmental data into actionable daily insights. Rather than treating this as a purely predictive task, the project emphasizes nonlinear environmental behavior, time-series rigor, and interpretability, acknowledging the stochastic and regime-based nature of biological and atmospheric systems.

Tech Stack: Python, Pandas, NumPy, Scikit-learn, LightGBM, XGBoost, Matplotlib, Streamlit, Makefile

Key Features:

Multi-Source Data Integration : Combined independently sourced datasets from Open-Meteo (weather), EPA AQS (pollutants and AQI), and EPHT (pollen), standardizing timestamps, units, and variable naming across more than a decade of daily observations
Time-Series Feature Engineering : Constructed lagged and rolling features (1–365 days), cyclical encodings (day-of-year, month, weekday), and interaction terms while strictly enforcing chronological splits to prevent data leakage
Nonlinear Environmental Modeling : Developed and evaluated multiple tree-based models (Random Forest, XGBoost, LightGBM) to capture nonlinear seasonality and regime-based behavior. Achieved R² up to 0.43 on held-out pollen test data using log-scaled targets
Spike Detection & Hazard Modeling : Implemented a two-stage modeling approach (classifier → regressor) to distinguish pollen spike days from baseline conditions, achieving AUC = 0.98 for spike classification and enabling threshold-based health alerts
Clustering & Regime Analysis : Applied K-Means clustering to weather, AQI, and pollen features to uncover distinct environmental regimes (e.g., high-ozone summer days, spring pollen release conditions), validating the need for nonlinear modeling
Visualization & Deployment : Built a Streamlit dashboard to visualize seasonal trends, model predictions, and spike behavior. The entire pipeline is reproducible via Makefile, enabling one-command setup and execution

Soft Skills:

Collaborative Modeling & Data Integrity: Worked closely with a multidisciplinary team to audit feature engineering pipelines, identify subtle sources of time-series data leakage, and collectively enforce strict chronological forecasting constraints across all models
Model Evaluation & Technical Decision-Making: Collaborated with teammates to reassess overly optimistic performance targets, align evaluation metrics with domain realities, and prioritize health-relevant outcomes (e.g., spike detection and AQI category accuracy) over raw R² maximization.
Technical Communication & Team Alignment: Co-authored a comprehensive technical report and coordinated recorded walkthroughs, translating complex modeling choices, limitations, and public-health implications for both technical and non-technical audiences
Reproducibility & Shared Ownership: Designed and maintained a reproducible project workflow (Makefile-driven setup, data ingestion, modeling, and visualization) to ensure consistent execution and smooth collaboration across the team

GitHub

⬅️ Back to Projects