Air Quality & Pollen Forecasting (Boston)
This project is an end-to-end data science and machine learning system designed to forecast daily Air Quality Index (AQI) values and pollen levels for the Boston area. The goal is to support public-health–oriented decision-making for individuals with asthma, allergies, or other respiratory sensitivities by translating complex environmental data into actionable daily insights. Rather than treating this as a purely predictive task, the project emphasizes nonlinear environmental behavior, time-series rigor, and interpretability, acknowledging the stochastic and regime-based nature of biological and atmospheric systems.
Tech Stack: Python, Pandas, NumPy, Scikit-learn, LightGBM, XGBoost, Matplotlib, Streamlit, Makefile
Key Features:
- Multi-Source Data Integration : Combined independently sourced datasets from Open-Meteo (weather), EPA AQS (pollutants and AQI), and EPHT (pollen), standardizing timestamps, units, and variable naming across more than a decade of daily observations
- Time-Series Feature Engineering : Constructed lagged and rolling features (1–365 days), cyclical encodings (day-of-year, month, weekday), and interaction terms while strictly enforcing chronological splits to prevent data leakage
- Nonlinear Environmental Modeling : Developed and evaluated multiple tree-based models (Random Forest, XGBoost, LightGBM) to capture nonlinear seasonality and regime-based behavior. Achieved R² up to 0.43 on held-out pollen test data using log-scaled targets
- Spike Detection & Hazard Modeling : Implemented a two-stage modeling approach (classifier → regressor) to distinguish pollen spike days from baseline conditions, achieving AUC = 0.98 for spike classification and enabling threshold-based health alerts
- Clustering & Regime Analysis : Applied K-Means clustering to weather, AQI, and pollen features to uncover distinct environmental regimes (e.g., high-ozone summer days, spring pollen release conditions), validating the need for nonlinear modeling
- Visualization & Deployment : Built a Streamlit dashboard to visualize seasonal trends, model predictions, and spike behavior. The entire pipeline is reproducible via Makefile, enabling one-command setup and execution
Soft Skills:
- Collaborative Modeling & Data Integrity: Worked closely with a multidisciplinary team to audit feature engineering pipelines, identify subtle sources of time-series data leakage, and collectively enforce strict chronological forecasting constraints across all models
- Model Evaluation & Technical Decision-Making: Collaborated with teammates to reassess overly optimistic performance targets, align evaluation metrics with domain realities, and prioritize health-relevant outcomes (e.g., spike detection and AQI category accuracy) over raw R² maximization.
- Technical Communication & Team Alignment: Co-authored a comprehensive technical report and coordinated recorded walkthroughs, translating complex modeling choices, limitations, and public-health implications for both technical and non-technical audiences
- Reproducibility & Shared Ownership: Designed and maintained a reproducible project workflow (Makefile-driven setup, data ingestion, modeling, and visualization) to ensure consistent execution and smooth collaboration across the team