My personal data processing, analysis, and reporting stack includes the following:
- Apache Airflow and Python for ETL task automation.
- MySQL, primarily, for data storage.
- Verticals, an anomaly detection system for time series data using machine learning methods, developed by me.
- Jupyter Lab for static HTML reports.
- Integration with Google Sheets and Drive APIs as alternative and convenient data storage solution.
- Google Data Studio for interactive dashboards.
I've developed a handful of resources on the COVID-19 pandemic:
Pandemic Overview report focuses on progression of infections and deaths worldwide and in selected epicenters.
An automated Airflow process extracts Johns Hopkins data from GitHub each day, transforms it into daily row data, and loads it into my database. It then generates an HTML report using Jupyter Lab.
Anomalies Analysis report highlights data points of note from the previous day using Verticals.
This report is also based on the source data from above process. Data is then analyzed by Verticals to identify data points from previous day that are anomalous when compared to the days before. These are then addressed in an HTML report generated by Jupyter Lab.
US Dashboard provides overview and drill-down views for total and daily infections and deaths in the US.
This dashboard is also based on the source data from above. An additional Airflow process calculates daily deltas for the time series data and uploads to Google Sheets using the Sheets API for use by the dashboard.
All reports are automated and refresh daily.
Data Pipeline Diagram
Below is a diagram of how the stack is used to generate the reports above. Using this diagram, I can diagnose upstream failures and formulate a comprehensive recovery plan to rebuild data sets and reports.