In [1]:
import sys
from generallib import *

connection = getConnection()

rollingDays = 2
daysBack = 60
In [2]:
display(md('# COVID-19 Pandemic'))

COVID-19 Pandemic

This is an analysis of the effects of COVID-19. This report is automatically updated each morning.

Data sourced from Johns Hopkins CSSE and is available here.

In [3]:
display(md("Compiled ""%Y-%m-%d %H:%M:%S")+" UTC."))

Compiled 2020-07-09 10:49:50 UTC.

Why Should I Care?

2% case fatality rate may not seem high. Let's get some perspective by really looking at the numbers.

In recent years, the United States has an average annual mortality rate of 0.72%.(1) The average American knows around 600 people.(2) You may learn that someone you know personally has died once every two years or so.

In recent years, the common flu infects and produces symptoms in around 35 million Americans each year, or a little over 10% of the total population.(5) In your average-size circle of acquaintances, then, you may know around 60 people who show flu symptoms each year. The common flu has a mortality rate below 0.1%, so most people do not actually know anyone who has died from flu complications.

Now, let's compare this to our current situation. Mortality rate for COVID-19 has been calculated to be anywhere from 1% to 4%.(3,4) The consensus seems to be that the real mortality rate for COVID-19 will be around 2%. Imagine that COVID-19 spreads at the same rate as the common flu. On average, at least one person you know will die from COVID-19.

The situation may actually be worse, as recent research indicates that there is a high rate of asymptomatic COVID-19 infections, meaning people are walking around with it and have no idea.(4,6) This is what leads to the higher mortality calculations of 3% or 4%, because the deaths are carefully recorded+, while total infections may be under-counted by a large margin. If those estimates are true, two or three people you know may die.

Of course, all this is predicated on COVID-19 spreading as widely as the common flu. By following current guidelines and mandates for social distancing and sheltering in place, these numbers could be reduced.

+ Even this may not be true. Recent data out of Italy shows that many deaths may not have been counted at all,(7) falsely deflating reported mortality rates even further.


I am not an epidemiologist. I do not work in the medical field. I am a data analyst whose perspective is "numbers are numbers", but as we all know, context is key. Please take this data with a grain of salt.

Infections and Deaths Overview

In [4]:
query = f"""
    dataDate as Date,
    sum(confirmed) as TotalInfections,
    sum(dead) as TotalDeaths,
    sum(recovered) as TotalRecovered
    dataDate >= current_date - interval {daysBack} day
group by dataDate
order by dataDate
covidtotdf = generateTable(query,connection,columns=['Date','TotalInfections','TotalDeaths','TotalRecovered'])
In [5]:
fig = generateGenericGraphDF('Total Infections, Deaths, and Recovered Worldwide',covidtotdf,['TotalInfections','TotalDeaths','TotalRecovered'],labels=['infections','deaths','recovered'])

Total COVID-19 infections worldwide.

In [6]:
fig = generateGenericGraphDF('Total Deaths Worldwide',covidtotdf,['TotalDeaths'],labels=['deaths'],ylabel='Deaths')

Total deaths caused by COVID-19 worldwide.

World Epicenters

In [7]:
def getCountryQuery(country):
    return f"""
        sum(c1.dead) - sum(c2.dead),
        sum(c1.confirmed) - sum(c1.recovered) - sum(c2.dead),
        sum(c1.confirmed) - sum(c2.confirmed)
        covidMetrics c1
        inner join covidMetrics c2 on c1.province=c2.province and and c1.dataDate=c2.dataDate + interval 1 day
        and c1.dataDate >= current_date - interval {daysBack} day
    group by c1.dataDate
    order by c1.dataDate

covidusdf = generateTable(getCountryQuery('US'),connection,['Date','usdeath','usinf','usrecov','usdailydeath','usactive','usdailyconfirmed'])
covidusdf['usdeathprob'] = covidusdf['usdailydeath'].rolling(rollingDays).mean() / covidusdf['usactive'].rolling(rollingDays).mean()
covidusdf = covidusdf.set_index('Date')
for ctry in ['Russia','Brazil']:
    temp = generateTable(getCountryQuery(ctry),connection,['Date','death','inf','recov','dailydeath','active','dailyconfirmed'])
    temp = temp.set_index('Date')
    covidusdf[ctry + 'death'] = temp['death']
    covidusdf[ctry + 'inf'] = temp['inf']
    covidusdf[ctry + 'recov'] = temp['recov']
    covidusdf[ctry + 'dailydeath'] = temp['dailydeath']
    covidusdf[ctry + 'active'] = temp['active']
    covidusdf[ctry + 'dailyconfirmed'] = temp['dailyconfirmed']
    covidusdf[ctry + 'deathprob'] = temp['dailydeath'].rolling(rollingDays).mean() / temp['active'].rolling(rollingDays).mean()
In [8]:
fig = generateGenericGraphDF('Infections in Epicenters',covidusdf,['usinf','Brazilinf','Russiainf'],labels=['US','Brazil','Russia'],ylabel='Infections')

COVID-19 infections reported in epicenters.

In [9]:
fig = generateGenericGraphDF('Daily New Infections in Epicenters',covidusdf,['usdailyconfirmed','Brazildailyconfirmed','Russiadailyconfirmed'],labels=['US','Brazil','Russia'],ylabel='Infections',rolling=7)

Daily new infections reported in epicenters, and 7-day rolling average.

In [10]:
fig = generateGenericGraphDF('Deaths in Epicenters',covidusdf,['usdeath','Brazildeath','Russiadeath'],labels=['US','Brazil','Russia'],ylabel='Deaths')

COVID-19 deaths reported in epicenters.

In [11]:
fig = generateGenericGraphDF('Daily New Deaths in Epicenters',covidusdf,['usdailydeath','Brazildailydeath','Russiadailydeath'],labels=['US','Brazil','Russia'],ylabel='Deaths',rolling=7)

Daily new deaths reported in epicenters, and 7-day rolling average.

Relationship Between Infections and Deaths

Data above show that deaths rise some time after infections. This makes sense, as it takes time for someone to become sick enough that they succumb to the disease. But how long does this take? If we observe a spike in infections, how long before we see a similar spike in deaths?

In [12]:
cordf = pd.read_csv('covidCorrelation.csv')
cordf = cordf.set_index('days')

fig = generateGenericGraphDF('Expected Days Between Infection and Death',cordf,['R2'],labels=['Correlation'],ylabel='Correlation',xType='auto',xCol='days')
ba = BoxAnnotation(left=11.5,right=14.5,fill_color='red',fill_alpha=0.1)

Expected number of days that pass between an increase in infections and proportionate change in deaths, by country. Based on data from January 22 to April 7.

When a country experiences an increase in infections, it is most likely that a proprtionate increase in deaths will occur 12-14 days later. It is this correlation that allows experts to predict when they think "peak deaths" will occur, though these predictions are made using different mathematical methods.

This was calculated by determining the correlation (R2) of the linear regression for each dataset where percentage increase in infections was plotted with percentage increase in deaths X days later, where X was 1 to 24 days. Only time periods that began with a country having at least 50 deaths were considered. The correlation between percent increase in infections and percent increase in deaths 12 days later was highest.

In [13]:
def getStateTopQuery(state):
    return f"""
        and dataDate=(select max(dataDate) from covidMetricsUs)
        and (confirmed > 0 or dead > 0)
    order by confirmed desc

def getStateQuery(state):
    return f"""
        sum(c1.confirmed) - sum(c2.dead),
        sum(c1.dead) - sum(c2.dead)
        covidMetricsUs c1
        inner join covidMetricsUs c2 on c1.state=c2.state and and c1.dataDate=c2.dataDate + interval 1 day
        c1.dataDate >= current_date - interval {daysBack} day
        and c1.state='{state}'
    group by c1.dataDate
    order by c1.dataDate
statesdf = generateTable(getStateQuery('Washington'),connection,['Date','Washingtonconfirmed','Washingtondead','Washingtonactive','Washingtondailydead'])
statesdf = statesdf.set_index('Date')
statesdf['Washingtonddp'] = statesdf['Washingtondailydead'].rolling(rollingDays).mean() / statesdf['Washingtonactive'].rolling(rollingDays).mean()
for state in ['Florida','Georgia','Alaska','New York','California']:
    temp = generateTable(getStateQuery(state),connection,['Date','confirmed','dead','active','dailydead'])
    temp = temp.set_index('Date')
    state = state.replace(' ','_')
    statesdf[state + 'confirmed'] = temp['confirmed']
    statesdf[state + 'dead'] = temp['dead']
    statesdf[state + 'active'] = temp['active']
    statesdf[state + 'dailydead'] = temp['dailydead']
    statesdf[state + 'ddp'] = temp['dailydead'].rolling(rollingDays).mean() / temp['active'].rolling(rollingDays).mean()


By State

State-level analysis has been moved to the COVID-19 Dashboard. This dashboard has the added benefit of seeing all US states and territories, as well as a breakdown by county and daily change data.


  1. "Mortality in the United States, 2018", CDC. (
  2. "The Average American Knows How Many People?", NY Times. (
  3. "The WHO Estimated COVID-19 Mortality at 3.4%. That Doesn't Tell the Whole Story", Time. (
  4. "Coronavirus disease 2019 (COVID-19) Situation Report – 46", WHO. (
  5. "Disease Burden of Influenza", CDC. (
  6. "CDC Director On Models For The Months To Come: 'This Virus Is Going To Be With Us'", NPR. (
  7. "Italy's Coronavirus Death Toll Is Far Higher Than Reported", MSN. (