In [1]:
import sys
from generallib import *

connection = getConnection()

rollingDays = 2
daysBack = 60
In [2]:
display(md('# COVID-19 Pandemic'))

COVID-19 Pandemic

This is an analysis of the effects of COVID-19. This report is automatically updated each morning.

Data sourced from Johns Hopkins CSSE and is available here.

In [3]:
display(md("Compiled ""%Y-%m-%d %H:%M:%S")+" UTC."))

Compiled 2020-06-04 10:32:26 UTC.

Why Should I Care?

2% case fatality rate may not seem high. Let's get some perspective by really looking at the numbers.

In recent years, the United States has an average annual mortality rate of 0.72%.(1) The average American knows around 600 people.(2) You may learn that someone you know personally has died once every two years or so.

In recent years, the common flu infects and produces symptoms in around 35 million Americans each year, or a little over 10% of the total population.(5) In your average-size circle of acquaintances, then, you may know around 60 people who show flu symptoms each year. The common flu has a mortality rate below 0.1%, so most people do not actually know anyone who has died from flu complications.

Now, let's compare this to our current situation. Mortality rate for COVID-19 has been calculated to be anywhere from 1% to 4%.(3,4) The consensus seems to be that the real mortality rate for COVID-19 will be around 2%. Imagine that COVID-19 spreads at the same rate as the common flu. On average, at least one person you know will die from COVID-19.

The situation may actually be worse, as recent research indicates that there is a high rate of asymptomatic COVID-19 infections, meaning people are walking around with it and have no idea.(4,6) This is what leads to the higher mortality calculations of 3% or 4%, because the deaths are carefully recorded+, while total infections may be under-counted by a large margin. If those estimates are true, two or three people you know may die.

Of course, all this is predicated on COVID-19 spreading as widely as the common flu. By following current guidelines and mandates for social distancing and sheltering in place, these numbers could be reduced.

+ Even this may not be true. Recent data out of Italy shows that many deaths may not have been counted at all,(7) falsely deflating reported mortality rates even further.


I am not an epidemiologist. I do not work in the medical field. I am a data analyst whose perspective is "numbers are numbers", but as we all know, context is key. Please take this data with a grain of salt. Note that DDP metric below is completely made up by me after thinking through this data for about 10 minutes.

Infections and Deaths Overview

In [4]:
query = f"""
    dataDate as Date,
    sum(confirmed) as TotalInfections,
    sum(dead) as TotalDeaths,
    sum(recovered) as TotalRecovered
    dataDate >= current_date - interval {daysBack} day
group by dataDate
order by dataDate
covidtotdf = generateTable(query,connection,columns=['Date','TotalInfections','TotalDeaths','TotalRecovered'])
In [5]:
fig = generateGenericGraphDF('Total Infections, Deaths, and Recovered Worldwide',covidtotdf,['TotalInfections','TotalDeaths','TotalRecovered'],labels=['infections','deaths','recovered'])

Total COVID-19 infections worldwide.

In [6]:
fig = generateGenericGraphDF('Total Deaths Worldwide',covidtotdf,['TotalDeaths'],labels=['deaths'],ylabel='Deaths')

Total deaths caused by COVID-19 worldwide.

World Epicenters

In [7]:
def getCountryQuery(country):
    return f"""
        sum(c1.dead) - sum(c2.dead),
        sum(c1.confirmed) - sum(c1.recovered) - sum(c2.dead),
        sum(c1.confirmed) - sum(c2.confirmed)
        covidMetrics c1
        inner join covidMetrics c2 on c1.province=c2.province and and c1.dataDate=c2.dataDate + interval 1 day
        and c1.dataDate >= current_date - interval {daysBack} day
    group by c1.dataDate
    order by c1.dataDate

covidusdf = generateTable(getCountryQuery('US'),connection,['Date','usdeath','usinf','usrecov','usdailydeath','usactive','usdailyconfirmed'])
covidusdf['usdeathprob'] = covidusdf['usdailydeath'].rolling(rollingDays).mean() / covidusdf['usactive'].rolling(rollingDays).mean()
covidusdf = covidusdf.set_index('Date')
for ctry in ['China','Italy','Spain']:
    temp = generateTable(getCountryQuery(ctry),connection,['Date','death','inf','recov','dailydeath','active','dailyconfirmed'])
    temp = temp.set_index('Date')
    covidusdf[ctry + 'death'] = temp['death']
    covidusdf[ctry + 'inf'] = temp['inf']
    covidusdf[ctry + 'recov'] = temp['recov']
    covidusdf[ctry + 'dailydeath'] = temp['dailydeath']
    covidusdf[ctry + 'active'] = temp['active']
    covidusdf[ctry + 'dailyconfirmed'] = temp['dailyconfirmed']
    covidusdf[ctry + 'deathprob'] = temp['dailydeath'].rolling(rollingDays).mean() / temp['active'].rolling(rollingDays).mean()

# fix some misleading data
# China arbitrarily adjusted Wuhan deaths up on this day[datetime.datetime.strptime('2020-04-17','%Y-%m-%d').date(),'Chinadeathprob'] = 0[datetime.datetime.strptime('2020-04-18','%Y-%m-%d').date(),'Chinadeathprob'] = 0[datetime.datetime.strptime('2020-04-19','%Y-%m-%d').date(),'Chinadeathprob'] = 0[datetime.datetime.strptime('2020-04-20','%Y-%m-%d').date(),'Chinadeathprob'] = 0[datetime.datetime.strptime('2020-04-21','%Y-%m-%d').date(),'Chinadeathprob'] = 0[datetime.datetime.strptime('2020-04-22','%Y-%m-%d').date(),'Chinadeathprob'] = 0[datetime.datetime.strptime('2020-04-23','%Y-%m-%d').date(),'Chinadeathprob'] = 0
# Spain had massive reduction in infections on this day[datetime.datetime.strptime('2020-04-24','%Y-%m-%d').date(),'Spaindailyconfirmed'] = 0
In [8]:
fig = generateGenericGraphDF('Infections in Epicenters',covidusdf,['usinf','Italyinf','Spaininf'],labels=['US','Italy','Spain'],ylabel='Infections')

COVID-19 infections reported in epicenters.

In [9]:
fig = generateGenericGraphDF('Daily New Infections in Epicenters',covidusdf,['usdailyconfirmed','Italydailyconfirmed','Spaindailyconfirmed'],labels=['US','Italy','Spain'],ylabel='Infections',rolling=7)

Daily new infections reported in epicenters, and 7-day rolling average.

Of note is that infections in Spain started a rebound on April 15, 3 days after the Spanish government announced a loosening of restrictions.

In [10]:
fig = generateGenericGraphDF('Deaths in Epicenters',covidusdf,['usdeath','Italydeath','Spaindeath'],labels=['US','Italy','Spain'],ylabel='Deaths')

COVID-19 deaths reported in epicenters.

In [11]:
fig = generateGenericGraphDF('Daily New Deaths in Epicenters',covidusdf,['usdailydeath','Italydailydeath','Spaindailydeath'],labels=['US','Italy','Spain'],ylabel='Deaths',rolling=7)

Daily new deaths reported in epicenters, and 7-day rolling average.

Note that spike in deaths in China on April 17 was due to a retrospective adjustment to Wuhan, Hubei death count.

Relationship Between Infections and Deaths

Data above show that deaths rise some time after infections. This makes sense, as it takes time for someone to become sick enough that they succumb to the disease. But how long does this take? If we observe a spike in infections, how long before we see a similar spike in deaths?

In [12]:
cordf = pd.read_csv('covidCorrelation.csv')
cordf = cordf.set_index('days')

fig = generateGenericGraphDF('Expected Days Between Infection and Death',cordf,['R2'],labels=['Correlation'],ylabel='Correlation',xType='auto',xCol='days')
ba = BoxAnnotation(left=11.5,right=14.5,fill_color='red',fill_alpha=0.1)

Expected number of days that pass between an increase in infections and proportionate change in deaths, by country. Based on data from January 22 to April 7.

When a country experiences an increase in infections, it is most likely that a proprtionate increase in deaths will occur 12-14 days later. It is this correlation that allows experts to predict when they think "peak deaths" will occur, though these predictions are made using different mathematical methods.

This was calculated by determining the correlation (R2) of the linear regression for each dataset where percentage increase in infections was plotted with percentage increase in deaths X days later, where X was 1 to 24 days. Only time periods that began with a country having at least 50 deaths were considered. The correlation between percent increase in infections and percent increase in deaths 12 days later was highest.

Daily Death Probability

Actively infected total is calculated by subtracting total previous deaths and total recovered from total infections. Daily Death Probability (DDP) is the ratio between deaths on a particular day and that total. This metric essentially answers the question, "For all people who were still infected as of that day, what was the probability that one selected at random would die on that day?" and may indicate quality of healthcare. Note that I am not an epidemiologist.

In [13]:
fig = generateGenericGraphDF('Daily Death Probability (DDP) in Epicenters',covidusdf,['usdeathprob','Chinadeathprob','Italydeathprob','Spaindeathprob'],labels=['US','China','Italy','Spain'],ylabel='Probability')

Probability that any actively infected person will die on a given day on a 2-day rolling average.

DDP smooths over time as both numbers of actively infected and deaths increase. Of note:

  • China's DDP excluded for 4/17 and 4/18, due to arbitrary increase in death count.
  • Italy's and Spain's DDPs started declining in late March. This coincides with their transition from exponential to linear growth in deaths and indicated "turning the corner" in those countries.
In [14]:
def getStateTopQuery(state):
    return f"""
        and dataDate=(select max(dataDate) from covidMetricsUs)
        and (confirmed > 0 or dead > 0)
    order by confirmed desc

def getStateQuery(state):
    return f"""
        sum(c1.confirmed) - sum(c2.dead),
        sum(c1.dead) - sum(c2.dead)
        covidMetricsUs c1
        inner join covidMetricsUs c2 on c1.state=c2.state and and c1.dataDate=c2.dataDate + interval 1 day
        c1.dataDate >= current_date - interval {daysBack} day
        and c1.state='{state}'
    group by c1.dataDate
    order by c1.dataDate

statesdf = generateTable(getStateQuery('Washington'),connection,['Date','Washingtonconfirmed','Washingtondead','Washingtonactive','Washingtondailydead'])
statesdf = statesdf.set_index('Date')
statesdf['Washingtonddp'] = statesdf['Washingtondailydead'].rolling(rollingDays).mean() / statesdf['Washingtonactive'].rolling(rollingDays).mean()
for state in ['Florida','Georgia','Alaska','New York','California']:
    temp = generateTable(getStateQuery(state),connection,['Date','confirmed','dead','active','dailydead'])
    temp = temp.set_index('Date')
    state = state.replace(' ','_')
    statesdf[state + 'confirmed'] = temp['confirmed']
    statesdf[state + 'dead'] = temp['dead']
    statesdf[state + 'active'] = temp['active']
    statesdf[state + 'dailydead'] = temp['dailydead']
    statesdf[state + 'ddp'] = temp['dailydead'].rolling(rollingDays).mean() / temp['active'].rolling(rollingDays).mean()

By State


In [15]:
fig = generateGenericGraphDF('Infections in California',statesdf,['Californiaconfirmed'],labels=['infections in CA'],ylabel='Infections')
fig2 = generateGenericGraphDF('Deaths in California',statesdf,['Californiadead'],labels=['deaths in CA'],ylabel='Deaths')

COVID-19 infections and deaths reported in California.

In [16]:
temp = generateTable(getStateTopQuery('California'),connection,['County','Infections','Deaths'])
temp = temp.set_index('County')
temp.loc['TOTAL'] = temp.sum(axis=0)
temp = temp.sort_values('Infections',axis=0,ascending=False)
Infections Deaths
TOTAL 119348 4374
Los Angeles 58261 2489
Riverside 8155 342
San Diego 7674 276
Orange 6678 158
San Bernardino 5650 209
Alameda 3548 97
Santa Clara 2832 144
San Francisco 2613 43
Kern 2418 40
San Mateo 2212 84

Top ten locations in California for infections.

New York

In [17]:
fig = generateGenericGraphDF('Infections in New York',statesdf,['New_Yorkconfirmed'],labels=['infections in NY'],ylabel='Infections')
fig2 = generateGenericGraphDF('Deaths in New York',statesdf,['New_Yorkdead'],labels=['deaths in NY'],ylabel='Deaths')

COVID-19 infections and deaths reported in New York.

In [18]:
temp = generateTable(getStateTopQuery('New York'),connection,['County','Infections','Deaths'])
temp = temp.set_index('County')
temp.loc['TOTAL'] = temp.sum(axis=0)
temp = temp.sort_values('Infections',axis=0,ascending=False)
Infections Deaths
TOTAL 374085 30019
New York 204872 21688
Nassau 40644 2130
Suffolk 40062 1915
Westchester 33691 1381
Rockland 13259 646
Orange 10460 444
Erie 6234 524
Dutchess 3951 142
Monroe 3048 216
Onondaga 2256 139

Top ten locations in New York for infections.


In [19]:
fig = generateGenericGraphDF('Infections in Washington',statesdf,['Washingtonconfirmed'],labels=['infections in WA'],ylabel='Infections')
fig2 = generateGenericGraphDF('Deaths in Washington',statesdf,['Washingtondead'],labels=['deaths in WA'],ylabel='Deaths')

COVID-19 infections and deaths reported in Washington.

Because Washington was the first state in the US hit by a major outbreak, it is worth watching how the pandemic there plays out.

In [20]:
temp = generateTable(getStateTopQuery('Washington'),connection,['County','Infections','Deaths'])
temp = temp.set_index('County')
temp.loc['TOTAL'] = temp.sum(axis=0)
temp = temp.sort_values('Infections',axis=0,ascending=False)
Infections Deaths
TOTAL 22484 1135
King 8264 573
Yakima 3938 97
Snohomish 2991 150
Pierce 1968 82
Benton 838 63
Franklin 632 20
Spokane 627 34
Clark 550 22
Skagit 443 15
Whatcom 402 37

Top ten locations in Washington for infections.


In [21]:
fig = generateGenericGraphDF('Infections in Florida',statesdf,['Floridaconfirmed'],labels=['infections in FL'],ylabel='Infections')
fig2 = generateGenericGraphDF('Deaths in Florida',statesdf,['Floridadead'],labels=['deaths in FL'],ylabel='Deaths')

COVID-19 infections and deaths reported in Florida.

In [22]:
temp = generateTable(getStateTopQuery('Florida'),connection,['County','Infections','Deaths'])
temp = temp.set_index('County')
temp.loc['TOTAL'] = temp.sum(axis=0)
temp = temp.sort_values('Infections',axis=0,ascending=False)
Infections Deaths
TOTAL 58764 2566
Miami-Dade 18456 731
Broward 7339 318
Palm Beach 6477 359
Hillsborough 2384 84
Orange 2121 41
Lee 2087 113
Collier 1679 49
Duval 1669 51
Pinellas 1361 90
Manatee 1110 98

Top ten locations in Florida for infections.


In [23]:
fig = generateGenericGraphDF('Infections in Georgia',statesdf,['Georgiaconfirmed'],labels=['infections in GA'],ylabel='Infections')
fig2 = generateGenericGraphDF('Deaths in Georgia',statesdf,['Georgiadead'],labels=['deaths in GA'],ylabel='Deaths')

COVID-19 infections and deaths reported in Georgia.

In [24]:
temp = generateTable(getStateTopQuery('Georgia'),connection,['County','Infections','Deaths'])
temp = temp.set_index('County')
temp.loc['TOTAL'] = temp.sum(axis=0)
temp = temp.sort_values('Infections',axis=0,ascending=False)
Infections Deaths
TOTAL 48894 2123
Fulton 4688 244
Gwinnett 4063 135
DeKalb 3874 122
Cobb 3097 186
Hall 2539 50
Out of GA 2183 31
Dougherty 1790 149
Unassigned 1338 0
Clayton 1260 51
Cherokee 923 33

Top ten locations in Georgia for infections.


In [25]:
fig = generateGenericGraphDF('Infections in Alaska',statesdf,['Alaskaconfirmed'],labels=['infections in AK'],ylabel='Infections')
fig2 = generateGenericGraphDF('Deaths in Alaska',statesdf,['Alaskadead'],labels=['deaths in AK'],ylabel='Deaths')

COVID-19 infections and deaths reported in Alaska.

In [26]:
temp = generateTable(getStateTopQuery('Alaska'),connection,['City','Infections','Deaths'])
temp = temp.set_index('City')
temp.loc['TOTAL'] = temp.sum(axis=0)
temp = temp.sort_values('Infections',axis=0,ascending=False)
Infections Deaths
TOTAL 504 10
Anchorage 254 4
Fairbanks North Star 85 2
Kenai Peninsula 60 2
Matanuska-Susitna 34 1
Juneau 33 0
Ketchikan Gateway 16 0
Petersburg 4 1
Bethel 3 0
Southeast Fairbanks 3 0
Nome 3 0

Top ten locations in Alaska for infections.

COVID-19 Event Timeline

Below is a timeline of events related to COVID-19 in virus epicenters.

In [27]:
dates = [
names = [
    '1st US infection',
    'WH forms task force',
    'US travel restrictions',
    '1st non-Chinese death',
    'Diamond Princess',
    'Actual first',
    'Dow drops 1000',
    '1st untraceable US case',
    '1st death in US',
    '10 dead in WA',
    '100k worldwide',
    '500 US cases',
    'Nat\'l emerg.',
    'NYSE halted',
    'Spain loosens',
    'Stimulus payments',
    'GA loosens',
    'CO loosens',
descriptions = [
    'First COVID-19 infection in the US',
    'White House announces a dedicated task force',
    'Travel restrictions for those entering the US who have recently traveled in China',
    'First death of a COVID-19 victim outside of China',
    'Diamond Princess quarantine reported by media',
    'Actual first death occurred in US, as reported April 22',
    'Dow Jones sheds 1000 points, beginning a five-day correction',
    'First case in the US that could not be traced to an origin',
    'First death of a COVID-19 victim in the US',
    'Four more dead in Washington state, bringing total to ten in that state',
    'Worldwide infections pass 100,000 mark',
    'Over 500 infections in the US',
    'WHO officially declares COVID-19 a pandemic',
    'President Trump declares national emergency',
    'NYSE temporarily halted after 2,725 point drop',
    'DHS issues "no unnecessary travel" advisory',
    'Congress agrees on $2 trillion stimulus bill',
    'Trump extends distancing guidelines through April 30',
    'Spain begins loosening restrictions',
    'Stimulus bill payments start going out',
    'Georgia to allow gyms, barbers, etc. to open',
    'Colorado to lift stay-at-home mandate, though still asking citizens to stay home',
timelinedf = pd.DataFrame()
timelinedf['Date'] = dates
timelinedf['Event'] = names
timelinedf['Description'] = descriptions
timelinedf = timelinedf.set_index('Date')
fig,axis = getTimeline("COVID-19 Event Timeline",dates,names,interval=5)

Details on most recent events:

In [28]:
timelinedf = timelinedf.reset_index()
def prettyDateFormat(val):
    valDate = datetime.datetime.strptime(val,'%Y-%m-%d')
    return valDate.strftime('%b %d')

timelinedf['Date'] = timelinedf.apply(lambda x: prettyDateFormat(x['Date']), axis=1)
Date Event Description
Mar 11 Pandemic WHO officially declares COVID-19 a pandemic
Mar 13 Nat'l emerg. President Trump declares national emergency
Mar 16 NYSE halted NYSE temporarily halted after 2,725 point drop
Mar 16 Advisory DHS issues "no unnecessary travel" advisory
Mar 25 Stimulus Congress agrees on $2 trillion stimulus bill
Mar 29 Extension Trump extends distancing guidelines through April 30
Apr 12 Spain loosens Spain begins loosening restrictions
Apr 15 Stimulus payments Stimulus bill payments start going out
Apr 24 GA loosens Georgia to allow gyms, barbers, etc. to open
Apr 26 CO loosens Colorado to lift stay-at-home mandate, though still asking citizens to stay home

Why These States?

I have provided more detailed data for states that matter to my audience (mostly close friends and family). If you would like to see other states, just shoot me an email (


  1. "Mortality in the United States, 2018", CDC. (
  2. "The Average American Knows How Many People?", NY Times. (
  3. "The WHO Estimated COVID-19 Mortality at 3.4%. That Doesn't Tell the Whole Story", Time. (
  4. "Coronavirus disease 2019 (COVID-19) Situation Report – 46", WHO. (
  5. "Disease Burden of Influenza", CDC. (
  6. "CDC Director On Models For The Months To Come: 'This Virus Is Going To Be With Us'", NPR. (
  7. "Italy's Coronavirus Death Toll Is Far Higher Than Reported", MSN. (