#TidyTuesday: EDA of Beer Production

Jen Ren 2020-04-03 6 minute read

It’s been a while since I’ve done exploratory data analyses (EDAs), so I’m refreshing those skills with my first stab at playing with Tidy Tuesday datasets! This week, we’re looking at beer production in the US. Find the files, overview, and data dictionaries here.

library(tidyverse)
# Get the data

brewing_materials <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/brewing_materials.csv')
beer_taxed <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/beer_taxed.csv')
brewer_size <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/brewer_size.csv')
beer_states <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/beer_states.csv')

Let’s quickly run summary() on all of these datasets to get a sense of what they contain.

Let’s see what monthly beer sales look like.

monthly_sales <- 
  beer_taxed %>% 
  filter(type == 'Total Removals') %>% 
  select(
    year,
    month,
    barrels_sold = month_current,
    tax_rate
  )
# Explore annual sales trends over the years

scale_formatter_mil <- function(x) {
  scales::comma(x / 1000000, accuracy = 1)
}

monthly_sales %>% 
  ggplot(aes(x = month, y = barrels_sold, group = year, color = year)) + 
  geom_line() +
  scale_x_discrete(
    limits = 1:12, 
    labels = month.abb
  ) +
  scale_y_continuous(labels = scale_formatter_mil) + 
  labs(
    title = "Barrels of beer sold from 2008-2019",
    x = "Month",
    y = "Barrels sold (in millions)"
  )

It looks like there’s been a steady decline over the years in how many barrels of beer are sold during the summer months, which are the peak months when beer is sold. However, it looks like nearing the winter holidays at the end of the year, beer sales have stayed relatively consistent over the years.

It looks like after the holidays though, in recent years beer sales have plummeted to record lows. I wonder why this might be: is it because of a “New Years Resolution” effect that’s become stronger over the years? Or that beer is seen as increasingly unhealthy? Or have tastes shifted since 2008 away from beer and toward some other alcohol?

It might be cool to try making some models to predict beer sales in 2020. I’ll also get to try out the tidymodels package, which I haven’t yet played with!


Let’s do a bit more EDA since I also want to explore how each states’ beer production has trended over time. What state produces the most? What state has experienced the most change in production over time? Are there certain regions in the US that tend to produce more/less beer? My gut instinct is that the states where Anheuser-Busch has operations and domestic breweries will dominate the charts. We can also look at what kinds of production each state leans towards: on premises, bottle/cans, or kegs/barrels. I expect the states with a stronger tradition of craft breweries (maybe like Illinois-Chicago and Oregon-Portland) to have higher percentages of on-prem or kegs/barrels, versus states like Missouri where Anheuser-Busch has breweries to be mostly bottle/cans for export.

beer_states %>% 
  filter(
    state != "total",
    !is.na(barrels)
  ) %>% 
  ggplot(aes(x = year, y = barrels, fill = type)) + 
  geom_col() + 
  facet_wrap(~ state) + 
  scale_x_continuous(labels = NULL) + 
  scale_y_continuous(labels = NULL) + 
  theme_minimal() +
  theme(legend.position = "bottom")

beer_states %>% 
  filter(year == 2019) %>% 
  group_by(state) %>% 
  summarize(total_production = sum(barrels, na.rm = TRUE)) %>% 
  ungroup() %>% 
  arrange(desc(total_production))
## # A tibble: 52 x 2
##    state total_production
##    <chr>            <dbl>
##  1 total       167077233.
##  2 CO           19097585.
##  3 TX           18187258.
##  4 CA           17872597.
##  5 OH           17676964.
##  6 VA           14179376.
##  7 GA           13411107.
##  8 MO           12169102.
##  9 FL           10373198.
## 10 WI            9227545.
## # … with 42 more rows

It looks like California, Colorado, Florida, Georgia, Missouri, Ohio, Texas, and Virginia have really high production levels consistently over the years. For all of them, bottle/cans dominate. This doesn’t surprise me as many of these states are where Anheuser-Busch has its domestic breweries.

At first glance, it looks like Illinois, Oregon, Ohio, and Pennsylvania have decently high proportions of non-bottle/can production.

I’m going to quickly define a function that will help us do deeper dives into each state that we care about!

# Define helper to visualize state data

scale_formatter_1000 <- function(x) {
  scales::comma(x / 1000, accuracy = 1)
}
  
plot_state_production <- function (state_code) {  
  beer_states %>% 
  filter(
    state == state_code,
    !is.na(barrels)
  ) %>% 
  ggplot(aes(x = year, y = barrels, fill = type)) + 
  geom_col() +
  scale_x_discrete(limits = 2008:2019) + 
  scale_y_continuous(labels = scale_formatter_1000) + 
  theme(legend.position = "bottom") +
  labs(
    title = paste0(state_code, "'s beer production from 2008-2019"),
    x = "Year",
    y = "Barrels produced (in thousands)"
  )
}

Now let’s invoke it to peek at a few states:

plot_state_production("CA")

California’s production is among the highest, second to only Colorado and Texas in 2019. It’s seen a steady decline since 2008, though.

plot_state_production("WA")

Washington has significantly smaller production, ranking 20th of all states. However, its spread is far different; production of kegs/barrels and on-prem is far higher. Even compared to Arizona, which is only just slightly behind Washington’s production, it has a higher percentage of non-bottle/can production in 2019:

plot_state_production("AZ")

However, in Arizona’s case, it looks like it’s because of a huge spike in bottle/can production in the past year – I wonder if this is because a major brand opened up a brewery in Arizona that caused this spike or scaled up production of a formerly craft beer-focused operation?

plot_state_production("NJ")

New Jersey has seen a huge slide in production since 2008! I wonder why: is this because operations are moving to other locations that might have lower operating costs (New Jersey is close to New York and perhaps land and labor costs are going up?) and a general trend of moving manufacturing away from the tri-state area, or decreased interest in that region?

plot_state_production("WV")

On the total opposite end of the spectrum, West Virginia produces some of the smallest quantities of beer of any state (second only to North Dakota). Most of its production is for on-prem consumption and in kegs/barrels. It’s also seen a huge climb since 2014-2015; I wonder if there was some event around then? We see a similar spike for North Dakota around the same time.


There’s still so much more to dive into, including the brewing_materials dataset and trying out tidymodels on beer_taxed. I’ll definitely be revisiting this dataset to see what else we might be able to find! So far, I’ve been most surprised to see spikes in bottled/canned beer production in Arizona and other statewide trends. I’m also surprised that beer sales in general have dipped in recent summer months, and especially around Jan/Feb which seem like prime beer months with the Superbowl. I wonder if other alcoholic beverages like Whiteclaw are now cannibalizing some of the sales in recent years? In any case, this was really fun as a refresher to dive back into doing EDA!