10 Aug 2017

tidyverse / EPL / rstats

Introduction

I was listening to Jeff Atwood’s interview on the podcast Developer on Fire and he said something that struck home with me. It was along the lines of, “The best time to start blogging is yesterday.” I have been considering starting a blog about #rstats but had been putting it off because of any number of reasons. But after listening to his interview, I decided now was as good of a time as any. With the help of Yihui Xie’s blogdown, I was able to set up the basic webpage pretty easily too.

So In honor of the start of a new season for the English Premier League (EPL), I put together this exploratory data analysis of historical EPL data to see how teams typically do to start a season. I would love any feed back and suggestions! Please feel free to follow me on twitter (I just created it this past week to better keep up with #rstats rather than just continually googling it as I have been for the last year.)

General Data Analysis

# loading in the required packages
suppressWarnings(suppressPackageStartupMessages({
  library(tidyr)
  library(ggplot2)
  library(lubridate)
  library(magrittr)
  library(tidyquant)
  library(purrr)
  library(ggjoy)
  library(dplyr)
}))

First, we will load in the data. All of the data used in this analysis is from <www.football-data.co.uk>, and can be found on my github page under the epl repository.

files <- list.files(path = "epl_results", full.names = TRUE)

raw_data <- map(files, read.csv)

Below is the column information provided by the website:

* Div = League Division
* Date = Match Date (dd/mm/yy)
* HomeTeam = Home Team
* AwayTeam = Away Team
* FTHG = Full Time Home Team Goals
* FTAG = Full Time Away Team Goals
* FTR = Full Time Result (H=Home Win, D=Draw, A=Away Win) 
* HTHG = Half Time Home Team Goals 
* HTAG = Half Time Away Team Goals 
* HTR = Half Time Result (H=Home Win, D=Draw, A=Away Win)

Match Statistics (where available):

* Attendance = Crowd Attendance
* Referee = Match Referee
* HS = Home Team Shots
* AS = Away Team Shots
* HST = Home Team Shots on Target
* AST = Away Team Shots on Target
* HHW = Home Team Hit Woodwork
* AHW = Away Team Hit Woodwork
* HC = Home Team Corners
* AC = Away Team Corners
* HF = Home Team Fouls Committed
* AF = Away Team Fouls Committed
* HO = Home Team Offsides
* AO = Away Team Offsides
* HY = Home Team Yellow Cards
* AY = Away Team Yellow Cards
* HR = Home Team Red Cards
* AR = Away Team Red Cards

There are more columns provided in the raw data set that have to do with betting odds, however, we will remove them as they are not going to be used in this analysis. Additionally, only two seasons data has attendance recorded, so this will be removed. Looking through the data, all of the data sets for the 2000/2001 through the 2016/2017 seasons have all of the match statistics listed above. The data sets prior to the 2000/2001 data set only have the general data and none of the match statistics. Because of these differences in available data we will only use the general statistics for the first part of this analysis.

data_general <- map(raw_data, function(x){
  output <- x %>%
    filter(Date != "") %>%  #the csv files pulled in some extra rows, this line removes them
    mutate(Date = dmy(Date)) %>% #converting the Date column from a factor to date object
    select(Div:HTR) %>%
    mutate(season = ifelse(month(Date) > 7,
                                   year(Date) + 1,
                                    year(Date)))  #creating a reference date 
  return(output)
})

data <- do.call(rbind, data_general) %>% as.tibble()

data

## # A tibble: 8,360 x 11
##       Div       Date    HomeTeam       AwayTeam  FTHG  FTAG    FTR  HTHG
##    <fctr>     <date>      <fctr>         <fctr> <int> <int> <fctr> <int>
##  1     E0 1995-08-19 Aston Villa     Man United     3     1      H     3
##  2     E0 1995-08-19   Blackburn            QPR     1     0      H     1
##  3     E0 1995-08-19     Chelsea        Everton     0     0      D     0
##  4     E0 1995-08-19   Liverpool Sheffield Weds     1     0      H     0
##  5     E0 1995-08-19    Man City      Tottenham     1     1      D     0
##  6     E0 1995-08-19   Newcastle       Coventry     3     0      H     1
##  7     E0 1995-08-19 Southampton  Nott'm Forest     3     4      A     1
##  8     E0 1995-08-19    West Ham          Leeds     1     2      A     1
##  9     E0 1995-08-19   Wimbledon         Bolton     3     2      H     2
## 10     E0 1995-08-20     Arsenal  Middlesbrough     1     1      D     1
## # ... with 8,350 more rows, and 3 more variables: HTAG <int>, HTR <fctr>,
## #   season <dbl>

Now that the data from the csv files is all in one data frame, we can do some manipulation to get it into a more tidy format.

data_tidy <- data %>%
  gather(key = "venue", value = team, HomeTeam:AwayTeam) %>% 
  arrange(Date) %>%
  mutate_if(is.factor, as.character) %>%
  mutate(venue = ifelse(venue == "HomeTeam",
                        "Home",
                        "Away"),
         FTR = case_when(venue == "Home" & FTR == "H" ~ "W",
                         venue == "Home" & FTR == "A" ~ "L",
                         venue == "Away" & FTR == "H" ~ "L",
                         venue == "Away" & FTR == "A" ~ "W",
                         TRUE ~ FTR),
         HTR = case_when(venue == "Home" & HTR == "H" ~ "W",
                         venue == "Home" & HTR == "A" ~ "L",
                         venue == "Away" & HTR == "H" ~ "L",
                         venue == "Away" & HTR == "A" ~ "W",
                         TRUE ~ HTR),
         FTGF = ifelse(venue == "Home", FTHG, FTAG),  #Full Time Goals For
         FTGA = ifelse(venue == "Home", FTAG, FTHG),  #Full Time Goals Against
         HTGF = ifelse(venue == "Home", HTHG, HTAG),  #Half Time Goals For
         HTGA = ifelse(venue == "Home", HTAG, HTHG),  #Half Time Goals Against
         goal_diff = FTGF - FTGA,                    #goal difference
         points_earned = case_when(FTR == "W" ~ 3,           #adding points
                                   FTR == "D" ~ 1,
                                   FTR == "L" ~ 0)) %>% 
  select(Div, season, Date, team, venue, FTR, FTGF, 
         FTGA, HTR, HTGF, HTGA, goal_diff, points_earned) %>%
  group_by(season, team) %>%
  mutate(points = cumsum(points_earned),
         goal_diff_tot = cumsum(goal_diff)) %>% #calculating the number of points each team has through out the season
  ungroup()

data_tidy

## # A tibble: 16,720 x 15
##      Div season       Date        team venue   FTR  FTGF  FTGA   HTR  HTGF
##    <chr>  <dbl>     <date>       <chr> <chr> <chr> <int> <int> <chr> <int>
##  1    E0   1996 1995-08-19 Aston Villa  Home     W     3     1     W     3
##  2    E0   1996 1995-08-19   Blackburn  Home     W     1     0     W     1
##  3    E0   1996 1995-08-19     Chelsea  Home     D     0     0     D     0
##  4    E0   1996 1995-08-19   Liverpool  Home     W     1     0     D     0
##  5    E0   1996 1995-08-19    Man City  Home     D     1     1     L     0
##  6    E0   1996 1995-08-19   Newcastle  Home     W     3     0     W     1
##  7    E0   1996 1995-08-19 Southampton  Home     L     3     4     L     1
##  8    E0   1996 1995-08-19    West Ham  Home     L     1     2     W     1
##  9    E0   1996 1995-08-19   Wimbledon  Home     W     3     2     D     2
## 10    E0   1996 1995-08-19  Man United  Away     L     1     3     L     0
## # ... with 16,710 more rows, and 5 more variables: HTGA <int>,
## #   goal_diff <int>, points_earned <dbl>, points <dbl>,
## #   goal_diff_tot <int>

To ensure that our tidying did not create any missing values, we can use the summarise_all() function.

data_tidy %>% 
  summarise_all(function(x) sum(is.na(x)))

## # A tibble: 1 x 15
##     Div season  Date  team venue   FTR  FTGF  FTGA   HTR  HTGF  HTGA
##   <int>  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1     0      0     0     0     0     0     0     0     0     0     0
## # ... with 4 more variables: goal_diff <int>, points_earned <int>,
## #   points <int>, goal_diff_tot <int>

Now that we know that the data is in a tidy format, we can begin exploring the data. As an Arsenal fan, I think we should start by looking at how Arsenal has done each year.

data_tidy %>%
  filter(team == "Arsenal",
         season < 2007) %>%
  ggplot(aes(Date, points)) +
  facet_wrap(~season, scales = "free") +
  geom_line() +
  theme_tq() +
  scale_x_date(date_breaks = "2 month", date_labels = "%m/%d") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, 
                                   vjust = 0.5, size = 12),
        axis.text.y = element_text(size = 12),
        strip.text = element_text(size = 12),
        axis.title = element_text(size = 12)) +
  labs(x = "", 
       y = "Point Tally",
       title = "Arsenal Point Tally over the course of each season from\nthe 1995/1996 season to the 2005/2006 season")

data_tidy %>%
  filter(team == "Arsenal",
         season >= 2007) %>%
  ggplot(aes(Date, points)) +
  facet_wrap(~season, scales = "free") +
  geom_line() +
  theme_tq() +
  scale_x_date(date_breaks = "2 month", 
               date_labels = "%m/%d") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, 
                                   vjust = 0.5, size = 12),
        axis.text.y = element_text(size = 12),
        strip.text = element_text(size = 12),
        axis.title = element_text(size = 12))+
  labs(x = "", 
       y = "Point Tally",
       title = "Arsenal Point Tally over the course of each season from\nthe 2006/2007 season to the 2016/2017 season")

From these graphs, it looks like Arsenal follow a similar pattern every year, which is not surprising since the 2017 season was the first in Arsene Wenger’s tenure that they have not finished in the top 4. Looking at the plot for the 2017 season, it is clear that the period of the season that killed their chances of finishing in the top 4 was the stretch of games between February and March where their point increase flat lined. Let’s now take a look at team’s average finishing point total.

season_ending <- data_tidy %>%
  group_by(season, team) %>%
  summarise(final_points = max(points), 
            final_goals_for = sum(FTGF), 
            final_goals_against = sum(FTGA),
            final_goal_diff = sum(goal_diff)) %>%
  ungroup() %>%
  group_by(season) %>%
  mutate(table_position = 20 - rank(final_points, ties.method = "min")) %>%
  ungroup()
  
season_ending

## # A tibble: 440 x 7
##    season        team final_points final_goals_for final_goals_against
##     <dbl>       <chr>        <dbl>           <int>               <int>
##  1   1996     Arsenal           63              49                  32
##  2   1996 Aston Villa           63              52                  35
##  3   1996   Blackburn           61              61                  47
##  4   1996      Bolton           29              39                  71
##  5   1996     Chelsea           50              46                  44
##  6   1996    Coventry           38              42                  60
##  7   1996     Everton           61              64                  44
##  8   1996       Leeds           43              40                  57
##  9   1996   Liverpool           71              70                  34
## 10   1996    Man City           38              33                  58
## # ... with 430 more rows, and 2 more variables: final_goal_diff <int>,
## #   table_position <dbl>

From this output, right away we can see that point ties result in the rank function returning a tie. The EPL determines ties first by highest goal difference, and then goals for. Looking at this first example of the tie between Arsenal and Aston Villa, we can see that they both had a +17 overall goal difference but Aston Villa had 3 more goals for, meaning they finished 4th and Arsenal finished 5th. For now, we won’t worry about fixing these ties.

season_ending_stats <- season_ending %>%
  group_by(team) %>%
  summarise(avg_final_points = round(mean(final_points), 0), 
            sd_final_points = round(sd(final_points), 0), 
            avg_final_goals_for = round(mean(final_goals_for), 0), 
            sd_final_goals_for = round(sd(final_goals_for), 0),
            avg_final_goals_against = round(mean(final_goals_against), 0), 
            sd_final_goals_against = round(sd(final_goals_against), 0),
            avg_table_position = round(mean(table_position), 0),
            sd_table_poisiton = round(sd(table_position), 0),
            num_seasons = n())

season_ending_stats

## # A tibble: 46 x 10
##           team avg_final_points sd_final_points avg_final_goals_for
##          <chr>            <dbl>           <dbl>               <dbl>
##  1     Arsenal               75               7                  71
##  2 Aston Villa               50              11                  46
##  3    Barnsley               35             NaN                  37
##  4  Birmingham               43               7                  39
##  5   Blackburn               48              10                  48
##  6   Blackpool               39             NaN                  55
##  7      Bolton               44               9                  44
##  8 Bournemouth               44               3                  50
##  9    Bradford               31               7                  34
## 10     Burnley               34               5                  36
## # ... with 36 more rows, and 6 more variables: sd_final_goals_for <dbl>,
## #   avg_final_goals_against <dbl>, sd_final_goals_against <dbl>,
## #   avg_table_position <dbl>, sd_table_poisiton <dbl>, num_seasons <int>

As we can see from the tibble summary, there are several teams that have only been in the premier league for a single year, resulting in NaN values for the standard deviation columns. To avoid cluttering the next graph, we will remove any teams that have not been in the premier league for at least 10 of the 22 seasons being analyzed.

season_ending_stats %>%
  filter(num_seasons >= 10) %>%
  arrange(-avg_final_points) %>%
  mutate(team = factor(team, team)) %>%
  ggplot(aes(team, avg_final_points)) +
  geom_bar(stat = "identity", fill = "red") +
  geom_point(color = "navy") +
  geom_errorbar(aes(ymin = avg_final_points - 2*sd_final_points, 
                    ymax = avg_final_points + 2*sd_final_points), 
                color = "navy", size = 1) +
  theme_tq() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, 
                                   vjust = 0.5, size = 12),
        axis.text.y = element_text(size = 12),
        axis.title = element_text(size = 12)) +
  labs(x = "Team",
       y = "Final Points",
       title = "Average Final Points from 1995/1996 season to 2016/2017 season",
       subtitle = "Only teams with 10 or more seasons in the EPL were included")

It looks like Manchester United have the highest average final points per season. They are followed closely by Arsenal and Chelsea, who are then closely followed by Liverpool, Manchester City, and Tottenham. It is worth noting that Chelsea and Manchester City both have much higher error associated with their mean value, indicating they have much more fluctuation in the final point tallies. Another interesting take away from this figure is that the lowest average final point tally for teams that have been in the premier league for at least 10 of the previous 22 seasons is 39 points. 39 points typically guarantees a season that is safe from relegation. However, for the teams with the lower average finally point tally, such as Sunderland and West Brom, see the lower tails of their error bars treading dangerously close to relegation zone.

Now, it will be interesting to see how these teams final point tallies will look on the new joyplots from the ggjoy package.

season_ending %>%
  group_by(team) %>%
  mutate(num_seasons = n(),
         avg_final_points = round(mean(final_points), 0)) %>% #add these columns for filtering and factoring respectively
  ungroup() %>%
  filter(num_seasons >= 10) %>%
  arrange(avg_final_points) %>%
  mutate(team = factor(team, unique(team))) %>%
  ggplot(aes(final_points, team)) +
  geom_joy(scale = 0.9, rel_min_height = 0.01, 
           fill = "red", color = "black", size = 1) +
  theme_tq()+
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 12)) +
  labs(x = "Final Points",
       y = "Team",
       title = "Joyplot showing EPL teams' individual season final point tally distribution")

The joyplot confirms the conclusions drawn from the bar chart. Manchester United and Arsenal have the highest final point distribution and both have close to a bell curve. However, Chelsea and Manchester City both have much wider distributions as they have much more ups and downs over the past 22 years.

Looking at opening weekend results

Since this weekend marks the start of the 2017/2018 season, it will be interesting to see how teams have fared over the last 22 seasons. First, we will start by filtering for the first game of the season.

opening_games <- data_tidy %>%
  group_by(season, team) %>%
  mutate(final_points = max(points), 
            final_goals_for = sum(FTGF), 
            final_goals_against = sum(FTGA),
         game = rank(Date)) %>%
  ungroup() %>%
  group_by(season, game) %>%
  mutate(table_position = 21 - min_rank(final_points)) %>%
  ungroup() %>%
  filter(game == 1) %>%
  group_by(team) %>%
  mutate(final_points_last = lag(final_points),
         final_goals_for_last = lag(final_goals_for),
         final_goals_against_last = lag(final_goals_against),
         final_table_position_last = lag(table_position)) %>%
  ungroup()

opening_games

## # A tibble: 440 x 24
##      Div season       Date        team venue   FTR  FTGF  FTGA   HTR  HTGF
##    <chr>  <dbl>     <date>       <chr> <chr> <chr> <int> <int> <chr> <int>
##  1    E0   1996 1995-08-19 Aston Villa  Home     W     3     1     W     3
##  2    E0   1996 1995-08-19   Blackburn  Home     W     1     0     W     1
##  3    E0   1996 1995-08-19     Chelsea  Home     D     0     0     D     0
##  4    E0   1996 1995-08-19   Liverpool  Home     W     1     0     D     0
##  5    E0   1996 1995-08-19    Man City  Home     D     1     1     L     0
##  6    E0   1996 1995-08-19   Newcastle  Home     W     3     0     W     1
##  7    E0   1996 1995-08-19 Southampton  Home     L     3     4     L     1
##  8    E0   1996 1995-08-19    West Ham  Home     L     1     2     W     1
##  9    E0   1996 1995-08-19   Wimbledon  Home     W     3     2     D     2
## 10    E0   1996 1995-08-19  Man United  Away     L     1     3     L     0
## # ... with 430 more rows, and 14 more variables: HTGA <int>,
## #   goal_diff <int>, points_earned <dbl>, points <dbl>,
## #   goal_diff_tot <int>, final_points <dbl>, final_goals_for <int>,
## #   final_goals_against <int>, game <dbl>, table_position <dbl>,
## #   final_points_last <dbl>, final_goals_for_last <int>,
## #   final_goals_against_last <int>, final_table_position_last <dbl>

Now that we have a data set with all of the teams opening games results, we can look at some of the factors that could influence opening weekend results. First let’s look at how home field advantage helps teams on opening weekend

opening_home_adv <- opening_games %>%
  group_by(venue) %>%
  summarise(perc_W = round((sum(FTR == "W")/n()) * 100, 1), 
            perc_L = round((sum(FTR == "L")/n()) * 100, 1),
            perc_D = round((sum(FTR == "D")/n()) * 100, 1))

opening_home_adv

## # A tibble: 2 x 4
##   venue perc_W perc_L perc_D
##   <chr>  <dbl>  <dbl>  <dbl>
## 1  Away   30.5   42.3   27.3
## 2  Home   41.8   30.9   27.3

It looks like the home teams have better luck on the first weekend of the season than the away team. Let’s look at home this compares with the home field advantage of all games to see if there is any difference for the first week.

home_adv <- data_tidy %>%
  group_by(venue) %>%
  summarise(perc_W = round((sum(FTR == "W")/n()) * 100, 1), 
            perc_L = round((sum(FTR == "L")/n()) * 100, 1),
            perc_D = round((sum(FTR == "D")/n()) * 100, 1))

home_adv

## # A tibble: 2 x 4
##   venue perc_W perc_L perc_D
##   <chr>  <dbl>  <dbl>  <dbl>
## 1  Away   27.5   46.5     26
## 2  Home   46.5   27.5     26

It looks like the away team actually fairs slightly better on opening weekend than during the rest of the season. Let’s now take a look at the winning percentage by year for the home teams and the away teams.

opening_games %>%
  group_by(season, venue) %>%
  summarise(perc_W = round((sum(FTR == "W")/n()) * 100, 1), 
            perc_L = round((sum(FTR == "L")/n()) * 100, 1),
            perc_D = round((sum(FTR == "D")/n()) * 100, 1)) %>%
  gather(key = "result", value = "percentage", perc_W:perc_D) %>%
  mutate(result = case_when(result == "perc_W" ~ "Winning Percentage",
                            result == "perc_L" ~ "Losing Percentage",
                            TRUE ~ "Percent Draws")) %>%
  filter(venue == "Home") %>%
  ggplot(aes(season, percentage, color = result)) +
  geom_point() +
  geom_line(size = 1) +
  theme_tq() +
  scale_color_tq() +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1, 
                                   size = 12),
        axis.text.y = element_text(size = 12),
        legend.title = element_blank()) +
  labs(x = "Year",
       y = "Percentage of Results by Year",
       title = "The Percent of Results (Win, Draw, Loss) per Season\nfor Home Teams on Opening Weekend")

It looks like the winning percentage for home teams fluctuates and there is no clear trend as to whether being home on opening weekend provides any benefit. Since 2014, the home teams have a higher losing percentage than winning percentage. Perhaps, in recent years, there has been some shift that has made playing away from home more desirable on opening weekend. Likely this is just random, however, and there is no overall benefit to being home or away. While there does not appear to be any overall inferences that can be made from playing home or away on opening weekend, looking at the winning percentage for home and away opening games for the individual teams may provide some interesting results.

home_adv_team <- opening_games %>%
  group_by(team) %>%
  #removing teams that have not played at least ten seasons in the premier league
  mutate(num_seasons = n()) %>%
  filter(num_seasons >= 10) %>% 
  select(-num_seasons) %>%
  ungroup() %>%
  group_by(team, venue) %>%
  summarise(winning_percentage = round((sum(FTR == "W")/n()) * 100, 0), 
            num_games_total = n()) %>%
  ungroup()

home_adv_team

## # A tibble: 36 x 4
##           team venue winning_percentage num_games_total
##          <chr> <chr>              <dbl>           <int>
##  1     Arsenal  Away                 43               7
##  2     Arsenal  Home                 60              15
##  3 Aston Villa  Away                 31              13
##  4 Aston Villa  Home                 50               8
##  5   Blackburn  Away                 40               5
##  6   Blackburn  Home                 40              10
##  7      Bolton  Away                 43               7
##  8      Bolton  Home                 50               6
##  9     Chelsea  Away                 56               9
## 10     Chelsea  Home                 77              13
## # ... with 26 more rows

home_adv_team %>%
  arrange(-winning_percentage) %>%
  mutate(team = factor(team, unique(team)),
         venue = factor(venue, levels = c("Home", "Away"))) %>%
  ggplot(aes(team, winning_percentage, fill = venue)) +
  facet_wrap(~venue, scales = "fixed", ncol = 1) +
  geom_bar(stat = "identity") +
  theme_tq() +
  scale_fill_tq() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 12),
        axis.text.y = element_text(size = 12),
        strip.text = element_text(size = 12),
        legend.position = "none") +
  geom_text(aes(x = team, y = winning_percentage + 5, label = paste0("n=", as.character(num_games_total)))) +
  labs(x = "",
       y = "Winning Percentage",
       title = "Winning Percentage for Teams on Opening Weekend both Home and Away",
       subtitle = "Only teams with more than 10 seasons in the premier league were included")

This figure seems to provide much more insight into how home field advantage can impact teams. For example, Arsenal have a 60% winning percentage at home on opening weekend, and their opponent this weekend, Leicester City, have not won a opening game away from home in the premier league since 1996. This is a very positive sign for the gunners as their match is being played at the Emirates Stadium (Arsenal’s Home Field). Chelsea have a winning percentage in the high 70s when they play their opening game at home, as they do this year. That fact seems promising (unfortunately) for the blues, as their opening match this year is at home. Other noticeable impacts of home field advantage are New Castle United and Sunderland. New Castle have a home winning percentage of 50% on opening weekend, but only have a 10% winning percentage for away games on opening weekend. Fortunately for them, they are also playing at home this year. Unfortunately for them, they are matched up with West Ham who don’t have much issue playing on the road opening weekend, with a 40% winning percentage when away from home. Sunderland, similarly has an average home winning percentage of just under 40% for opening weekend, but have a winning percentage of only just north of 10% when playing away from home. However, Sunderland were relegated last season, so maybe they will have better luck in the Championship. Several teams seem not to be impacted by whether or not they play at home on opening weekend. Manchester United and Manchester City both only see a slight dip in winning percentage when away from home, and Blackburn has a 40% winning percentage when both home and away on opening weekend.

Let’s take a look at each team’s overall winning percentage on opening weekend.

opening_games %>%
  group_by(team) %>%
  #removing teams that have not played at least ten seasons in the premier league
  mutate(num_seasons = n()) %>%
  filter(num_seasons >= 10) %>% 
  select(-num_seasons) %>%
  ungroup() %>%
  group_by(team) %>%
  summarise(winning_percentage = round((sum(FTR == "W")/n()) * 100, 0), 
            num_games_total = n()) %>%
  ungroup() %>%
  arrange(-winning_percentage) %>%
  mutate(team = factor(team, unique(team))) %>%
  ggplot(aes(team, winning_percentage)) +
  geom_bar(stat = "identity", fill = "red", color = "black") +
  theme_tq() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 12),
        axis.text.y = element_text(size = 12),
        strip.text = element_text(size = 12),
        legend.position = "none") +
  geom_text(aes(x = team, y = winning_percentage + 5, label = paste0("n=", as.character(num_games_total)))) +
  labs(x = "",
       y = "Winning Percentage",
       title = "Winning Percentage for Teams on Opening Weekend",
       subtitle = "Only teams with more than 10 seasons in the premier league were included")

It looks like the results follow a similar pattern to the previous figure, as expected. Chelsea and Manchester United have the best winning percentages (>60%) on opening weekend, with Arsenal, Liverpool, Manchester City, and West Ham all close behind in the 50% range.

Conclusions

It looks like the odds are in favor of several of the typical powerhouse teams to perform well this weekend, if history has anything to say about it. There is much more that can be done with this data set and I hope to revisit it at a later date. I hope you all enjoyed reading this, I hope to put out more blog posts in the future!

And of course, COYG!!

Exploring English Premier League Historical Match Results

Introduction

General Data Analysis

Looking at opening weekend results

Conclusions