Steps to get Football Data with a Python Package
Step 1: Installing and Importing the StatsBombPy Package
Begin by installing the StatsBombPy package via pip:
pip install statsbombpy
Importing statsbombpy using following code:
from statsbombpy import sb
Step 2: Exploring Available Competitions:
To view the available competitions within the StatsBomb dataset, use:
sb.competitions()
Output:
competition_id season_id country_name competition_name competition_gender competition_youth competition_international season_name match_updated match_updated_360 match_available_360 match_available
0 9 27 Germany 1. Bundesliga male False False 2015/2016 2023-12-12T07:43:33.436182 None None 2023-12-12T07:43:33.436182
1 1267 107 Africa African Cup of Nations male False True 2023 2024-02-14T05:41:27.566989 None None 2024-02-14T05:41:27.566989
2 16 4 Europe Champions League male False False 2018/2019 2023-03-07T12:20:48.118250 2021-06-13T16:17:31.694 None 2023-03-07T12:20:48.118250
3 16 1 Europe Champions League male False False 2017/2018 2021-08-27T11:26:39.802832 2021-06-13T16:17:31.694 None 2021-01-23T21:55:30.425330
4 16 2 Europe Champions League male False False 2016/2017 2021-08-27T11:26:39.802832 2021-06-13T16:17:31.694 None 2020-07-29T05:00
... ... ... ... ... ... ... ... ... ... ... ... ...
66 55 43 Europe UEFA Euro male False True 2020 2023-02-24T21:26:47.128979 2023-04-27T22:38:34.970148 2023-04-27T22:38:34.970148 2023-02-24T21:26:47.128979
67 35 75 Europe UEFA Europa League male False False 1988/1989 2023-06-18T19:28:39.443883 2021-06-13T16:17:31.694 None 2023-06-18T19:28:39.443883
68 53 106 Europe UEFA Women's Euro female False True 2022 2023-10-24T03:36:54.066267 2023-10-24T03:37:29.085948 2023-10-24T03:37:29.085948 2023-10-24T03:36:54.066267
69 72 107 International Women's World Cup female False True 2023 2023-12-12T14:06:50.626363 2023-12-12T14:12:41.561162 2023-12-12T14:12:41.561162 2023-12-12T14:06:50.626363
70 72 30 International Women's World Cup female False True 2019 2023-07-27T10:33:48.273734 2021-06-13T16:17:31.694 None 2023-07-27T10:33:48.273734
71 rows × 12 columns
To Filter out duplicate entries to display unique competitions
drop_duplicates(['country_name', 'competition_name'])
removes duplicate rows from the DataFrame based on the specified columns (‘country_name’ and ‘competition_name’). If there are multiple rows with the same country name and competition name, only the first occurrence is kept, and the rest are dropped.
sb.competitions().drop_duplicates(['country_name', 'competition_name'])
Output:
competition_id season_id country_name competition_name competition_gender competition_youth competition_international season_name match_updated match_updated_360 match_available_360 match_available
0 9 27 Germany 1. Bundesliga male False False 2015/2016 2023-12-12T07:43:33.436182 None None 2023-12-12T07:43:33.436182
1 1267 107 Africa African Cup of Nations male False True 2023 2024-02-14T05:41:27.566989 None None 2024-02-14T05:41:27.566989
2 16 4 Europe Champions League male False False 2018/2019 2023-03-07T12:20:48.118250 2021-06-13T16:17:31.694 None 2023-03-07T12:20:48.118250
20 87 84 Spain Copa del Rey male False False 1983/1984 2020-07-29T05:00 2021-06-13T16:17:31.694 None 2020-07-29T05:00
23 37 90 England FA Women's Super League female False False 2020/2021 2023-02-25T14:52:09.326729 2021-06-13T16:17:31.694 None 2023-02-25T14:52:09.326729
26 1470 274 International FIFA U20 World Cup male False False 1979 2023-06-28T10:55:11.501179 None None 2023-06-28T10:55:11.501179
27 43 106 International FIFA World Cup male False True 2022 2023-11-05T04:23:26.649917 2023-11-21T15:37:11.589616 2023-11-21T15:37:11.589616 2023-11-05T04:23:26.649917
This provides insights into competitions such as the FIFA World Cup, Champions League, La Liga, and more.
Step 3: Exploring Specific Matches (e.g., FIFA World Cup 2018):
sb.matches(competition_id=43, season_id=3)
: This method fetches match data for a specific competition and season. In this case,competition_id=43
specifies the ID of the competition (e.g., Premier League), andseason_id=3
specifies the ID of the season (e.g., 2018-2019 season).df_2018 = sb.matches(competition_id=43, season_id=3)
: This line assigns the retrieved match data to a DataFrame calleddf_2018
.df_2018.head(5)
: This line displays the first 5 rows of thedf_2018
DataFrame, providing a glimpse of the match data for the 2018 season
df_2018 = sb.matches(competition_id=43, season_id=3)
df_2018.head(5)
Output:
match_id match_date kick_off competition season home_team away_team home_score away_score match_status ... last_updated_360 match_week competition_stage stadium referee home_managers away_managers data_version shot_fidelity_version xy_fidelity_version
0 7585 2018-07-03 20:00:00.000 International - FIFA World Cup 2018 Colombia England 1 1 available ... 2021-06-13T16:17:31.694 4 Round of 16 Otkritie Bank Arena Mark Geiger José Néstor Pekerman Gareth Southgate 1.0.2 None None
1 7570 2018-06-28 20:00:00.000 International - FIFA World Cup 2018 England Belgium 0 1 available ... 2021-06-13T16:17:31.694 3 Group Stage Stadion Kaliningrad Damir Skomina Gareth Southgate Roberto Martínez Montoliú 1.0.2 None None
2 7586 2018-07-03 16:00:00.000 International - FIFA World Cup 2018 Sweden Switzerland 1 0 available ... 2021-06-13T16:17:31.694 4 Round of 16 Saint-Petersburg Stadium Damir Skomina Jan Olof Andersson Vladimir Petković 1.0.2 None None
3 7557 2018-06-25 20:00:00.000 International - FIFA World Cup 2018 Iran Portugal 1 1 available ... 2021-06-13T16:17:31.694 3 Group Stage Mordovia Arena Enrique Cáceres Carlos Manuel Brito Leal Queiróz Fernando Manuel Fernandes da Costa Santos 1.0.2 None None
4 7542 2018-06-20 14:00:00.000 International - FIFA World Cup 2018 Portugal Morocco 1 0 available ... 2021-06-13T16:17:31.694 2 Group Stage Stadion Luzhniki Mark Geiger Fernando Manuel Fernandes da Costa Santos Hervé Renard 1.0.2 None None
5 rows × 22 columns
Step 4: Retrieving Lineups:
This code retrieves the lineups for a specific football match in the StatsBomb dataset for the 2018 season. Let’s break down the code:
id_final_2018 = 8658
: This line defines theid_final_2018
variable and assigns it the match ID8658
. This ID is used to uniquely identify the specific match for which we want to retrieve the lineups.lineups = sb.lineups(match_id=id_final_2018)
: This line calls thesb.lineups()
method with thematch_id=id_final_2018
argument to retrieve the lineups for the match with the specified ID. The result is stored in thelineups
variable.lineups.keys()
: This line retrieves the keys (column names) of thelineups
DataFrame, which contain information about the players in each team’s lineup for the specified match.
id_final_2018 = 8658
lineups = sb.lineups(match_id=id_final_2018)
lineups.keys()
Output:
dict_keys(['France', 'Croatia'])
Step 5: Retrieving Match Events:
df_events = sb.events(match_id=id_final_2018)
: This line calls thesb.events()
method with thematch_id=id_final_2018
argument to retrieve event data for the match with the specified ID (id_final_2018
). The result is stored in thedf_events
variable, which is a DataFrame containing information about various events that occurred during the match (e.g., goals, fouls, substitutions).df_events.columns
: This line retrieves the column names (keys) of thedf_events
DataFrame. Each column represents a different attribute or piece of information about the events recorded during the match.
df_events = sb.events(match_id=id_final_2018)
df_events.columns
Output:
Index(['ball_receipt_outcome', 'ball_recovery_recovery_failure',
'block_deflection', 'carry_end_location', 'clearance_aerial_won',
'counterpress', 'dribble_outcome', 'dribble_overrun', 'duel_outcome',
'duel_type', 'duration', 'foul_committed_advantage',
'foul_committed_card', 'foul_committed_penalty', 'foul_committed_type',
'foul_won_advantage', 'foul_won_defensive', 'goalkeeper_body_part',
'goalkeeper_end_location', 'goalkeeper_outcome', 'goalkeeper_position',
'goalkeeper_technique', 'goalkeeper_type', 'id', 'index',
'injury_stoppage_in_chain', 'interception_outcome', 'location',
'match_id', 'minute', 'pass_aerial_won', 'pass_angle',
'pass_assisted_shot_id', 'pass_backheel', 'pass_body_part',
'pass_cross', 'pass_cut_back', 'pass_deflected', 'pass_end_location',
'pass_goal_assist', 'pass_height', 'pass_length', 'pass_outcome',
'pass_recipient', 'pass_recipient_id', 'pass_shot_assist',
'pass_switch', 'pass_type', 'period', 'play_pattern', 'player',
'player_id', 'position', 'possession', 'possession_team',
'possession_team_id', 'related_events', 'second', 'shot_aerial_won',
'shot_body_part', 'shot_deflected', 'shot_end_location',
'shot_first_time', 'shot_freeze_frame', 'shot_key_pass_id',
'shot_outcome', 'shot_statsbomb_xg', 'shot_technique', 'shot_type',
'substitution_outcome', 'substitution_outcome_id',
'substitution_replacement', 'substitution_replacement_id', 'tactics',
'team', 'team_id', 'timestamp', 'type', 'under_pressure'],
dtype='object')
Step 6: Filtering and sorting of event data
df_events = df_events[['timestamp','team', 'type', 'minute', 'location', 'pass_end_location', 'player']]
: This line selects only the specified columns (‘timestamp’, ‘team’, ‘type’, ‘minute’, ‘location’, ‘pass_end_location’, ‘player’) from thedf_events
DataFrame and assigns the result back todf_events
. This step filters the DataFrame to include only these columns for further analysis.df_events = df_events.sort_values(['minute', 'timestamp'])
: This line sorts thedf_events
DataFrame based on the ‘minute’ and ‘timestamp’ columns in ascending order. This ensures that the events are ordered chronologically within each minute of the match.df_events.tail(30)
: This line displays the last 30 rows of thedf_events
DataFrame, showing the most recent events recorded in the match. Each row represents a specific event (e.g., pass, shot, foul) along with the corresponding details such as the team, player, location, and minute of the event.
df_events = df_events[['timestamp','team', 'type', 'minute', 'location', 'pass_end_location', 'player']]
df_events = df_events.sort_values(['minute', 'timestamp'])
df_events.tail(5)
Output:
timestamp team type minute location pass_end_location player
2215 00:49:45.427 France Carry 94 [5.0, 33.0] NaN Hugo Lloris
2960 00:49:45.427 France Goal Keeper 94 [5.0, 33.0] NaN Hugo Lloris
851 00:50:01.987 France Pass 95 [18.0, 31.0] [52.0, 25.0] Hugo Lloris
2967 00:50:03.760 France Half End 95 NaN NaN NaN
2968 00:50:03.760 Croatia Half End 95 NaN NaN NaN
With the StatsBombPy package, obtaining football data becomes seamless and efficient. By following the steps outlined in this guide, analysts and enthusiasts alike can delve into comprehensive datasets encompassing various competitions, matches, lineups, and events. Empowered with this wealth of data, the possibilities for football analytics projects are boundless.
How to get Football Data with a Python Package
Football (soccer) is one of the most popular sports worldwide, captivating millions of fans with its thrilling matches and compelling narratives. In this article, we’ll explore how to easily access football data using Python.
We’ll explore in this article all the free football data that Statsbomb shares on its Python package statsbombpy
.