Cutting Edge Football Analytics

using polars, keras and spektral




Joris Bekkers

Cutting Edge Football Analytics

and... kloppy & unravelsports




Joris Bekkers

Contents

  • Introduction

  • Football Data

  • GraphEPV
    • Expected Possession Value
    • Graph Neural Networks
    • World Cup 2022 Dataset
    • kloppy
    • unravelsports

Di Maria 36' (2-0)

Introduction

  • Joris Bekkers

  • Football Analytics Consultant
    • Research
    • Development
    • Deployment

Messi

Introduction

Messi Sahasrabudhe & Bekkers (2023)

PySport

  • PySport Boardmember
    • Unite Sports Analytics Community
    • Grow and Maintain Open-Source
      • kloppy contributor
    • πŸ”— PySport.org
    Messi

    Football Data

    Football Data

    • Event Data
      • Shots, Passes, Defensive actions etc.
      • Only event location
      • Expected Goals
    • Positional Tracking Data
    • Box Score Data
    • GPS / LPS Data
    • Health Data & Questionnaire
    • Body Pose / Skeletal Data
    Messi

    Football Data

    • Event Data
    • Positional Tracking Data
      • Players & Ball [25FPS]
      • Collected in-stadium or from TV footage
    • Box Score Data
    • GPS / LPS Data
    • Health Data & Questionnaire
    • Body Pose / Skeletal Data
    Messi

    World Cup 2022

    Tracking & Event Data

    match_id home_team away_team date
    10517 Argentina France 2022-12-18T15:00:00
    10516 Croatia Morocco 2022-12-17T15:00:00
    10515 France Morocco 2022-12-14T19:00:00
    10514 Argentina Croatia 2022-12-13T19:00:00
    10513 England France 2022-12-10T19:00:00
    ... ... ... ...
    3815 United States Wales 2022-11-21T19:00:00
    3812 Senegal Netherlands 2022-11-21T16:00:00
    3813 England Iran 2022-11-21T13:00:00
    3814 Qatar Ecuador 2022-11-20T16:00:00

    πŸ”— PFF.com

    GraphEPV

    • EPV: Expected Possession Value
      • Expected outcome of a sequence of player actions
      • Chance to score within the next t seconds or n actions
    • GNN: Graph Neural Network
      • Graphs to represent complex, non-linear relationships
      • One graph per frame of positional data
    • Label:
      • Annotate positional data with event data
      • Goal within next 10 seconds

    Existing Work


    Possession Value Frameworks

    Existing Work

    Possession Value Frameworks

    Open-Source Tools

    Wirtz

    Kloppy solves data standardization issues within football through a vendor-independent data model for both event and tracking data.

    • Load
    • Filter
    • Transform
    • Export
      • To polars DataFrame

    kloppy Supported Providers

    Provider Event Tracking Public Data
    Metrica βœ… βœ… πŸ”—
    Sportec βœ… βœ… πŸ”—
    StatsPerform / Opta βœ… βœ…
    SecondSpectrum βŒ› βœ…
    PFF βŒ› βœ… πŸ”—
    SkillCorner βœ… πŸ”—
    Hawkeye (2D) βœ…
    Signality βœ…
    Tracab βœ…
    StatsBomb βœ… πŸ”—
    WyScout βœ… πŸ”—
    DataFactory βœ…
    Impect βŒ›
    Wirtz

    The unravelsports package aims to aid researchers, analysts and enthusiasts by providing intermediary steps in the complex process of converting raw sports data into meaningful information and actionable insights.

    • ⚽ 🏈 Polars DataFrame Conversion
      • Use kloppy to Standardize, Transform and Export
    • ⚽ 🏈 Graph Neural Network
      • Use polars to Filter and Enhance
      • Use spektral and keras to Train, Validate and Predict
      • Use mplsoccer to Plot and Animate
    • ⚽ Pressing Intensity
    • v1.0.0 released yesterday!
    Wirtz

    "Spektral is a Python library for graph deep learning, based on the Keras API and TensorFlow 2"

    Graph Neural Network


    Messi
    Messi

    Node Features

    Messi

    Global Features

    Messi

    Edge Features

    Messi

    Adjacency Matrix

    Messi

    Adjacency Matrix

    Messi

    Model Architecture

    Messi

    (Sahasrabudhe & Bekkers 2023)

    Loading Tracking Data with kloppy


    from kloppy import pff
    
    match_id = 10513 # England v France
    
    dataset = (
        pff
        .load_tracking(
            raw_data=f"{match_id}.jsonl.bz2",
            meta_data=f"{match_id}.json",
            roster_meta_data=f"{match_id}.json",
        )
        .filter(
            # Remove penalty shootouts
            lambda frame: frame.period.id in [1, 2, 3, 4]
        )
    )
    
    • Total 7+ million frames in 64 Games

    Create Dataset with unravelsports

    from unravel.soccer import KloppyPolarsDataset
    
    kloppy_pl_ds = KloppyPolarsDataset(
        dataset,
    
        # Defaults
        ball_carrier_threshold = 25.0,
        max_player_speed = 12.0,
        max_ball_speed = 28.0,
        max_player_acceleration = 6.0,
        max_ball_acceleration = 13.5,
        orient_ball_owning = True,
        add_smoothing = True,
    )
    

    KloppyPolarsDataset

    • Use kloppy to:
      • Transform Attack Left to Right
      • Transform Coordinates
      • To Polars DataFrame
    • Additional Enhancements
      • Add Speed
      • Add Acceleration
      • Infer Goalkeeper
      • Infer Ball Carrier
      • Extra Checks
    • Output
      • kloppy_pl_ds.data
      • kloppy_pl_ds.settings

    
    (
        dataset
        .transform(
            # attack left to right
            to_orientation="BALL_OWNING_TEAM",
    
            # x [-52.5, 52.5] and y [-34, 34]
            to_coordinate_system="secondspectrum"
        )
        .to_df(engine="polars")
    )
    

    πŸ”— kloppy Documentation

    KloppyPolarsDataset.data

    period_id timestamp frame_id ball_state id x y z team_id position_name game_id vx vy vz v ax ay az a ball_owning_team_id is_ball_carrier
    0 1 0 days 00:00:00 10000 alive DFL-OBJ-00008F -20.67 -4.56 0 DFL-CLU-000005 RCB DFL-MAT-J03WPY 0.393 -0.214 0 0.447 0 0 0 0 DFL-CLU-00000P False
    1 1 0 days 00:00:00 10000 alive DFL-OBJ-0000EJ -8.86 -0.94 0 DFL-CLU-000005 UNK DFL-MAT-J03WPY -0.009 0.018 0 0.02 0 0 0 0 DFL-CLU-00000P False
    2 1 0 days 00:00:00 10000 alive DFL-OBJ-0000F8 -2.12 9.85 0 DFL-CLU-00000P RM DFL-MAT-J03WPY 0 0 0 0 0 0 0 0 DFL-CLU-00000P False
    3 1 0 days 00:00:00 10000 alive DFL-OBJ-0000NZ 0.57 23.23 0 DFL-CLU-00000P RB DFL-MAT-J03WPY 0.179 -0.134 0 0.223 0 0 0 0 DFL-CLU-00000P False
    4 1 0 days 00:00:00 10000 alive DFL-OBJ-0001HW -46.26 0.08 0 DFL-CLU-000005 GK DFL-MAT-J03WPY 0.357 0.071 0 0.364 0 0 0 0 DFL-CLU-00000P False

    • No labels, yet...

    Labels

    Event Data

    • PFF, not yet supported in kloppy
      • Simply load .json files instead
      • No need for any transformations etc.
    • Create Labels
      • Frames that leads to a goal within 10 seconds [1]
        • For team in possession
        • In the same period
        • Backfill to cover all frames
      • Rest [0]
      • Only 172 goals
    frame_id event_type result
    252342 "SHOT" "GOAL"
    162107 "SHOT" "GOAL"
    164933 "SHOT" "GOAL"
    68200 "PASS" "COMPLETE"
    44867 "PASS" "COMPLETE"

    Labels


    kloppy_pl_ds.data = (
        kloppy_pl_ds.data
        .join(events, on=["frame_id"], how="left")
        .sort(["frame_id"])
        .with_columns([
            # When does the next goal occur
            pl.when(pl.col("result") == "GOAL")
            .then(pl.col("timestamp"))
            .otherwise(None)
            .backward_fill()
            .over(["period_id"])
            .alias("next_goal_timestamp"),
            
            # Who scores the next goal
            pl.when(pl.col("result") == "GOAL")
            .then(pl.col("ball_owning_team_id"))
            .otherwise(None)
            .backward_fill()
            .over(["period_id"])
            .alias("next_goal_team_id"),
        ])
        .with_columns([
            # Does the next goal happen within 10 seconds, 
            # and is it scored by the team currently in possession
            pl.when(
                (pl.col("next_goal_timestamp") - pl.col("timestamp") < pl.duration(seconds=10)) &
                (pl.col("next_goal_team_id") == pl.col("ball_owning_team_id"))
            )
            .then(1)
            .otherwise(0)
            .alias("scores_in_10s")
        ])
    )
    

    Sanity Check

    Negative / Positive Labels

    • Goalkeepers (attack blue left, defense green right)
    • Ball locations (team a, team b)
    • Add Global Features

      • Time since start of a half
      • Time since last SHOT or PASS



    • Add Graph Ids

    kloppy_pl_ds.add_graph_ids(["game_id"])
    

    Convert to Graphs with unravelsports


    from unravel.soccer import SoccerGraphConverter
        
    converter = SoccerGraphConverter(
        dataset=kloppy_pl_ds,
        label_col="scores_in_10s",
        global_feature_cols=["period_normed", "timestamp_normed", "time_since_last_event"],
        sample_rate=(1/5),
        adjacency_matrix_type="split_by_team",
        adjacency_matrix_connect_type="ball",
        random_seed=True,
    )
    
    • Global Features
      • Columns in kloppy_pl_ds.data
    • Downsample
      • 7 million to 1.4 million
    • Default node_features_funcs and edge_feature_funcs

    ⏩ Add different or custom features

    • SoccerGraphConverter has:
      • node_feature_funcs=[is_gk, etc..]
      • edge_feature_funcs=[distances_between_players_normed, etc..]
    • e.g. Acceleration not in defaults

    @graph_feature(is_custom=False, feature_type="node")
    def is_gk(**kwargs):
        return np.where(kwargs["is_gk"], 1, 0.1)
    
    @graph_feature(is_custom=False, feature_type="edge")
    def distances_between_players_normed(**kwargs):
        distances_between_players = np.linalg.norm(
            kwargs["position"][:, None, :] - kwargs["position"][None, :, :], axis=-1
        )
        return normalize_distance(
            distances_between_players, max_distance=kwargs["settings"].max_distance
        )
    

    Convert to Graphs

    • Write directly to compressed pickle file
      • One file per match
      • 64 matches
    converter.to_pickle(f"data/converted_matches/{match_id}.pickle.gz")
    
    • Converts each frame to spektral.data.Graph
      • node features
      • edge features
      • adjacency matrix
      • label
      • graph Id
      • frame Id
    Messi

    ⏩ Plot Graph (Video)

    converter.plot(
        "video.mp4",
        timestamp=pl.duration(minutes=16, seconds=7),
        end_timestamp=pl.duration(minutes=16, seconds=18),
        fps=25,
        period_id=1,
        sort=True
    )
    
    Messi

    Initialize spektral Dataset

    from unravel.utils import GraphDataset
    
    graph_dataset = GraphDataset(
        pickle_folder="data/converted_matches"
    )
    
    • >>> GraphDataset(n_graphs=1_393_005)
      • 8309 Positive labels
    • GraphDataset inherits from spektral.data.Dataset
      • Adds split_test_train_validation()
      • Adds split_test_train()
      • Adds dimensions()

    Split and Downsample

    train, test, val = graph_dataset.split_test_train_validation(
        split_train=5, split_test=1, split_validation=1, 
        train_label_ratio = (1 / 7), 
        by_graph_id=True, 
    )
    
    >>> train: n_graphs=39_179  |  85.7%  |  0:  33_582   |  1:  5597
    >>> test: n_graphs=221_342  | 99.45%  |  0:  220_144  |  1:  1198
    >>> val: n_graphs=218_567   | 99.31%  |  0:  217_053  |  1:  1514
    
    • Downsample training data
      • Better training results
    • Split by "game_id"
      • Ensure every game is either in test, train or validation set

    Train GNN

    from spektral.data import DisjointLoader
    
    loader_tr = DisjointLoader(
        train, 
        batch_size=batch_size
    )
    loader_va = DisjointLoader(
        val, 
        epochs=1, 
        shuffle=False, 
        batch_size=batch_size
    )
    
    Messi

    Batch of Graphs as disjoint union
    πŸ”— spektral.data.DisjointLoader

    Train GNN

    from unravel.classifiers import CrystalGraphClassifier
    
    # Architecture from Sahasrabudhe & Bekkers (2023)
    model = CrystalGraphClassifier()
    

    Train GNN

    model.compile(
        loss=BinaryCrossentropy(), 
        optimizer=Adam(), 
        metrics=[
            AUC(curve="PR", name="pr_auc"),
            BinaryAccuracy()
        ]
    )
    
    • Precision Recall AUC
      • Heavily imbalanced datasets optimizing for minority ("scores_in_10s") class

    Train GNN

    model.fit(
        loader_tr.load(), # DisjointLoader
        steps_per_epoch=loader_tr.steps_per_epoch,
        epochs=100,
        use_multiprocessing=True,
        validation_data=loader_va.load(), # DisjointLoader
        validation_steps=loader_va.steps_per_epoch,
        callbacks=[
            EarlyStopping(
                monitor="val_pr_auc", 
                patience=5,  
                restore_best_weights=True, 
                mode="max"
            ),
        ],
    )
    
    • Validation PR AUC 0.054
    • Test PR AUC 0.053
      • ~11x better than random
      • Far from perfect
      • Uncalibrated due to undersampling
    • Improve? We need more goals!

    Calibration

    4x Calibration
    • Validation Set (n=218,567, 10 games)
    • 1514 Positive Samples (0.7%)

    Calibration

    2x Calibration
    • Temperature Scaling
    • Small sample
    • Needs more games...

    Prediction


    from unravel.utils import GraphDataset
    
    pred_dataset = GraphDataset(
        # World Cup Final GraphDataset
        pickle_file="10517.pickle.gz"
    )
    
    loader_pred = DisjointLoader(
        pred_dataset, 
        batch_size=2048, 
        epochs=1, 
        shuffle=False
    )
    
    preds = model.predict(
        loader_pred.load(), 
        use_multiprocessing=True
    )
    

    Prediction

    import polars as pl
    
    preds_df = pl.DataFrame({
        "frame_id": [graph.frame_id for graph in pred_dataset],
        "y_hat": preds.flatten()
    })
    
    • Each Graph in GraphDataset has a frame_id
    • Load KloppyPolarsDataset.data for World Cup Final
    • Join predictions back to KloppyPolarsDataset.data on frame_id

    Prediction

    Prediction

    Summary

    • Use kloppy to:
      • Standardize
      • Load
      • Filter
      • Transform
      • Export
    • Use unravelsports to:
      • convert to Graph
      • polars DataFrames
      • spektral Graph Neural Networks
      • keras Machine Learning
    2x Calibration

    2x Calibration

    American Football


    from unravel.american_football import BigDataBowlDataset, AmericanFootballGraphConverter
    

    πŸ”— BigDataBowl 2025

    ⏩ Open-Source Football Analytics Tools!

    ffmpeg -i img/pydata-ldn/1607-1618-1.mp4 -c:v libx264 -b:v 1500k -vf scale=1280:-2 -c:a aac -b:a 128k 1607-1618-1-browser.mp4

    marp pydata-london.md --html -o pydata.html --allow-local-files

    ffmpeg -i img/pydata-ldn/1607-1618-1-pred.mp4 -c:v libx264 -b:v 1500k -vf scale=1280:-2 -c:a aac -b:a 128k 1607-1618-1-pred-browser.mp4 ffmpeg -i "img/pydata-ldn/Prediction 1.mp4" -c:v libx264 -b:v 1500k -vf scale=1280:-2 -c:a aac -b:a 128k "Prediction 1-browser.mp4"

    ffmpeg -i "3104-3112-2-pred.mp4" -c:v libx264 -b:v 1500k -vf scale=1280:-2 -c:a aac -b:a 128k "3104-3112-2-pred-browser.mp4"

    ffmpeg -i "img/pydata-ldn/1607-1618-1.mp4" -c:v libx264 -b:v 1500k -vf scale=1280:-2 -c:a aac -b:a 128k "1607-1618-1.mp4" ffmpeg -i "Di-Maria.mp4" -c:v libx264 -b:v 1500k -vf scale=1280:-2 -c:a aac -b:a 128k "Di-Maria-browser.mp4"