Computer Vision

FIFASoccerDS

Multi-model computer vision pipeline for soccer video analysis. YOLOv8n detection, ByteTrack persistence, and GraphSAGE neural networks for tactical pattern recognition.

View on GitHub
PythonYOLOv8ByteTrackPyTorch GeometricFastAPIMLflow
YOLOv8n
Detection Model
Nano variant
7 Stages
DVC Pipeline
End-to-end
3-Class
GNN Classification
GraphSAGE
50+
MLflow Experiments
Tracked

Pipeline Architecture

Video Input
MP4/RTSP · 1280×720
Frame Extract
OpenCV · stride 5
Detection
YOLOv8n · conf 0.25
Tracking
ByteTrack · Kalman + Hungarian
Graph Build
Spatial · 80px threshold
GNN Analysis
GraphSAGE · 3-class

Deep Dive

How It Works

1

Detection — YOLOv8 Nano

Each frame is processed through YOLOv8n with a confidence threshold of 0.25. The nano variant was chosen for real-time inference speed on consumer GPUs. Detects players, ball, and referees with automatic device selection (CUDA when available).

Example

Frame → detections with bboxes, scores, and class IDs

src/detect/infer.pypython
from ultralytics import YOLO

@dataclass(slots=True)
class InferenceConfig:
    weights: str = "yolov8n.pt"
    device: str = "cuda_if_available"
    confidence: float = 0.25
    max_frames: int = 30

def load_model(config: InferenceConfig) -> YOLO:
    device = _resolve_device(config.device)
    model = YOLO(config.weights)
    model.to(device)
    return model

def run_inference(image_path, config=None, model=None):
    cfg = config or InferenceConfig()
    detector = model or load_model(cfg)
    results = detector.predict(image_path, conf=cfg.confidence, verbose=False)
    return results[0]

Output

[Detection(bbox=[120,340,180,520], score=0.92, class="player"),
 Detection(bbox=[540,280,560,300], score=0.87, class="ball")]
2

Tracking — ByteTrack Runtime

A Kalman filter-based tracker with Hungarian algorithm matching. Uses distance-based association with a threshold of 80px, and max_age of 15 frames for track persistence. Features ID reuse with delay to prevent collisions.

Example

Persistent IDs across frames with Kalman-smoothed bboxes

src/track/bytetrack_runtime.pypython
class ByteTrackRuntime:
    def __init__(
        self,
        min_confidence: float = 0.25,
        distance_threshold: float = 80.0,
        max_age: int = 15,
        max_track_id: int = 10000,
        id_reuse_delay: int = 30,
    ) -> None:
        self._tracks: list[_TrackState] = []
        self._id_pool: deque[int] = deque()

    def update(self, frame_id, detections) -> Tracklets:
        # Kalman predict → Hungarian match → update tracks
        predictions = [track.predict() for track in self._tracks]
        cost_matrix = # distance between predictions and detections
        track_indices, det_indices = linear_sum_assignment(cost_matrix)
        # Match, create new tracks, clean up old
        return Tracklets(frame_id=frame_id, items=outputs)

Output

Tracklets(frame_id=150, items=[
  Tracklet(track_id=7, bbox=[125,342,185,522], score=0.91),
  Tracklet(track_id=3, bbox=[400,200,440,380], score=0.88),
])
3

Graph Construction

Builds spatial interaction graphs from tracked detections using windowed frames. Nodes are tracked entities, edges connect those within an 80px distance threshold. Supports both spatial and temporal edges for cross-frame relationships.

Example

Window of 30 frames → spatial + temporal graph

src/graph/build_graph.pypython
from torch_geometric.data import Data
from src.track.bytetrack_runtime import Tracklets

def build_track_graph(
    track_windows: list[Tracklets],
    window: int = 30,
    distance_threshold: float = 80.0,
    include_temporal_edges: bool = True,
    max_spatial_edges: int = 1000,
):
    # Node features: bbox coordinates [x1, y1, x2, y2]
    # Edges: spatial proximity + temporal identity links
    for i, j in combinations(range(num_nodes), 2):
        dist = torch.linalg.norm(centers[i] - centers[j])
        if dist <= distance_threshold:
            edges.append([i, j])
            edges.append([j, i])

    return Data(x=node_features, edge_index=edge_index)

Output

Data(
  x=[num_tracks, 4],     # bbox features per track
  edge_index=[2, N],     # spatial + temporal edges
)
4

GraphSAGE Classification

A 2-layer GraphSAGE network classifies player interactions into 3 categories. Uses global mean pooling over node embeddings for graph-level predictions. Trained with the standard PyTorch Geometric DataLoader pipeline.

Example

Graph-level interaction classification (3 classes)

src/models/gcn.pypython
from torch_geometric.nn import SAGEConv, global_mean_pool

class GraphSAGENet(nn.Module):
    def __init__(
        self, in_channels: int = 4,
        hidden_channels: int = 64,
        num_classes: int = 3,
    ) -> None:
        super().__init__()
        self.conv1 = SAGEConv(in_channels, hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, hidden_channels)
        self.head = nn.Linear(hidden_channels, num_classes)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index).relu()
        pooled = global_mean_pool(x, batch)
        return self.head(pooled)

Output

logits: tensor([[-0.12, 1.45, 0.33]])
probabilities: tensor([[0.08, 0.72, 0.20]])

Performance

0
Pipeline Stages
DVC orchestrated
0+
MLflow Experiments
Tracked & versioned
0
GNN Classes
Interaction types
0
Frame Window
Temporal context
YOLOv8n
Detection backbone
DVC + MLflow
Pipeline & tracking
Docker
Containerized

Full Pipeline

pipeline.sh

Processing a 10-second La Liga clip (Barcelona vs Real Madrid):

Input
10s clip @ 30fps
300 frames, 1280×720
Preprocess
60 frames extracted
stride 5, resized
Detection
YOLOv8n inference
conf ≥ 0.25
Tracking
Persistent track IDs
Kalman + Hungarian
Graph
Spatial interaction graph
80px distance threshold
GNN
3-class predictions
GraphSAGE embeddings
Run the full pipelinebash
# Run the DVC pipeline end-to-end
dvc repro

# Or run individual stages
python src/detect/infer.py --weights yolov8n.pt --confidence 0.25
python src/pipeline_full.py \
    --video data/raw/sample.mp4 \
    --model build/detection_yolov8n.plan \
    --output outputs/tracking_results

Architecture

Key Decisions

YOLOv8 Nano for Real-Time Speed

YOLOv8n provides the fastest inference for real-time video processing on consumer GPUs. Nano variant keeps latency low enough for live stream processing while maintaining usable detection accuracy.

ByteTrack over DeepSORT

Kalman filter + Hungarian algorithm matching is simpler and faster than DeepSORT's appearance-based re-ID. Distance threshold of 80px works well for soccer where players have predictable motion patterns.

MLflow + DVC over Managed Platforms

Lightweight experiment tracking without Kubernetes. MLflow tracks experiments, DVC versions data and orchestrates the 7-stage pipeline. Total cost: S3 storage only.

GraphSAGE for Interaction Classification

Graph structure naturally captures spatial relationships between players. SAGEConv layers aggregate neighbor features through message passing, enabling 3-class interaction classification from positional data.

What I Learned

FP16 inference is essential for real-time CV on consumer GPUs — 2x speedup with <1% accuracy loss.
Two-stage matching (ByteTrack) dramatically improved track continuity. Don't discard low-confidence detections.
GNNs are perfect for spatial relationship modeling in sports — players' relative positions define tactics.
MLflow + DVC lightweight stack worked surprisingly well for solo ML projects.
Production ML is 20% model training, 80% pipeline engineering.
Weekly automated retraining prevented model drift as data distribution evolved.

Future Work

Pose estimation with SAM2 for injury risk and fatigue detection
Multi-camera fusion for 3D position estimation
Vision-language models for zero-shot player identification
Automated highlight generation using action detection
Edge deployment with INT8 quantization for mobile
Pass prediction GNN for expected threat heatmaps