Multi-model computer vision pipeline for soccer video analysis. YOLOv8n detection, ByteTrack persistence, and GraphSAGE neural networks for tactical pattern recognition.
Pipeline Architecture
Deep Dive
Each frame is processed through YOLOv8n with a confidence threshold of 0.25. The nano variant was chosen for real-time inference speed on consumer GPUs. Detects players, ball, and referees with automatic device selection (CUDA when available).
Frame → detections with bboxes, scores, and class IDs
from ultralytics import YOLO
@dataclass(slots=True)
class InferenceConfig:
weights: str = "yolov8n.pt"
device: str = "cuda_if_available"
confidence: float = 0.25
max_frames: int = 30
def load_model(config: InferenceConfig) -> YOLO:
device = _resolve_device(config.device)
model = YOLO(config.weights)
model.to(device)
return model
def run_inference(image_path, config=None, model=None):
cfg = config or InferenceConfig()
detector = model or load_model(cfg)
results = detector.predict(image_path, conf=cfg.confidence, verbose=False)
return results[0][Detection(bbox=[120,340,180,520], score=0.92, class="player"), Detection(bbox=[540,280,560,300], score=0.87, class="ball")]
A Kalman filter-based tracker with Hungarian algorithm matching. Uses distance-based association with a threshold of 80px, and max_age of 15 frames for track persistence. Features ID reuse with delay to prevent collisions.
Persistent IDs across frames with Kalman-smoothed bboxes
class ByteTrackRuntime:
def __init__(
self,
min_confidence: float = 0.25,
distance_threshold: float = 80.0,
max_age: int = 15,
max_track_id: int = 10000,
id_reuse_delay: int = 30,
) -> None:
self._tracks: list[_TrackState] = []
self._id_pool: deque[int] = deque()
def update(self, frame_id, detections) -> Tracklets:
# Kalman predict → Hungarian match → update tracks
predictions = [track.predict() for track in self._tracks]
cost_matrix = # distance between predictions and detections
track_indices, det_indices = linear_sum_assignment(cost_matrix)
# Match, create new tracks, clean up old
return Tracklets(frame_id=frame_id, items=outputs)Tracklets(frame_id=150, items=[ Tracklet(track_id=7, bbox=[125,342,185,522], score=0.91), Tracklet(track_id=3, bbox=[400,200,440,380], score=0.88), ])
Builds spatial interaction graphs from tracked detections using windowed frames. Nodes are tracked entities, edges connect those within an 80px distance threshold. Supports both spatial and temporal edges for cross-frame relationships.
Window of 30 frames → spatial + temporal graph
from torch_geometric.data import Data
from src.track.bytetrack_runtime import Tracklets
def build_track_graph(
track_windows: list[Tracklets],
window: int = 30,
distance_threshold: float = 80.0,
include_temporal_edges: bool = True,
max_spatial_edges: int = 1000,
):
# Node features: bbox coordinates [x1, y1, x2, y2]
# Edges: spatial proximity + temporal identity links
for i, j in combinations(range(num_nodes), 2):
dist = torch.linalg.norm(centers[i] - centers[j])
if dist <= distance_threshold:
edges.append([i, j])
edges.append([j, i])
return Data(x=node_features, edge_index=edge_index)Data( x=[num_tracks, 4], # bbox features per track edge_index=[2, N], # spatial + temporal edges )
A 2-layer GraphSAGE network classifies player interactions into 3 categories. Uses global mean pooling over node embeddings for graph-level predictions. Trained with the standard PyTorch Geometric DataLoader pipeline.
Graph-level interaction classification (3 classes)
from torch_geometric.nn import SAGEConv, global_mean_pool
class GraphSAGENet(nn.Module):
def __init__(
self, in_channels: int = 4,
hidden_channels: int = 64,
num_classes: int = 3,
) -> None:
super().__init__()
self.conv1 = SAGEConv(in_channels, hidden_channels)
self.conv2 = SAGEConv(hidden_channels, hidden_channels)
self.head = nn.Linear(hidden_channels, num_classes)
def forward(self, data):
x, edge_index, batch = data.x, data.edge_index, data.batch
x = self.conv1(x, edge_index).relu()
x = self.conv2(x, edge_index).relu()
pooled = global_mean_pool(x, batch)
return self.head(pooled)logits: tensor([[-0.12, 1.45, 0.33]]) probabilities: tensor([[0.08, 0.72, 0.20]])
Performance
Full Pipeline
Processing a 10-second La Liga clip (Barcelona vs Real Madrid):
# Run the DVC pipeline end-to-end
dvc repro
# Or run individual stages
python src/detect/infer.py --weights yolov8n.pt --confidence 0.25
python src/pipeline_full.py \
--video data/raw/sample.mp4 \
--model build/detection_yolov8n.plan \
--output outputs/tracking_resultsArchitecture
YOLOv8n provides the fastest inference for real-time video processing on consumer GPUs. Nano variant keeps latency low enough for live stream processing while maintaining usable detection accuracy.
Kalman filter + Hungarian algorithm matching is simpler and faster than DeepSORT's appearance-based re-ID. Distance threshold of 80px works well for soccer where players have predictable motion patterns.
Lightweight experiment tracking without Kubernetes. MLflow tracks experiments, DVC versions data and orchestrates the 7-stage pipeline. Total cost: S3 storage only.
Graph structure naturally captures spatial relationships between players. SAGEConv layers aggregate neighbor features through message passing, enabling 3-class interaction classification from positional data.
What I Learned
Future Work