USV Obstacle Perception & Tracking

Brian Munoz Calonge

PyTorchU-Net / ResNet-34MaSTr1325MODD2Kalman FilterSORT

Overview

An end-to-end pipeline for detecting and tracking floating obstacles from a USV camera. The design splits the problem in two: a learned segmenter handles the pixel-level classification (training data and model), and a classical Kalman/SORT tracker handles temporal association.

The segmenter was trained on MaSTr1325 (1325 labeled coastal images, 3 classes). The tracker was evaluated on MODD2 stereo sequences at ~70 FPS end-to-end on a single GPU.

0.991
Val mIoU (MaSTr1325)
0.979
Obstacle IoU
0.996
Water IoU
0.997
Sky IoU
~70
FPS end-to-end (consumer GPU)

Pipeline

Camera frame1278 × 958
SegmentationU-Net / ResNet-34
Obstacle mask3-class labels
Box detectionConnected components
Kalman / SORTIoU + Hungarian
Tracked obstaclesID + trajectory

The segmenter labels every pixel as obstacle, water, or sky. Because MaSTr1325's obstacle class includes shoreline and piers, a geometric filter isolates floating obstacles. Two modes are used depending on scene: enclosed keeps only blobs with open water directly above them (the defining signature of a fully floating obstacle, e.g. a buoy in open water); adjacent keeps blobs that border water on any side, which also catches vessels partially near a coastline. Both modes drop sky-only regions and wide coast bands via a max-width fraction threshold.

Obstacle Water Sky

Training

Architecture: U-Net decoder with a ResNet-34 encoder pretrained on ImageNet, via segmentation_models_pytorch. ImageNet pretraining means the encoder already extracts useful low-level features before seeing any maritime data, so ~1125 training images (85% of the dataset) are enough.

ResNet-34 (~21 M parameters) was chosen over larger models to meet the real-time throughput requirement of a USV perception stack. Also, U-Net's skip connections preserve the fine spatial detail needed for sharp water/obstacle boundaries at the pixel level.

Augmentation (train only):

  • Horizontal flip  p = 0.5
  • Brightness / contrast  p = 0.3
  • Hue / saturation / value  p = 0.2
  • Gaussian noise  p = 0.15
FrameworkPyTorch
Input size512 × 384
Batch size8
Epochs40 (~10 min on single GPU)
OptimizerAdamW
Learning rate3 × 10⁻⁴
Weight decay1 × 10⁻⁴
LR scheduleCosine annealing
LossCross-entropy
Train / val85 % / 15 % (fixed seed)

mIoU plateaus above 0.99 by epoch 25. The dip at epoch 7 is a cosine LR artifact interacting with aggressive augmentation early in training.

Segmentation

Predictions blended over MaSTr1325 validation images (red = obstacle, blue = water, pale blue = sky).

Coastal scene
Shoreline correctly labeled obstacle
Open water
Open water with sky/water boundary
Marina
Coast and boat

Obstacle Detection Filter

Raw segmentation includes the coastline. The enclosed filter keeps only blobs that have open water directly above them, isolating floating obstacles.

All detections
No filter: coast band dominates
Enclosed mode
Enclosed mode: floating obstacles only (no box in this case)

Tracking

SORT tracker with per-track Kalman filter. Each track is assigned a stable color-coded ID. min-hits=3 suppresses single-frame detections; max-age=20 keeps a track alive through up to 20 missed frames.

With Kalman / SORT: stable IDs, smoothed boxes
Raw detector only: flickering, no IDs
Tracking frame
MODD2 frame: sailboat (ID 2), dinghy (ID 35), distant landmass+ship (ID 1), artifact (ID 14).

IMU Motion Compensation

MODD2 provides per-frame IMU Euler angles (roll, pitch, yaw in radians, body-to-world convention). The pipeline implements rotation-induced homography compensation: between consecutive frames the IMU delta is converted to a 3 × 3 planar homography H = K · ΔRcam · K−1, and each track's Kalman-predicted box is warped by H before IoU matching. This corrects for perceived obstacle displacement caused by camera rotation, most likely induced by waves and vessel maneuvers.

Camera-to-IMU alignment (Rcam→IMU) was computed from the MODD2 calibration sequences: ground-plane PCA on stereo point clouds gives the camera's vertical axis; combined with flat-ground IMU readings, a two-vector alignment yields the 3 × 3 calibration rotation. The result shows that MODD2's IMU uses an NWU convention (X = bow, Y = port, Z = up).

Without IMU compensation: constant-velocity Kalman predictions only (kope67, adjacent mode)
With IMU compensation: predicted positions warped by H = K·ΔR·K⁻¹ before IoU matching (kope67, adjacent mode)

Conclusion: MODD2's choppy sequences are of only coastline images, and its floating-obstacle sequences have almost no camera motion (≤ 0.004 rad/frame). On the sailboat sequence, IMU compensation even slightly hurt tracking (33 vs 30 IDs, shorter average lifetime) because almost zero homography adds more noise than it removes.

Limitations