USV Obstacle Perception & Tracking

Overview

An end-to-end pipeline for detecting and tracking floating obstacles from a USV camera. The design splits the problem in two: a learned segmenter handles the pixel-level classification (training data and model), and a classical Kalman/SORT tracker handles temporal association.

The segmenter was trained on MaSTr1325 (1325 labeled coastal images, 3 classes). The tracker was evaluated on MODD2 stereo sequences at ~70 FPS end-to-end on a single GPU.

0.991

Val mIoU (MaSTr1325)

0.979

Obstacle IoU

0.996

Water IoU

0.997

Sky IoU

~70

FPS end-to-end (consumer GPU)

Pipeline

Camera frame1278 × 958

→

SegmentationU-Net / ResNet-34

→

Obstacle mask3-class labels

→

Box detectionConnected components

→

Kalman / SORTIoU + Hungarian

→

Tracked obstaclesID + trajectory

The segmenter labels every pixel as obstacle, water, or sky. Because MaSTr1325's obstacle class includes shoreline and piers, a geometric filter isolates floating obstacles. Two modes are used depending on scene: enclosed keeps only blobs with open water directly above them (the defining signature of a fully floating obstacle, e.g. a buoy in open water); adjacent keeps blobs that border water on any side, which also catches vessels partially near a coastline. Both modes drop sky-only regions and wide coast bands via a max-width fraction threshold.

Obstacle Water Sky

Training

Architecture: U-Net decoder with a ResNet-34 encoder pretrained on ImageNet, via segmentation_models_pytorch. ImageNet pretraining means the encoder already extracts useful low-level features before seeing any maritime data, so ~1125 training images (85% of the dataset) are enough.

ResNet-34 (~21 M parameters) was chosen over larger models to meet the real-time throughput requirement of a USV perception stack. Also, U-Net's skip connections preserve the fine spatial detail needed for sharp water/obstacle boundaries at the pixel level.

Augmentation (train only):

Horizontal flip p = 0.5
Brightness / contrast p = 0.3
Hue / saturation / value p = 0.2
Gaussian noise p = 0.15

Framework	PyTorch
Input size	512 × 384
Batch size	8
Epochs	40 (~10 min on single GPU)
Optimizer	AdamW
Learning rate	3 × 10⁻⁴
Weight decay	1 × 10⁻⁴
LR schedule	Cosine annealing
Loss	Cross-entropy
Train / val	85 % / 15 % (fixed seed)

mIoU plateaus above 0.99 by epoch 25. The dip at epoch 7 is a cosine LR artifact interacting with aggressive augmentation early in training.

Obstacle Detection Filter

Raw segmentation includes the coastline. The enclosed filter keeps only blobs that have open water directly above them, isolating floating obstacles.

All detections — No filter: coast band dominates

Enclosed mode: floating obstacles only (no box in this case)

Tracking

SORT tracker with per-track Kalman filter. Each track is assigned a stable color-coded ID. min-hits=3 suppresses single-frame detections; max-age=20 keeps a track alive through up to 20 missed frames.

With Kalman / SORT: stable IDs, smoothed boxes

Raw detector only: flickering, no IDs

Tracking frame — MODD2 frame: sailboat (ID 2), dinghy (ID 35), distant landmass+ship (ID 1), artifact (ID 14).

IMU Motion Compensation

MODD2 provides per-frame IMU Euler angles (roll, pitch, yaw in radians, body-to-world convention). The pipeline implements rotation-induced homography compensation: between consecutive frames the IMU delta is converted to a 3 × 3 planar homography H = K · ΔR_cam · K⁻¹, and each track's Kalman-predicted box is warped by H before IoU matching. This corrects for perceived obstacle displacement caused by camera rotation, most likely induced by waves and vessel maneuvers.

Camera-to-IMU alignment (R_cam→IMU) was computed from the MODD2 calibration sequences: ground-plane PCA on stereo point clouds gives the camera's vertical axis; combined with flat-ground IMU readings, a two-vector alignment yields the 3 × 3 calibration rotation. The result shows that MODD2's IMU uses an NWU convention (X = bow, Y = port, Z = up).

Without IMU compensation: constant-velocity Kalman predictions only (kope67, adjacent mode)

With IMU compensation: predicted positions warped by H = K·ΔR·K⁻¹ before IoU matching (kope67, adjacent mode)

Conclusion: MODD2's choppy sequences are of only coastline images, and its floating-obstacle sequences have almost no camera motion (≤ 0.004 rad/frame). On the sailboat sequence, IMU compensation even slightly hurt tracking (33 vs 30 IDs, shorter average lifetime) because almost zero homography adds more noise than it removes.

Limitations

Single obstacle class. Boats, buoys, piers, and coastline share one label, this means that instance separation is all geometric, and overlapping vessels merge into one box.
Detection range. Frames are downscaled to 512 × 384 for inference. Distant small obstacles may not survive the resize. Running at higher resolution causes the model to hallucinate obstacles on water texture, because it was trained at 512 × 384, so a real fix requires retraining at higher resolution.
Constant-velocity model. The Kalman filter assumes linear motion. The movement of the ship itself can introduce target velocity. IMU rotation compensation is implemented (see above) but its benefit is limited by the dataset's lack of floating-obstacle sequences with significant camera motion.
IoU-only data association. Track crossings can swap object IDs. Using appearance embeddings (DeepSORT) could improve this.
In-distribution evaluation. Evaluation mIoU of 0.991 is on the split of the MaSTr1325 dataset, meaning it all has the same lighting, same sensor, and similar conditions as training. Night, glare, and fog are untested.