Continual Learning for Robust Video Segmentation of Robot-Assisted Surgical Tool

Gokul Adethya
Nitish N
Raghavan Balanathan
Sitara, K

Paper
WanDB Report

Abstract

Robust identification and segmentation of surgical tools in robot-assisted procedures are critical for safeguarding patient safety and driving progress toward fully automated surgeries. However, deep learning models often fail to handle challenging visual conditions of real-world surgical scenes, including distortions like over-bleeding, smoke, and low illumination. Conventional training on limited, high-quality data makes adaptation to diverse surgical conditions difficult, with standard approaches often causing catastrophic forgetting and violating privacy due to the need for sensitive historical data. We propose a new framework that unifies Domain-Incremental Continual Learning (CL) to achieve robust surgical tool segmentation while preserving data privacy across evolving domains. Our approach utilizes the Segment Anything Model 2 (SAM2) as a baseline, employing parameter-efficient Low-Rank Adaptation (LoRA) to facilitate adaptation within the CL framework. This enables sequential, privacy-preserving learning across domains and mitigates forgetting. We evaluate our methodology on challenging endoscopic video datasets containing various distortions, measuring performance using standard segmentation metrics (mIoU, DSC) and assessing knowledge retention via forgetting analysis. Our results demonstrate that the K-Means based CL method achieves high performance across several learned domains, presenting a significant advancement towards reliable, adaptable, and privacy-conscious computer vision systems for real-world surgical applications.

Methodology

Domain Identification and Adaptive LoRA Loading (K-Means + CLIP)

Instead of adapting a single set of parameters over time, this method trains individual LoRA adapters for each domain encountered during the learning phase. During inference, the domain of the input video is first identified using CLIP embeddings and K-Means clustering, and the corresponding adapter is loaded into the base segmentation model for inference.

Training Phase:

Each domain gets a dedicated LoRA adapter trained independently.
Initialization can either be random or derived from the previous domain's adapter.
The segmentation loss is optimized per domain.
Optional knowledge distillation uses the previous domain’s adapter as a teacher.
Each trained adapter is stored separately, preserving domain-specific expertise.

Inference Phase:

Frame-level embeddings are extracted using a pre-trained CLIP model.
Embeddings are averaged per video and clustered to compute domain anchor embeddings.
A test sequence’s embedding is matched to the closest anchor.
The corresponding LoRA adapter is dynamically loaded into the model.

This approach transforms the challenge of continual learning into a domain recognition and parameter selection problem, offering robustness to domain shifts and avoiding typical drawbacks of shared-parameter models.

Observations

Continual Learning Strategy Evaluation and Forgetting Analysis

The core challenge in continual learning is catastrophic forgetting, where the model degrades on earlier domains after learning new ones. To assess this, performance metrics like mean Dice Similarity Coefficient (DSC) and IoU were tracked across domains as new ones were added in sequence:

Smoke → Blood → Low Brightness → Background Change → Regular

Key Findings:

1. Naive Sequential Fine-tuning

Exhibited significant forgetting, especially on early domains.
Example: Performance on Smoke dropped by -0.27 after Background Change training.
Adapts well to recent domains but often overfits, hurting generalization.
Shows minimal forward knowledge transfer to new tasks.

2. Learning without Forgetting (LwF)

Demonstrated better retention than the Naive baseline.
Example: Forgetting on Smoke was reduced to -0.106.
Uses distillation to maintain old knowledge but may slow new domain adaptation.

3. K-Means + CLIP Adapter Selection (KM)

Showed the least forgetting, with scores often near zero or slightly positive.
Example: +0.021 on Smoke after Low Brightness training.
Uses separate adapters for each domain, preventing knowledge overwriting.
Exhibits better forward generalization due to CLIP-guided adapter selection.

This strategy uses domain-aware inference to maintain high performance across tasks and avoids the instability of sequential fine-tuning approaches.