Gokul Adethya: Personal Website

Continual Learning for Robust Video Segmentation of Robot-Assisted Surgical Tool

Abstract

Robust identification and segmentation of surgical tools in robot-assisted procedures are critical for safeguarding patient safety and driving progress toward fully automated surgeries. However, deep learning models often fail to handle challenging visual conditions of real-world surgical scenes, including distortions like over-bleeding, smoke, and low illumination. Conventional training on limited, high-quality data makes adaptation to diverse surgical conditions difficult, with standard approaches often causing catastrophic forgetting and violating privacy due to the need for sensitive historical data.

We propose a new framework that unifies Domain-Incremental Continual Learning (CL) to achieve robust surgical tool segmentation while preserving data privacy across evolving domains. Our approach utilizes the Segment Anything Model 2 (SAM2) as a baseline, employing parameter-efficient Low-Rank Adaptation (LoRA) to facilitate adaptation within the CL framework. This enables sequential, privacy-preserving learning across domains and mitigates forgetting.

We evaluate our methodology on challenging endoscopic video datasets containing various distortions, measuring performance using standard segmentation metrics (mIoU, DSC) and assessing knowledge retention via forgetting analysis. Our results demonstrate that the K-Means based CL method achieves high performance across several learned domains, presenting a significant advancement towards reliable, adaptable, and privacy-conscious computer vision systems for real-world surgical applications.

Methodology

Domain Identification and Adaptive LoRA Loading (K-Means + CLIP)

Instead of adapting a single set of parameters over time, this method trains individual LoRA adapters for each domain encountered during the learning phase. During inference, the domain of the input video is first identified using CLIP embeddings and K-Means clustering, and the corresponding adapter is loaded into the base segmentation model for inference.

Training Phase:

Each domain gets a dedicated LoRA adapter trained independently.
Initialization can either be random or derived from the previous domain's adapter.
The segmentation loss is optimized per domain.
Optional knowledge distillation uses the previous domain’s adapter as a teacher.
Each trained adapter is stored separately, preserving domain-specific expertise.

Inference Phase:

Frame-level embeddings are extracted using a pre-trained CLIP model.
Embeddings are averaged per video and clustered to compute domain anchor embeddings.
A test sequence’s embedding is matched to the closest anchor.
The corresponding LoRA adapter is dynamically loaded into the model.

This approach transforms the challenge of continual learning into a domain recognition and parameter selection problem, offering robustness to domain shifts and avoiding typical drawbacks of shared-parameter models.

Observations

Continual Learning Strategy Evaluation and Forgetting Analysis

The core challenge in continual learning is catastrophic forgetting, where the model degrades on earlier domains after learning new ones. To assess this, performance metrics like mean Dice Similarity Coefficient (DSC) and IoU were tracked across domains as new ones were added in sequence:

Smoke → Blood → Low Brightness → Background Change → Regular

Key Findings:

1. Naive Sequential Fine-tuning

Exhibited significant forgetting, especially on early domains.
Example: Performance on Smoke dropped by -0.27 after Background Change training.
Adapts well to recent domains but often overfits, hurting generalization.
Shows minimal forward knowledge transfer to new tasks.

2. Learning without Forgetting (LwF)

Demonstrated better retention than the Naive baseline.
Example: Forgetting on Smoke was reduced to -0.106.
Uses distillation to maintain old knowledge but may slow new domain adaptation.

3. K-Means + CLIP Adapter Selection (KM)

Showed the least forgetting, with scores often near zero or slightly positive.
Example: +0.021 on Smoke after Low Brightness training.
Uses separate adapters for each domain, preventing knowledge overwriting.
Exhibits better forward generalization due to CLIP-guided adapter selection.

This strategy uses domain-aware inference to maintain high performance across tasks and avoids the instability of sequential fine-tuning approaches.

WanDB Report

A Study on Regularization-Based Continual Learning Methods for Indic ASR

Abstract

India's linguistic diversity challenges inclusive Automatic Speech Recognition (ASR) system development. Traditional multilingual models, requiring simultaneous access to all language data, are impractical due to sequential data arrival and privacy constraints.

Continual Learning (CL) enables models to learn new languages sequentially without catastrophically forgetting prior knowledge. This paper investigates CL for ASR on Indian languages using the subset of the indicSUPERB benchmark.

We employ a Conformer-based hybrid RNN-T/CTC model, initially pretrained on Hindi, which is subsequently trained incrementally on eight additional Indian languages, for a sequence of nine languages in total.

We evaluate three prominent regularization and distillation-based CL strategies: Elastic Weight Consolidation (EWC), Memory Aware Synapses (MAS), and Learning without Forgetting (LwF), chosen for their suitability in no-replay, privacy-conscious scenarios.

Performance is analyzed using Word Error Rate (WER) for both RNN-T and CTC paths on clean/noisy data, and knowledge retention via Backward Transfer. We explore varying training epochs (1, 2, 5 and 10) per task.

Results, compared against naive fine-tuning, demonstrate CL's efficacy in mitigating forgetting for scalable ASR in diverse Indian languages under realistic constraints.

Observations

CTC Benchmarking

As shown in Figure 1, the average WER across tasks reveals a clear ranking among methods. LwF achieves the best overall performance, followed by EWC, then MAS, with naive fine-tuning performing the worst. This ranking is particularly evident in short and medium task horizons. For longer sequences, however, the performance gap between methods narrows considerably. Naive fine-tuning, in particular, produces the highest WER maxima across tasks. When analyzing backward transfer (BWT), MAS performs best in short sequences, while LwF excels in medium-length tasks. For longer sequences, both MAS and LwF converge to similar average BWT values, whereas EWC and naive fine-tuning fall behind.

RNN-T Benchmarking

Figure 9 shows that RNN-T consistently outperforms CTC in WER across all continual learning strategies. Among these, EWC achieves the lowest WER across task lengths, demonstrating strong performance retention on the current task. However, this benefit comes at a cost: EWC exhibits the worst BWT of all methods, even lower than that of naive fine-tuning, indicating substantial forgetting. MAS shows some improvement in BWT for medium-length sequences, but for longer horizons, BWT scores deteriorate across all methods except EWC, eventually becoming nearly indistinguishable.

General Comparison of CL Methods under Noisy Settings

In noisy conditions (Figure 2), both LwF and MAS outperform EWC and the naive baseline in BWT, suggesting better retention of prior knowledge. Interestingly, noise appears to improve backward transfer, likely due to regularization effects. However, this improvement comes with a trade-off: WER increases, and models perform better on clean audio in absolute terms. This contrast indicates that noise can enhance stability, by reducing forgetting, while simultaneously impairing plasticity, by diminishing learning precision, which is reflected in the higher WER.

WER Performance Analysis

Figures 3 and 4 present WER trends over increasing task lengths. Evaluations are averaged over the last two and current tasks, categorized as short (1–3), medium (1–6), and long (1–9). In general, models perform better with clean data. Among the methods, LwF consistently maintains WER below 1.0, with high stability indicated by narrow shaded variance regions.

Interestingly, the upper bounds of noisy WER for LwF are comparable to the maxima seen under clean conditions. This can be attributed to its distillation-based loss, which prevents overfitting to noisy inputs by anchoring the model to previous predictions. MAS follows a similar pattern, though with slightly lower stability. EWC occasionally achieves better minimum WERs, particularly for short tasks, but continues to show poor BWT. The naive method performs surprisingly well in short sequences but fails to retain knowledge over longer horizons. Overall, LwF demonstrates the effectiveness of knowledge distillation in maintaining a balance between acquiring new knowledge and retaining previous learning. For longer sequences, average WER tends to decline, possibly due to simpler language characteristics in later tasks.

EWC Ablation Studies

In Figure 5, we examine the impact of different regularization strengths in EWC by testing λ_EWC ∈ {5, 10}. While both values yield similar outcomes, λ_EWC = 10 leads to slightly better WER in medium and long tasks, though the benefit is minimal in short tasks. BWT trends (Figure 8) for both values remain close to those of the naive baseline, suggesting limited ability to retain performance on earlier tasks. Additionally, results from epoch-wise ablation (Figure 11) show that increasing training epochs reduces WER, with the best results achieved at epoch 10. However, BWT steadily declines with more epochs (Figure 14), confirming the stability-plasticity trade-off: improved learning on new tasks often leads to increased forgetting of previous ones.

LwF Ablation Studies

As shown in Figure 6, adjusting the distillation weight α_KD significantly impacts LwF’s performance. A higher value of 0.5 severely limits the model’s ability to learn new tasks, resulting in WERs close to 1.0 across all horizons—worse than naive fine-tuning for short sequences. In contrast, α_KD = 0.1 strikes a better balance, achieving WER comparable to or better than naive fine-tuning while maintaining much stronger BWT. As shown in Figure 8, the 0.5 configuration yields the highest BWT, primarily because the model barely updates and effectively freezes previous knowledge. The 0.1 setting enables more meaningful learning while controlling forgetting.

Epoch-wise trends (Figures 10 and 14) are consistent with those observed in EWC. Increasing the epochs improves WER but worsens BWT.

MAS Ablation Studies

In Figure 7, we compare MAS with regularization weights α_ctx of 0.3 and 1.0. The stronger setting of 1.0 consistently achieves better WER and shows more stable variance across tasks. Its shaded performance region closely overlaps with that of naive fine-tuning, though with lower dispersion. When examining BWT (Figure 8), the 0.3 configuration performs better, matching LwF in retaining knowledge.

As with the other methods, MAS exhibits the stability-plasticity trade-off: increasing epochs (Figure 12) lowers WER but leads to worsening BWT (Figure 14). This consistent trend across methods emphasizes the fundamental challenge in continual learning of effectively balancing the acquisition of new information with the retention of existing knowledge.

WanDB Report

Medical Multi-Modal Fusion & Cross Modal Alignment

Summer internship report

Dataset Analysis and Performance Report

This project focuses on analyzing and improving the performance of models using a multimodal dataset comprising Electronic Health Records (EHR) data, including demographic information, ICU vitals, Chest X-rays (CXR), ECG signals, and clinical notes. The primary tasks include predicting readmission, mortality, and length of stay (LOS) for patients based on data available within 24 to 48 hours after admission.

Modalities and Tasks

EHR Demographics:
Includes categorical data like race and age, with age discretized into groups due to deidentification.
ICU Vitals:
A selection of 39 vital signs, time series data, and categorical events like procedures.
CXR:
Time series of chest X-ray images.
ECG:
12-lead ECG signals represented as time series.
Clinical Notes:
Series of notes, such as discharge and radiology notes.

The tasks involve:

Readmission Prediction:
Whether a patient will be readmitted within a month after discharge.
Mortality Prediction:
Predicting if the patient will survive or die.
Length of Stay Prediction:
Whether the stay will exceed 3 or 7 days.

Dataset and Class Imbalance

The dataset shows heavy class imbalance across all tasks and modalities. This imbalance persists even when considering the intersection of multiple modalities, leading to a significant portion of the data being unused.
The dataset was split into training, validation, and test sets with an approximately 80%-10%-10% split, maintaining the proportions of each modality combination and balancing the classes.

Experimentation Setup

Data Pipeline:
A heavily modified HADM pipeline handles data loading and interlinked modality fusion. PyTorch Lightning with Distributed Data Parallel (DDP) is used for training, with configurations managed via YAML files.
Metrics:
Evaluation metrics include binary classification metrics monitored across train, validation, and test sets. Performance is also assessed at the modality and modality combination levels.
Model Structure:
The model pipeline includes a modality encoder, a time series encoder, and a binary classification head. Class imbalance is addressed through oversampling and undersampling strategies, depending on the skewness of the class distribution.

Observations from Literature

Cross-modal alignment is rarely performed or analyzed extensively in current research, and ECG data is often excluded or underutilized in multimodal tasks. Reported F1 scores are typically below 0.7, highlighting the challenge in improving model performance.

Experimentation Results

4.2. Multi-Modal Learning Framework

1. EHR:

a. GRU-D vs RNN:

Table 1: GRU-D vs RNN F1-score comparison.

MODEL	TEST F1 SCORE
GRU-D	0.6625
RNN	0.71844

RNN demonstrates superior performance compared to GRU-D, both in terms of F1 score and training time. RNN achieved a higher F1 score (0.71844 vs. 0.6625) and was more efficient, taking almost 40% less time to train. This indicates that the RNN model is both faster and more effective for EHR data

b. RNN vs RNN-TS:

Table 2: RNN vs RNN-TS F1-score comparison.

MODEL	TEST F1 SCORE
RNN	0.71844
RNN-TS	0.71549

RNN-TS (with time series data) performed almost similarly to the standard RNN (F1 score of 0.71549 vs. 0.71844). This suggests that incorporating time series data as positional embeddings does not significantly enhance the model's performance and may even slightly degrade it.

c. RNN Class Balance vs. Without Class Balance:

Table 3: RNN Class Balance comparison.

MODEL	TEST F1 SCORE
RNN (NO CLASS BALANCE)	0.71288
RNN (WITH CLASS BALANCE)	0.71844

Class balancing provided a marginal improvement in performance, boosting the F1 score from 0.71288 to 0.71844. This suggests that while class balancing helps, its impact is not substantial for the EHR dataset.

d. Training Time:

Table 4: GRU-D vs RNN-TS training time.

MODEL	TRAINING TIME (SECONDS)
GRU-D	135,230
RNN-TS	82,968

RNN-TS is significantly faster to train compared to GRU-D, indicating that the simpler RNN architecture is more efficient in this context, potentially due to the reduced complexity in handling time-series data compared to GRU-D.

2. CXR:

Table 5: Class balance and positional embedding comparison.

CONFIGURATION	TEST F1 SCORE
No class balance and positional embedding	0.56609
With class balance and positional embedding	0.66634
With class balance but no positional embedding	0.62633

Class balancing improved the model’s performance significantly, increasing the F1 score from 0.56609 to 0.66634. Additionally, adding positional embeddings further enhanced the performance (F1 score: 0.66634 vs. 0.62633), suggesting that positional embeddings in CXR data play a role in improving the model's capacity to capture temporal information.

3. Notes:

Table 6: Frozen vs unfrozen configurations.

CONFIGURATION	TEST F1 SCORE
No class balance (Frozen)	0.12199
Frozen (With class balance)	0.5004
Frozen (No positional embedding)	0.24544
Unfrozen (No class balance)	0.66085
Unfrozen (With class balance)	0.63192

Unfreezing the notes encoder and allowing the model to fine-tune the notes embeddings significantly improved performance, with the F1 score rising to 0.66085 (without class balance). Class balancing marginally reduced performance in the unfrozen model (0.63192). Frozen models performed poorly, highlighting the importance of fine-tuning. Additionally, positional embeddings play a significant role, as removing them drastically dropped the F1 score from 0.5004 to 0.24544.

4. ECG:

Table 7: Class balance impact on ECG performance.

CONFIGURATION	TEST F1 SCORE	AUROC
Without class balance	0.0	0.5197
With class balance	0.6236	0.52089

Without class balancing, the ECG model failed to predict more than one class, resulting in an F1 score of 0. However, class balancing greatly improved the performance, yielding an F1 score of 0.6236. This demonstrates that class imbalance severely impacts the ECG modality, and balancing is crucial for achieving meaningful predictions. Additionally, this highlights that AUROC is not a reliable metric in skewed predictions, as the AUROC for the unbalanced ECG model was still comparable despite the F1 score being 0.

5. Total Comparison:

Table 8: Summary across modalities.

MODALITY	TEST F1 SCORE	AUROC SCORE
Notes	0.66085	0.62601
ECG	0.6236	0.52089
RNN-TS (EHR)	0.71549	0.77536
CXR	0.66634	0.52081

Among the single modalities, the EHR data using RNN-TS performed the best, with an F1 score of 0.71549 and AUROC of 0.77536. CXR also performed well, with an F1 score of 0.66634. Notes and ECG followed with slightly lower scores. The AUROC scores indicate that while F1 remains the primary measure of interest, AUROC is not as reliable in imbalanced datasets like ECG and CXR.

6. Multimodality:

Table 9: Fusion method comparison.

FUSION METHOD	F1 SCORE
Sum Fusion of CXR + EHR	0.69789
Transformer Fusion of CXR + EHR	0.70182
Sum Fusion of CXR + EHR + Notes	0.45058
Transformer Fusion of CXR + EHR + Notes	0.66817
Sum Fusion of CXR + ECG + EHR + Notes	0.51967
Transformer Fusion of CXR + ECG + EHR + Notes	0.61993

The transformer-based fusion consistently outperforms the sum fusion method across all modality combinations. For instance, Transformer CXR + EHR (F1 score: 0.70182) slightly outperforms Sum CXR + EHR (F1 score: 0.69789). Similarly, Transformer CXR + ECG + EHR + Notes (F1 score: 0.61993) shows significant improvement over the sum fusion of the same modalities (F1 score: 0.51967). This indicates that transformer-based fusion is more effective in learning from the complex interactions between modalities.

Table 10: Fusion vs. single modality baselines.

FUSION/MODALITY	F1 SCORE
Transformer CXR + ECG + EHR + Notes	0.61993
Transformer CXR + Notes + ECG	0.55538
Transformer CXR + EHR + ECG	0.45439
Transformer CXR + EHR + Notes	0.66817
Transformer CXR + Notes	0.64509
Transformer Notes + ECG	0.58547
SINGLE MODALITY BASELINE
EHR	0.71549
CXR	0.66634
Notes	0.66085
ECG	0.62360

Although multimodal fusion using transformers improves performance in some cases, adding more modalities does not always result in better outcomes. For example, Transformer CXR + ECG + EHR + Notes (F1 score: 0.61993) performed worse than EHR alone (F1 score: 0.71549). This suggests that poorly aligned modality combinations can degrade performance. The best-performing multimodal models should ideally surpass the highest-performing single-modality model. However, this is not consistently observed, indicating that proper modality alignment and fusion are critical for achieving optimal performance. Transformer-based fusion provides better results compared to sum fusion in multimodal learning, especially when integrating complex and diverse modalities such as CXR, ECG, EHR, and Notes. However, simply combining modalities does not guarantee improved performance; well-aligned models must be carefully designed to achieve results that exceed the best-performing individual modalities. Ideally, the performance of a multimodal model should be greater than the maximum F1 score of the individual modalities (e.g., EHR or CXR alone).

WanDB Report

Surgical Motion Planning

Project Summary:

The project focuses on the development of an embodied robotic agent designed to execute high-level natural language surgical instructions. This system leverages a combination of Reinforcement Learning (RL), Imitation Learning, and a pre-trained Large Language Model (LLM) to function as a "planner." The agent can perform short-horizon skills across multiple surgical scenarios using a library of pre-trained policies. These policies are selected based on the scenario, with skills that are common across scenarios mapped under common primitives with textual labels.

Methodology:

1. Planning and Filtering: The LLM is responsible for planning by determining the sequence of actions required to execute a given surgical instruction. A filtering process similar to SayCAN is used, which ranks skills based on both logical reasoning and geometric feasibility. The LLM generates possible action permutations, which are then filtered to a manageable number using a Chain-of-Thought (CoT) process, reducing the likelihood of errors and improving interpretability.

2. Action Scoring: Each action is scored based on two criteria: the LLM's planning score and the RL policy’s affordance score, which reflects the geometric feasibility of the action. The best action is selected based on these combined scores, with a fallback to the highest LLM score if all actions are deemed infeasible.

3. Execution: Once an action is selected, a human-in-the-loop is given the option to approve or modify the action before execution. The agent updates the scene and action status after each step, which informs subsequent actions.

4. Experimental Setups: The system is tested in two main environments: Surrol and Lapgym, each with specific tasks like retracting tissue, reaching a tumor, and picking and placing gauze. The experiments explore various strategies, including closed-loop, open-loop, and inner monologue setups, with scene and action updates. The DINO model is utilized for detecting objects in the surgical environment, such as gauze and blood, by generating bounding boxes around these targets based on textual inputs provided by the LLM. This integration allows for accurate, real-time detection of key elements, which is critical for the success of surgical tasks.

Key Features:

Human-in-the-loop Feedback: Ensures control over the actions performed, allowing for adjustments based on the requirements.

Chain-of-Thought (CoT): Improves planning efficiency and makes the process more interpretable for surgical tasks.

Scene Description Generation: Simplifies the scene context for the LLM, making the planning process more effective and relatable to real-world applications.

The project demonstrates the feasibility of integrating LLMs with RL, Imitation Learning, and DINO for complex surgical tasks, providing a robust foundation for further development in autonomous surgical systems.

Demo

SaSi: A Self-augmented and Self-interpreted for Few-shot Cryo-ET Particle Detection

Description:

This project tackles the critical task of particle localization and classification in cryo-electron tomography (cryo-ET), focusing on real-world challenges such as limited and weakly annotated data. The study is conducted on both private datasets and the SHREC 2021 simulated dataset, aiming to establish robust baselines and explore advanced techniques to improve model performance. The goal is to enhance the accuracy and robustness of particle detection and classification, making it more effective for structural biology applications.

Methodology:

1. Supervised Learning: The approach begins by generating pseudo-strong labels from weak labels, converting them into segmentation masks by drawing spheres of minimum radius around the particle centroids. These masks serve as ground truth for training. The tomograms are divided into smaller subtomograms, which are fed into the model to generate voxel-level probabilities for each particle type. Key techniques include:

Balanced Sampling and Window Method: Ensuring that training data is evenly sampled and that particle centroids are accurately localized within smaller windows.
AugMix Modification: A tailored version of AugMix is applied to improve data augmentation, focusing on enhancing the diversity and robustness of the training samples.

2. Self-Supervised Learning: To further improve the model's performance, SimCLR is employed. This technique leverages contrastive learning to learn representations that are invariant to data augmentation, improving the model’s ability to generalize from few-shot data.

3. Post-Processing: After initial predictions, advanced post-processing techniques are applied to refine the results:

Connected Components Analysis: Used to identify and isolate individual particles from the segmentation masks, improving the accuracy of particle localization.
MeanShift Clustering: Applied to group detected particles based on their spatial coordinates, helping to better classify and localize particles within the tomogram.

WanDB Report

SAM-PM: Enhancing Video Camouflaged Object Detection

PDF

Code

bibtex

Abstract:

In the domain of large foundation models the Segment Anything Model (SAM) has gained notable recognition for its exceptional performance in image segmentation. However tackling the video camouflage object detection (VCOD) task presents a unique challenge. Camouflaged objects typically blend into the background making them difficult to distinguish in still images. Additionally ensuring temporal consistency in this context is a challenging problem. As a result SAM encounters limitations and falls short when applied to the VCOD task. To overcome these challenges we propose a new method called the SAM Propagation Module (SAM-PM). Our propagation module enforces temporal consistency within SAM by employing spatio-temporal cross-attention mechanisms. Moreover we exclusively train the propagation module while keeping the SAM network weights frozen allowing us to integrate task-specific insights with the vast knowledge accumulated by the large model. Our method effectively incorporates temporal consistency and domain-specific expertise into the segmentation network with an addition of less than 1% of SAM's parameters. Extensive experimentation reveals a substantial performance improvement in the VCOD benchmark when compared to the most recent state-of-the-art techniques. Code and pre-trained weights are open-sourced at https://github.com/SpiderNitt/SAM-PM

Methodology
The proposed SAM-PM framework adapts the Segment Anything Model (SAM) for the Video Camouflaged Object Detection (VCOD) task. SAM-PM consists of two main components: the Temporal Fusion Mask Module (TFMM) and the Memory Prior Affinity Module (MPAM).

Temporal Fusion Mask Module (TFMM): This module enhances temporal information by integrating mask embeddings from multiple frames. It uses a spatio-temporal cross-attention mechanism to create temporally enriched mask embeddings.

Memory Prior Affinity Module (MPAM): MPAM utilizes the temporally infused mask embeddings from TFMM along with image embeddings from the current and previous frames. It applies affinity to strengthen temporal consistency in mask predictions.
SAM-PM operates in a semi-supervised manner, where only the first frame's ground truth mask is used for training. The framework keeps SAM's weights frozen and trains only the SAM-PM components, ensuring parameter efficiency with less than 1 million parameters.
During training and inference, SAM-PM updates its memory with new frames and their predicted masks while discarding outdated data to maintain temporal coherence.

WanDB Report

Test-time adaptation for Optical Flow

Optical Flow Estimation

In this research, we focused on optimizing the existing RAFT (Recurrent All-Pairs Field Transforms) model for optical flow estimation using a multi-GPU and multi-node setup. The model was deployed in a distributed computing environment using Distributed Data Parallel (DDP) processing, facilitated by Lightning-Fabric. The primary aim of this setup was to efficiently scale the model's training to accommodate larger batch sizes without running into out-of-memory (OOM) errors, a common challenge in deep learning tasks with high computational demands like optical flow estimation. The RAFT model's architecture was not modified; instead, we concentrated on improving the scalability and memory handling to enhance computational efficiency.

To achieve these optimizations, a cluster of multiple GPUs and nodes was configured to maximize parallel processing capabilities. The use of DDP allowed us to distribute the workload across multiple GPUs and nodes, synchronizing their operations to ensure smooth and efficient model training. This setup provided the necessary computational resources to handle large-scale data while maintaining memory efficiency across devices.

Several key optimization techniques were employed to maximize the efficiency and scalability of the RAFT model in this distributed environment. First, Fairscale’s CPU offloading was integrated to shift some memory loads from the GPU to the CPU, thereby freeing up GPU resources for more critical computations. This technique helped mitigate memory bottlenecks, allowing for the handling of larger batch sizes.

Next, we applied mixed-precision training, which combines 16-bit and 32-bit floating point operations. This significantly reduced memory consumption without compromising the accuracy of the optical flow estimations.

Finally, we implemented activation checkpointing to further minimize memory usage. By only storing key intermediate results and recomputing others when necessary, we reduced the overall memory footprint during training, allowing for a higher batch size without OOM errors.

These optimizations, applied in tandem with a distributed multi-GPU, multi-node setup, enabled us to expand the computational capabilities of the RAFT model, ensuring more efficient training while preserving memory integrity.

We begin with an input image \( x \), from which an initial optical flow map is generated:

\[ \text{opt} = \text{model}(x) \tag{9.1} \] Following this, the optical flow map is iteratively refined through a series of augmentation steps. Each step involves generating an augmented input \( x' \) as follows:

\[ x' = \text{aug}(\text{opt}, x) \tag{9.2} \] The model then processes the augmented input \( x' \) to produce an updated optical flow map:

\[ \text{opt}' = \text{model}(x') \tag{9.3} \] The loss function is calculated as the variance between the original optical flow map \( \text{opt} \) and the augmented optical flow map \( \text{opt}' \):

\[ \text{Loss} = \text{Variance}(\text{opt}', \text{opt}) \tag{9.4} \] The primary goal of this algorithm is to reduce the variance between the optical flow map generated for the original input and the augmented images. The augmentation process generates random patches with pixel values constrained by the minimum and maximum values of the input image. These patches are then placed in random positions in both the original image \( x \) and its augmented counterpart \( x' \). The optical flow map produced by the model during test-time adaptation guides the patch placement, and minimizing the variance between the original and augmented optical flow maps helps the model improve its predictions on unseen data.

Figure 9: Percentage of absolute EPE score changes
Figure 10: EPE scores for different configurations
Table 11: EPE scores for different configurations

Batch Size Number of Augmentations EPE Value

8 2 2.75277

2 11 2.75071

6 3 2.78732

2 3 2.74143

6 2 2.80244

2 2 2.76932

Figure 9 illustrates percentage of absolute EPE score decrease for different combinations of number of augmentations, number of epochs, and batch size, while Figure 10 illustrates different combinations of those parameters and their corresponding EPE scores. Table 11 presents EPE scores for different batch sizes and number of augmentations using a system equipped with six GPUs.

Across these configurations, a reduction in EPE is observed, although the trend is not entirely consistent. This suggests that while augmentations and optimizations contribute to minimizing EPE, further tuning of batch size, GPU count, and augmentation strategies is needed to achieve definitive improvements.

For example, the configuration with a batch size of 2 and 11 augmentations resulted in the lowest EPE value of 2.75071. This configuration illustrates how batch size is calculated in distributed training, where the net batch size is determined using: \[ \text{Net Batch Size} = \text{Batch Size} \times (\text{Num of Augmentations} + 1) \times \text{Number of GPUs} \] In this case: \[ 2 \times (11 + 1) \times 6 = 144 \]

Batch Size	Number of Augmentations	EPE Value
8	2	2.75277
2	11	2.75071
6	3	2.78732
2	3	2.74143
6	2	2.80244
2	2	2.76932

Tech Stack

Framework: Torch Scale

Framework: PyTorch Lightning Fabric

WanDB Report

Currently in progress and kept private.

Intelligent OCR

This project involves the development of an advanced Optical Character Recognition (OCR) software, leveraging a combination of cutting-edge neural network models: CRAFT, Faster R-CNN, Tesseract, and a Siamese network model. The software is hosted on Azure Cloud, utilizing a virtual machine (VM) to run the server and process documents. The entire system is designed to convert scanned documents into editable text, extract bounding boxes for words and sentences, classify sentences into categories, and provide a comprehensive annotation interface for user modifications.

Key Features:

Model Integration: The OCR software integrates multiple state-of-the-art models trained using PyTorch on the FUND dataset. This setup ensures high accuracy and reliability in text extraction and classification.

Document Processing: Users can upload scanned documents in various formats (.png, .jpg, .jpeg, and .pdf) to the frontend website. For PDFs, only the first page is processed. The software extracts editable text, provides bounding boxes for each word and sentence, and classifies sentences into categories such as 'other', 'question', 'answer', and 'header'.

Annotation Interface: The frontend features a user-friendly annotation interface built with annotorious.js. Users can modify model predictions or annotate documents from scratch. Annotations can be saved and downloaded in a simple .txt format.

Offline Processing: The software supports offline batch processing, allowing users to process multiple images at once. Users can choose between output formats like MTX or FUND dataset formats.

Server Details: The models run on an Azure VM with Linux (Ubuntu 18.04), specifically a Standard B2s instance (2 vCPUs, 4 GiB memory). Due to server connection constraints, there might be occasional interruptions in access, but comprehensive setup instructions are provided for replicating the environment.

Training and Metrics: Model training is conducted using PyTorch, with detailed explanations of training steps and performance metrics. All trained models and predictions on public test datasets are available for download.

Real Estate VR

Problem Statement
Traditional real estate methods rely on physical visits and static images, limiting immersion and convenience. To enhance the real estate experience, we aim to integrate virtual reality (VR) technology, creating a user-friendly, cross-platform VR application. This solution will offer immersive property exploration and real-time updates, modernizing how users view and interact with properties.

Design
1. Asset Coordination and Environment Generation: Utilizing Unity’s assets and spatial mapping, we create a realistic, adaptable virtual environment for immersive exploration.
2. Advanced User Interaction with XR Device Simulator: An intuitive interface with Unity’s XR Device Simulator allows seamless navigation and interaction through a 'digital wand,' enhancing user experience with C# scripting.
3. Dynamic Building Placement with Spatial Intelligence: Real-time building placement is facilitated by Unity’s spatial analysis, offering precise options based on spatial context.
4. Futuristic Navigation Paradigms: Advanced navigation includes teleportation and aerial views, allowing instant spatial traversal and comprehensive property inspection.
5. Precision-Engineered C# Scripting: C# scripts ensure smooth user controls, dynamic building placement, and effective interaction management.
Our VR platform allows users to add plots, select and place buildings, and view properties with detailed information. It supports flythrough and teleportation, with eagle-eye views for broader city and plot planning.

Key Features
- Add and visualize new plots
- Place houses with size and price details
- Immersive VR interface
- Flythrough and teleportation travel
- Multi-plot monitoring with eagle-eye views

Tools Used
- Unity
- C#

Yet Another Python Compiler

A Python Compiler, built with Bison and Flex, processes Python code by tokenizing it, parsing it into a syntax tree, and performing semantic checks. It generates intermediate code and applies optimizations like constant folding and dead code elimination. Key features include lexical analysis, syntax and semantic error detection, and code optimization.

Overview
The project focuses on developing a compiler for a custom programming language, implementing various stages of compilation including lexical analysis, parsing, semantic analysis, intermediate code generation, and code optimization.

Tech Stack
Programming Language: Python
Tools: YACC (Yet Another Compiler Compiler), Lex (for lexical analysis)
Data Structures: Symbol tables, parse trees, abstract syntax trees (ASTs)

Features and Contributions

Lexical Analysis:
Token identification and lexical error detection.
Developed token identification and error detection mechanisms.
Created a symbol table to manage identifiers and constants.

Parser:
Syntax declaration, indentation and syntactic error detection.
Implemented syntax rules using YACC for expressions and control flow.
Generated parse trees and abstract syntax trees (ASTs).

Semantic Analysis:
SDD + SDT, Annotated Parse Tree, and Semantic Error detection.
Designed syntax-directed translations to ensure semantic correctness.
Detected semantic errors such as undeclared variables and misuse of reserved identifiers.

Intermediate Code Generation (ICG):
Generated three-address code and quadruples.
Implemented backpatching for jump statements.

Code Optimization:
Basic blocks, DAG, CFG, Induction Variable elimination.
Applied optimization techniques including constant folding, copy propagation, dead code elimination, and peephole optimization.
Enhanced code efficiency through these optimizations.

Education

Experience

Check the project reports for each experienes under Projects section!

Prof. Pengtao Xie Lab @ University of California San DiegoResearch Intern | June 2025 – Sep 2025

PRM in LLMs via Meta-Learning for Math Reasoning

Fine-tuning Process Reward Models (PRMs) using Qwen2.5 7B with LoRA for step-level reward assignment on math reasoning tasks. Leveraging solution trajectories from o4-mini and Gemini 2.5 Pro, and exploring domain-weighted meta-learning strategies, benchmarked on the AIME.

VISTA @ Indian Institute of ScienceResearch Intern | Jun 2024 – Jan 2025

Multi-modality in Healthcare & Test-time Adaptation in Optical Flow

Working on cross-modal alignment and self-supervised learning for medical tasks (MIMIC), including mortality prediction. Exploring optical flow estimation and test-time adaptation using Torchscale and Fabric for multi-node training and performance optimization.

Xu Labs @ Carnegie Mellon UniversityResearch Intern | Dec 2023 – May 2024

Few-shot Weak Label Cryo-ET Segmentation

First-author (submitted to PLOS Computational Biology) on SaSi, a self-augmented few-shot learning method for weakly supervised Cryo-ET segmentation. Worked on consistency loss, SimCLR, AugMix, and pretraining MAE for denoising/reconstruction. Adapted the Segment Anything Model.

Laboratory of Medical Mechatronics @ National University of SingaporeResearch Intern | Feb 2023 – Feb 2024

Surgical Task & Motion Planning

Developed MASS, an LLM + RL framework for interpretable robotic motion planning in surgical scenarios using PyBullet and LapGym-SOFA (submitted to RA-L). Trained RL policies like HER with imitation learning and integrated Grounding DINO for enhanced planning.

Spider R&D ClubHead of ML Research | July 2022 – May 2025

Camouflage Video Segmentation

Proposed SAM-PM for video camouflage object detection, improving SAM with minimal parameters (CVPR 2024). Leading research in LLM-driven Task & Motion Planning, Reinforcement Learning with custom robotic arms, and Continual Learning.

Samsung PRISMML Research Intern | Aug 2022 – Mar 2023

Empathetic Response Generation

Worked on Empathetic Response Generation and emotion/intent classification using Flan-T5, BART, and RoBERTa with Hugging Face for edge devices. Performed Knowledge Distillation on T5 achieving a 77% size reduction while maintaining BLEU score.

NIT TrichyML Research Intern | Apr 2022 – Oct 2022

Natural Legal Language Processing

Worked on Natural Legal Language Processing using BERT, XLNET, and Hierarchical Transformers on judicial data. Published at EMNLP NLLP 2022 workshop with an 80x speedup for sentence boundary detection using a lightweight CNN model.

News

Publications

Projects

CLICK on the project to get the project report and run reports it !!!

Crypto MLOps Dashboard

A Study on Regularization-Based Continual Learning Methods for Indic ASR

Sequentially adapting a Conformer-based ASR model to nine Indian languages using regularization based continual learning strategies, improving multilingual performance without forgetting.

Continual Learning for Robust Video Segmentation of Robot-Assisted Surgical Tool

A continual learning framework using SAM2 and LoRA for robust, privacy-preserving surgical tool segmentation across evolving, distorted endoscopic video domains.

MASS (My ASSistant)

MASS uses an LLM to assist surgeons in automating surgical robots by sequencing actions from simple instructions. The project develops robotic motion planning with LLMs and reinforcement/imitation learning for surgery.

Medical Multi-Modal Fusion & Cross Modal Alignment

This project focuses on multi-modal fusion and cross-modal representation alignment with self0supervised learning by integrating EHR, CXR, Notes, and ECG data.

SaSi: A Self-augmented and Self-interpreted for Few-shot Cryo-ET Particle Detection

This project introduces a SaSi deep learning approach for weak label cryo-ET segmentation in few-shot learning, using Consistency Loss, SimCLR, and AugMix to improve data efficiency and outperform existing methods.

Test-time adaptation for Optical Flow

Test-time adaptation for optical flow method by using augmentation and pesudo predictions.

SAM-PM: Enhancing Video Camouflaged Object Detection

SAM-PM is a propagation Module to enhance the SAM’s performance in video camouflage object detec- tion, achieving substantial improvements with minimal parameter addition (< 1M) while performing better than previous SOTA by 37.9% on mIoU, which got accepted for CVPR 2024 workshops

Natural Legal Language Processing

Worked on Natural Legal Language Processing by using LLMs like BERT, XLNET, and Hierarchical Transformers on judiciary datasets. Developed a lightweight CNN model for sentence boundary detection that is 80x faster than traditional statistical models.

Intelligent OCR

My OCR integrates CRAFT, Faster R-CNN, Tesseract, and a Siamese network to perform sentence classification and key-value pair detection, including bounding boxes and linked information using PyTorch which is hosted in Azure Cloud.

i-Pravesh

I-Pravesh is a Smart Attendance Android App which uses a combination of edge face detection and recognition (MobileFaceNet + TensorFlow Lite) as the authentication biometric for recording attendance.

SummarizeIQ: An Integrated Summarization and Content Analysis Engine

Real Estate VR

Yet Another Python Compiler

Oct tree based 3D OpenGL Renderer

The Octree-Based 3D OpenGL Renderer, created with PyOpenGL, efficiently renders 3D scenes by using an octree data structure to manage and optimize rendering performance.

Sartorius Cell Instance Segmentation

Trained R-CNN, U-Net and Detectron2 using Pytorch to detect and delineate distinct objects of interest in biological images depicting neuronal cell types commonly used in studying neurological disorders.

Other Projects

Prof. Pengtao Xie Lab @ University of California San Diego

Research Intern | June 2025 – Sep 2025

VISTA @ Indian Institute of Science

Research Intern | Jun 2024 – Jan 2025

Xu Labs @ Carnegie Mellon University

Research Intern | Dec 2023 – May 2024

Laboratory of Medical Mechatronics @ National University of Singapore

Research Intern | Feb 2023 – Feb 2024

Spider R&D Club

Head of ML Research | July 2022 – May 2025

Samsung PRISM

ML Research Intern | Aug 2022 – Mar 2023

NIT Trichy

ML Research Intern | Apr 2022 – Oct 2022