Soumava Paul
email:

Hi, visitor! This is Soumava and you have reached my small corner on the World Wide Web. I am currently a research intern at the Astra Vision Group in INRIA Paris. I recently completed my Master's in Visual Computing at Universität des Saarlandes. My Master's thesis was on sparse-view 360° scene reconstruction using 2D diffusion priors, advised by Dr. Jan Eric Lenssen and Prof. Bernt Schiele at the Max-Planck-Institut für Informatik.

Before coming to Europe, I was a data scientist at HP Inc India where I worked with Niranjan Damera Venkata on time series forecasting problems in finance. I have also spent time as a Research Fellow at Indian Institute of Science, Bangalore, working with Prof. Soma Biswas on problems at the intersection of cross-modal retrieval and domain generalization.

Previously, I graduated from Indian Institute of Technology Kharagpur with a bachelor's degree in Electrical Engineering and a minor in Computer Science. I was advised by Prof. K. Sreenivasa Rao for my bachelor's thesis on Singing Voice Detection. During my undergrad, I interned at IBM India Research Labs, Bangalore where I worked on novel zero-shot learning algorithms.

profile photo
News
  • [Sep 2024]  One paper accepted to ECCV Wild3D 2024.
  • [May 2024]  Defended my Master's thesis on sparse-view 360° scene reconstruction using 2D Diffusion Priors.
  • [Apr 2024]  Awarded Deutschland Stipendium for strong academic performance during Master's.
  • [Apr 2022]  Started my Master's at Universität des Saarlandes.
  • [Aug 2021]  Two papers accepted to ICCV 2021. 1 main track and 1 workshop.
  • [Jun 2021]  One paper on singing voice detection accepted to Interspeech 2021.
  • [May 2020]  Graduated from IIT Kharagpur after 4 wonderful years.
  • [Dec 2018]  Presented my first research paper at ICVGIP 2018 held at IIIT Hyderabad.
  • [Jul 2016]  Started my journey at IIT Kharagpur as an undergrad.
Research

I am interested in computer vision, specifically topics at the intersection of generative models and static, dynamic or deformable scene representations. In earlier years, my research spanned topics in domain generalization, zero-shot learning, music information retrieval and medical imaging. Representative papers are highlighted.

* denotes equal contribution
Sp2360: Sparse-view 360° Scene Reconstruction using Cascaded 2D Diffusion Priors
Soumava Paul, Christopher Wewer, Bernt Schiele, Jan Eric Lenssen
ECCV Wild3D 2024
abstract / bibtex / proceedings / arXiv / Code

We aim to tackle sparse-view reconstruction of a 360° 3D scene using priors from latent diffusion models (LDM). The sparse-view setting is ill-posed and underconstrained, especially for scenes where the camera rotates 360 degrees around a point, as no visual information is available beyond some frontal views focused on the central object(s) of interest. In this work, we show that pretrained 2D diffusion models can strongly improve the reconstruction of a scene with low-cost fine-tuning. Specifically, we present SparseSplat360 (Sp2360), a method that employs a cascade of in-painting and artifact removal models to fill in missing details and clean novel views. Due to superior training and rendering speeds, we use an explicit scene representation in the form of 3D Gaussians over NeRF-based implicit representations. We propose an iterative update strategy to fuse generated pseudo novel views with existing 3D Gaussians fitted to the initial sparse inputs. As a result, we obtain a multi-view consistent scene representation with details coherent with the observed inputs. Our evaluation on the challenging Mip-NeRF360 dataset shows that our proposed 2D to 3D distillation algorithm considerably improves the performance of a regularized version of 3DGS adapted to a sparse-view setting and outperforms existing sparse-view reconstruction methods in 360° scene reconstruction. Qualitatively, our method generates entire 360° scenes from as few as 9 input views, with a high degree of foreground and background detail.

        @inproceedings{
          paul2024sp,
          title={Sp2360: Sparse-view 360{\textopenbullet} Scene Reconstruction using Cascaded 2D Diffusion Priors},
          author={Soumava Paul and Christopher Wewer and Bernt Schiele and Jan Eric Lenssen},
          booktitle={ECCV 2024 Workshop on Wild 3D: 3D Modeling, Reconstruction, and Generation in the Wild},
          year={2024},
          url={https://openreview.net/forum?id=XuNhNyHHwK}
        }
      

Synthesizing novel views with a combination of diffusion-based image inpainting and artifact removal priors helps reconstruct complex 360° scenes without 3D-aware finetuning of a 2D diffusion model on million-scale multiview datasets.

Test-time Training for Data-efficient UCDR
Soumava Paul, Titir Dutta, Aheli Saha, Abhishek Samanta, Soma Biswas
arXiv preprint
abstract / bibtex / arXiv / Code

Image retrieval under generalized test scenarios has gained significant momentum in literature, and the recently proposed protocol of Universal Cross-domain Retrieval is a pioneer in this direction. A common practice in any such generalized classification or retrieval algorithm is to exploit samples from many domains during training to learn a domain-invariant representation of data. Such criterion is often restrictive, and thus in this work, for the first time, we explore the generalized retrieval problem in a data-efficient manner. Specifically, we aim to generalize any pre-trained cross-domain retrieval network towards any unknown query domain/category, by means of adapting the model on the test data leveraging self-supervised learning techniques. Toward that goal, we explored different self-supervised loss functions (for example, RotNet, JigSaw, Barlow Twins, etc.) and analyze their effectiveness for the same. Extensive experiments demonstrate the proposed approach is simple, easy to implement, and effective in handling data-efficient UCDR.

      @misc{paul2023testtime,
        title={Test-time Training for Data-efficient UCDR}, 
        author={Soumava Paul and Titir Dutta and Aheli Saha and Abhishek Samanta and Soma Biswas},
        year={2023},
        eprint={2208.09198},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
      }
      

Test-time training heuristics weaken the assumption of training data across multiple domains for universal cross-domain retrieval.

Universal Cross-Domain Retrieval: Generalizing Across Classes and Domains
Soumava Paul*, Titir Dutta*, Soma Biswas
ICCV 2021
abstract / bibtex / proceedings / arXiv / Code / Video / slides / poster

In this work, for the first time, we address the problem of universal cross-domain retrieval, where the test data can belong to classes or domains which are unseen during training. Due to dynamically increasing number of categories and practical constraint of training on every possible domain, which requires large amounts of data, generalizing to both unseen classes and domains is important. Towards that goal, we propose SnMpNet (Semantic Neighbourhood and Mixture Prediction Network), which incorporates two novel losses to account for the unseen classes and domains encountered during testing. Specifically, we introduce a novel Semantic Neighborhood loss to bridge the knowledge gap between seen and unseen classes and ensure that the latent space embedding of the unseen classes is semantically meaningful with respect to its neighboring classes. We also introduce a mix-up based supervision at image-level as well as semantic-level of the data for training with the Mixture Prediction loss, which helps in efficient retrieval when the query belongs to an unseen domain. These losses are incorporated on the SE-ResNet50 backbone to obtain SnMpNet. Extensive experiments on two large-scale datasets, Sketchy Extended and DomainNet, and thorough comparisons with state-of-the-art justify the effectiveness of the proposed model.

      @InProceedings{Paul_2021_ICCV,
        author    = {Paul, Soumava and Dutta, Titir and Biswas, Soma},
        title     = {Universal Cross-Domain Retrieval: Generalizing Across Classes and Domains},
        booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
        month     = {October},
        year      = {2021},
        pages     = {12056-12064}
      }
      

Semantic domain-invariant learning of latent space embeddings prepares an image retrieval model for a challenging cross-domain retrieval scenario where test data can belong to classes or domains unseen during training.

Knowledge Distillation for Singing Voice Detection
Soumava Paul, Gurunath Reddy M, K. Sreenivasa Rao, PP Das
INTERSPEECH 2021
abstract / bibtex / proceedings / arXiv / Code / Video / slides

Singing Voice Detection (SVD) has been an active area of research in music information retrieval (MIR). Currently, two deep neural network-based methods, one based on CNN and the other on RNN, exist in literature that learn optimized features for the voice detection (VD) task and achieve state-of-the-art performance on common datasets. Both these models have a huge number of parameters (1.4M for CNN and 65.7K for RNN) and hence not suitable for deployment on devices like smartphones or embedded sensors with limited capacity in terms of memory and computation power. The most popular method to address this issue is known as knowledge distillation in deep learning literature (in addition to model compression) where a large pre-trained network known as the teacher is used to train a smaller student network. Given the wide applications of SVD in music information retrieval, to the best of our knowledge, model compression for practical deployment has not yet been explored. In this paper, efforts have been made to investigate this issue using both conventional as well as ensemble knowledge distillation techniques.

      @inproceedings{paul21b_interspeech,
        author={Soumava Paul and Gurunath Reddy M and K. Sreenivasa Rao and Partha Pratim Das},
        title={{Knowledge Distillation for Singing Voice Detection}},
        year=2021,
        booktitle={Proc. Interspeech 2021},
        pages={4159--4163},
        doi={10.21437/Interspeech.2021-636}
      }
      

Knowledge Distillation with state-of-the-art voice detection models as teachers improves performance of student models upto 1000x smaller in parameter count.

Addressing Target Shift in Zero-shot Learning using Grouped Adversarial Learning
Saneem Ahmed Chemmengath*, Soumava Paul*, Samarth Bharadwaj, Suranjana Samanta, Karthik Sankaranarayanan
ICCV 2021 MELEX Workshop   (Oral)
abstract / bibtex / proceedings / arXiv / Code / Video / slides

Zero-shot learning (ZSL) algorithms typically work by exploiting attribute correlations to make predictions for unseen classes. However, these correlations do not remain intact at test time in most practical settings, and the resulting change in these correlations leads to adverse effects on zero-shot learning performance. In this paper, we present a new paradigm for ZSL that: (i) utilizes the class-attribute mapping of unseen classes to estimate the change in target distribution (target shift), and (ii) propose a novel technique called grouped Adversarial Learning (gAL) to reduce negative effects of this shift. Our approach is widely applicable for several existing ZSL algorithms, including those with implicit attribute predictions. We apply the proposed technique (gAL) on three popular ZSL algorithms: ALE, SJE, and DEVISE, and show performance improvements on 4 popular ZSL datasets: AwA2, aPY, CUB, and SUN.

      @InProceedings{Chemmengath_2021_ICCV,
        author    = {Chemmengath, Saneem A. and Paul, Soumava and Bharadwaj, Samarth and Samanta, Suranjana and Sankaranarayanan, Karthik},
        title     = {Addressing Target Shift in Zero-Shot Learning Using Grouped Adversarial Learning},
        booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
        month     = {October},
        year      = {2021},
        pages     = {2368-2377}
      }
      

Learning image attributes with grouped Adversarial Learning (gAL) reduces effects of target shift in zero-shot learning algorithms..

Jointly Learning Convolutional Representations to Compress Radiological Images and Classify Thoracic Diseases in the Compressed Domain
Ekagra Ranjan*, Soumava Paul*, Siddharth Kapoor, Aupendu Kar, Ramanathan Sethuraman, Debdoot Sheet
ICVGIP, ACM, 2018   (Oral)
abstract / bibtex / proceedings / Code / slides

Deep learning models trained in natural images are commonly used for different classification tasks in the medical domain. Generally, very high dimensional medical images are down-sampled by using interpolation techniques before feeding them to deep learning models that are ImageNet compliant and accept only low-resolution images of size 224 x 224 px. This popular technique may lead to the loss of key information thus hampering the classification. Significant pathological features in medical images typically being small sized and highly affected. To combat this problem, we introduce a convolutional neural network (CNN) based classification approach which learns to reduce the resolution of the image using an autoencoder and at the same time classify it using another network, while both the tasks are trained jointly. This algorithm guides the model to learn essential representations from high-resolution images for classification along with reconstruction. We have used the publicly available dataset of chest x-rays to evaluate this approach and have outperformed state-of-the-art on test data. Besides, we have experimented with the effects of different augmentation approaches in this dataset and report baselines using some well known ImageNet class of CNNs.

      @inproceedings{ranjan2018jointly,
        title={Jointly learning convolutional representations to compress radiological images and classify thoracic diseases in the compressed domain},
        author={Ranjan, Ekagra and Paul, Soumava and Kapoor, Siddharth and Kar, Aupendu and Sethuraman, Ramanathan and Sheet, Debdoot},
        booktitle={Proceedings of the 11th Indian Conference on computer vision, graphics and image processing},
        pages={1--8},
        year={2018}
      }
      

Downscaling high resolution Chest X-Rays using an autoencoder leads to superior retention of important pathological features for thoracic disease classification over usual interpolation techniques.

Side Hustles
Four Musings and a Mural
Soumava Paul, Neel Kelkar
project page / Code

Our submission for the Rendering Competiton of the Computer Graphics-I course at UdS (Winter Term 2022/23). Finished in the A-tier with 80% points, narrowly missing out on the podium places.

ArMyo
LBS Hall (IIT Kgp) Hardware Modelling Team
Report / Code

Our submission for the Inter-Hall Hardware Modelling Competiton at IIT Kgp in 2019. We built a robotic exoskeleton system for assisting patients with muscular atrophy. Secured 4th place out of 15 teams.

  Academic Service
  Teaching
  Misc
  • In my downtime, I like to play tennis, pingpong, and football.
  • I speak English, Bengali, Hindi, and beginner-level German.
  • On weekends, I follow the English Premier League and overthink which players to pick for my fantasy football team. My favorite footballers of all-time are Son Heung-min and Zlatan Ibrahimović.
  • I use mvp18 as my web handle for social accounts, a nickname I came up with when I turned (you guessed it) 18! I had just cracked IIT-JEE and felt like a real hotshot back then. The next 4 years couldn't have been any more humbling :')
  • In an earlier life, I could write.
This revolver map seemed kinda cool. It shows total webpage visits since April 13, 2023.

Imitation is the most sincere form of flattery