Vision and Reasoning

Many vision and language tasks require commonsense reasoning beyond data-driven image and natural language processing. Cognitive Sciences and Active Vision literature points to an explicit iterative interaction among perception, reasoning, and memory (knowledge) modules (DeepIU ACS 2015).

Ongoing Projects

  1. SERB DST Startup Research Grant (2021-23) ~ INR 26 Lacs | Topic: “Learning from Rules and Data for Image Analytics”
  2. IIT Kharagpur Faculty Startup Research Grant (2022-24) ~ INR 25 Lacs
    Topic: The Role of Feedback in Vision-Language enabled Embodied Agents towards Applications in Desire Management
    Joint PI: Prof. Pawan Goyal
  3. Counterfactual Reasoning in Videos
  4. Active Learning for 3D Video Grounding (with Dr. Maneesh Singh)

Captioning

In our earliest attempt (CVIU 2017), we used a combination of image classification, reasoning with commonsense knowledge (extracted from training captions) to propose a Scene Description Graph as an intermediate representation for a natural image. We showed the efficacy of this representation through image captioning, image retrieval tasks (and QA case studies).

Visual QA, Image Puzzles and Visual Reasoning

We have proposed instantiations of this abstract architecture to solve image puzzles, VQA and visual reasoning tasks such as CLEVR. In our AAAI 2018 VQA, and UAI 2018 Puzzles work, we have proposed an explicit probabilistic soft logic layer on top of a neural architecture that helps integrate commonsense knowledge and induces post-hoc interpretability.

Later on, for an end-to-end (differentiable) integration of spatial knowledge, we explore a combination of knowledge distillation, probabilistic logic, and relational network in our WACV 2019 CLEVR.

Avatar
Somak Aditya
Assistant Professor

My research interests include integrating knowledge and enabling higher-order reasoning in AI.

Publications

ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments | In EMNLP 2024 (Main).
(2024).

PDF nlp vision


Text2Afford: Probing Object Affordance Prediction abilities of Language Models solely from Text | In CONLL 2024 (Main).
(2024).

PDF nlp vision


Integrating Knowledge and Reasoning in Image Understanding | In IJCAI 2019.
(2019).

PDF vision nlp


Knowledge and Reasoning for Image Understanding | In Ph.D Dissertation, Defended 2018.
(2019).

PDF vision nlp


Spatial Knowledge Distillation to aid Visual Reasoning | In IEEE WACV 2019.
(2019).

PDF vision nlp neurosymbolic


Explicit Reasoning over End-to-End Neural Architectures | In AAAI 2018.
(2018).

PDF Code Project vision nlp neurosymbolic


Visual common-sense for scene understanding using perception, semantic parsing and reasoning. | In AAAI Spring Symposium, 2015.
(2015).

PDF Slides vision nlp