NLKI: A Lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA

NLKI Framework — Project Website

Paper (PDF) arXiv Code EMNLP Findings
Aritra Dutta1 Swapnanil Mukherjee2 Deepanway Ghoshal3 Somak Aditya1
1 IIT Kharagpur   2 Ashoka University   3 Google DeepMind
NLKI Framewrok Block Diagram. Click to Zoom In !

Abstract

Commonsense visual–question answering often hinges on knowledge that is missing from the image or the question. Small vision–language models (sVLMs) such as ViLT, VisualBERT, and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework NLKI that: (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information–enriched prompt yield explanations that largely reduce hallucinations, while lifting end-to-end answer accuracy by up to 7% across three datasets. This enables FLAVA and other models in NLKI to match or exceed medium-sized VLMs such as Qwen-2 VL-2B and SmolVLM-2.5B. As these benchmarks contain 10–25% label noise, additional finetuning using noise-robust losses (such as symmetric cross entropy and generalised cross entropy) adds another 2.5% in CRIC and 5.5% in AOKVQA. Our findings expose when LLM-based commonsense knowledge beats retrieval from commonsense knowledge bases, how noise-aware training stabilises small models in the context of external knowledge augmentation, and why parameter-efficient commonsense reasoning is now within reach for 250M-parameter models.

Highlights

Plug & Play
Retriever + LLM explainer + lightweight reader that attaches to any sub-240M VLM.
Better Explanations
Prompts enriched with dense/region captions, objects, and retrieved facts reduce hallucination.
Accuracy Gains
Up to ~13% absolute on AOKVQA with FLAVA + Type-5 explanations; consistent lift across models.
Noise Robust
SCE / GCE losses stabilise training on label-noisy datasets like CRIC and AOKVQA.

Method: NLKI Pipeline

  1. Visual context extraction: dense / region captions + detected objects.
  2. Knowledge retrieval: fine-tuned ColBERTv2 fetches top-k commonsense facts.
  3. LLM explanations: an instruction-tuned LLM (e.g., Llama-3-8B) crafts a short rationale (Type-5 prompt).
  4. Reader integration: prepend rationale to the question and feed to a small VLM for answer classification.
  5. Noise-robust training: apply SCE or CE+GCE depending on dataset noise level.

All components are modular; swap retriever/LLM/reader as needed. Replace this text with your final description.

NLKI Pipeline Diagram

Demonstration of NLKI

Results

Ablation table placeholder
Figure placeholder — accuracy vs model size (AOKVQA Val). Add caption here.
Ablation table placeholder
Table/plot placeholder — ablations: retrieval types, explanation types, loss functions.
CRIC
+2–3% with Type-5 explanations, further +~2–3% with SCE on noisy labels.
AOKVQA
FLAVA: ~+13% absolute with NLKI; matches or surpasses 1–4B generative VLMs.
e-SNLI-VE
Cleaner labels: CE suffices; consistent gains from NLKI without heavy regularisation.

Datasets

Dataset Type Train Val Test Answer
CRICVQA364K76K84KMCQ
AOKVQAVQA17K1.1K6.7KMCQ / Free-form
e-SNLI-VENLI401K14K14K3-way

Numbers above are placeholders copied from the paper; adjust to the final stats you wish to report.

Resources

Paper (PDF)
Camera-ready version for EMNLP Findings.
Open
Code
Training scripts, checkpoints, and data preparation.
GitHub
Models & Checkpoints
FLAVA / ViLT / VisualBERT readers + ColBERTv2 retriever.
Download
Processed Data
Prompts, explanations, and retrieval indices.
Download

BibTeX

@inproceedings{dutta2025nlki,
  title     = {NLKI: A Lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks},
  author    = {Aritra Dutta and Swapnanil Mukherjee and Deepanway Ghoshal and Somak Aditya},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP},
  year      = {2025}
}

Update as needed.

TR²AIL Lab and External Collaborators for this Work

Aritra Dutta
TR²AIL Lab, IIT Kharagpur
LinkedIN
Swapnanil Mukherjee
Ashoka University
LinkedIN
Deepanway Ghoshal
Google DeepMind
LinkedIN
Somak Aditya
TR²AIL Lab, IIT Kharagpur
LinkedIN Homepage
Contact

For questions, please email traillab@gmail.com.

If you use our code or ideas, please cite the paper above. Thanks!

Get Resources