NLKI: Lightweight Natural-Language Knowledge Integration for Small VLMs

Abstract

Commonsense visual–question answering often hinges on knowledge that is missing from the image or the question. Small vision–language models (sVLMs) such as ViLT, VisualBERT, and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework NLKI that: (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information–enriched prompt yield explanations that largely reduce hallucinations, while lifting end-to-end answer accuracy by up to 7% across three datasets. This enables FLAVA and other models in NLKI to match or exceed medium-sized VLMs such as Qwen-2 VL-2B and SmolVLM-2.5B. As these benchmarks contain 10–25% label noise, additional finetuning using noise-robust losses (such as symmetric cross entropy and generalised cross entropy) adds another 2.5% in CRIC and 5.5% in AOKVQA. Our findings expose when LLM-based commonsense knowledge beats retrieval from commonsense knowledge bases, how noise-aware training stabilises small models in the context of external knowledge augmentation, and why parameter-efficient commonsense reasoning is now within reach for 250M-parameter models.

Highlights

Plug & Play

Retriever + LLM explainer + lightweight reader that attaches to any sub-240M VLM.

Better Explanations

Prompts enriched with dense/region captions, objects, and retrieved facts reduce hallucination.

Accuracy Gains

Up to ~13% absolute on AOKVQA with FLAVA + Type-5 explanations; consistent lift across models.

Noise Robust

SCE / GCE losses stabilise training on label-noisy datasets like CRIC and AOKVQA.

Method: NLKI Pipeline

Visual context extraction: dense / region captions + detected objects.
Knowledge retrieval: fine-tuned ColBERTv2 fetches top-k commonsense facts.
LLM explanations: an instruction-tuned LLM (e.g., Llama-3-8B) crafts a short rationale (Type-5 prompt).
Reader integration: prepend rationale to the question and feed to a small VLM for answer classification.
Noise-robust training: apply SCE or CE+GCE depending on dataset noise level.

All components are modular; swap retriever/LLM/reader as needed.

Demonstration of NLKI

Results

Figure placeholder — accuracy vs model size (AOKVQA Val). Add caption here.

Table/plot placeholder — ablations: retrieval types, explanation types, loss functions.

CRIC

+2–3% with Type-5 explanations, further +~2–3% with SCE on noisy labels.

AOKVQA

FLAVA: ~+13% absolute with NLKI; matches or surpasses 1–4B generative VLMs.

e-SNLI-VE

Cleaner labels: CE suffices; consistent gains from NLKI without heavy regularisation.

Datasets

Dataset	Type	Train	Val	Test	Answer
CRIC	VQA	364K	76K	84K	MCQ
AOKVQA	VQA	17K	1.1K	6.7K	MCQ / Free-form
e-SNLI-VE	NLI	401K	14K	14K	3-way

Numbers above are placeholders copied from the paper; adjust to the final stats you wish to report.

Resources

Paper (PDF)

Camera-ready version for EMNLP Findings.

Open

Code

Training scripts, checkpoints, and data preparation.

GitHub

Models & Checkpoints

FLAVA / ViLT / VisualBERT readers + ColBERTv2 retriever.

Download

Processed Data

Prompts, explanations, and retrieval indices.

Download

BibTeX

          
@misc{dutta2025nlkilightweightnaturallanguage,
  title={NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks}, 
  author={Aritra Dutta and Swapnanil Mukherjee and Deepanway Ghosal and Somak Aditya},
  year={2025},
  eprint={2508.19724},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2508.19724},
}