Gouthaman, K.V.,
Nambiar, A.[Athira],
Srinivas, K.S.[Kancheti Sai],
Mittal, A.[Anurag],
Linguistically-aware attention for reducing the semantic gap in
vision-language tasks,
PR(112), 2021, pp. 107812.
Elsevier DOI
2102
Attention models, Visual question answering,
Counting in visual question answering, Image captioning
BibRef
Zhou, K.Y.[Kai-Yang],
Yang, J.K.[Jing-Kang],
Loy, C.C.[Chen Change],
Liu, Z.W.[Zi-Wei],
Learning to Prompt for Vision-Language Models,
IJCV(130), No. 9, September 2022, pp. 2337-2348.
Springer DOI
2208
BibRef
Zhou, K.Y.[Kai-Yang],
Yang, J.K.[Jing-Kang],
Loy, C.C.[Chen Change],
Liu, Z.W.[Zi-Wei],
Conditional Prompt Learning for Vision-Language Models,
CVPR22(16795-16804)
IEEE DOI
2210
Training, Representation learning, Adaptation models,
Neural networks, Manuals, Representation learning
BibRef
Ma, C.C.[Cheng-Cheng],
Liu, Y.[Yang],
Deng, J.K.[Jian-Kang],
Xie, L.X.[Ling-Xi],
Dong, W.M.[Wei-Ming],
Xu, C.S.[Chang-Sheng],
Understanding and Mitigating Overfitting in Prompt Tuning for
Vision-Language Models,
CirSysVideo(33), No. 9, September 2023, pp. 4616-4629.
IEEE DOI Code:
WWW Link.
2310
BibRef
Chen, C.Q.[Chong-Qing],
Han, D.[Dezhi],
Chang, C.C.[Chin-Chen],
MPCCT: Multimodal vision-language learning paradigm with
context-based compact Transformer,
PR(147), 2024, pp. 110084.
Elsevier DOI Code:
WWW Link.
2312
Multimodal vision-language paradigms,
High-dependency modeling, Visual question answering (VQA),
Logical relationship reasoning
BibRef
Yu, Z.T.[Zheng-Tao],
Zhao, J.[Jia],
Guo, C.L.[Chen-Liang],
Yang, Y.[Ying],
StableNet: Distinguishing the hard samples to overcome language
priors in visual question answering,
IET-CV(18), No. 2, 2024, pp. 315-327.
DOI Link
2403
multimedia systems
BibRef
Bazi, Y.[Yakoub],
Bashmal, L.[Laila],
Rahhal, M.M.A.[Mohamad Mahmoud Al],
Ricci, R.[Riccardo],
Melgani, F.[Farid],
RS-LLaVA: A Large Vision-Language Model for Joint Captioning and
Question Answering in Remote Sensing Imagery,
RS(16), No. 9, 2024, pp. 1477.
DOI Link
2405
BibRef
Tan, Y.T.[Ying-Tao],
Chen, Y.Y.[Ying-Ying],
Wang, J.Q.[Jin-Qiao],
DSTA: Reinforcing Vision-Language Understanding for Scene-Text VQA
With Dual-Stream Training Approach,
SPLetters(32), 2025, pp. 6-10.
IEEE DOI
2501
Optical character recognition, Training, Visualization,
Feature extraction, Transformers, Text recognition, sence-text understanding
BibRef
Alsabbagh, A.R.[Abdel Rahman],
Mansour, T.[Tariq],
Al-Kharabsheh, M.[Mohammad],
Ebdah, A.S.[Abdel Salam],
Al-Emaryeen, R.[Roa'a],
Al-Nahhas, S.[Sara],
Mahafza, W.[Waleed],
Al-Kadi, O.[Omar],
MiniMedGPT: Efficient Large Vision-Language Model for medical Visual
Question Answering,
PRL(189), 2025, pp. 8-16.
Elsevier DOI Code:
WWW Link.
2503
Medical VQA, Large Vision-Language Model, MedGPT,
Generative pre-trained transformers, Natural language processing
BibRef
Wang, X.[Xiao],
Wu, J.L.[Jian-Long],
Lin, Z.[Zijia],
Zhang, F.Z.[Fu-Zheng],
Zhang, D.[Di],
Nie, L.Q.[Li-Qiang],
Video DataFlywheel: Resolving the Impossible Data Trinity in
Video-Language Understanding,
PAMI(47), No. 4, April 2025, pp. 2912-2923.
IEEE DOI
2503
Noise, Annotations, Iterative methods, Scalability, Data models,
Question answering (information retrieval), Foundation models,
text-video retrieval
BibRef
Shen, R.[Ruoyue],
Inoue, N.[Nakamasa],
Guan, D.[Dayan],
Cai, R.[Rizhao],
Kot, A.C.[Alex C.],
Shinoda, K.[Koichi],
ContextualCoder: Adaptive In-Context Prompting for Programmatic
Visual Question Answering,
MultMed(27), 2025, pp. 4936-4949.
IEEE DOI
2509
BibRef
Earlier: A1, A2, A6, Only:
Pyramid Coder: Hierarchical Code Generator for Compositional Visual
Question Answering,
ICIP24(430-436)
IEEE DOI
2411
Codes, Context modeling, Computational modeling, Visualization,
Cognition, Training, Question answering (information retrieval),
visual question answering.
Training, Visualization, Codes, Accuracy, Large language models,
Natural languages, Visual question answering, Prompting methods
BibRef
Zhang, Y.H.[Yu-Hui],
Su, Y.C.[Yu-Chang],
Liu, Y.M.[Yi-Ming],
Wang, X.H.[Xiao-Han],
Burgess, J.[James],
Sui, E.[Elaine],
Wang, C.Y.[Chen-Yu],
Aklilu, J.[Josiah],
Lozano, A.[Alejandro],
Wei, A.[Anjiang],
Schmidt, L.[Ludwig],
Yeung-Levy, S.[Serena],
Automated Generation of Challenging Multiple-Choice Questions for
Vision Language Model Evaluation,
CVPR25(29580-29590)
IEEE DOI
2508
Visualization, Accuracy, Natural languages, Transforms,
Benchmark testing, Question answering (information retrieval),
multiple choice questions
BibRef
Jiang, X.[Xin],
Zheng, J.W.[Jun-Wei],
Liu, R.P.[Rui-Ping],
Li, J.H.[Jia-Hang],
Zhang, J.M.[Jia-Ming],
Matthiesen, S.[Sven],
Stiefelhagen, R.[Rainer],
@BENCH: Benchmarking Vision-Language Models for Human-centered
Assistive Technology,
WACV25(3934-3943)
IEEE DOI
2505
Image segmentation, Visualization, Depth measurement,
Optical character recognition, Visual impairment, VQA
BibRef
Wang, W.Z.[Wei-Zhen],
Duan, C.[Chenda],
Peng, Z.H.[Zheng-Hao],
Liu, Y.X.[Yu-Xin],
Zhou, B.[Bolei],
Embodied Scene Understanding for Vision Language Models via MetaVQA,
CVPR25(22453-22464)
IEEE DOI Code:
WWW Link.
2508
Visualization, Annotations, Decision making, Benchmark testing,
Robustness, Question answering (information retrieval),
Artificial intelligence
BibRef
Tian, X.Y.[Xin-Yu],
Zou, S.[Shu],
Yang, Z.Y.[Zhao-Yuan],
Zhang, J.[Jing],
Identifying and Mitigating Position Bias of Multi-image
Vision-Language Models,
CVPR25(10599-10609)
IEEE DOI Code:
WWW Link.
2508
Interpolation, Analytical models, Codes, Computational modeling,
Cognition, Question answering (information retrieval), position bias
BibRef
Sheng, L.J.[Li-Jun],
Liang, J.[Jian],
Wang, Z.[Zilei],
He, R.[Ran],
R-TPT: Improving Adversarial Robustness of Vision-Language Models
through Test-Time Prompt Tuning,
CVPR25(29958-29967)
IEEE DOI Code:
WWW Link.
2508
Training, Reviews, Foundation models, Training data, Minimization,
Entropy, Robustness, Safety, Tuning, vision-language models,
test time
BibRef
Das, D.[Deepayan],
Talon, D.[Davide],
Mancini, M.[Massimiliano],
Wang, Y.M.[Yi-Ming],
Ricci, E.[Elisa],
One VLM to Keep it Learning: Generation and Balancing for Data-free
Continual Visual Question Answering,
WACV25(5635-5645)
IEEE DOI Code:
WWW Link.
2505
Visualization, Adaptation models, Prevention and mitigation,
Training data, Quality control, Benchmark testing, Data models,
catastrophic forgetting
BibRef
Ishmam, M.F.[Md Farhan],
Tashdeed, I.[Ishmam],
Saadat, T.A.[Talukder Asir],
Ashmafee, M.H.[Md Hamjajul],
Kamal, A.R.M.[Abu Raihan Mostofa],
Hossain, M.A.[Md. Azam],
Visual Robustness Benchmark for Visual Question Answering (VQA),
WACV25(6623-6633)
IEEE DOI
2505
Measurement, Visualization, Computational modeling,
Large language models, Benchmark testing, Linguistics, Robustness, multimodal
BibRef
Chen, X.[Xi],
Djolonga, J.[Josip],
Padlewski, P.[Piotr],
Mustafa, B.[Basil],
Changpinyo, S.[Soravit],
Wu, J.L.[Jia-Lin],
Ruiz, C.R.[Carlos Riquelme],
Goodman, S.[Sebastian],
Wang, X.[Xiao],
Tay, Y.[Yi],
Shakeri, S.[Siamak],
Dehghani, M.[Mostafa],
Salz, D.[Daniel],
Lucic, M.[Mario],
Tschannen, M.[Michael],
Nagrani, A.[Arsha],
Hu, H.[Hexiang],
Joshi, M.[Mandar],
Pang, B.[Bo],
Montgomery, C.[Ceslee],
Pietrzyk, P.[Paulina],
Ritter, M.[Marvin],
Piergiovanni, A.[AJ],
Minderer, M.[Matthias],
Pavetic, F.[Filip],
Waters, A.[Austin],
Li, G.[Gang],
Alabdulmohsin, I.[Ibrahim],
Beyer, L.[Lucas],
Amelot, J.[Julien],
Lee, K.[Kenton],
Steiner, A.P.[Andreas Peter],
Li, Y.[Yang],
Keysers, D.[Daniel],
Arnab, A.[Anurag],
Xu, Y.Z.[Yuan-Zhong],
Rong, K.[Keran],
Kolesnikov, A.[Alexander],
Seyedhosseini, M.[Mojtaba],
Angelova, A.[Anelia],
Zhai, X.H.[Xiao-Hua],
Houlsby, N.[Neil],
Soricut, R.[Radu],
On Scaling Up a Multilingual Vision and Language Model,
CVPR24(14432-14444)
IEEE DOI
2410
Training, Visualization, Computational modeling, Object detection,
Benchmark testing, Question answering (information retrieval),
pretraining
BibRef
Li, R.J.[Rong-Jie],
Wu, Y.[Yu],
He, X.M.[Xu-Ming],
Learning by Correction: Efficient Tuning Task for Zero-Shot
Generative Vision-Language Reasoning,
CVPR24(13428-13437)
IEEE DOI
2410
Training, Visualization, Costs, Computational modeling, Cognition,
Question answering (information retrieval),
Vision-Language
BibRef
Khan, Z.[Zaid],
Fu, Y.[Yun],
Consistency and Uncertainty: Identifying Unreliable Responses From
Black-Box Vision-Language Models for Selective Visual Question
Answering,
CVPR24(10854-10863)
IEEE DOI
2410
Visualization, Uncertainty, Computational modeling, Closed box,
Predictive models, Question answering (information retrieval),
trustworthy ml
BibRef
Gu, T.C.[Tian-Cheng],
Yang, K.C.[Kai-Cheng],
Liu, D.[Dongnan],
Cai, W.D.[Wei-Dong],
LaPA: Latent Prompt Assist Model for Medical Visual Question
Answering,
DEF-AI-MIA24(4971-4980)
IEEE DOI Code:
WWW Link.
2410
Visualization, Accuracy, Medical services, Predictive models,
Feature extraction, Question answering (information retrieval), Data mining
BibRef
Feinglass, J.[Joshua],
Yang, Y.Z.[Ye-Zhou],
Towards Addressing the Misalignment of Object Proposal Evaluation for
Vision-Language Tasks via Semantic Grounding,
WACV24(4385-4395)
IEEE DOI
2404
Measurement, Visualization, Protocols, Annotations, Grounding,
Semantics, Question answering (information retrieval),
Image recognition and understanding
BibRef
Nadeem, A.[Asmar],
Hilton, A.[Adrian],
Dawes, R.[Robert],
Thomas, G.[Graham],
Mustafa, A.[Armin],
CAD: Contextual Multi-modal Alignment for Dynamic AVQA,
WACV24(7236-7248)
IEEE DOI
2404
Visualization, Semantics, Decision making, Robustness,
Question answering (information retrieval), Complexity theory,
Smartphones / end user devices
BibRef
Wu, W.[Wenyi],
Li, Q.[Qi],
Zhong, W.L.[Wen-Liang],
Huang, J.Z.[Jun-Zhou],
MIVC: Multiple Instance Visual Component for Visual-Language Models,
WACV24(8102-8111)
IEEE DOI
2404
Visualization, Computational modeling, Neural networks,
Question answering (information retrieval),
Image recognition and understanding
BibRef
Walmer, M.[Matthew],
Sikka, K.[Karan],
Sur, I.[Indranil],
Shrivastava, A.[Abhinav],
Jha, S.[Susmit],
Dual-Key Multimodal Backdoors for Visual Question Answering,
CVPR22(15354-15364)
IEEE DOI
2210
Visualization, Training data, Detectors, Feature extraction,
Question answering (information retrieval),
Vision + language
BibRef
Ding, Y.[Yang],
Yu, J.[Jing],
Liu, B.[Bang],
Hu, Y.[Yue],
Cui, M.X.[Ming-Xin],
Wu, Q.[Qi],
MuKEA: Multimodal Knowledge Extraction and Accumulation for
Knowledge-based Visual Question Answering,
CVPR22(5079-5088)
IEEE DOI
2210
Bridges, Visualization, Codes, Computational modeling,
Knowledge based systems, Semantics, Vision + language
BibRef
Gao, F.[Feng],
Ping, Q.[Qing],
Thattai, G.[Govind],
Reganti, A.[Aishwarya],
Wu, Y.N.[Ying Nian],
Natarajan, P.[Prem],
Transform-Retrieve-Generate: Natural Language-Centric
Outside-Knowledge Visual Question Answering,
CVPR22(5057-5067)
IEEE DOI
2210
Knowledge engineering, Visualization, Solid modeling,
Knowledge based systems, Natural languages, Transforms,
Visual reasoning
BibRef
Aflalo, E.[Estelle],
Du, M.[Meng],
Tseng, S.Y.[Shao-Yen],
Liu, Y.F.[Yong-Fei],
Wu, C.[Chenfei],
Duan, N.[Nan],
Lal, V.[Vasudev],
VL-InterpreT: An Interactive Visualization Tool for Interpreting
Vision-Language Transformers,
CVPR22(21374-21383)
IEEE DOI
2210
Heating systems, Visualization, Machine vision,
Computational modeling, Transformers, Question answering (information retrieval)
BibRef
Jain, V.,
Lodhavia, J.,
Automatic Question Tagging using k-Nearest Neighbors and Random
Forest,
ISCV20(1-4)
IEEE DOI
2011
learning (artificial intelligence),
question answering (information retrieval),
Natural Language Processing
BibRef
Chapter on Implementations and Applications, Databases, QBIC, Video Analysis, Hardware and Software, Inspection continues in
Vision-Language Models, Hallucination Mitigation .