20.4.3.3.18 Large Language Models and Visual Grounding

Chapter Contents (Back)
Large Language Models. LLM. Grounding. Visual Grounding.
See also Visual Question Answering, Query, VQA.

Chen, Z.X.[Zhi-Xuan], Bie, Y.[Yequan], Jin, H.B.[Hai-Bo], Chen, H.[Hao],
Large Language Model With Region-Guided Referring and Grounding for CT Report Generation,
MedImg(44), No. 8, August 2025, pp. 3139-3150.
IEEE DOI Code:
WWW Link. 2508
Computed tomography, Grounding, Feature extraction, Training, Medical diagnostic imaging, Accuracy, Geometry, Lungs, Visualization, large language model BibRef

Liu, Y.[Yi], Hou, H.W.[Hao-Wen], Ma, F.[Fei], Ni, S.G.[Shi-Guang], Yu, F.R.[Fei Richard],
MLLM-TA: Leveraging Multimodal Large Language Models for Precise Temporal Video Grounding,
SPLetters(32), 2025, pp. 281-285.
IEEE DOI 2501
Visualization, Grounding, Large language models, Feature extraction, Benchmark testing, Vectors, Training, video grounding BibRef

Li, G.Z.[Guo-Zhang], Ding, X.P.[Xin-Peng], Cheng, D.[De], Li, J.[Jie], Wang, N.N.[Nan-Nan], Gao, X.B.[Xin-Bo],
ETC: Temporal Boundary Expand Then Clarify for Weakly Supervised Video Grounding With Multimodal Large Language Model,
MultMed(27), 2025, pp. 1772-1782.
IEEE DOI 2504
Proposals, Grounding, Visualization, Annotations, Noise measurement, Location awareness, Large language models, Data augmentation, video grounding BibRef

Wu, J.L.[Jian-Long], Liu, W.[Wei], Liu, Y.[Ye], Liu, M.[Meng], Nie, L.Q.[Li-Qiang], Lin, Z.C.[Zhou-Chen], Chen, C.W.[Chang Wen],
A Survey on Video Temporal Grounding With Multimodal Large Language Model,
PAMI(48), No. 2, February 2026, pp. 1521-1541.
IEEE DOI 2601
Survey, Grounding. Videos, Grounding, Visualization, Surveys, Training, Question answering (information retrieval), Cognition, multimodal learning BibRef

Wang, P.[Peifu], Liang, Y.X.[Yi-Xiong], Cen, Y.G.[Yi-Gang], Cen, L.H.[Li-Hui], Qu, Z.[Zhe], Liu, J.L.[Jing-Ling], Kan, S.C.[Shi-Chao],
Integrating spatial features and dynamically learned temporal features via contrastive learning for video temporal grounding in LLM,
IVC(167), 2026, pp. 105895.
Elsevier DOI 2602
Large language model, Video temporal grounding, Video temporal localization, Contrastive learning BibRef


Liu, Y.[Yang], Jiang, L.[Le], Li, G.M.[Guo-Ming], Ye, X.Z.[Xiao-Zhou], Ouyang, Y.[Ye],
YOLO-VG: Enhancing Multi-Stage Feature Interaction for Visual Grounding,
ICIP25(469-473)
IEEE DOI 2601
Visualization, Grounding, Large language models, Semantics, Pipelines, Object detection, Data collection, Feature extraction, Large Language Models BibRef

Gao, J.[Jun], Li, Y.Q.[Yong-Qi], Cao, Z.Q.[Zi-Qiang], Li, W.J.[Wen-Jie],
Interleaved-Modal Chain-of-Thought,
CVPR25(19520-19529)
IEEE DOI 2508
Visualization, Grounding, Large language models, Memory management, Benchmark testing, Cognition, chain-of-thought prompting, vision-language models BibRef

Yu, C.L.[Chun-Lin], Wang, H.Q.[Han-Qing], Shi, Y.[Ye], Luo, H.Y.[Hao-Yang], Yang, S.[Sibei], Yu, J.Y.[Jing-Yi], Wang, J.Y.[Jing-Ya],
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model,
CVPR25(1691-1701)
IEEE DOI Code:
WWW Link. 2508
Solid modeling, Grounding, Affordances, Large language models, Benchmark testing, Cognition, Intelligent agents, Context modeling, multi-modal large language model BibRef

Huang, Y.[Yangyu], Gao, T.Y.[Tian-Yi], Xu, H.R.[Hao-Ran], Zhao, Q.H.[Qi-Hao], Song, Y.[Yang], Gui, Z.P.[Zhi-Peng], Lv, T.C.[Teng-Chao], Chen, H.[Hao], Cui, L.[Lei], Li, S.[Scarlett], Wei, F.[Furu],
PEACE: Empowering Geologic Map Holistic Understanding with MLLMs,
CVPR25(3899-3908)
IEEE DOI Code:
WWW Link. 2508
Hands, Grounding, Geology, Large language models, Earthquakes, Feature extraction, Information retrieval, benchmark BibRef

Chen, W.B.[Wen-Bo], Xu, Z.[Zhen], Xu, R.[Ruotao], Wu, S.[Si], Wong, H.S.[Hau-San],
Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding,
CVPR25(3931-3941)
IEEE DOI 2508
Bridges, Visualization, Grounding, Large language models, Semantics, Transformers, Feature extraction, Feeds, visual grounding, multimodal BibRef

Wu, S.[Size], Jin, S.[Sheng], Zhang, W.W.[Wen-Wei], Xu, L.[Lumin], Liu, W.T.[Wen-Tao], Li, W.[Wei], Loy, C.C.[Chen Change],
F-LMM: Grounding Frozen Large Multimodal Models,
CVPR25(24710-24721)
IEEE DOI Code:
WWW Link. 2508
Visualization, Codes, Attention mechanisms, Grounding, Oral communication, Benchmark testing, Cognition, Decoding, visual segmentation BibRef

Qian, R.[Rui], Yin, X.[Xin], Dou, D.[Dejing],
Reasoning to Attend: Try to Understand How Token Works,
CVPR25(24722-24731)
IEEE DOI Code:
WWW Link. 2508
Visualization, Vocabulary, Grounding, Computational modeling, Large language models, Semantics, Cognition, Decoding, large multimodal models BibRef

Chen, Y.Y.[Yan-Yuan], Xu, D.X.[De-Xuan], Huang, Y.[Yu], Zhan, S.K.[Song-Kun], Wang, H.[Hanpin], Chen, D.X.[Dong-Xue], Wang, X.P.[Xue-Ping], Qiu, M.K.[Mei-Kang], Li, H.[Hang],
MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output,
CVPR25(24732-24741)
IEEE DOI Code:
WWW Link. 2508
Visualization, Grounding, Terminology, Large language models, Computational modeling, Semantics, medical visual question answering BibRef

Huang, H.F.[Hai-Feng], Chen, X.[Xinyi], Chen, Y.L.[Yi-Lun], Li, H.[Hao], Han, X.[Xiaoshen], Wang, Z.[Zehan], Wang, T.[Tai], Pang, J.M.[Jiang-Miao], Zhao, Z.[Zhou],
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors,
CVPR25(22540-22550)
IEEE DOI 2508
Codes, Grounding, Shape, Pipelines, robotic manipulation, d large vision-language models BibRef

Man, Y.Z.[Yun-Ze], Huang, D.A.[De-An], Liu, G.L.[Gui-Lin], Sheng, S.W.[Shi-Wei], Liu, S.L.[Shi-Long], Gui, L.Y.[Liang-Yan], Kautz, J.[Jan], Wang, Y.X.[Yu-Xiong], Yu, Z.[Zhiding],
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought,
CVPR25(14268-14280)
IEEE DOI 2508
Visualization, Accuracy, Grounding, Large language models, Benchmark testing, Cognition BibRef

Yin, H.[Heng], Ren, Y.Q.[Yu-Qiang], Yan, K.[Ke], Ding, S.H.[Shou-Hong], Hao, Y.T.[Yong-Tao],
ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models,
CVPR25(14358-14368)
IEEE DOI 2508
Location awareness, Visualization, Grounding, Annotations, Large language models, Pipelines, Training data, Object detection, visual grounding BibRef

Liao, Y.H.[Yuan-Hong], Mahmood, R.[Rafid], Fidler, S.[Sanja], Acuna, D.[David],
Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?,
CVPR25(14667-14678)
IEEE DOI 2508
Training, Vocabulary, Accuracy, Grounding, Statistical analysis, Semantics, Refining, Training data, Data models, Iterative methods, self-correction BibRef

Yuan, Z.H.[Zhi-Hao], Peng, Y.[Yibo], Ren, J.[Jinke], Liao, Y.H.[Ying-Hong], Han, Y.[Yatong], Feng, C.M.[Chun-Mei], Zhao, H.S.[Heng-Shuang], Li, G.B.[Guan-Bin], Cui, S.G.[Shu-Guang], Li, Z.[Zhen],
Empowering Large Language Models with 3D Situation Awareness,
CVPR25(19435-19445)
IEEE DOI 2508
Grounding, Large language models, Manuals, Observers, Data models, Trajectory, Videos, point cloud, vision and language BibRef

Kang, S.[Seil], Kim, J.[Jinyeong], Kim, J.[Junhyeok], Hwang, S.J.[Seong Jae],
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding,
CVPR25(9339-9350)
IEEE DOI 2508
Location awareness, Visualization, Image segmentation, Head, Attention mechanisms, Grounding, Semantics, Text to image, large vision-language model BibRef

Liu, Q.Y.[Qian-Yi], Zhang, S.Q.[Si-Qi], Qiao, Y.Y.[Yan-Yuan], Zhu, J.Y.[Jun-You], Li, X.[Xiang], Guo, L.T.[Long-Teng], Wang, Q.[Qunbo], He, X.J.[Xing-Jian], Wu, Q.[Qi], Liu, J.[Jing],
GroundingMate: Aiding Object Grounding for Goal-Oriented Vision-and-Language Navigation,
WACV25(1775-1784)
IEEE DOI 2505
Bridges, Navigation, Grounding, Large language models, Computational modeling, Natural languages, Cognition, Data mining, Object recognition BibRef

Yan, S.[Siming], Bai, M.[Min], Chen, W.F.[Wei-Feng], Zhou, X.[Xiong], Huang, Q.X.[Qi-Xing], Li, L.E.[Li Erran],
Vigor: Improving Visual Grounding of Large Vision Language Models with Fine-grained Reward Modeling,
ECCV24(LXI: 37-53).
Springer DOI 2412
BibRef

Chowdhury, S.[Sanjoy], Nag, S.[Sayan], Dasgupta, S.[Subhrajyoti], Chen, J.[Jun], Elhoseiny, M.[Mohamed], Gao, R.H.[Ruo-Han], Manocha, D.[Dinesh],
Meerkat: Audio-visual Large Language Model for Grounding in Space and Time,
ECCV24(LXIV: 52-70).
Springer DOI 2412
BibRef

Kuckreja, K.[Kartik], Danish, M.S.[Muhammad Sohail], Naseer, M.[Muzammal], Das, A.[Abhijit], Khan, S.[Salman], Khan, F.S.[Fahad Shahbaz],
GeoChat: Grounded Large Vision-Language Model for Remote Sensing,
CVPR24(27831-27840)
IEEE DOI 2410
Visualization, Scene classification, Grounding, Oral communication, Object detection, Benchmark testing, Data models BibRef

Song, C.H.[Chan Hee], Sadler, B.M.[Brian M.], Wu, J.[Jiaman], Chao, W.L.[Wei-Lun], Washington, C.[Clayton], Su, Y.[Yu],
LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models,
ICCV23(2986-2997)
IEEE DOI 2401
BibRef

You, K.[Keen], Zhang, H.T.[Hao-Tian], Schoop, E.[Eldon], Weers, F.[Floris], Swearngin, A.[Amanda], Nichols, J.[Jeffrey], Yang, Y.F.[Yin-Fei], Gan, Z.[Zhe],
FERRET-UI: Grounded Mobile UI Understanding with Multimodal LLMs,
ECCV24(LXIV: 240-255).
Springer DOI 2412
BibRef

Tong, S.B.[Sheng-Bang], Liu, Z.[Zhuang], Zhai, Y.X.[Yue-Xiang], Ma, Y.[Yi], LeCun, Y.[Yann], Xie, S.[Saining],
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs,
CVPR24(9568-9578)
IEEE DOI 2410
Representation learning, Visualization, Systematics, Correlation, Grounding, Large language models, Multimodal LLMs, Vision Language Model BibRef

Xu, J.R.[Jia-Rui], Zhou, X.Y.[Xing-Yi], Yan, S.[Shen], Gu, X.[Xiuye], Arnab, A.[Anurag], Sun, C.[Chen], Wang, X.L.[Xiao-Long], Schmid, C.[Cordelia],
Pixel Aligned Language Models,
CVPR24(13030-13039)
IEEE DOI 2410
Location awareness, Visualization, Grounding, Large language models, Machine vision, Computational modeling BibRef

Wu, P.H.[Peng-Hao], Xie, S.[Saining],
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs,
CVPR24(13084-13094)
IEEE DOI 2410
Training, Visualization, Grounding, Computational modeling, Seals, Benchmark testing, multimodal large language model, visual search BibRef

He, R.[Ruozhen], Cascante-Bonilla, P.[Paola], Yang, Z.Y.[Zi-Yan], Berg, A.C.[Alexander C.], Ordonez, V.[Vicente],
Improved Visual Grounding through Self-Consistent Explanations,
CVPR24(13095-13105)
IEEE DOI 2410
Location awareness, Visualization, Vocabulary, Grounding, Large language models, Data augmentation, Data models, visual grounding BibRef

Feng, C.[Chun], Hsu, J.[Joy], Liu, W.Y.[Wei-Yu], Wu, J.J.[Jia-Jun],
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners,
CVPR24(13269-13278)
IEEE DOI 2410
Visualization, Solid modeling, Accuracy, Grounding, Large language models, 3D visual grounding, Language constraints BibRef

He, J.W.[Jun-Wen], Wang, Y.F.[Yi-Fan], Wang, L.J.[Li-Jun], Lu, H.C.[Hu-Chuan], He, J.Y.[Jun-Yan], Lan, J.P.[Jin-Peng], Luo, B.[Bin], Xie, X.[Xuansong],
Multi-Modal Instruction Tuned LLMs with Fine-Grained Visual Perception,
CVPR24(13980-13990)
IEEE DOI Code:
WWW Link. 2410
Image segmentation, Visualization, Technological innovation, Grounding, Computational modeling, Large language models, Natural languages BibRef

Yuan, Z.H.[Zhi-Hao], Ren, J.[Jinke], Feng, C.M.[Chun-Mei], Zhao, H.S.[Heng-Shuang], Cui, S.G.[Shu-Guang], Li, Z.[Zhen],
Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding,
CVPR24(20623-20633)
IEEE DOI Code:
WWW Link. 2410
Visualization, Vocabulary, Grounding, Annotations, Navigation, Large language models, Visual Grounding, Point Cloud, Vision and Language BibRef

Chen, G.[Gongwei], Shen, L.[Leyang], Shao, R.[Rui], Deng, X.[Xiang], Nie, L.Q.[Li-Qiang],
LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge,
CVPR24(26530-26540)
IEEE DOI 2410
Visualization, Accuracy, Grounding, Large language models, Semantics, Benchmark testing BibRef

Qu, M.X.[Meng-Xue], Chen, X.D.[Xiao-Dong], Liu, W.[Wu], Li, A.[Alicia], Zhao, Y.[Yao],
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models,
PVUW24(1847-1856)
IEEE DOI 2410
Grounding, Annotations, Large language models, Supervised learning, Natural languages BibRef

Zhang, Y.[Yichi], Ma, Z.Q.[Zi-Qiao], Gao, X.F.[Xiao-Feng], Shakiah, S.[Suhaila], Gao, Q.[Qiaozi], Chai, J.[Joyce],
Groundhog Grounding Large Language Models to Holistic Segmentation,
CVPR24(14227-14238)
IEEE DOI 2410
Training, Visualization, Grounding, Shape, Large language models, Semantics, Feature extraction, Multi-Modal, Language Grounding, Vision-Language Model BibRef

Kim, K.[Kibum], Yoon, K.[Kanghoon], Jeon, J.[Jaehyeong], In, Y.[Yeonjun], Moon, J.[Jinyoung], Kim, D.H.[Dong-Hyun], Park, C.[Chanyoung],
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation,
CVPR24(28306-28316)
IEEE DOI Code:
WWW Link. 2410
Training, Visualization, Grounding, Large language models, Semantics, Genomics, Focusing, Scene Understanding, Large Language Model, Long-Tail Problem BibRef

Chapter on Implementations and Applications, Databases, QBIC, Video Analysis, Hardware and Software, Inspection continues in
Visual Grounding in Visual Question Answering .


Last update:Feb 26, 2026 at 10:58:24