Chen, Z.X.[Zhi-Xuan],
Bie, Y.[Yequan],
Jin, H.B.[Hai-Bo],
Chen, H.[Hao],
Large Language Model With Region-Guided Referring and Grounding for
CT Report Generation,
MedImg(44), No. 8, August 2025, pp. 3139-3150.
IEEE DOI Code:
WWW Link.
2508
Computed tomography, Grounding, Feature extraction, Training,
Medical diagnostic imaging, Accuracy, Geometry, Lungs, Visualization,
large language model
BibRef
Liu, Y.[Yi],
Hou, H.W.[Hao-Wen],
Ma, F.[Fei],
Ni, S.G.[Shi-Guang],
Yu, F.R.[Fei Richard],
MLLM-TA: Leveraging Multimodal Large Language Models for Precise
Temporal Video Grounding,
SPLetters(32), 2025, pp. 281-285.
IEEE DOI
2501
Visualization, Grounding, Large language models, Feature extraction,
Benchmark testing, Vectors, Training, video grounding
BibRef
Li, G.Z.[Guo-Zhang],
Ding, X.P.[Xin-Peng],
Cheng, D.[De],
Li, J.[Jie],
Wang, N.N.[Nan-Nan],
Gao, X.B.[Xin-Bo],
ETC: Temporal Boundary Expand Then Clarify for Weakly Supervised
Video Grounding With Multimodal Large Language Model,
MultMed(27), 2025, pp. 1772-1782.
IEEE DOI
2504
Proposals, Grounding, Visualization, Annotations, Noise measurement,
Location awareness, Large language models, Data augmentation,
video grounding
BibRef
Wu, J.L.[Jian-Long],
Liu, W.[Wei],
Liu, Y.[Ye],
Liu, M.[Meng],
Nie, L.Q.[Li-Qiang],
Lin, Z.C.[Zhou-Chen],
Chen, C.W.[Chang Wen],
A Survey on Video Temporal Grounding With Multimodal Large Language
Model,
PAMI(48), No. 2, February 2026, pp. 1521-1541.
IEEE DOI
2601
Survey, Grounding. Videos, Grounding, Visualization, Surveys, Training,
Question answering (information retrieval), Cognition,
multimodal learning
BibRef
Wang, P.[Peifu],
Liang, Y.X.[Yi-Xiong],
Cen, Y.G.[Yi-Gang],
Cen, L.H.[Li-Hui],
Qu, Z.[Zhe],
Liu, J.L.[Jing-Ling],
Kan, S.C.[Shi-Chao],
Integrating spatial features and dynamically learned temporal
features via contrastive learning for video temporal grounding in LLM,
IVC(167), 2026, pp. 105895.
Elsevier DOI
2602
Large language model, Video temporal grounding,
Video temporal localization, Contrastive learning
BibRef
Gao, J.[Jun],
Li, Y.Q.[Yong-Qi],
Cao, Z.Q.[Zi-Qiang],
Li, W.J.[Wen-Jie],
Interleaved-Modal Chain-of-Thought,
CVPR25(19520-19529)
IEEE DOI
2508
Visualization, Grounding, Large language models, Memory management,
Benchmark testing, Cognition, chain-of-thought prompting, vision-language models
BibRef
Yu, C.L.[Chun-Lin],
Wang, H.Q.[Han-Qing],
Shi, Y.[Ye],
Luo, H.Y.[Hao-Yang],
Yang, S.[Sibei],
Yu, J.Y.[Jing-Yi],
Wang, J.Y.[Jing-Ya],
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large
Language Model,
CVPR25(1691-1701)
IEEE DOI Code:
WWW Link.
2508
Solid modeling, Grounding, Affordances, Large language models,
Benchmark testing, Cognition, Intelligent agents, Context modeling,
multi-modal large language model
BibRef
Huang, Y.[Yangyu],
Gao, T.Y.[Tian-Yi],
Xu, H.R.[Hao-Ran],
Zhao, Q.H.[Qi-Hao],
Song, Y.[Yang],
Gui, Z.P.[Zhi-Peng],
Lv, T.C.[Teng-Chao],
Chen, H.[Hao],
Cui, L.[Lei],
Li, S.[Scarlett],
Wei, F.[Furu],
PEACE: Empowering Geologic Map Holistic Understanding with MLLMs,
CVPR25(3899-3908)
IEEE DOI Code:
WWW Link.
2508
Hands, Grounding, Geology, Large language models, Earthquakes,
Feature extraction, Information retrieval, benchmark
BibRef
Chen, W.B.[Wen-Bo],
Xu, Z.[Zhen],
Xu, R.[Ruotao],
Wu, S.[Si],
Wong, H.S.[Hau-San],
Task-aware Cross-modal Feature Refinement Transformer with Large
Language Models for Visual Grounding,
CVPR25(3931-3941)
IEEE DOI
2508
Bridges, Visualization, Grounding, Large language models, Semantics,
Transformers, Feature extraction, Feeds, visual grounding, multimodal
BibRef
Wu, S.[Size],
Jin, S.[Sheng],
Zhang, W.W.[Wen-Wei],
Xu, L.[Lumin],
Liu, W.T.[Wen-Tao],
Li, W.[Wei],
Loy, C.C.[Chen Change],
F-LMM: Grounding Frozen Large Multimodal Models,
CVPR25(24710-24721)
IEEE DOI Code:
WWW Link.
2508
Visualization, Codes, Attention mechanisms, Grounding,
Oral communication, Benchmark testing, Cognition, Decoding,
visual segmentation
BibRef
Qian, R.[Rui],
Yin, X.[Xin],
Dou, D.[Dejing],
Reasoning to Attend: Try to Understand How
CVPR25(24722-24731)
IEEE DOI Code:
WWW Link.
2508
Visualization, Vocabulary, Grounding, Computational modeling,
Large language models, Semantics, Cognition, Decoding,
large multimodal models
BibRef
Chen, Y.Y.[Yan-Yuan],
Xu, D.X.[De-Xuan],
Huang, Y.[Yu],
Zhan, S.K.[Song-Kun],
Wang, H.[Hanpin],
Chen, D.X.[Dong-Xue],
Wang, X.P.[Xue-Ping],
Qiu, M.K.[Mei-Kang],
Li, H.[Hang],
MIMO: A medical vision language model with visual referring
multimodal input and pixel grounding multimodal output,
CVPR25(24732-24741)
IEEE DOI Code:
WWW Link.
2508
Visualization, Grounding, Terminology, Large language models,
Computational modeling, Semantics,
medical visual question answering
BibRef
Huang, H.F.[Hai-Feng],
Chen, X.[Xinyi],
Chen, Y.L.[Yi-Lun],
Li, H.[Hao],
Han, X.[Xiaoshen],
Wang, Z.[Zehan],
Wang, T.[Tai],
Pang, J.M.[Jiang-Miao],
Zhao, Z.[Zhou],
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors,
CVPR25(22540-22550)
IEEE DOI
2508
Codes, Grounding, Shape, Pipelines, robotic manipulation,
d large vision-language models
BibRef
Man, Y.Z.[Yun-Ze],
Huang, D.A.[De-An],
Liu, G.L.[Gui-Lin],
Sheng, S.W.[Shi-Wei],
Liu, S.L.[Shi-Long],
Gui, L.Y.[Liang-Yan],
Kautz, J.[Jan],
Wang, Y.X.[Yu-Xiong],
Yu, Z.[Zhiding],
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought,
CVPR25(14268-14280)
IEEE DOI
2508
Visualization, Accuracy, Grounding, Large language models,
Benchmark testing, Cognition
BibRef
Yin, H.[Heng],
Ren, Y.Q.[Yu-Qiang],
Yan, K.[Ke],
Ding, S.H.[Shou-Hong],
Hao, Y.T.[Yong-Tao],
ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large
Language Models,
CVPR25(14358-14368)
IEEE DOI
2508
Location awareness, Visualization, Grounding, Annotations,
Large language models, Pipelines, Training data, Object detection,
visual grounding
BibRef
Liao, Y.H.[Yuan-Hong],
Mahmood, R.[Rafid],
Fidler, S.[Sanja],
Acuna, D.[David],
Can Large Vision-Language Models Correct Semantic Grounding Errors By
Themselves?,
CVPR25(14667-14678)
IEEE DOI
2508
Training, Vocabulary, Accuracy, Grounding, Statistical analysis,
Semantics, Refining, Training data, Data models, Iterative methods,
self-correction
BibRef
Yuan, Z.H.[Zhi-Hao],
Peng, Y.[Yibo],
Ren, J.[Jinke],
Liao, Y.H.[Ying-Hong],
Han, Y.[Yatong],
Feng, C.M.[Chun-Mei],
Zhao, H.S.[Heng-Shuang],
Li, G.B.[Guan-Bin],
Cui, S.G.[Shu-Guang],
Li, Z.[Zhen],
Empowering Large Language Models with 3D Situation Awareness,
CVPR25(19435-19445)
IEEE DOI
2508
Grounding, Large language models, Manuals, Observers,
Data models, Trajectory, Videos, point cloud, vision and language
BibRef
Kang, S.[Seil],
Kim, J.[Jinyeong],
Kim, J.[Junhyeok],
Hwang, S.J.[Seong Jae],
Your Large Vision-Language Model Only Needs A Few Attention Heads For
Visual Grounding,
CVPR25(9339-9350)
IEEE DOI
2508
Location awareness, Visualization, Image segmentation, Head,
Attention mechanisms, Grounding, Semantics, Text to image,
large vision-language model
BibRef
Liu, Q.Y.[Qian-Yi],
Zhang, S.Q.[Si-Qi],
Qiao, Y.Y.[Yan-Yuan],
Zhu, J.Y.[Jun-You],
Li, X.[Xiang],
Guo, L.T.[Long-Teng],
Wang, Q.[Qunbo],
He, X.J.[Xing-Jian],
Wu, Q.[Qi],
Liu, J.[Jing],
GroundingMate: Aiding Object Grounding for Goal-Oriented
Vision-and-Language Navigation,
WACV25(1775-1784)
IEEE DOI
2505
Bridges, Navigation, Grounding, Large language models, Computational modeling,
Natural languages, Cognition, Data mining, Object recognition
BibRef
Yan, S.[Siming],
Bai, M.[Min],
Chen, W.F.[Wei-Feng],
Zhou, X.[Xiong],
Huang, Q.X.[Qi-Xing],
Li, L.E.[Li Erran],
Vigor: Improving Visual Grounding of Large Vision Language Models with
Fine-grained Reward Modeling,
ECCV24(LXI: 37-53).
Springer DOI
2412
BibRef
Chowdhury, S.[Sanjoy],
Nag, S.[Sayan],
Dasgupta, S.[Subhrajyoti],
Chen, J.[Jun],
Elhoseiny, M.[Mohamed],
Gao, R.H.[Ruo-Han],
Manocha, D.[Dinesh],
Meerkat: Audio-visual Large Language Model for Grounding in Space and
Time,
ECCV24(LXIV: 52-70).
Springer DOI
2412
BibRef
Kuckreja, K.[Kartik],
Danish, M.S.[Muhammad Sohail],
Naseer, M.[Muzammal],
Das, A.[Abhijit],
Khan, S.[Salman],
Khan, F.S.[Fahad Shahbaz],
GeoChat: Grounded Large Vision-Language Model for Remote Sensing,
CVPR24(27831-27840)
IEEE DOI
2410
Visualization, Scene classification, Grounding, Oral communication,
Object detection, Benchmark testing, Data models
BibRef
Song, C.H.[Chan Hee],
Sadler, B.M.[Brian M.],
Wu, J.[Jiaman],
Chao, W.L.[Wei-Lun],
Washington, C.[Clayton],
Su, Y.[Yu],
LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with
Large Language Models,
ICCV23(2986-2997)
IEEE DOI
2401
BibRef
You, K.[Keen],
Zhang, H.T.[Hao-Tian],
Schoop, E.[Eldon],
Weers, F.[Floris],
Swearngin, A.[Amanda],
Nichols, J.[Jeffrey],
Yang, Y.F.[Yin-Fei],
Gan, Z.[Zhe],
FERRET-UI: Grounded Mobile UI Understanding with Multimodal LLMs,
ECCV24(LXIV: 240-255).
Springer DOI
2412
BibRef
Tong, S.B.[Sheng-Bang],
Liu, Z.[Zhuang],
Zhai, Y.X.[Yue-Xiang],
Ma, Y.[Yi],
LeCun, Y.[Yann],
Xie, S.[Saining],
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs,
CVPR24(9568-9578)
IEEE DOI
2410
Representation learning, Visualization, Systematics, Correlation,
Grounding, Large language models, Multimodal LLMs, Vision Language Model
BibRef
Xu, J.R.[Jia-Rui],
Zhou, X.Y.[Xing-Yi],
Yan, S.[Shen],
Gu, X.[Xiuye],
Arnab, A.[Anurag],
Sun, C.[Chen],
Wang, X.L.[Xiao-Long],
Schmid, C.[Cordelia],
Pixel Aligned Language Models,
CVPR24(13030-13039)
IEEE DOI
2410
Location awareness, Visualization, Grounding,
Large language models, Machine vision, Computational modeling
BibRef
Wu, P.H.[Peng-Hao],
Xie, S.[Saining],
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs,
CVPR24(13084-13094)
IEEE DOI
2410
Training, Visualization, Grounding, Computational modeling, Seals,
Benchmark testing, multimodal large language model,
visual search
BibRef
He, R.[Ruozhen],
Cascante-Bonilla, P.[Paola],
Yang, Z.Y.[Zi-Yan],
Berg, A.C.[Alexander C.],
Ordonez, V.[Vicente],
Improved Visual Grounding through Self-Consistent Explanations,
CVPR24(13095-13105)
IEEE DOI
2410
Location awareness, Visualization, Vocabulary, Grounding,
Large language models, Data augmentation, Data models,
visual grounding
BibRef
Feng, C.[Chun],
Hsu, J.[Joy],
Liu, W.Y.[Wei-Yu],
Wu, J.J.[Jia-Jun],
Naturally Supervised 3D Visual Grounding with Language-Regularized
Concept Learners,
CVPR24(13269-13278)
IEEE DOI
2410
Visualization, Solid modeling, Accuracy, Grounding,
Large language models, 3D visual grounding, Language constraints
BibRef
He, J.W.[Jun-Wen],
Wang, Y.F.[Yi-Fan],
Wang, L.J.[Li-Jun],
Lu, H.C.[Hu-Chuan],
He, J.Y.[Jun-Yan],
Lan, J.P.[Jin-Peng],
Luo, B.[Bin],
Xie, X.[Xuansong],
Multi-Modal Instruction Tuned LLMs with Fine-Grained Visual
Perception,
CVPR24(13980-13990)
IEEE DOI Code:
WWW Link.
2410
Image segmentation, Visualization, Technological innovation,
Grounding, Computational modeling, Large language models, Natural languages
BibRef
Yuan, Z.H.[Zhi-Hao],
Ren, J.[Jinke],
Feng, C.M.[Chun-Mei],
Zhao, H.S.[Heng-Shuang],
Cui, S.G.[Shu-Guang],
Li, Z.[Zhen],
Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding,
CVPR24(20623-20633)
IEEE DOI Code:
WWW Link.
2410
Visualization, Vocabulary, Grounding, Annotations, Navigation,
Large language models, Visual Grounding, Point Cloud, Vision and Language
BibRef
Chen, G.[Gongwei],
Shen, L.[Leyang],
Shao, R.[Rui],
Deng, X.[Xiang],
Nie, L.Q.[Li-Qiang],
LION: Empowering Multimodal Large Language Model with Dual-Level
Visual Knowledge,
CVPR24(26530-26540)
IEEE DOI
2410
Visualization, Accuracy, Grounding, Large language models, Semantics,
Benchmark testing
BibRef
Qu, M.X.[Meng-Xue],
Chen, X.D.[Xiao-Dong],
Liu, W.[Wu],
Li, A.[Alicia],
Zhao, Y.[Yao],
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large
Language Models,
PVUW24(1847-1856)
IEEE DOI
2410
Grounding, Annotations, Large language models, Supervised learning,
Natural languages
BibRef
Zhang, Y.[Yichi],
Ma, Z.Q.[Zi-Qiao],
Gao, X.F.[Xiao-Feng],
Shakiah, S.[Suhaila],
Gao, Q.[Qiaozi],
Chai, J.[Joyce],
Groundhog Grounding Large Language Models to Holistic Segmentation,
CVPR24(14227-14238)
IEEE DOI
2410
Training, Visualization, Grounding, Shape, Large language models,
Semantics, Feature extraction, Multi-Modal, Language Grounding,
Vision-Language Model
BibRef
Kim, K.[Kibum],
Yoon, K.[Kanghoon],
Jeon, J.[Jaehyeong],
In, Y.[Yeonjun],
Moon, J.[Jinyoung],
Kim, D.H.[Dong-Hyun],
Park, C.[Chanyoung],
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph
Generation,
CVPR24(28306-28316)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Grounding, Large language models, Semantics,
Genomics, Focusing, Scene Understanding, Large Language Model,
Long-Tail Problem
BibRef
Chapter on Implementations and Applications, Databases, QBIC, Video Analysis, Hardware and Software, Inspection continues in
Visual Grounding in Visual Question Answering .