20.4.3.3.6 Large Language Models for Vision, LLM, LVLM

Chapter Contents (Back)
Large Language Models. LLM. Visual Reasoning.
See also General Spatial Reasoning and Geometric Reasoning Issues, Visual Relations.

Hu, Z.J.[Zhong-Jian], Yang, P.[Peng], Jiang, Y.S.[Yuan-Shuang], Bai, Z.J.[Zi-Jian],
Prompting large language model with context and pre-answer for knowledge-based VQA,
PR(151), 2024, pp. 110399.
Elsevier DOI 2404
Visual question answering, Large language model, Knowledge-based VQA, Fine-tuning, In-context learning BibRef

Zhang, Z.C.[Zi-Cheng], Wu, H.N.[Hao-Ning], Zhang, E.[Erli], Zhai, G.T.[Guang-Tao], Lin, W.S.[Wei-Si],
Q-Bench+: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs,
PAMI(46), No. 12, December 2024, pp. 10404-10418.
IEEE DOI 2411
Visualization, Benchmark testing, Task analysis, Natural languages, Visual perception, Large language models, perception BibRef

Zhao, Z.[Zihao], Wang, S.[Sheng], Gu, J.[Jinchen], Zhu, Y.[Yitao], Mei, L.[Lanzhuju], Zhuang, Z.X.[Zi-Xu], Cui, Z.M.[Zhi-Ming], Wang, Q.[Qian], Shen, D.G.[Ding-Gang],
ChatCAD+: Toward a Universal and Reliable Interactive CAD Using LLMs,
MedImg(43), No. 11, November 2024, pp. 3755-3766.
IEEE DOI 2411
Solid modeling, Reliability, Medical diagnostic imaging, Chatbots, Visualization, Brain modeling, Databases, Large language models, computer-assisted diagnosis BibRef

Luo, H.[Haonan], Zeng, Y.J.[Yi-Jie], Yang, L.[Li], Chen, K.[Kexun], Shen, Z.X.[Zhi-Xuan], Lv, F.[Fengmao],
VLAI: Exploration and Exploitation based on Visual-Language Aligned Information for Robotic Object Goal Navigation,
IVC(151), 2024, pp. 105259.
Elsevier DOI Code:
WWW Link. 2411
Object goal navigation, Visual-to-language, Embodied artificial intelligence, Large language model BibRef

Mansourian, A.[Ali], Oucheikh, R.[Rachid],
ChatGeoAI: Enabling Geospatial Analysis for Public through Natural Language, with Large Language Models,
IJGI(13), No. 10, 2024, pp. 348.
DOI Link 2411
BibRef

Li, D.[Diya], Zhao, Y.[Yue], Wang, Z.F.[Zhi-Fang], Jung, C.[Calvin], Zhang, Z.[Zhe],
Large Language Model-Driven Structured Output: A Comprehensive Benchmark and Spatial Data Generation Framework,
IJGI(13), No. 11, 2024, pp. 405.
DOI Link 2412
BibRef

Li, Y.[Yunxin], Hu, B.[Baotian], Chen, X.Y.[Xin-Yu], Ma, L.[Lin], Xu, Y.[Yong], Zhang, M.[Min],
LMEye: An Interactive Perception Network for Large Language Models,
MultMed(26), 2024, pp. 10952-10964.
IEEE DOI 2412
Visualization, Task analysis, Data models, Tuning, Large language models, Training, Cognition, interactive perception network BibRef

Shao, R.[Run], Zhang, Z.Y.[Zhao-Yang], Tao, C.[Chao], Zhang, Y.S.[Yun-Sheng], Peng, C.L.[Chang-Le], Li, H.F.[Hai-Feng],
Homogeneous tokenizer matters: Homogeneous visual tokenizer for remote sensing image understanding,
PandRS(218), 2024, pp. 294-310.
Elsevier DOI Code:
WWW Link. 2412
Remote sensing image understanding, Visual tokenizer, Homogeneous, Semantically independent region, Visual transformer model BibRef

Liu, T.Q.[Tian-Qi], Qin, Y.J.[Yan-Jun], Zhang, S.H.[Shang-Hang], Tao, X.M.[Xiao-Ming],
Empowering Corner Case Detection in Autonomous Vehicles With Multimodal Large Language Models,
SPLetters(32), 2025, pp. 51-55.
IEEE DOI 2501
Rare objects in odd locations. Object detection, Visualization, Autonomous vehicles, Large language models, Roads, Vectors, Transformers, object detection BibRef

Liu, Y.[Yi], Hou, H.[Haowen], Ma, F.[Fei], Ni, S.G.[Shi-Guang], Yu, F.R.[Fei Richard],
MLLM-TA: Leveraging Multimodal Large Language Models for Precise Temporal Video Grounding\,
SPLetters(32), 2025, pp. 281-285.
IEEE DOI 2501
Visualization, Grounding, Large language models, Feature extraction, Benchmark testing, Vectors, Training, video grounding BibRef

Wang, Z.H.[Zhe-Hui], Luo, T.[Tao], Liu, C.[Cheng], Liu, W.C.[Wei-Chen], Goh, R.S.M.[Rick Siow Mong], Wong, W.F.[Weng-Fai],
Enabling Energy-Efficient Deployment of Large Language Models on Memristor Crossbar: A Synergy of Large and Small,
PAMI(47), No. 2, February 2025, pp. 916-933.
IEEE DOI 2501
Memristors, Computer architecture, Random access memory, Nonvolatile memory, Computational modeling, Neural networks, non-volatile memory BibRef


Chu, X.X.[Xiang-Xiang], Su, J.L.[Jian-Lin], Zhang, B.[Bo], Shen, C.H.[Chun-Hua],
VisionlLaMA: A Unified LLaMA Backbone for Vision Tasks,
ECCV24(LXVI: 1-18).
Springer DOI 2412
Code:
WWW Link. BibRef

Long, F.C.[Fu-Chen], Qiu, Z.F.[Zhao-Fan], Yao, T.[Ting], Mei, T.[Tao],
VideoStudio: Generating Consistent-content and Multi-scene Videos,
ECCV24(LX: 468-485).
Springer DOI 2412
Code:
WWW Link. BibRef

Liu, S.L.[Shi-Long], Cheng, H.[Hao], Liu, H.T.[Hao-Tian], Zhang, H.[Hao], Li, F.[Feng], Ren, T.[Tianhe], Zou, X.[Xueyan], Yang, J.W.[Jian-Wei], Su, H.[Hang], Zhu, J.[Jun], Zhang, L.[Lei], Gao, J.F.[Jian-Feng], Li, C.Y.[Chun-Yuan],
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents,
ECCV24(XLVII: 126-142).
Springer DOI 2412
BibRef

Kong, X.H.[Xiang-Hao], Chen, J.[Jinyu], Wang, W.G.[Wen-Guan], Su, H.[Hang], Hu, X.L.[Xiao-Lin], Yang, Y.[Yi], Liu, S.[Si],
Controllable Navigation Instruction Generation with Chain of Thought Prompting,
ECCV24(XXIX: 37-54).
Springer DOI 2412
Instruction generation. BibRef

Zhu, W.Y.C.[William Yi-Cheng], Ye, K.[Keren], Ke, J.J.[Jun-Jie], Yu, J.[Jiahui], Guibas, L.J.[Leonidas J.], Milanfar, P.[Peyman], Yang, F.[Feng],
ARTVLM: Attribute Recognition Through Vision-based Prefix Language Modeling,
ECCV24(XXVII: 127-145).
Springer DOI 2412
Code:
WWW Link. BibRef

Kim, D.[Donggyun], Cho, S.[Seongwoong], Kim, S.[Semin], Luo, C.[Chong], Hong, S.[Seunghoon],
Chameleon: A Data-efficient Generalist for Dense Visual Prediction in the Wild,
ECCV24(XXIII: 422-441).
Springer DOI 2412
Code:
WWW Link. BibRef

Ke, F.[Fucai], Cai, Z.X.[Zhi-Xi], Jahangard, S.[Simindokht], Wang, W.Q.[Wei-Qing], Haghighi, P.D.[Pari Delir], Rezatofighi, H.[Hamid],
Hydra: A Hyper Agent for Dynamic Compositional Visual Reasoning,
ECCV24(XX: 132-149).
Springer DOI 2412
BibRef

Bao, X.Y.[Xiao-Yi], Sun, S.Y.[Si-Yang], Ma, S.L.[Shuai-Lei], Zheng, K.C.[Ke-Cheng], Guo, Y.X.[Yu-Xin], Zhao, G.S.[Guo-Sheng], Zheng, Y.[Yun], Wang, X.G.[Xin-Gang],
Cores: Orchestrating the Dance of Reasoning and Segmentation,
ECCV24(XVIII: 187-204).
Springer DOI 2412
BibRef

Liu, Z.[Zuyan], Liu, B.[Benlin], Wang, J.[Jiahui], Dong, Y.H.[Yu-Hao], Chen, G.Y.[Guang-Yi], Rao, Y.M.[Yong-Ming], Krishna, R.[Ranjay], Lu, J.W.[Ji-Wen],
Efficient Inference of Vision Instruction-following Models with Elastic Cache,
ECCV24(XVII: 54-69).
Springer DOI 2412
Code:
WWW Link. BibRef

Alaluf, Y.[Yuval], Richardson, E.[Elad], Tulyakov, S.[Sergey], Aberman, K.[Kfir], Cohen-Or, D.[Daniel],
MYVLM: Personalizing VLMS for User-specific Queries,
ECCV24(XIII: 73-91).
Springer DOI 2412
BibRef

Cai, R.[Rizhao], Song, Z.[Zirui], Guan, D.[Dayan], Chen, Z.H.[Zhen-Hao], Li, Y.H.[Yao-Hang], Luo, X.[Xing], Yi, C.Y.[Chen-Yu], Kot, A.C.[Alex C.],
BenchLMM: Benchmarking Cross-Style Visual Capability of Large Multimodal Models,
ECCV24(L: 340-358).
Springer DOI 2412
BibRef

Ma, Z.X.[Zi-Xian], Huang, W.[Weikai], Zhang, J.[Jieyu], Gupta, T.[Tanmay], Krishna, R.[Ranjay],
m&m's: A Benchmark to Evaluate Tool-use for multi-step multi-modal Tasks,
ECCV24(X: 18-34).
Springer DOI 2412

WWW Link. and
WWW Link. BibRef

Miao, Y.[Yang], Engelmann, F.[Francis], Vysotska, O.[Olga], Zhao, Z.H.[Zhong-Han], Chai, W.H.[Wen-Hao], Wang, X.[Xuan], Li, B.[Boyi], Hao, S.Y.[Sheng-Yu], Cao, S.D.[Shi-Dong], Ye, T.[Tian], Wang, G.A.[Gao-Ang],
See and Think: Embodied Agent in Virtual Environment,
ECCV24(VIII: 187-204).
Springer DOI 2412
BibRef

Liu, Y.[Yuan], Duan, H.D.[Hao-Dong], Zhang, Y.[Yuanhan], Li, B.[Bo], Zhang, S.Y.[Song-Yang], Zhao, W.[Wangbo], Yuan, Y.[Yike], Wang, J.Q.[Jia-Qi], He, C.H.[Cong-Hui], Liu, Z.W.[Zi-Wei], Chen, K.[Kai], Lin, D.[Dahua],
MMBENCH: Is Your Multi-Modal Model an All-Around Player?,
ECCV24(VI: 216-233).
Springer DOI 2412
BibRef

Liu, Y.[Yang], Ding, P.X.[Peng-Xiang], Huang, S.[Siteng], Zhang, M.[Min], Zhao, H.[Han], Wang, D.L.[Dong-Lin],
PITE: Pixel-Temporal Alignment for Large Video-Language Model,
ECCV24(V: 160-176).
Springer DOI 2412
BibRef

Liu, S.[Shi], Zheng, K.[Kecheng], Chen, W.[Wei],
Paying More Attention to Image: A Training-free Method for Alleviating Hallucination in LVLMS,
ECCV24(LXXXIII: 125-140).
Springer DOI 2412
BibRef

Tu, H.Q.[Hao-Qin], Cui, C.[Chenhang], Wang, Z.J.[Zi-Jun], Zhou, Y.Y.[Yi-Yang], Zhao, B.C.[Bing-Chen], Han, J.L.[Jun-Lin], Zhou, W.[Wangchunshu], Yao, H.X.[Hua-Xiu], Xie, C.[Cihang],
How Many Are in This Image A Safety Evaluation Benchmark for Vision LLMs,
ECCV24(LI: 37-55).
Springer DOI 2412
BibRef

Panagopoulou, A.[Artemis], Xue, L.[Le], Yu, N.[Ning], Li, J.[Junnan], Li, D.X.[Dong-Xu], Joty, S.[Shafiq], Xu, R.[Ran], Savarese, S.[Silvio], Xiong, C.M.[Cai-Ming], Niebles, J.C.[Juan Carlos],
X-instructblip: A Framework for Aligning Image, 3d, Audio, Video to LLMs and its Emergent Cross-modal Reasoning,
ECCV24(XLV: 177-197).
Springer DOI 2412
BibRef

Mirza, M.J.[M. Jehanzeb], Karlinsky, L.[Leonid], Lin, W.[Wei], Doveh, S.[Sivan], Micorek, J.[Jakub], Kozinski, M.[Mateusz], Kuehne, H.[Hilde], Possegger, H.[Horst],
Meta-prompting for Automating Zero-shot Visual Recognition with LLMs,
ECCV24(II: 370-387).
Springer DOI 2412
BibRef

Yu, E.[En], Zhao, L.[Liang], Wei, Y.[Yana], Yang, J.R.[Jin-Rong], Wu, D.M.[Dong-Ming], Kong, L.Y.[Ling-Yu], Wang, T.[Tiancai], Ge, Z.[Zheng], Zhang, X.Y.[Xiang-Yu], Tao, W.B.[Wen-Bing],
Merlin: Empowering Multimodal LLMs with Foresight Minds,
ECCV24(IV: 425-443).
Springer DOI 2412
BibRef

Liu, Z.Y.[Zhao-Yang], Lai, Z.[Zeqiang], Gao, Z.W.[Zhang-Wei], Cui, E.[Erfei], Li, Z.H.[Zi-Heng], Zhu, X.[Xizhou], Lu, L.W.[Le-Wei], Chen, Q.F.[Qi-Feng], Qiao, Y.[Yu], Dai, J.F.[Ji-Feng], Wang, W.H.[Wen-Hai],
ControlLLM: Augment Language Models with Tools by Searching on Graphs,
ECCV24(XII: 89-105).
Springer DOI 2412
BibRef

Yao, Y.[Yi], Hsu, C.F.[Chan-Feng], Lin, J.H.[Jhe-Hao], Xie, H.X.[Hong-Xia], Lin, T.[Terence], Huang, Y.N.[Yi-Ning], Shuai, H.H.[Hong-Han], Cheng, W.H.[Wen-Huang],
The Fabrication of Reality and Fantasy: Scene Generation with LLM-assisted Prompt Interpretation,
ECCV24(XXII: 422-438).
Springer DOI 2412
BibRef

Wu, Y.X.[Yi-Xuan], Wang, Y.Z.[Yi-Zhou], Tang, S.X.[Shi-Xiang], Wu, W.H.[Wen-Hao], He, T.[Tong], Ouyang, W.L.[Wan-Li], Torr, P.H.S.[Philip H.S.], Wu, J.[Jian],
Dettoolchain: A New Prompting Paradigm to Unleash Detection Ability of MLLM,
ECCV24(XXXII: 164-182).
Springer DOI 2412
BibRef

Song, K.[Kunpeng], Zhu, Y.Z.[Yi-Zhe], Liu, B.C.[Bing-Chen], Yan, Q.[Qing], Elgammal, A.[Ahmed], Yang, X.[Xiao],
MOMA: Multimodal LLM Adapter for Fast Personalized Image Generation,
ECCV24(XL: 117-132).
Springer DOI 2412
BibRef

Wang, H.[Han], Ye, Y.J.[Yong-Jie], Wang, Y.J.[Yan-Jie], Nie, Y.X.[Yu-Xiang], Huang, C.[Can],
Elysium: Exploring Object-level Perception in Videos via MLLM,
ECCV24(XXII: 166-185).
Springer DOI 2412
BibRef

Gou, Y.H.[Yun-Hao], Chen, K.[Kai], Liu, Z.[Zhili], Hong, L.Q.[Lan-Qing], Xu, H.[Hang], Li, Z.G.[Zhen-Guo], Yeung, D.Y.[Dit-Yan], Kwok, J.T.[James T.], Zhang, Y.[Yu],
Eyes Closed, Safety on: Protecting Multimodal LLMs via Image-to-text Transformation,
ECCV24(XVII: 388-404).
Springer DOI 2412
BibRef

Guo, Z.H.[Zong-Hao], Xu, R.[Ruyi], Yao, Y.[Yuan], Cui, J.[Junbo], Ni, Z.[Zanlin], Ge, C.J.[Chun-Jiang], Chua, T.S.[Tat-Seng], Liu, Z.Y.[Zhi-Yuan], Huang, G.[Gao],
LLAVA-UHD: An LMM Perceiving Any Aspect Ratio and High-resolution Images,
ECCV24(LXXXIII: 390-406).
Springer DOI 2412
BibRef

Wang, D.S.[Dong-Sheng], Cui, J.[Jiequan], Li, M.[Miaoge], Lin, W.[Wang], Chen, B.[Bo], Zhang, H.W.[Han-Wang],
Instruction Tuning-free Visual Token Complement for Multimodal LLMs,
ECCV24(LXXXI: 446-462).
Springer DOI 2412
BibRef

You, K.[Keen], Zhang, H.T.[Hao-Tian], Schoop, E.[Eldon], Weers, F.[Floris], Swearngin, A.[Amanda], Nichols, J.[Jeffrey], Yang, Y.F.[Yin-Fei], Gan, Z.[Zhe],
FERRET-UI: Grounded Mobile UI Understanding with Multimodal LLMs,
ECCV24(LXIV: 240-255).
Springer DOI 2412
BibRef

McKinzie, B.[Brandon], Gan, Z.[Zhe], Fauconnier, J.P.[Jean-Philippe], Dodge, S.[Sam], Zhang, B.[Bowen], Dufter, P.[Philipp], Shah, D.[Dhruti], Du, X.Z.[Xian-Zhi], Peng, F.[Futang], Belyi, A.[Anton], Zhang, H.T.[Hao-Tian], Singh, K.[Karanjeet], Kang, D.[Doug], Hè, H.Y.[Hong-Yu], Schwarzer, M.[Max], Gunter, T.[Tom], Kong, X.[Xiang], Zhang, A.[Aonan], Wang, J.Y.[Jian-Yu], Wang, C.[Chong], Du, N.[Nan], Lei, T.[Tao], Wiseman, S.[Sam], Lee, M.[Mark], Wang, Z.[Zirui], Pang, R.[Ruoming], Grasch, P.[Peter], Toshev, A.[Alexander], Yang, Y.F.[Yin-Fei],
MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training,
ECCV24(XXIX: 304-323).
Springer DOI 2412
BibRef

Zhou, G.Z.[Geng-Ze], Hong, Y.C.[Yi-Cong], Wang, Z.[Zun], Wang, X.E.[Xin Eric], Wu, Q.[Qi],
NAVGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-language Models,
ECCV24(VII: 260-278).
Springer DOI 2412
BibRef

Wei, H.R.[Hao-Ran], Kong, L.Y.[Ling-Yu], Chen, J.Y.[Jin-Yue], Zhao, L.[Liang], Ge, Z.[Zheng], Wei, J.R.Y.H.R.[Jin-Rong Yang Hao-Ran], Wang, T.[Tiancai], Ge, Z.[Zheng], Zhang, X.Y.[Xiang-Yu], Tao, W.B.[Wen-Bing],
Vary: Scaling up the Vision Vocabulary for Large Vision-language Model,
ECCV24(IV: 408-424).
Springer DOI 2412
BibRef

Wang, Y.[Yu], Liu, X.G.[Xiao-Geng], Li, Y.[Yu], Chen, M.[Muhao], Xiao, C.W.[Chao-Wei],
Adashield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting,
ECCV24(XX: 77-94).
Springer DOI 2412
BibRef

He, S.T.[Shu-Ting], Ding, H.H.[Heng-Hui], Jiang, X.D.[Xu-Dong], Wen, B.[Bihan],
Segpoint: Segment Any Point Cloud via Large Language Model,
ECCV24(XXII: 349-367).
Springer DOI 2412
BibRef

Zhao, H.H.[Henry Hengyuan], Zhou, P.[Pan], Shou, M.Z.[Mike Zheng],
Genixer: Empowering Multimodal Large Language Model as a Powerful Data Generator,
ECCV24(XXIII: 129-147).
Springer DOI 2412
BibRef

Fu, X.Y.[Xing-Yu], Hu, Y.S.[Yu-Shi], Li, B.Z.[Bang-Zheng], Feng, Y.[Yu], Wang, H.Y.[Hao-Yu], Lin, X.D.[Xu-Dong], Roth, D.[Dan], Smith, N.A.[Noah A.], Ma, W.C.[Wei-Chiu], Krishna, R.[Ranjay],
Blink: Multimodal Large Language Models Can See but Not Perceive,
ECCV24(XXIII: 148-166).
Springer DOI 2412
BibRef

Zhang, Z.K.[Zhi-Kai], Li, Y.T.[Yi-Tang], Huang, H.F.[Hao-Feng], Lin, M.X.[Ming-Xian], Yi, L.[Li],
Freemotion: Mocap-free Human Motion Synthesis with Multimodal Large Language Models,
ECCV24(XXIII: 403-421).
Springer DOI 2412
BibRef

Murugesan, B.[Balamurali], Silva-Rodríguez, J.[Julio], Ben Ayed, I.[Ismail], Dolz, J.[Jose],
Robust Calibration of Large Vision-language Adapters,
ECCV24(XXIV: 147-165).
Springer DOI 2412
BibRef

Xu, R.[Runsen], Wang, X.L.[Xiao-Long], Wang, T.[Tai], Chen, Y.L.[Yi-Lun], Pang, J.M.[Jiang-Miao], Lin, D.[Dahua],
Pointllm: Empowering Large Language Models to Understand Point Clouds,
ECCV24(XXV: 131-147).
Springer DOI 2412
BibRef

Cai, K.W.[Kai-Wen], Duan, Z.K.[Zhe-Kai], Liu, G.[Gaowen], Fleming, C.[Charles], Lu, C.X.X.[Chris Xiao-Xuan],
Self-adapting Large Visual-language Models to Edge Devices Across Visual Modalities,
ECCV24(XXVIII: 301-318).
Springer DOI 2412
BibRef

Yu, R.[Runpeng], Yu, W.H.[Wei-Hao], Wang, X.C.[Xin-Chao],
Attention Prompting on Image for Large Vision-language Models,
ECCV24(XXX: 251-268).
Springer DOI 2412
BibRef

Luo, Y.L.[Yu-Lin], An, R.[Ruichuan], Zou, B.[Bocheng], Tang, Y.M.[Yi-Ming], Liu, J.[JiaMing], Zhang, S.H.[Shang-Hang],
Llm as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model,
ECCV24(XXXIII: 235-252).
Springer DOI 2412
BibRef

Pi, R.J.[Ren-Jie], Han, T.Y.[Tian-Yang], Xiong, W.[Wei], Zhang, J.P.[Ji-Peng], Liu, R.[Runtao], Pan, R.[Rui], Zhang, T.[Tong],
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization,
ECCV24(XXXIII: 382-398).
Springer DOI 2412
BibRef

Chen, Y.[Yuan], Ding, Z.H.[Zi-Han], Wang, Z.Q.[Zi-Qin], Wang, Y.[Yan], Zhang, L.J.[Li-Jun], Liu, S.[Si],
Asynchronous Large Language Model Enhanced Planner for Autonomous Driving,
ECCV24(XXXVI: 22-38).
Springer DOI 2412
BibRef

Huang, Z.J.[Zhi-Jian], Tang, T.[Tao], Chen, S.X.[Shao-Xiang], Lin, S.[Sihao], Jie, Z.Q.[Ze-Qun], Ma, L.[Lin], Wang, G.[Guangrun], Liang, X.D.[Xiao-Dan],
Making Large Language Models Better Planners with Reasoning-decision Alignment,
ECCV24(XXXVI: 73-90).
Springer DOI 2412
BibRef

Xia, B.[Bin], Wang, S.Y.[Shi-Yin], Tao, Y.[Yingfan], Wang, Y.T.[Yi-Tong], Jia, J.Y.[Jia-Ya],
Llmga: Multimodal Large Language Model Based Generation Assistant,
ECCV24(XXXVIII: 389-406).
Springer DOI 2412
BibRef

Zhan, Y.F.[Yu-Fei], Zhu, Y.[Yousong], Chen, Z.Y.[Zhi-Yang], Yang, F.[Fan], Tang, M.[Ming], Wang, J.Q.[Jin-Qiao],
Griffon: Spelling Out All Object Locations at Any Granularity with Large Language Models,
ECCV24(XLII: 405-422).
Springer DOI 2412
BibRef

Li, Y.W.[Yan-Wei], Wang, C.Y.[Cheng-Yao], Jia, J.Y.[Jia-Ya],
Llama-vid: An Image is Worth 2 Tokens in Large Language Models,
ECCV24(XLVI: 323-340).
Springer DOI 2412
BibRef

Ju, C.[Chen], Wang, H.[Haicheng], Cheng, H.Z.[Hao-Zhe], Chen, X.[Xu], Zhai, Z.H.[Zhong-Hua], Huang, W.L.[Wei-Lin], Lan, J.S.[Jin-Song], Xiao, S.[Shuai], Zheng, B.[Bo],
Turbo: Informativity-driven Acceleration Plug-in for Vision-language Large Models,
ECCV24(XLVI: 436-455).
Springer DOI 2412
BibRef

Zhao, Q.[Qinyu], Xu, M.[Ming], Gupta, K.[Kartik], Asthana, A.[Akshay], Zheng, L.[Liang], Gould, S.[Stephen],
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-language Models?,
ECCV24(XLVIII: 127-142).
Springer DOI 2412
BibRef

Lee, B.K.[Byung-Kwan], Park, B.[Beomchan], Kim, C.W.[Chae Won], Ro, Y.M.[Yong Man],
Moai: Mixture of All Intelligence for Large Language and Vision Models,
ECCV24(XLIX: 273-302).
Springer DOI 2412
BibRef

Liu, X.[Xin], Zhu, Y.C.[Yi-Chen], Gu, J.D.[Jin-Dong], Lan, Y.[Yunshi], Yang, C.[Chao], Qiao, Y.[Yu],
MM-Safetybench: A Benchmark for Safety Evaluation of Multimodal Large Language Models,
ECCV24(LVI: 386-403).
Springer DOI 2412
BibRef

Liu, R.[Ruyang], Li, C.[Chen], Tang, H.R.[Hao-Ran], Ge, Y.X.[Yi-Xiao], Shan, Y.[Ying], Li, G.[Ge],
ST-LLM: Large Language Models Are Effective Temporal Learners,
ECCV24(LVII: 1-18).
Springer DOI 2412
BibRef

Cheng, H.[Hao], Xiao, E.[Erjia], Gu, J.D.[Jin-Dong], Yang, L.[Le], Duan, J.[Jinhao], Zhang, J.[Jize], Cao, J.H.[Jia-Hang], Xu, K.D.[Kai-Di], Xu, R.[Renjing],
Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-language Models,
ECCV24(LIX: 179-196).
Springer DOI 2412
BibRef

Lin, Z.[Ziyi], Liu, D.Y.[Dong-Yang], Zhang, R.R.[Ren-Rui], Gao, P.[Peng], Qiu, L.[Longtian], Xiao, H.[Han], Qiu, H.[Han], Shao, W.Q.[Wen-Qi], Chen, K.Q.[Ke-Qin], Han, J.[JiaMing], Huang, S.Y.[Si-Yuan], Zhang, Y.[Yichi], He, X.M.[Xu-Ming], Qiao, Y.[Yu], Li, H.S.[Hong-Sheng],
Sphinx: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models,
ECCV24(LXII: 36-55).
Springer DOI 2412
BibRef

Chiquier, M.[Mia], Mall, U.[Utkarsh], Vondrick, C.[Carl],
Evolving Interpretable Visual Classifiers with Large Language Models,
ECCV24(LXIV: 183-201).
Springer DOI 2412
BibRef

Zhang, J.[Jinrui], Wang, T.[Teng], Zhang, H.G.[Hai-Gang], Lu, P.[Ping], Zheng, F.[Feng],
Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-language Models,
ECCV24(LXVIII: 196-213).
Springer DOI 2412
BibRef

Li, Y.F.[Yi-Fan], Guo, H.[Hangyu], Zhou, K.[Kun], Zhao, W.X.[Wayne Xin], Wen, J.R.[Ji-Rong],
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models,
ECCV24(LXXIII: 174-189).
Springer DOI 2412
BibRef

Wu, T.[Tianhe], Ma, K.[Kede], Liang, J.[Jie], Yang, Y.[Yujiu], Zhang, L.[Lei],
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment,
ECCV24(LXXIV: 143-160).
Springer DOI 2412
BibRef

Muhtar, D.[Dilxat], Li, Z.[Zhenshi], Gu, F.[Feng], Zhang, X.L.[Xue-Liang], Xiao, P.F.[Peng-Feng],
Lhrs-bot: Empowering Remote Sensing with Vgi-enhanced Large Multimodal Language Model,
ECCV24(LXXIV: 440-457).
Springer DOI 2412
BibRef

Chen, L.[Liang], Zhao, H.Z.[Hao-Zhe], Liu, T.Y.[Tian-Yu], Bai, S.[Shuai], Lin, J.Y.[Jun-Yang], Zhou, C.[Chang], Chang, B.[Baobao],
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-play Inference Acceleration for Large Vision-language Models,
ECCV24(LXXXI: 19-35).
Springer DOI 2412
BibRef

Yang, Y.C.[Yu-Chen], Lee, K.[Kwonjoon], Dariush, B.[Behzad], Cao, Y.[Yinzhi], Lo, S.Y.[Shao-Yuan],
Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models,
ECCV24(LXXXI: 304-322).
Springer DOI 2412
BibRef

Chen, Y.C.[Yi-Chia], Li, W.H.[Wei-Hua], Sun, C.[Cheng], Wang, Y.C.A.F.[Yu-Chi-Ang Frank], Chen, C.S.[Chu-Song],
Sam4mllm: Enhance Multi-modal Large Language Model for Referring Expression Segmentation,
ECCV24(LXXXI: 323-340).
Springer DOI 2412
BibRef

Zheng, S.[Sipeng], Zhou, B.[Bohan], Feng, Y.C.[Yi-Cheng], Wang, Y.[Ye], Lu, Z.Q.[Zong-Qing],
Unicode: Learning a Unified Codebook for Multimodal Large Language Models,
ECCV24(VIII: 426-443).
Springer DOI 2412
BibRef

Shi, B.F.[Bai-Feng], Wu, Z.Y.[Zi-Yang], Mao, M.L.[Mao-Lin], Wang, X.[Xin], Darrell, T.J.[Trevor J.],
When Do We Not Need Larger Vision Models?,
ECCV24(VIII: 444-462).
Springer DOI 2412
BibRef

Sun, G.H.[Guo-Hao], Qin, C.[Can], Wang, J.[Jiamian], Chen, Z.[Zeyuan], Xu, R.[Ran], Tao, Z.Q.[Zhi-Qiang],
SQ-LLAVA: Self-questioning for Large Vision-language Assistant,
ECCV24(IX: 156-172).
Springer DOI 2412
BibRef

Ye, Q.[Qilang], Yu, Z.T.[Zi-Tong], Shao, R.[Rui], Xie, X.Y.[Xin-Yu], Torr, P.H.S.[Philip H.S.], Cao, X.C.[Xiao-Chun],
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-visual Scenarios,
ECCV24(X: 146-164).
Springer DOI 2412
BibRef

Yu, Q.H.[Qi-Hang], Shen, X.H.[Xiao-Hui], Chen, L.C.[Liang-Chieh],
Towards Open-ended Visual Recognition with Large Language Models,
ECCV24(XIV: 359-376).
Springer DOI 2412
BibRef

Yan, C.[Cilin], Wang, H.C.[Hao-Chen], Yan, S.L.[Shi-Lin], Jiang, X.L.[Xiao-Long], Hu, Y.[Yao], Kang, G.L.[Guo-Liang], Xie, W.[Weidi], Gavves, E.[Efstratios],
VISA: Reasoning Video Object Segmentation via Large Language Models,
ECCV24(XV: 98-115).
Springer DOI 2412
BibRef

Huang, K.[Kai], Zou, H.[Hao], Xi, Y.[Ye], Wang, B.[BoChen], Xie, Z.[Zhen], Yu, L.[Liang],
IVTP: Instruction-guided Visual Token Pruning for Large Vision-language Models,
ECCV24(XVII: 214-230).
Springer DOI 2412
BibRef

Liu, H.T.[Hao-Tian], Li, C.Y.[Chun-Yuan], Li, Y.H.[Yu-Heng], Lee, Y.J.[Yong Jae],
Improved Baselines with Visual Instruction Tuning,
CVPR24(26286-26296)
IEEE DOI 2410
Training, Connectors, Visualization, Systematics, Codes, Computational modeling BibRef

Ren, Z.W.[Zhong-Wei], Huang, Z.C.[Zhi-Cheng], Wei, Y.C.[Yun-Chao], Zhao, Y.[Yao], Fu, D.M.[Dong-Mei], Feng, J.S.[Jia-Shi], Jin, X.J.[Xiao-Jie],
PixelLM: Pixel Reasoning with Large Multimodal Model,
CVPR24(26364-26373)
IEEE DOI 2410
Bridges, Image segmentation, Codes, Benchmark testing, Cognition, Decoding BibRef

Hu, Y.[Yutao], Li, T.[Tianbin], Lu, Q.[Quanfeng], Shao, W.Q.[Wen-Qi], He, J.J.[Jun-Jun], Qiao, Y.[Yu], Luo, P.[Ping],
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM,
CVPR24(22170-22183)
IEEE DOI Code:
WWW Link. 2410
Reflectivity, Visualization, Biological system modeling, Computational modeling, Medical services, Benchmark testing BibRef

Schiappa, M.[Madeline], Abdullah, R.[Raiyaan], Azad, S.[Shehreen], Claypoole, J.[Jared], Cogswell, M.[Michael], Divakaran, A.[Ajay], Rawat, Y.[Yogesh],
Probing Conceptual Understanding of Large Visual-Language Models,
WhatNext24(1797-1807)
IEEE DOI Code:
WWW Link. 2410
Training, Visualization, Shape, Snow, Color, Benchmark testing, Transformers, Robustness, Conceptual understanding BibRef

Yue, T.T.[Tong-Tian], Cheng, J.[Jie], GUo, L.T.[Long-Teng], Dai, X.Y.[Xing-Yuan], Zhao, Z.[Zijia], He, X.J.[Xing-Jian], Xiong, G.[Gang], Lv, Y.S.[Yi-Sheng], Liu, J.[Jing],
SC- Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models,
CVPR24(13073-13083)
IEEE DOI Code:
WWW Link. 2410
Training, Codes, Computational modeling, Focusing, Benchmark testing BibRef

Wu, T.H.[Tsung-Han], Lian, L.[Long], Gonzalez, J.E.[Joseph E.], Li, B.[Boyi], Darrell, T.J.[Trevor J.],
Self-Correcting LLM-Controlled Diffusion Models,
CVPR24(6327-6336)
IEEE DOI Code:
WWW Link. 2410
Image synthesis, Pipelines, Text to image, Process control, Detectors, Superluminescent diodes, Diffusion models BibRef

Yue, X.[Xiang], Ni, Y.S.[Yuan-Sheng], Zheng, T.Y.[Tian-Yu], Zhang, K.[Kai], Liu, R.[Ruoqi], Zhang, G.[Ge], Stevens, S.[Samuel], Jiang, D.[Dongfu], Ren, W.M.[Wei-Ming], Sun, Y.X.[Yu-Xuan], Wei, C.[Cong], Yu, B.T.[Bo-Tao], Yuan, R.B.[Rui-Bin], Sun, R.L.[Ren-Liang], Yin, M.[Ming], Zheng, B.[Boyuan], Yang, Z.Z.[Zhen-Zhu], Liu, Y.[Yibo], Huang, W.H.[Wen-Hao], Sun, H.[Huan], Su, Y.[Yu], Chen, W.[Wenhu],
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI,
CVPR24(9556-9567)
IEEE DOI 2410
Computational modeling, Artificial general intelligence, Social sciences, Manuals, Benchmark testing, Cognition, LLMs BibRef

Li, Z.[Zhuowan], Jasani, B.[Bhavan], Tang, P.[Peng], Ghadar, S.[Shabnam],
Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA,
CVPR24(13613-13623)
IEEE DOI 2410
Training, Visualization, Technological innovation, Accuracy, Computational modeling, Training data, Data augmentation BibRef

Zheng, D.[Duo], Huang, S.[Shijia], Zhao, L.[Lin], Zhong, Y.[Yiwu], Wang, L.W.[Li-Wei],
Towards Learning a Generalist Model for Embodied Navigation,
CVPR24(13624-13634)
IEEE DOI Code:
WWW Link. 2410
Training, Adaptation models, Solid modeling, Navigation, Soft sensors, Computational modeling, Visual-Language Navigation, LLM BibRef

Singh, S.[Simranjit], Fore, M.[Michael], Stamoulis, D.[Dimitrios],
GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots,
EarthVision24(585-594)
IEEE DOI 2410
Earth, Geology, Natural languages, Benchmark testing, Parallel processing, Geospatial analysis, Satellite images, Benchmark BibRef

Li, X.C.[Xu-Chen], Feng, X.K.[Xiao-Kun], Hu, S.Y.[Shi-Yu], Wu, M.[Meiqi], Zhang, D.L.[Dai-Ling], Zhang, J.[Jing], Huang, K.Q.[Kai-Qi],
DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM,
VDU24(7283-7292)
IEEE DOI 2410
Visualization, Annotations, Semantics, Natural languages, Benchmark testing BibRef

Zhang, Y.C.[Yue-Chen], Qian, S.J.[Sheng-Ju], Peng, B.[Bohao], Liu, S.[Shu], Jia, J.Y.[Jia-Ya],
Prompt Highlighter: Interactive Control for Multi-Modal LLMs,
CVPR24(13215-13224)
IEEE DOI 2410
Training, Semantics, Process control, Focusing, Reliability, Usability, VLM, LLM, Interactive Control, Image Caption, Training-Free BibRef

Kaul, P.[Prannay], Li, Z.Z.[Zhi-Zhong], Yang, H.[Hao], Dukler, Y.[Yonatan], Swaminathan, A.[Ashwin], Taylor, C.J., Soatto, S.[Stefano],
THRONE: An Object-Based Hallucination Benchmark for the Free-Form Generations of Large Vision-Language Models,
CVPR24(27218-27228)
IEEE DOI 2410
Measurement, Training, Ethics, Accuracy, Computational modeling, Graphics processing units, hallucination, benchmark, LLM, LVLM, large vision-language model BibRef

Özdemir, Ö.[Övgü], Akagündüz, E.[Erdem],
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts,
Prompting24(1562-1571)
IEEE DOI Code:
WWW Link. 2410
Visualization, Computational modeling, Large language models, Pipelines, Semantics, Question answering (information retrieval), image captioning BibRef

Shao, Z.W.[Zhen-Wei], Yu, Z.[Zhou], Wang, M.[Meng], Yu, J.[Jun],
Prompting Large Language Models with Answer Heuristics for Knowledge-Based Visual Question Answering,
CVPR23(14974-14983)
IEEE DOI 2309
BibRef

Wang, D.K.[Dong-Kai], Xuan, S.Y.[Shi-Yu], Zhang, S.L.[Shi-Liang],
LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model,
CVPR24(614-623)
IEEE DOI Code:
WWW Link. 2410
Location awareness, Training, Large language models, Pipelines, Training data, Cognition, Keypoint Localization, Large Language Model BibRef

Liu, H.[Hanchao], Zhan, X.H.[Xiao-Hang], Huang, S.L.[Shao-Li], Mu, T.J.[Tai-Jiang], Shan, Y.[Ying],
Programmable Motion Generation for Open-Set Motion Control Tasks,
CVPR24(1399-1408)
IEEE DOI 2410
Motion planning, Large language models, Computational modeling, Semantics, Dynamics, Training data BibRef

Zhu, L.[Lanyun], Chen, T.R.[Tian-Run], Ji, D.[Deyi], Ye, J.P.[Jie-Ping], Liu, J.[Jun],
LLaFS: When Large Language Models Meet Few-Shot Segmentation,
CVPR24(3065-3075)
IEEE DOI 2410
Training, Image segmentation, Visualization, Large language models, Natural language processing, Large vision-language models BibRef

Wu, J.F.[Jun-Feng], Jiang, Y.[Yi], Liu, Q.H.[Qi-Hao], Yuan, Z.H.[Ze-Huan], Bai, X.[Xiang], Bai, S.[Song],
General Object Foundation Model for Images and Videos at Scale,
CVPR24(3783-3795)
IEEE DOI Code:
WWW Link. 2410
Training, Visualization, Image segmentation, Grounding, Soft sensors, Large language models BibRef

Xia, Z.F.[Zhuo-Fan], Han, D.C.[Dong-Chen], Han, Y.Z.[Yi-Zeng], Pan, X.[Xuran], Song, S.[Shiji], Huang, G.[Gao],
GSVA: Generalized Segmentation via Multimodal Large Language Models,
CVPR24(3858-3869)
IEEE DOI Code:
WWW Link. 2410
Image segmentation, Visualization, Codes, Large language models, Benchmark testing BibRef

Zhao, L.[Lirui], Yang, Y.[Yue], Zhang, K.[Kaipeng], Shao, W.Q.[Wen-Qi], Zhang, Y.X.[Yu-Xin], Qiao, Y.[Yu], Luo, P.[Ping], Ji, R.R.[Rong-Rong],
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model,
CVPR24(6390-6399)
IEEE DOI Code:
WWW Link. 2410
Training, Technological innovation, Accuracy, Codes, Large language models, Computational modeling, LLM Agent, LLM Tool Usage BibRef

Yao, J.[Junyi], Liu, Y.J.[Yi-Jiang], Dong, Z.[Zhen], Guo, M.F.[Ming-Fei], Hu, H.[Helan], Keutzer, K.[Kurt], Du, L.[Li], Zhou, D.[Daquan], Zhang, S.H.[Shang-Hang],
PromptCoT: Align Prompt Distribution via Adapted Chain-of-Thought,
CVPR24(7027-7037)
IEEE DOI 2410
Training, Adaptation models, Visualization, Computational modeling, Large language models, Semantics, Text to image BibRef

Cai, Z.P.[Zhi-Peng], Mueller, M.[Matthias], Birkl, R.[Reiner], Wofk, D.[Diana], Tseng, S.Y.[Shao-Yen], Cheng, J.[Junda], Stan, G.B.M.[Gabriela Ben-Melech], Lai, V.[Vasudev], Paulitsch, M.[Michael],
L-MAGIC: Language Model Assisted Generation of Images with Coherence,
CVPR24(7049-7058)
IEEE DOI Code:
WWW Link. 2410
Point cloud compression, Solid modeling, Layout, Superresolution, Estimation, Diffusion models, Image generation, large language models BibRef

Li, Y.[Yanyu], Liu, X.[Xian], Kag, A.[Anil], Hu, J.[Ju], Idelbayev, Y.[Yerlan], Sagar, D.[Dhritiman], Wang, Y.Z.[Yan-Zhi], Tulyakov, S.[Sergey], Ren, J.[Jian],
TextCraftor: Your Text Encoder can be Image Quality Controller,
CVPR24(7985-7995)
IEEE DOI 2410
Training, Measurement, Interpolation, Image synthesis, Large language models, Pipelines, Text to image, Stable Diffusion, Image and video synthesis and generation BibRef

Argaw, D.M.[Dawit Mureja], Yoon, S.H.[Seung-Hyun], Heilbron, F.C.[Fabian Caba], Deilamsalehy, H.[Hanieh], Bui, T.[Trung], Wang, Z.W.[Zhao-Wen], Dernoncourt, F.[Franck], Chung, J.S.[Joon Son],
Scaling Up Video Summarization Pretraining with Large Language Models,
CVPR24(8332-8341)
IEEE DOI 2410
Analytical models, Large language models, Computational modeling, Pipelines, Benchmark testing BibRef

Tong, S.[Shengbang], Liu, Z.[Zhuang], Zhai, Y.X.[Yue-Xiang], Ma, Y.[Yi], LeCun, Y.[Yann], Xie, S.[Saining],
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs,
CVPR24(9568-9578)
IEEE DOI 2410
Representation learning, Visualization, Systematics, Correlation, Grounding, Large language models, Multimodal LLMs, Vision Language Model BibRef

Lai, X.[Xin], Tian, Z.[Zhuotao], Chen, Y.[Yukang], Li, Y.W.[Yan-Wei], Yuan, Y.H.[Yu-Hui], Liu, S.[Shu], Jia, J.Y.[Jia-Ya],
LISA: Reasoning Segmentation via Large Language Model,
CVPR24(9579-9589)
IEEE DOI 2410
Image segmentation, Vocabulary, Visualization, Target recognition, Large language models, Benchmark testing BibRef

Shang, C.[Chenming], Zhou, S.[Shiji], Zhang, H.[Hengyuan], Ni, X.Z.[Xin-Zhe], Yang, Y.[Yujiu], Wang, Y.[Yuwang],
Incremental Residual Concept Bottleneck Models,
CVPR24(11030-11040)
IEEE DOI 2410
Measurement, Visualization, Accuracy, Large language models, Current measurement, Decision making, Closed box BibRef

Xie, Y.T.[Yu-Tong], Chen, Q.[Qi], Wang, S.[Sinuo], To, M.S.[Minh-Son], Lee, I.[Iris], Khoo, E.W.[Ee Win], Hendy, K.[Kerolos], Koh, D.[Daniel], Xia, Y.[Yong], Wu, Q.[Qi],
PairAug: What Can Augmented Image-Text Pairs Do for Radiology?,
CVPR24(11652-11661)
IEEE DOI Code:
WWW Link. 2410
Data privacy, Medical conditions, Large language models, Radiology, Data augmentation BibRef

Dong, Z.K.[Zhi-Kang], Liu, X.[Xiulong], Chen, B.[Bin], Polak, P.[Pawel], Zhang, P.[Peng],
MuseChat: A Conversational Music Recommendation System for Videos,
CVPR24(12775-12785)
IEEE DOI Code:
WWW Link. 2410
Accuracy, Large language models, Natural languages, Cognition, Recommender systems, Multimodal Learning, Music Information Retrieval BibRef

Li, F.[Feng], Jiang, Q.[Qing], Zhang, H.[Hao], Ren, T.[Tianhe], Liu, S.[Shilong], Zou, X.[Xueyan], Xu, H.Z.[Huai-Zhe], Li, H.Y.[Hong-Yang], Yang, J.W.[Jian-Wei], Li, C.Y.[Chun-Yuan], Zhang, L.[Lei], Gao, J.F.[Jian-Feng],
Visual in-Context Prompting,
CVPR24(12861-12871)
IEEE DOI Code:
WWW Link. 2410
Training, Visualization, Image segmentation, Codes, Large language models, Computer architecture BibRef

Sachdeva, R.[Ragav], Zisserman, A.[Andrew],
The Manga Whisperer: Automatically Generating Transcriptions for Comics,
CVPR24(12967-12976)
IEEE DOI Code:
WWW Link. 2410
Visualization, Codes, Large language models, Visual impairment, Oral communication, Linguistics BibRef

Ranasinghe, K.[Kanchana], Shukla, S.N.[Satya Narayan], Poursaeed, O.[Omid], Ryoo, M.S.[Michael S.], Lin, T.Y.[Tsung-Yu],
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs,
CVPR24(12977-12987)
IEEE DOI 2410
Training, Location awareness, Visualization, Image coding, Large language models, Pipelines, Cognition, LLM, VQA, Localization, Video BibRef

Xu, J.R.[Jia-Rui], Zhou, X.Y.[Xing-Yi], Yan, S.[Shen], Gu, X.[Xiuye], Arnab, A.[Anurag], Sun, C.[Chen], Wang, X.L.[Xiao-Long], Schmid, C.[Cordelia],
Pixel Aligned Language Models,
CVPR24(13030-13039)
IEEE DOI 2410
Location awareness, Visualization, Grounding, Large language models, Machine vision, Computational modeling BibRef

Ye, Q.H.[Qing-Hao], Xu, H.Y.[Hai-Yang], Ye, J.[Jiabo], Yan, M.[Ming], Hu, A.[Anwen], Liu, H.[Haowei], Qian, Q.[Qi], Zhang, J.[Ji], Huang, F.[Fei],
mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration,
CVPR24(13040-13051)
IEEE DOI 2410
Large language models, Computational modeling, Collaboration, Cognition, Decoding, Vision Language BibRef

Qi, P.[Peng], Yan, Z.[Zehong], Hsu, W.[Wynne], Lee, M.L.[Mong Li],
Sniffer: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection,
CVPR24(13052-13062)
IEEE DOI 2410
Visualization, Adaptation models, Accuracy, Large language models, Computational modeling, Data models, multimodal misinformation, explainability BibRef

Wu, P.H.[Peng-Hao], Xie, S.[Saining],
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs,
CVPR24(13084-13094)
IEEE DOI 2410
Training, Visualization, Grounding, Computational modeling, Seals, Benchmark testing, multimodal large language model, visual search BibRef

He, R.[Ruozhen], Cascante-Bonilla, P.[Paola], Yang, Z.Y.[Zi-Yan], Berg, A.C.[Alexander C.], Ordonez, V.[Vicente],
Improved Visual Grounding through Self-Consistent Explanations,
CVPR24(13095-13105)
IEEE DOI 2410
Location awareness, Visualization, Vocabulary, Grounding, Large language models, Data augmentation, Data models, visual grounding BibRef

Zhong, S.S.[Shan-Shan], Huang, Z.Z.[Zhong-Zhan], Gao, S.[Shanghua], Wen, W.[Wushao], Lin, L.[Liang], Zitnik, M.[Marinka], Zhou, P.[Pan],
Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation,
CVPR24(13246-13257)
IEEE DOI Code:
WWW Link. 2410
Technological innovation, Codes, Large language models, Games, Cognition BibRef

Gao, Z.[Zhi], Du, Y.T.[Yun-Tao], Zhang, X.T.[Xin-Tong], Ma, X.J.[Xiao-Jian], Han, W.J.[Wen-Juan], Zhu, S.C.[Song-Chun], Li, Q.[Qing],
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update,
CVPR24(13258-13268)
IEEE DOI 2410
Continuing education, Visualization, Limiting, Large language models, Training data, Tagging, Reflection, Compositional Reasoning BibRef

Feng, C.[Chun], Hsu, J.[Joy], Liu, W.Y.[Wei-Yu], Wu, J.J.[Jia-Jun],
Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners,
CVPR24(13269-13278)
IEEE DOI 2410
Visualization, Solid modeling, Accuracy, Grounding, Large language models, 3D visual grounding, Language constraints BibRef

Li, B.[Bohao], Ge, Y.Y.[Yu-Ying], Ge, Y.X.[Yi-Xiao], Wang, G.Z.[Guang-Zhi], Wang, R.[Rui], Zhang, R.M.[Rui-Mao], Shan, Y.[Ying],
SEED-Bench: Benchmarking Multimodal Large Language Models,
CVPR24(13299-13308)
IEEE DOI Code:
WWW Link. 2410
Accuracy, Codes, Annotations, Image synthesis, Large language models, Computational modeling, Benchmark, Multimodal, Hierarchical BibRef

Buettner, K.[Kyle], Malakouti, S.[Sina], Li, X.L.[Xiang Lorraine], Kovashka, A.[Adriana],
Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition,
CVPR24(13515-13524)
IEEE DOI 2410
Geography, Training, Large language models, Training data, Europe, Robustness BibRef

Tan, R.[Reuben], Sun, X.[Ximeng], Hu, P.[Ping], Wang, J.H.[Jui-Hsien], Deilamsalehy, H.[Hanieh], Plummer, B.A.[Bryan A.], Russell, B.[Bryan], Saenko, K.[Kate],
Koala: Key Frame-Conditioned Long Video-LLM,
CVPR24(13581-13591)
IEEE DOI 2410
Visualization, Accuracy, Large language models, Computational modeling, Benchmark testing, Question answering (information retrieval) BibRef

Liu, R.[Ruyang], Li, C.[Chen], Ge, Y.X.[Yi-Xiao], Li, T.H.[Thomas H.], Shan, Y.[Ying], Li, G.[Ge],
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning,
CVPR24(13658-13667)
IEEE DOI Code:
WWW Link. 2410
Training, Adaptation models, Visualization, Costs, Computational modeling, Graphics processing units, Video Large Language Models BibRef

Ding, X.P.[Xin-Peng], Han, J.H.[Jian-Hua], Xu, H.[Hang], Liang, X.D.[Xiao-Dan], Zhang, W.[Wei], Li, X.M.[Xiao-Meng],
Holistic Autonomous Driving Understanding by Bird'View Injected Multi-Modal Large Models,
CVPR24(13668-13677)
IEEE DOI Code:
WWW Link. 2410
Bridges, Large language models, Semantics, Autonomous vehicles BibRef

Li, J.X.[Jia-Xuan], Vo, D.M.[Duc Minh], Sugimoto, A.[Akihiro], Nakayama, H.[Hideki],
Evcap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension,
CVPR24(13733-13742)
IEEE DOI 2410
Training, Visualization, Adaptation models, Costs, Large language models, Memory management, Image Captioning, External Memory BibRef

Song, L.[Lin], Chen, Y.[Yukang], Yang, S.[Shuai], Ding, X.H.[Xiao-Han], Ge, Y.X.[Yi-Xiao], Chen, Y.C.[Ying-Cong], Shan, Y.[Ying],
Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs,
CVPR24(13763-13773)
IEEE DOI 2410
Training, Attention mechanisms, Computational modeling, Large language models, Benchmark testing, Natural language processing BibRef

Guo, Q.[Qiushan], de Mello, S.[Shalini], Yin, H.X.[Hong-Xu], Byeon, W.[Wonmin], Cheung, K.C.[Ka Chun], Yu, Y.Z.[Yi-Zhou], Luo, P.[Ping], Liu, S.[Sifei],
RegionGPT: Towards Region Understanding Vision Language Model,
CVPR24(13796-13806)
IEEE DOI 2410
Training, Visualization, Large language models, Pipelines, Training data, Object detection, Cognition BibRef

Yu, T.Y.[Tian-Yu], Yao, Y.[Yuan], Zhang, H.[Haoye], He, T.[Taiwen], Han, Y.F.[Yi-Feng], Cui, G.[Ganqu], Hu, J.Y.[Jin-Yi], Liu, Z.Y.[Zhi-Yuan], Zheng, H.T.[Hai-Tao], Sun, M.[Maosong],
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-Grained Correctional Human Feedback,
CVPR24(13807-13816)
IEEE DOI 2410
Image segmentation, Accuracy, Large language models, Computational modeling, Benchmark testing, Cognition, vision, hallucination BibRef

Xuan, S.Y.[Shi-Yu], Guo, Q.[Qingpei], Yang, M.[Ming], Zhang, S.L.[Shi-Liang],
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs,
CVPR24(13838-13848)
IEEE DOI Code:
WWW Link. 2410
Training, Visualization, Costs, Accuracy, Annotations, Large language models BibRef

Ganz, R.[Roy], Kittenplon, Y.[Yair], Aberdam, A.[Aviad], Avraham, E.B.[Elad Ben], Nuriel, O.[Oren], Mazor, S.[Shai], Litman, R.[Ron],
Question Aware Vision Transformer for Multimodal Reasoning,
CVPR24(13861-13871)
IEEE DOI 2410
Visualization, Image coding, Large language models, Focusing, Computer architecture, Transformers BibRef

Bansal, H.[Hritik], Bitton, Y.[Yonatan], Szpektor, I.[Idan], Chang, K.W.[Kai-Wei], Grover, A.[Aditya],
VideoCon: Robust Video-Language Alignment via Contrast Captions,
CVPR24(13927-13937)
IEEE DOI 2410
Large language models, Semantics, Question answering (information retrieval), Data models, large multimodal models BibRef

Wang, S.W.[Shao-Wei], Zhang, L.L.[Ling-Ling], Zhu, L.[Longji], Qin, T.[Tao], Yap, K.H.[Kim-Hui], Zhang, X.Y.[Xin-Yu], Liu, J.[Jun],
CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering,
CVPR24(13969-13979)
IEEE DOI 2410
Bridges, Visualization, Large language models, Computational modeling, Natural languages, Large Language Model BibRef

He, J.W.[Jun-Wen], Wang, Y.F.[Yi-Fan], Wang, L.J.[Li-Jun], Lu, H.C.[Hu-Chuan], He, J.Y.[Jun-Yan], Lan, J.P.[Jin-Peng], Luo, B.[Bin], Xie, X.[Xuansong],
Multi-Modal Instruction Tuned LLMs with Fine-Grained Visual Perception,
CVPR24(13980-13990)
IEEE DOI Code:
WWW Link. 2410
Image segmentation, Visualization, Technological innovation, Grounding, Computational modeling, Large language models, Natural languages BibRef

Yu, Q.[Qiying], Sun, Q.[Quan], Zhang, X.S.[Xiao-Song], Cui, Y.F.[Yu-Feng], Zhang, F.[Fan], Cao, Y.[Yue], Wang, X.L.[Xin-Long], Liu, J.J.[Jing-Jing],
CapsFusion: Rethinking Image-Text Data at Scale,
CVPR24(14022-14032)
IEEE DOI 2410
Training, Knowledge engineering, Scalability, Large language models, Computational modeling, Noise BibRef

Yao, J.W.[Jia-Wei], Qian, Q.[Qi], Hu, J.[Juhua],
Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering,
CVPR24(14066-14075)
IEEE DOI Code:
WWW Link. 2410
Deep learning, Bridges, Visualization, Codes, Large language models, Face recognition BibRef

Zou, B.[Bo], Yang, C.[Chao], Qiao, Y.[Yu], Quan, C.B.[Cheng-Bin], Zhao, Y.J.[You-Jian],
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction,
CVPR24(14089-14099)
IEEE DOI Code:
WWW Link. 2410
Visualization, Adaptation models, Codes, Computational modeling, Benchmark testing, Instruction Tuning, PEFT, Large Language Model BibRef

Huang, B.[Bin], Wang, X.[Xin], Chen, H.[Hong], Song, Z.[Zihan], Zhu, W.W.[Wen-Wu],
VTimeLLM: Empower LLM to Grasp Video Moments,
CVPR24(14271-14280)
IEEE DOI Code:
WWW Link. 2410
Training, Visualization, Grounding, Large language models, Benchmark testing, Cognition BibRef

Hong, W.[Wenyi], Wang, W.H.[Wei-Han], Lv, Q.S.[Qing-Song], Xu, J.Z.[Jia-Zheng], Yu, W.[Wenmeng], Ji, J.H.[Jun-Hui], Wang, Y.[Yan], Wang, Z.[Zihan], Dong, Y.X.[Yu-Xiao], Ding, M.[Ming], Tang, J.[Jie],
CogAgent: A Visual Language Model for GUI Agents,
CVPR24(14281-14290)
IEEE DOI Code:
WWW Link. 2410
Visualization, Limiting, Image resolution, Image recognition, Navigation, Large language models, Benchmark testing BibRef

Khan, Z.[Zaid], BG, V.K.[Vijay Kumar], Schulter, S.[Samuel], Fu, Y.[Yun], Chandraker, M.[Manmohan],
Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement,
CVPR24(14344-14353)
IEEE DOI Code:
WWW Link. 2410
Training, Visualization, Annotations, Large language models, Object detection, Question answering (information retrieval), visual question answering BibRef

Mitra, C.[Chancharik], Huang, B.[Brandon], Darrell, T.J.[Trevor J.], Herzig, R.[Roei],
Compositional Chain-of-Thought Prompting for Large Multimodal Models,
CVPR24(14420-14431)
IEEE DOI Code:
WWW Link. 2410
Bridges, Visualization, Codes, Annotations, Large language models, Benchmark testing, Large Multimodal Models, Multimodality, Prompting BibRef

Li, B.[Boyi], Wang, Y.[Yue], Mao, J.[Jiageng], Ivanovic, B.[Boris], Veer, S.[Sushant], Leung, K.[Karen], Pavone, M.[Marco],
Driving Everywhere with Large Language Model Policy Adaptation,
CVPR24(14948-14957)
IEEE DOI 2410
Measurement, Video on demand, Accuracy, Large language models, Planning, Large Language Models, Driving Copilot BibRef

Wei, Y.X.[Yu-Xi], Wang, Z.[Zi], Lu, Y.F.[Yi-Fan], Xu, C.X.[Chen-Xin], Liu, C.X.[Chang-Xing], Zhao, H.[Hao], Chen, S.[Siheng], Wang, Y.F.[Yan-Feng],
Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents,
CVPR24(15077-15087)
IEEE DOI Code:
WWW Link. 2410
Large language models, Face recognition, Natural languages, Collaboration, Lighting, Rendering (computer graphics), LLM agent BibRef

Shao, H.[Hao], Hu, Y.X.[Yu-Xuan], Wang, L.[Letian], Song, G.L.[Guang-Lu], Waslander, S.L.[Steven L.], Liu, Y.[Yu], Li, H.S.[Hong-Sheng],
LMDrive: Closed-Loop End-to-End Driving with Large Language Models,
CVPR24(15120-15130)
IEEE DOI 2410
Navigation, Large language models, Multimodal sensors, Natural languages, Benchmark testing, Software, LLM, autonomous driving BibRef

Ma, Y.S.[Yun-Sheng], Cui, C.[Can], Cao, X.[Xu], Ye, W.Q.[Wen-Qian], Liu, P.R.[Pei-Ran], Lu, J.[Juanwu], Abdelraouf, A.[Amr], Gupta, R.[Rohit], Han, K.T.[Kyung-Tae], Bera, A.[Aniket], Rehg, J.M.[James M.], Wang, Z.[Ziran],
LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs,
CVPR24(15141-15151)
IEEE DOI 2410
Codes, Large language models, Benchmark testing, Cognition, Safety, Pattern recognition BibRef

Zhang, J.W.[Jia-Wei], Xu, C.[Chejian], Li, B.[Bo],
ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles,
CVPR24(15459-15469)
IEEE DOI Code:
WWW Link. 2410
Training, Codes, Large language models, Transforms, Robustness, Safety, Autonomous Driving, Large Language Model, Safety-Critical Scenario BibRef

Liu, C.[Chaohu], Yin, K.[Kun], Cao, H.Y.[Hao-Yu], Jiang, X.H.[Xing-Hua], Li, X.[Xin], Liu, Y.[Yinsong], Jiang, D.Q.[De-Qiang], Sun, X.[Xing], Xu, L.[Linli],
HRVDA: High-Resolution Visual Document Assistant,
CVPR24(15534-15545)
IEEE DOI 2410
Training, Visualization, Large language models, Computational modeling, Training data, Transformers, Multimodal BibRef

Blau, T.[Tsachi], Fogel, S.[Sharon], Ronen, R.[Roi], Golts, A.[Alona], Tsiper, S.[Shahar], Avraham, E.B.[Elad Ben], Aberdam, A.[Aviad], Ganz, R.[Roy], Litman, R.[Ron],
GRAM: Global Reasoning for Multi-Page VQA,
CVPR24(15598-15607)
IEEE DOI 2410
Adaptation models, Visualization, Computational modeling, Large language models, Benchmark testing, Transformers, Cognition, Vision Language Models BibRef

Luo, C.[Chuwei], Shen, Y.F.[Yu-Fan], Zhu, Z.Q.[Zhao-Qing], Zheng, Q.[Qi], Yu, Z.[Zhi], Yao, C.[Cong],
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding,
CVPR24(15630-15640)
IEEE DOI 2410
Large language models, Layout, Manuals, Inspection, Benchmark testing, Boosting, Document Understanding, Layout, Large Language Models BibRef

Yang, Y.[Yue], Sun, F.Y.[Fan-Yun], Weihs, L.[Luca], Vanderbilt, E.[Eli], Herrasti, A.[Alvaro], Han, W.[Winson], Wu, J.J.[Jia-Jun], Haber, N.[Nick], Krishna, R.[Ranjay], Liu, L.J.[Ling-Jie], Callison-Burch, C.[Chris], Yatskar, M.[Mark], Kembhavi, A.[Aniruddha], Clark, C.[Christopher],
Holodeck: Language Guided Generation of 3D Embodied AI Environments,
CVPR24(16277-16287)
IEEE DOI 2410
Training, Navigation, Large language models, Semantics, Layout, Stars, Embodied AI, 3D Scene Generation, Language-guided Generation BibRef

Qin, Y.[Yiran], Zhou, E.[Enshen], Liu, Q.[Qichang], Yin, Z.F.[Zhen-Fei], Sheng, L.[Lu], Zhang, R.M.[Rui-Mao], Qiao, Y.[Yu], Shao, J.[Jing],
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception,
CVPR24(16307-16316)
IEEE DOI Code:
WWW Link. 2410
Visualization, Large language models, Active perception, Planning, Compounds BibRef

Zhang, S.[Sixian], Yu, X.Y.[Xin-Yao], Song, X.H.[Xin-Hang], Wang, X.H.[Xiao-Han], Jiang, S.Q.[Shu-Qiang],
Imagine Before Go: Self-Supervised Generative Map for Object Goal Navigation,
CVPR24(16414-16425)
IEEE DOI Code:
WWW Link. 2410
Training, Geometry, Navigation, Large language models, Semantics, Layout, Self-supervised learning, Embodied AI, Object Goal Navigation BibRef

Li, H.[Hao], Yang, X.[Xue], Wang, Z.[Zhaokai], Zhu, X.[Xizhou], Zhou, J.[Jie], Qiao, Y.[Yu], Wang, X.G.[Xiao-Gang], Li, H.S.[Hong-Sheng], Lu, L.W.[Le-Wei], Dai, J.F.[Ji-Feng],
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft,
CVPR24(16426-16435)
IEEE DOI 2410
Learning systems, Codes, Large language models, Lava, Semantics, Reinforcement learning, Syntactics, Large Language Model, Reward Shaping BibRef

Liu, M.X.[Ming-Xuan], Hayes, T.L.[Tyler L.], Ricci, E.[Elisa], Csurka, G.[Gabriela], Volpi, R.[Riccardo],
SHiNe: Semantic Hierarchy Nexus for Open-Vocabulary Object Detection,
CVPR24(16634-16644)
IEEE DOI 2410
Vocabulary, Fuses, Large language models, Semantics, Detectors, Object detection, Open-vocabulary, Object Detection, Vision-Language BibRef

Lei, T.[Ting], Yin, S.F.[Shao-Feng], Liu, Y.[Yang],
Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection,
CVPR24(16657-16667)
IEEE DOI Code:
WWW Link. 2410
Vocabulary, Correlation, Large language models, Semantics, Natural languages, Detectors BibRef

Kim, J.[Jooyeon], Cho, E.[Eulrang], Kim, S.[Sehyung], Kim, H.W.J.[Hyun-Woo J.],
Retrieval-Augmented Open-Vocabulary Object Detection,
CVPR24(17427-17436)
IEEE DOI Code:
WWW Link. 2410
Portable media players, Visualization, Vocabulary, Large language models, Semantics, Detectors, Object detection, Retrieval-Augmentation BibRef

Saha, O.[Oindrila], van Horn, G.[Grant], Maji, S.[Subhransu],
Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions,
CVPR24(17542-17552)
IEEE DOI Code:
WWW Link. 2410
Training, Visualization, Large language models, Habitats, Benchmark testing, Birds, Zero Shot Learning, Fine-grained Classification BibRef

Toubal, I.E.[Imad Eddine], Avinash, A.[Aditya], Alldrin, N.G.[Neil Gordon], Dlabal, J.[Jan], Zhou, W.[Wenlei], Luo, E.[Enming], Stretcu, O.[Otilia], Xiong, H.[Hao], Lu, C.T.[Chun-Ta], Zhou, H.[Howard], Krishna, R.[Ranjay], Fuxman, A.[Ariel], Duerig, T.[Tom],
Modeling Collaborator: Enabling Subjective Vision Classification with Minimal Human Effort via LLM Tool-Use,
CVPR24(17553-17563)
IEEE DOI 2410
Visualization, Computational modeling, Large language models, Natural languages, Wildlife, Training data, Manuals, tool-use BibRef

Li, X.Q.[Xiao-Qi], Zhang, M.X.[Ming-Xu], Geng, Y.[Yiran], Geng, H.R.[Hao-Ran], Long, Y.X.[Yu-Xing], Shen, Y.[Yan], Zhang, R.R.[Ren-Rui], Liu, J.[JiaMing], Dong, H.[Hao],
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation,
CVPR24(18061-18070)
IEEE DOI Code:
WWW Link. 2410
Training, Adaptation models, Large language models, Transforms, Predictive models, Robot sensing systems, Cognition, Embodied AI, Multi-modal Large Language Model BibRef

Han, T.[Tengda], Bain, M.[Max], Nagrani, A.[Arsha], Varol, G.[Gül], Xie, W.[Weidi], Zisserman, A.[Andrew],
AutoAD III: The Prequel: Back to the Pixels,
CVPR24(18164-18174)
IEEE DOI 2410
Training, Measurement, Visualization, Large language models, Current measurement, Training data, Computer architecture BibRef

Song, E.[Enxin], Chai, W.H.[Wen-Hao], Wang, G.[Guanhong], Zhang, Y.C.[Yu-Cheng], Zhou, H.Y.[Hao-Yang], Wu, F.[Feiyang], Chi, H.Z.[Hao-Zhe], Guo, X.[Xun], Ye, T.[Tian], Zhang, Y.T.[Yan-Ting], Lu, Y.[Yan], Hwang, J.N.[Jenq-Neng], Wang, G.[Gaoang],
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding,
CVPR24(18221-18232)
IEEE DOI Code:
WWW Link. 2410
Visualization, Costs, Large language models, Computational modeling, Manuals, Transformers BibRef

Qu, H.X.[Hao-Xuan], Cai, Y.J.[Yu-Jun], Liu, J.[Jun],
LLMs are Good Action Recognizers,
CVPR24(18395-18406)
IEEE DOI 2410
Accuracy, Large language models, Computer architecture, Linguistics, Benchmark testing, Skeleton BibRef

Chen, J.[Joya], Lv, Z.Y.[Zhao-Yang], Wu, S.W.[Shi-Wei], Lin, K.Q.[Kevin Qinghong], Song, C.[Chenan], Gao, D.F.[Di-Fei], Liu, J.W.[Jia-Wei], Gao, Z.T.[Zi-Teng], Mao, D.X.[Dong-Xing], Shou, M.Z.[Mike Zheng],
VideoLLM-online: Online Video Large Language Model for Streaming Video,
CVPR24(18407-18418)
IEEE DOI 2410
Training, Large language models, Soft sensors, Pipelines, Streaming media, Rendering (computer graphics), Data models BibRef

Zhu, A.[Anqi], Ke, Q.H.[Qiu-Hong], Gong, M.M.[Ming-Ming], Bailey, J.[James],
Part-Aware Unified Representation of Language and Skeleton for Zero-Shot Action Recognition,
CVPR24(18761-18770)
IEEE DOI Code:
WWW Link. 2410
Visualization, Source coding, Large language models, Natural languages, Skeleton, representation learning BibRef

Chen, T.J.[Tong-Jia], Yu, H.S.[Hong-Shan], Yang, Z.G.[Zhen-Geng], Li, Z.C.[Ze-Chuan], Sun, W.[Wei], Chen, C.[Chen],
OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition,
CVPR24(18888-18898)
IEEE DOI 2410
Training, Adaptation models, Visualization, Large language models, Semantics, Pipelines, Refining, Video Reognition, Multi-modality Video Understanding BibRef

Zhao, Q.H.[Qi-Hao], Dai, Y.[Yalun], Li, H.[Hao], Hu, W.[Wei], Zhang, F.[Fan], Liu, J.[Jun],
LTGC: Long-Tail Recognition via Leveraging LLMs-Driven Generated Content,
CVPR24(19510-19520)
IEEE DOI 2410
Semantic segmentation, Large language models, Computational modeling, Data visualization, Tail, Benchmark testing BibRef

Siddiqui, Y.[Yawar], Alliegro, A.[Antonio], Artemov, A.[Alexey], Tommasi, T.[Tatiana], Sirigatti, D.[Daniele], Rosov, V.[Vladislav], Dai, A.[Angela], Nießner, M.[Matthias],
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers,
CVPR24(19615-19625)
IEEE DOI 2410
Geometry, Vocabulary, Solid modeling, Shape, Large language models, Transformers, Mesh Generation, Generative Models for 3D, Transformers BibRef

Yuan, Z.H.[Zhi-Hao], Ren, J.[Jinke], Feng, C.M.[Chun-Mei], Zhao, H.S.[Heng-Shuang], Cui, S.G.[Shu-Guang], Li, Z.[Zhen],
Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding,
CVPR24(20623-20633)
IEEE DOI Code:
WWW Link. 2410
Visualization, Vocabulary, Grounding, Annotations, Navigation, Large language models, Visual Grounding, Point Cloud, Vision and Language BibRef

Li, Z.[Zhe], Gao, Z.Y.[Zhang-Yang], Tan, C.[Cheng], Ren, B.[Bocheng], Yang, L.T.[Laurence T.], Li, S.Z.[Stan Z.],
General Point Model Pretraining with Autoencoding and Autoregressive,
CVPR24(20954-20964)
IEEE DOI Code:
WWW Link. 2410
Point cloud compression, Representation learning, Codes, Large language models, Vector quantization, Computational modeling BibRef

Li, K.C.[Kun-Chang], Wang, Y.[Yali], He, Y.[Yinan], Li, Y.Z.[Yi-Zhuo], Wang, Y.[Yi], Liu, Y.[Yi], Wang, Z.[Zun], Xu, J.[Jilan], Chen, G.[Guo], Lou, P.[Ping], Wang, L.M.[Li-Min], Qiao, Y.[Yu],
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark,
CVPR24(22195-22206)
IEEE DOI Code:
WWW Link. 2410
Training, Systematics, Large language models, Image annotation, Manuals, Benchmark testing BibRef

Taesiri, M.R.[Mohammad Reza], Feng, T.J.[Tian-Jun], Bezemer, C.P.[Cor-Paul], Nguyen, A.[Anh],
GlitchBench: Can Large Multimodal Models Detect Video Game Glitches?,
CVPR24(22444-22455)
IEEE DOI Code:
WWW Link. 2410
Video games, Visualization, Quality assurance, Large language models, Benchmark testing, Linguistics, Cognition, game testing BibRef

Zhang, R.[Ruiyi], Zhang, Y.Z.[Yan-Zhe], Chen, J.[Jian], Zhou, Y.F.[Yu-Fan], Gu, J.X.[Jiu-Xiang], Chen, C.[Changyou], Sun, T.[Tong],
TRINS: Towards Multimodal Language Models that Can Read,
CVPR24(22584-22594)
IEEE DOI 2410
Visualization, Annotations, Large language models, Computational modeling, Optical character recognition, Training data BibRef

Zhang, H.J.[Hao-Jie], Su, Y.Y.[Yong-Yi], Xu, X.[Xun], Jia, K.[Kui],
Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation,
CVPR24(23385-23395)
IEEE DOI 2410
Image segmentation, Costs, Large language models, Robustness, Computational efficiency, Domain Adaptation, Weakly Supervised Adaptation BibRef

Dunlap, L.[Lisa], Zhang, Y.H.[Yu-Hui], Wang, X.H.[Xiao-Han], Zhong, R.Q.[Rui-Qi], Darrell, T.J.[Trevor J.], Steinhardt, J.[Jacob], Gonzalez, J.E.[Joseph E.], Yeung-Levy, S.[Serena],
Describing Differences in Image Sets with Natural Language,
CVPR24(24199-24208)
IEEE DOI Code:
WWW Link. 2410
Analytical models, Large language models, Computational modeling, Natural languages, Human in the loop BibRef

Ishmam, A.M.[Alvi Md], Thomas, C.[Christopher],
Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-Grained Knowledge Alignment,
CVPR24(24820-24830)
IEEE DOI 2410
Training, Visualization, Correlation, Computational modeling, Large language models, Semantics, Adversarial attack and defense, Vision languge model BibRef

Wu, H.N.[Hao-Ning], Zhang, Z.C.[Zi-Cheng], Zhang, E.[Erli], Chen, C.F.[Chao-Feng], Liao, L.[Liang], Wang, A.[Annan], Xu, K.X.[Kai-Xin], Li, C.Y.[Chun-Yi], Hou, J.W.[Jing-Wen], Zhai, G.T.[Guang-Tao], Xue, G.[Geng], Sun, W.X.[Wen-Xiu], Yan, Q.[Qiong], Lin, W.S.[Wei-Si],
Q-Instruct: Improving Low-Level Visual Abilities for Multi-Modality Foundation Models,
CVPR24(25490-25500)
IEEE DOI 2410
Visualization, Accuracy, Large language models, Natural languages, Solids, Quality assessment BibRef

Yang, Y.J.[Yi-Jun], Zhou, T.Y.[Tian-Yi], Li, K.[Kanxue], Tao, D.P.[Da-Peng], Li, L.[Lusong], Shen, L.[Li], He, X.D.[Xiao-Dong], Jiang, J.[Jing], Shi, Y.H.[Yu-Hui],
Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld,
CVPR24(26265-26275)
IEEE DOI 2410
Training, Visualization, Imitation learning, Large language models, Robustness, Reflection, Embodied AI, Large Language Models, Imitation Learning BibRef

Hong, Y.[Yining], Zheng, Z.[Zishuo], Chen, P.H.[Pei-Hao], Wang, Y.F.[Yi-Fan], Li, J.[Junyan], Gan, C.[Chuang],
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World,
CVPR24(26396-26406)
IEEE DOI 2410
Visualization, Correlation, Navigation, Large language models, Computational modeling BibRef

Chen, G.[Gongwei], Shen, L.[Leyang], Shao, R.[Rui], Deng, X.[Xiang], Nie, L.Q.[Li-Qiang],
LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge,
CVPR24(26530-26540)
IEEE DOI 2410
Visualization, Accuracy, Grounding, Large language models, Semantics, Benchmark testing BibRef

Zhang, Y.[Yichi], Dong, Y.P.[Yin-Peng], Zhang, S.Y.[Si-Yuan], Min, T.Z.[Tian-Zan], Su, H.[Hang], Zhu, J.[Jun],
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models,
CVPR24(26552-26562)
IEEE DOI 2410
Training, Visualization, Adaptation models, Computational modeling, Large language models, Semantics, Feature extraction, Transferability BibRef

Han, J.[JiaMing], Gong, K.X.[Kai-Xiong], Zhang, Y.Y.[Yi-Yuan], Wang, J.Q.[Jia-Qi], Zhang, K.[Kaipeng], Lin, D.[Dahua], Qiao, Y.[Yu], Gao, P.[Peng], Yue, X.Y.[Xiang-Yu],
OneLLM: One Framework to Align All Modalities with Language,
CVPR24(26574-26585)
IEEE DOI Code:
WWW Link. 2410
Point cloud compression, Large language models, Pipelines, Benchmark testing, Functional magnetic resonance imaging, Routing BibRef

Xie, H.X.[Hong-Xia], Peng, C.J.[Chu-Jun], Tseng, Y.W.[Yu-Wen], Chen, H.J.[Hung-Jen], Hsu, C.F.[Chan-Feng], Shuai, H.H.[Hong-Han], Cheng, W.H.[Wen-Huang],
EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning,
CVPR24(26586-26595)
IEEE DOI Code:
WWW Link. 2410
Visualization, Emotion recognition, Large language models, Pipelines, Benchmark testing, Cognition BibRef

Wang, X.Y.[Xin-Yu], Zhuang, B.[Bohan], Wu, Q.[Qi],
ModaVerse: Efficiently Transforming Modalities with LLMs,
CVPR24(26596-26606)
IEEE DOI Code:
WWW Link. 2410
Training, Adaptation models, Large language models, Natural languages, Layout, Data models BibRef

Lin, J.[Ji], Yin, H.X.[Hong-Xu], Ping, W.[Wei], Molchanov, P.[Pavlo], Shoeybi, M.[Mohammad], Han, S.[Song],
VILA: On Pre-training for Visual Language Models,
CVPR24(26679-26689)
IEEE DOI 2410
Degradation, Visualization, Accuracy, Large language models, Benchmark testing, Cognition BibRef

Li, L.[Li], Peng, J.W.[Jia-Wei], Chen, H.[Huiyi], Gao, C.Y.[Chong-Yang], Yang, X.[Xu],
How to Configure Good In-Context Sequence for Visual Question Answering,
CVPR24(26700-26710)
IEEE DOI Code:
WWW Link. 2410
Visualization, Codes, Design methodology, Large language models, Question answering (information retrieval) BibRef

Lyu, Y.H.[Yuan-Huiyi], Zheng, X.[Xu], Zhou, J.Z.[Jia-Zhou], Wang, L.[Lin],
UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All,
CVPR24(26742-26752)
IEEE DOI 2410
Point cloud compression, Visualization, Large language models, Knowledge based systems, Infrared imaging, Contrastive learning, Data mining BibRef

Liang, T.[Tian], Huang, J.[Jing], Kong, M.[Ming], Chen, L.[Luyuan], Zhu, Q.[Qiang],
Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model,
CVPR24(26845-26855)
IEEE DOI Code:
WWW Link. 2410
Training, Bridges, Adaptation models, Technological innovation, Codes, Computational modeling, multimodal, large language model BibRef

Jiang, C.[Chaoya], Xu, H.Y.[Hai-Yang], Dong, M.[Mengfan], Chen, J.X.[Jia-Xing], Ye, W.[Wei], Yan, M.[Ming], Ye, Q.[Qinghao], Zhang, J.[Ji], Huang, F.[Fei], Zhang, S.K.[Shi-Kun],
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model,
CVPR24(27026-27036)
IEEE DOI Code:
WWW Link. 2410
Representation learning, Visualization, Codes, Large language models, Natural languages, Contrastive learning BibRef

Zhu, L.[Lei], Wei, F.[Fangyun], Lu, Y.[Yanye],
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension,
CVPR24(27037-27047)
IEEE DOI Code:
WWW Link. 2410
Visualization, Vocabulary, Image recognition, Large language models, Semantics, Transforms, Feature extraction, Multi-modal learning BibRef

Pi, R.J.[Ren-Jie], Yao, L.W.[Le-Wei], Gao, J.[Jiahui], Zhang, J.P.[Ji-Peng], Zhang, T.[Tong],
PerceptionGPT: Effectively Fusing Visual Perception Into LLM,
CVPR24(27114-27123)
IEEE DOI 2410
Training, Visualization, Accuracy, Large language models, Decoding, Multimodal Learning BibRef

Tai, Y.[Yan], Fan, W.C.[Wei-Chen], Zhang, Z.[Zhao], Liu, Z.W.[Zi-Wei],
Link-Context Learning for Multimodal LLMs,
CVPR24(27166-27175)
IEEE DOI 2410
Training, Image recognition, Large language models, Oral communication, Propulsion, Cognition BibRef

Tang, Z.[Zineng], Yang, Z.[Ziyi], Khademi, M.[Mahmoud], Liu, Y.[Yang], Zhu, C.G.[Chen-Guang], Bansal, M.[Mohit],
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation,
CVPR24(27415-27424)
IEEE DOI 2410
Image synthesis, Large language models, Oral communication, Encoding, Cognition BibRef

Jain, J.[Jitesh], Yang, J.W.[Jian-Wei], Shi, H.[Humphrey],
VCoder: Versatile Vision Encoders for Multimodal Large Language Models,
CVPR24(27992-28002)
IEEE DOI 2410
Training, Visualization, Image segmentation, Costs, Image synthesis, Large language models, Machine vision BibRef

Yuan, Y.Q.[Yu-Qian], Li, W.[Wentong], Liu, J.[Jian], Tang, D.Q.[Dong-Qi], Luo, X.J.[Xin-Jie], Qin, C.[Chi], Zhang, L.[Lei], Zhu, J.[Jianke],
Osprey: Pixel Understanding with Visual Instruction Tuning,
CVPR24(28202-28211)
IEEE DOI Code:
WWW Link. 2410
Convolutional codes, Visualization, Computational modeling, Source coding, Large language models, Semantics BibRef

Zhai, A.J.[Albert J.], Shen, Y.[Yuan], Chen, E.Y.[Emily Y.], Wang, G.X.[Gloria X.], Wang, X.L.[Xin-Lei], Wang, S.[Sheng], Guan, K.Y.[Kai-Yu], Wang, S.[Shenlong],
Physical Property Understanding from Language-Embedded Feature Fields,
CVPR24(28296-28305)
IEEE DOI Code:
WWW Link. 2410
Point cloud compression, Visualization, Friction, Large language models, Estimation, digital twin BibRef

Zheng, Z.H.[Zhao-Heng], Wei, J.[Jingmin], Hu, X.F.[Xue-Feng], Zhu, H.D.[Hai-Dong], Nevatia, R.[Ram],
Large Language Models are Good Prompt Learners for Low-Shot Image Classification,
CVPR24(28453-28462)
IEEE DOI Code:
WWW Link. 2410
Learning systems, Training, Adaptation models, Codes, Large language models, Computational modeling BibRef

He, H.Y.[Hao-Yu], Pan, Z.Z.[Zi-Zheng], Liu, J.[Jing], Cai, J.F.[Jian-Fei], Zhuang, B.[Bohan],
Efficient Stitchable Task Adaptation,
CVPR24(28555-28565)
IEEE DOI Code:
WWW Link. 2410
Training, Deep learning, Adaptation models, Visualization, Scalability, Pipelines, Memory management, model stitching, large language model BibRef

Tian, X.Y.[Xin-Yu], Zou, S.[Shu], Yang, Z.Y.[Zhao-Yuan], Zhang, J.[Jing],
ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models,
CVPR24(28578-28587)
IEEE DOI Code:
WWW Link. 2410
Adaptation models, Visualization, Correlation, Computational modeling, Large language models, Semantics, few-shot adaptation BibRef

Han, G.X.[Guang-Xing], Lim, S.N.[Ser-Nam],
Few-Shot Object Detection with Foundation Models,
CVPR24(28608-28618)
IEEE DOI 2410
Training, Visualization, Large language models, Computational modeling, Object detection, Benchmark testing, Large Language Model BibRef

Roberts, J.[Jonathan], Lüddecke, T.[Timo], Sheikh, R.[Rehan], Han, K.[Kai], Albanie, S.[Samuel],
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs,
EarthVision24(554-563)
IEEE DOI 2410
Visualization, Image segmentation, Navigation, Large language models, Disasters, Focusing, Benchmark testing, Evaluation BibRef

Barbany, O.[Oriol], Huang, M.[Michael], Zhu, X.L.[Xin-Liang], Dhua, A.[Arnab],
Leveraging Large Language Models for Multimodal Search,
FGVC24(1201-1210)
IEEE DOI 2410
Large language models, Natural languages, Pipelines, Image retrieval, Computer architecture, LLM, retrieval, fashion, multimodal BibRef

Lv, J.X.[Jia-Xi], Huang, Y.[Yi], Yan, M.[Mingfu], Huang, J.C.[Jian-Cheng], Liu, J.Z.[Jian-Zhuang], Liu, Y.F.[Yi-Fan], Wen, Y.F.[Ya-Fei], Chen, X.X.[Xiao-Xin], Chen, S.F.[Shi-Feng],
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning,
PBDL24(1430-1440)
IEEE DOI Code:
WWW Link. 2410
Image synthesis, Large language models, Text to image, Fluid flow, Manuals, Diffusion models BibRef

Baldassini, F.B.[Folco Bertini], Shukor, M.[Mustafa], Cord, M.[Matthieu], Soulier, L.[Laure], Piwowarski, B.[Benjamin],
What Makes Multimodal In-Context Learning Work?,
Prompting24(1539-1550)
IEEE DOI 2410
Training, Analytical models, Codes, Large language models, Impedance matching, Large Language Models, Shortcuts learning BibRef

Wang, J.C.[Jun-Chi], Ke, L.[Lei],
LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning,
WhatNext24(1765-1774)
IEEE DOI Code:
WWW Link. 2410
Training, Image segmentation, Large language models, Design methodology, Pipelines, Cognition BibRef

Qu, M.X.[Meng-Xue], Chen, X.D.[Xiao-Dong], Liu, W.[Wu], Li, A.[Alicia], Zhao, Y.[Yao],
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models,
PVUW24(1847-1856)
IEEE DOI 2410
Grounding, Annotations, Large language models, Supervised learning, Natural languages BibRef

Hakim, Z.I.A.[Zaber Ibn Abdul], Sarker, N.H.[Najibul Haque], Singh, R.P.[Rahul Pratap], Paul, B.[Bishmoy], Dabouei, A.[Ali], Xu, M.[Min],
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning,
MULA24(1975-1985)
IEEE DOI 2410
Training, Adaptation models, Statistical analysis, Large language models, Estimation, Contrastive learning, Distance measurement BibRef

Deria, A.[Ankan], Kumar, K.[Komal], Chakraborty, S.[Snehashis], Mahapatra, D.[Dwarikanath], Roy, S.[Sudipta],
InVERGe: Intelligent Visual Encoder for Bridging Modalities in Report Generation,
MULA24(2028-2038)
IEEE DOI Code:
WWW Link. 2410
Training, Visualization, Computational modeling, Radiology, Transformers, Feature extraction, Decoding, Deep Learning, Large Language Model BibRef

Ma, F.P.[Fei-Peng], Zhou, Y.Z.[Yi-Zhou], Zhang, Y.[Yueyi], Wu, S.Y.[Si-Ying], Zhang, Z.[Zheyu], He, Z.L.[Zi-Long], Rao, F.Y.[Feng-Yun], Sun, X.Y.[Xiao-Yan],
Task Navigator: Decomposing Complex Tasks for Multimodal Large Language Models,
Reasoning24(2248-2257)
IEEE DOI 2410
Training, Systematics, Navigation, Large language models, Training data, Language and Vision, Multi-modal Vision BibRef

Arefeen, M.A.[Md Adnan], Debnath, B.[Biplob], Uddin, M.Y.S.[Md Yusuf Sarwar], Chakradhar, S.[Srimat],
ViTA: An Efficient Video-to-Text Algorithm using VLM for RAG-based Video Analysis System,
Reasoning24(2266-2274)
IEEE DOI 2410
Accuracy, Large language models, Natural language processing, Data models, Video Analytics, Large Language Models (LLMs) BibRef

Chen, Y.W.[Yu-Wei], Chu, S.Y.[Shi-Yong],
Large Language Models in Wargaming: Methodology, Application, and Robustness,
AML24(2894-2903)
IEEE DOI 2410
Navigation, Large language models, Decision making, Strategic planning, Solids, Robustness, Natural language processing BibRef

Lai, Z.X.[Zhi-Xin], Wu, J.[Jing], Chen, S.[Suiyao], Zhou, Y.C.[Yu-Cheng], Hovakimyan, N.[Naira],
Residual-based Language Models are Free Boosters for Biomedical Imaging Tasks,
DEF-AI-MIA24(5086-5096)
IEEE DOI Code:
WWW Link. 2410
Visualization, Large language models, Fasteners, Transformers, LLM, Biomedical Imaging BibRef

Verma, A.A.[Aayush Atul], Saeidi, A.[Amir], Hegde, S.[Shamanthak], Therala, A.[Ajay], Bardoliya, F.D.[Fenil Denish], Machavarapu, N.[Nagaraju], Ravindhiran, S.A.K.[Shri Ajay Kumar], Malyala, S.[Srija], Chatterjee, A.[Agneet], Yang, Y.Z.[Ye-Zhou], Baral, C.[Chitta],
Evaluating Multimodal Large Language Models across Distribution Shifts and Augmentations,
GenerativeFM24(5314-5324)
IEEE DOI 2410
Analytical models, Shape, Large language models, Computational modeling, Perturbation methods, Benchmark testing, Robustness BibRef

Fang, X.[Xi], Wang, W.G.[Wei-Gang], Lv, X.X.[Xiao-Xin], Yan, J.[Jun],
PCQA: A Strong Baseline for AIGC Quality Assessment Based on Prompt Condition,
NTIRE24(6167-6176)
IEEE DOI 2410
Image quality, Databases, Large language models, Semantics, Quality assessment, Ensemble learning, AIGC, multimodal learning BibRef

Ye, Z.[Zilyu], Liu, J.X.[Jin-Xiu], Cao, J.J.[Jin-Jin], Chen, Z.Y.[Zhi-Yang], Xuan, Z.W.[Zi-Wei], Zhou, M.Y.[Ming-Yuan], Liu, Q.[Qi], Qi, G.J.[Guo-Jun],
OpenStory: A Large-Scale Open-Domain Dataset for Subject-Driven Visual Storytelling,
VDU24(7953-7962)
IEEE DOI 2410
Training, Visualization, Annotations, Large language models, Pipelines, Manuals BibRef

Chen, X.Y.[Xiang-Yu], Liu, J.[Jing], Wang, Y.[Ye], Wang, P.P.[Pu Perry], Brand, M.[Matthew], Wang, G.H.[Guang-Hui], Koike-Akino, T.[Toshiaki],
SuperLoRA: Parameter-Efficient Unified Adaptation for Large Vision Models,
ECV24(8050-8055)
IEEE DOI 2410
Adaptation models, Tensors, Computational modeling, Large language models, Transfer learning, parameter efficiency, low-rank adaptation BibRef

Chen, Z.[Zhe], Wu, J.N.[Jian-Nan], Wang, W.H.[Wen-Hai], Su, W.J.[Wei-Jie], Chen, G.[Guo], Xing, S.[Sen], Zhong, M.[Muyan], Zhang, Q.L.[Qing-Long], Zhu, X.[Xizhou], Lu, L.W.[Le-Wei], Li, B.[Bin], Luo, P.[Ping], Lu, T.[Tong], Qiao, Y.[Yu], Dai, J.F.[Ji-Feng],
Intern VL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks,
CVPR24(24185-24198)
IEEE DOI 2410
Training, Visualization, Image recognition, Large language models, Data models, Question answering (information retrieval), vision-language model BibRef

Zhang, J.Y.[Jimu-Yang], Huang, Z.M.[Zan-Ming], Ray, A.[Arijit], Ohn-Bar, E.[Eshed],
Feedback-Guided Autonomous Driving,
CVPR24(15000-15011)
IEEE DOI 2410
Training, Large language models, Cloning, Computer architecture, Network architecture, Real-time systems, Autonomous Driving, Large Language Model BibRef

Wei, C.[Chen], Liu, C.X.[Chen-Xi], Qiao, S.Y.[Si-Yuan], Zhang, Z.S.[Zhi-Shuai], Yuille, A.L.[Alan L.], Yu, J.[Jiahui],
De-Diffusion Makes Text a Strong Cross-Modal Interface,
CVPR24(13492-13503)
IEEE DOI 2410
Large language models, Natural languages, Text to image, Transforms, Diffusion models, Decoding, Diffusion, Generative Model, Vision and Language BibRef

Chen, Y.[Yangyi], Sikka, K.[Karan], Cogswell, M.[Michael], Ji, H.[Heng], Divakaran, A.[Ajay],
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback,
CVPR24(14239-14250)
IEEE DOI 2410
Training, Visualization, Annotations, Large language models, Natural languages, Reinforcement learning, Natural Language Feedback BibRef

Chen, B.[Boyuan], Xu, Z.[Zhuo], Kirmani, S.[Sean], Ichter, B.[Brian], Sadigh, D.[Dorsa], Guibas, L.J.[Leonidas J.], Xia, F.[Fei],
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities,
CVPR24(14455-14465)
IEEE DOI Code:
WWW Link. 2410
Training, Solid modeling, Visualization, Pipelines, Training data, Cognition, spatial reasoning, large language model, multimodal, vision language model BibRef

Dorkenwald, M.[Michael], Barazani, N.[Nimrod], Snoek, C.G.M.[Cees G. M.], Asano, Y.M.[Yuki M.],
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs,
CVPR24(13548-13558)
IEEE DOI 2410
Training, Computational modeling, Machine vision, Large language models, Pipelines, Pins, Vision-Language Models, Efficient Adaption of VLMs BibRef

Cha, J.[Junbum], Kang, W.[Wooyoung], Mun, J.[Jonghwan], Roh, B.[Byungseok],
Honeybee: Locality-Enhanced Projector for Multimodal LLM,
CVPR24(13817-13827)
IEEE DOI Code:
WWW Link. 2410
Visualization, Codes, Large language models, Benchmark testing, Tuning, Multimodal LLM, Vision-Language BibRef

Huang, Q.D.[Qi-Dong], Dong, X.Y.[Xiao-Yi], Zhang, P.[Pan], Wang, B.[Bin], He, C.H.[Cong-Hui], Wang, J.Q.[Jia-Qi], Lin, D.[Dahua], Zhang, W.M.[Wei-Ming], Yu, N.H.[Neng-Hai],
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation,
CVPR24(13418-13427)
IEEE DOI Code:
WWW Link. 2410
Training, Measurement, Costs, Codes, Large language models, Focusing, Hallucination, Large vision-language model, Multimodal LLM, LLM BibRef

Zhang, Y.[Yichi], Ma, Z.Q.[Zi-Qiao], Gao, X.F.[Xiao-Feng], Shakiah, S.[Suhaila], Gao, Q.[Qiaozi], Chai, J.[Joyce],
Groundhog Grounding Large Language Models to Holistic Segmentation,
CVPR24(14227-14238)
IEEE DOI 2410
Training, Visualization, Grounding, Shape, Large language models, Semantics, Feature extraction, Multi-Modal, Language Grounding, Vision-Language Model BibRef

Sun, Z.Y.[Ze-Yi], Fang, Y.[Ye], Wu, T.[Tong], Zhang, P.[Pan], Zang, Y.H.[Yu-Hang], Kong, S.[Shu], Xiong, Y.J.[Yuan-Jun], Lin, D.[Dahua], Wang, J.Q.[Jia-Qi],
Alpha-CLIP: A CLIP Model Focusing on Wherever you Want,
CVPR24(13019-13029)
IEEE DOI Code:
WWW Link. 2410
Point cloud compression, Visualization, Image recognition, Codes, Large language models, CLIP, Vision-language pretraining, MLLMs BibRef

Parashar, S.[Shubham], Lin, Z.Q.[Zhi-Qiu], Liu, T.[Tian], Dong, X.J.[Xiang-Jue], Li, Y.[Yanan], Ramanan, D.[Deva], Caverlee, J.[James], Kong, S.[Shu],
The Neglected Tails in Vision-Language Models,
CVPR24(12988-12997)
IEEE DOI 2410
Training, Visualization, Accuracy, Large language models, Text to image, Tail, Flowering plants, Vision-Language Models, Long tailed recognition BibRef

Yu, Q.F.[Qi-Fan], Li, J.C.[Jun-Cheng], Wei, L.[Longhui], Pang, L.[Liang], Ye, W.T.[Wen-Tao], Qin, B.S.[Bo-Sheng], Tang, S.L.[Si-Liang], Tian, Q.[Qi], Zhuang, Y.T.[Yue-Ting],
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data,
CVPR24(12944-12953)
IEEE DOI Code:
WWW Link. 2410
Measurement, Visualization, Toxicology, Correlation, Codes, Large language models, Hallucinations, Vision-language reasoning BibRef

Luo, Y.[Yan], Shi, M.[Min], Khan, M.O.[Muhammad Osama], Afzal, M.M.[Muhammad Muneeb], Huang, H.[Hao], Yuan, S.[Shuaihang], Tian, Y.[Yu], Song, L.[Luo], Kouhana, A.[Ava], Elze, T.[Tobias], Fang, Y.[Yi], Wang, M.Y.[Meng-Yu],
FairCLIP: Harnessing Fairness in Vision-Language Learning,
CVPR24(12289-12301)
IEEE DOI Code:
WWW Link. 2410
Deep learning, Bridges, Analytical models, Ethics, Codes, Computational modeling, Fairness Learning, Large Language Models BibRef

Zara, G.[Giacomo], Conti, A.[Alessandro], Roy, S.[Subhankar], Lathuilière, S.[Stéphane], Rota, P.[Paolo], Ricci, E.[Elisa],
The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation,
ICCV23(10273-10283)
IEEE DOI 2401
BibRef

Liao, Z.[Zhaohe], Li, J.T.[Jiang-Tong], Niu, L.[Li], Zhang, L.Q.[Li-Qing],
Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering,
CVPR24(13395-13404)
IEEE DOI 2410
Measurement, Accuracy, Computational modeling, Aggregates, Large language models, Pipelines BibRef

Zhao, H.B.[Hong-Bo], Ni, B.[Bolin], Fan, J.S.[Jun-Song], Wang, Y.X.[Yu-Xi], Chen, Y.T.[Yun-Tao], Meng, G.F.[Gao-Feng], Zhang, Z.X.[Zhao-Xiang],
Continual Forgetting for Pre-Trained Vision Models,
CVPR24(28631-28642)
IEEE DOI Code:
WWW Link. 2410
Continuing education, Privacy, Codes, Large language models, Face recognition, Object detection, Continual Forgetting, Machine Unlearning BibRef

Kim, K.[Kibum], Yoon, K.[Kanghoon], Jeon, J.[Jaehyeong], In, Y.[Yeonjun], Moon, J.[Jinyoung], Kim, D.H.[Dong-Hyun], Park, C.[Chanyoung],
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation,
CVPR24(28306-28316)
IEEE DOI Code:
WWW Link. 2410
Training, Visualization, Grounding, Large language models, Semantics, Genomics, Focusing, Scene Understanding, Large Language Model, Long-Tail Problem BibRef

Zhan, X.Y.[Xin-Yu], Yang, L.X.[Li-Xin], Zhao, Y.F.[Yi-Fei], Mao, K.[Kangrui], Xu, H.L.[Han-Lin], Lin, Z.[Zenan], Li, K.L.[Kai-Lin], Lu, C.[Cewu],
OakInk2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion,
CVPR24(445-456)
IEEE DOI Code:
WWW Link. 2410
Annotations, Affordances, Computational modeling, Large language models, Decoding BibRef

Li, Y.C.[Yi-Cong], Zhao, N.[Na], Xiao, J.B.[Jun-Bin], Feng, C.[Chun], Wang, X.[Xiang], Chua, T.S.[Tat-Seng],
LASO: Language-Guided Affordance Segmentation on 3D Object,
CVPR24(14251-14260)
IEEE DOI Code:
WWW Link. 2410
Visualization, Solid modeling, Shape, Affordances, Large language models, Semantics, Multimodal, 3D-Language, Vision-Language BibRef

Rotstein, N.[Noam], Bensaïd, D.[David], Brody, S.[Shaked], Ganz, R.[Roy], Kimmel, R.[Ron],
FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions,
WACV24(5677-5688)
IEEE DOI 2404
Training, Surveys, Visualization, Fuses, Optical character recognition, Training data, Algorithms, Image recognition and understanding BibRef

Pan, J.T.[Jun-Ting], Lin, Z.[Ziyi], Ge, Y.Y.[Yu-Ying], Zhu, X.T.[Xia-Tian], Zhang, R.R.[Ren-Rui], Wang, Y.[Yi], Qiao, Y.[Yu], Li, H.S.[Hong-Sheng],
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models,
MMFM23(272-283)
IEEE DOI 2401
BibRef

Guo, J.X.[Jia-Xian], Li, J.[Junnan], Li, D.X.[Dong-Xu], Tiong, A.M.H.[Anthony Meng Huat], Li, B.Y.[Bo-Yang], Tao, D.C.[Da-Cheng], Hoi, S.[Steven],
From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models,
CVPR23(10867-10877)
IEEE DOI 2309
BibRef

Chapter on Implementations and Applications, Databases, QBIC, Video Analysis, Hardware and Software, Inspection continues in
Image-Text Matching, Image Text Retrieval, Image-Text Retrieval .


Last update:Jan 20, 2025 at 11:36:25