Hu, Z.J.[Zhong-Jian],
Yang, P.[Peng],
Jiang, Y.S.[Yuan-Shuang],
Bai, Z.J.[Zi-Jian],
Prompting large language model with context and pre-answer for
knowledge-based VQA,
PR(151), 2024, pp. 110399.
Elsevier DOI
2404
Visual question answering, Large language model,
Knowledge-based VQA, Fine-tuning, In-context learning
BibRef
Zhang, Z.C.[Zi-Cheng],
Wu, H.N.[Hao-Ning],
Zhang, E.[Erli],
Zhai, G.T.[Guang-Tao],
Lin, W.S.[Wei-Si],
Q-Bench+: A Benchmark for Multi-Modal Foundation Models on Low-Level
Vision From Single Images to Pairs,
PAMI(46), No. 12, December 2024, pp. 10404-10418.
IEEE DOI
2411
Visualization, Benchmark testing, Task analysis, Natural languages,
Visual perception, Large language models, perception
BibRef
Zhao, Z.[Zihao],
Wang, S.[Sheng],
Gu, J.[Jinchen],
Zhu, Y.[Yitao],
Mei, L.[Lanzhuju],
Zhuang, Z.X.[Zi-Xu],
Cui, Z.M.[Zhi-Ming],
Wang, Q.[Qian],
Shen, D.G.[Ding-Gang],
ChatCAD+: Toward a Universal and Reliable Interactive CAD Using LLMs,
MedImg(43), No. 11, November 2024, pp. 3755-3766.
IEEE DOI
2411
Solid modeling, Reliability, Medical diagnostic imaging, Chatbots,
Visualization, Brain modeling, Databases, Large language models,
computer-assisted diagnosis
BibRef
Luo, H.[Haonan],
Zeng, Y.J.[Yi-Jie],
Yang, L.[Li],
Chen, K.[Kexun],
Shen, Z.X.[Zhi-Xuan],
Lv, F.[Fengmao],
VLAI: Exploration and Exploitation based on Visual-Language Aligned
Information for Robotic Object Goal Navigation,
IVC(151), 2024, pp. 105259.
Elsevier DOI Code:
WWW Link.
2411
Object goal navigation, Visual-to-language,
Embodied artificial intelligence, Large language model
BibRef
Mansourian, A.[Ali],
Oucheikh, R.[Rachid],
ChatGeoAI: Enabling Geospatial Analysis for Public through Natural
Language, with Large Language Models,
IJGI(13), No. 10, 2024, pp. 348.
DOI Link
2411
BibRef
Li, D.[Diya],
Zhao, Y.[Yue],
Wang, Z.F.[Zhi-Fang],
Jung, C.[Calvin],
Zhang, Z.[Zhe],
Large Language Model-Driven Structured Output: A Comprehensive
Benchmark and Spatial Data Generation Framework,
IJGI(13), No. 11, 2024, pp. 405.
DOI Link
2412
BibRef
Li, Y.[Yunxin],
Hu, B.[Baotian],
Chen, X.Y.[Xin-Yu],
Ma, L.[Lin],
Xu, Y.[Yong],
Zhang, M.[Min],
LMEye: An Interactive Perception Network for Large Language Models,
MultMed(26), 2024, pp. 10952-10964.
IEEE DOI
2412
Visualization, Task analysis, Data models, Tuning,
Large language models, Training, Cognition,
interactive perception network
BibRef
Shao, R.[Run],
Zhang, Z.Y.[Zhao-Yang],
Tao, C.[Chao],
Zhang, Y.S.[Yun-Sheng],
Peng, C.L.[Chang-Le],
Li, H.F.[Hai-Feng],
Homogeneous tokenizer matters: Homogeneous visual tokenizer for
remote sensing image understanding,
PandRS(218), 2024, pp. 294-310.
Elsevier DOI Code:
WWW Link.
2412
Remote sensing image understanding, Visual tokenizer,
Homogeneous, Semantically independent region, Visual transformer model
BibRef
Liu, T.Q.[Tian-Qi],
Qin, Y.J.[Yan-Jun],
Zhang, S.H.[Shang-Hang],
Tao, X.M.[Xiao-Ming],
Empowering Corner Case Detection in Autonomous Vehicles With
Multimodal Large Language Models,
SPLetters(32), 2025, pp. 51-55.
IEEE DOI
2501
Rare objects in odd locations.
Object detection, Visualization, Autonomous vehicles,
Large language models, Roads, Vectors, Transformers, object detection
BibRef
Liu, Y.[Yi],
Hou, H.[Haowen],
Ma, F.[Fei],
Ni, S.G.[Shi-Guang],
Yu, F.R.[Fei Richard],
MLLM-TA: Leveraging Multimodal Large Language Models for Precise
Temporal Video Grounding\,
SPLetters(32), 2025, pp. 281-285.
IEEE DOI
2501
Visualization, Grounding, Large language models, Feature extraction,
Benchmark testing, Vectors, Training, video grounding
BibRef
Wang, Z.H.[Zhe-Hui],
Luo, T.[Tao],
Liu, C.[Cheng],
Liu, W.C.[Wei-Chen],
Goh, R.S.M.[Rick Siow Mong],
Wong, W.F.[Weng-Fai],
Enabling Energy-Efficient Deployment of Large Language Models on
Memristor Crossbar: A Synergy of Large and Small,
PAMI(47), No. 2, February 2025, pp. 916-933.
IEEE DOI
2501
Memristors, Computer architecture, Random access memory,
Nonvolatile memory, Computational modeling, Neural networks,
non-volatile memory
BibRef
Long, F.C.[Fu-Chen],
Qiu, Z.F.[Zhao-Fan],
Yao, T.[Ting],
Mei, T.[Tao],
VideoStudio: Generating Consistent-content and Multi-scene Videos,
ECCV24(LX: 468-485).
Springer DOI
2412
Code:
WWW Link.
BibRef
Liu, S.L.[Shi-Long],
Cheng, H.[Hao],
Liu, H.T.[Hao-Tian],
Zhang, H.[Hao],
Li, F.[Feng],
Ren, T.[Tianhe],
Zou, X.[Xueyan],
Yang, J.W.[Jian-Wei],
Su, H.[Hang],
Zhu, J.[Jun],
Zhang, L.[Lei],
Gao, J.F.[Jian-Feng],
Li, C.Y.[Chun-Yuan],
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents,
ECCV24(XLVII: 126-142).
Springer DOI
2412
BibRef
Kong, X.H.[Xiang-Hao],
Chen, J.[Jinyu],
Wang, W.G.[Wen-Guan],
Su, H.[Hang],
Hu, X.L.[Xiao-Lin],
Yang, Y.[Yi],
Liu, S.[Si],
Controllable Navigation Instruction Generation with Chain of Thought
Prompting,
ECCV24(XXIX: 37-54).
Springer DOI
2412
Instruction generation.
BibRef
Zhu, W.Y.C.[William Yi-Cheng],
Ye, K.[Keren],
Ke, J.J.[Jun-Jie],
Yu, J.[Jiahui],
Guibas, L.J.[Leonidas J.],
Milanfar, P.[Peyman],
Yang, F.[Feng],
ARTVLM: Attribute Recognition Through Vision-based Prefix Language
Modeling,
ECCV24(XXVII: 127-145).
Springer DOI
2412
Code:
WWW Link.
BibRef
Kim, D.[Donggyun],
Cho, S.[Seongwoong],
Kim, S.[Semin],
Luo, C.[Chong],
Hong, S.[Seunghoon],
Chameleon: A Data-efficient Generalist for Dense Visual Prediction in
the Wild,
ECCV24(XXIII: 422-441).
Springer DOI
2412
Code:
WWW Link.
BibRef
Ke, F.[Fucai],
Cai, Z.X.[Zhi-Xi],
Jahangard, S.[Simindokht],
Wang, W.Q.[Wei-Qing],
Haghighi, P.D.[Pari Delir],
Rezatofighi, H.[Hamid],
Hydra: A Hyper Agent for Dynamic Compositional Visual Reasoning,
ECCV24(XX: 132-149).
Springer DOI
2412
BibRef
Bao, X.Y.[Xiao-Yi],
Sun, S.Y.[Si-Yang],
Ma, S.L.[Shuai-Lei],
Zheng, K.C.[Ke-Cheng],
Guo, Y.X.[Yu-Xin],
Zhao, G.S.[Guo-Sheng],
Zheng, Y.[Yun],
Wang, X.G.[Xin-Gang],
Cores: Orchestrating the Dance of Reasoning and Segmentation,
ECCV24(XVIII: 187-204).
Springer DOI
2412
BibRef
Liu, Z.[Zuyan],
Liu, B.[Benlin],
Wang, J.[Jiahui],
Dong, Y.H.[Yu-Hao],
Chen, G.Y.[Guang-Yi],
Rao, Y.M.[Yong-Ming],
Krishna, R.[Ranjay],
Lu, J.W.[Ji-Wen],
Efficient Inference of Vision Instruction-following Models with Elastic
Cache,
ECCV24(XVII: 54-69).
Springer DOI
2412
Code:
WWW Link.
BibRef
Alaluf, Y.[Yuval],
Richardson, E.[Elad],
Tulyakov, S.[Sergey],
Aberman, K.[Kfir],
Cohen-Or, D.[Daniel],
MYVLM: Personalizing VLMS for User-specific Queries,
ECCV24(XIII: 73-91).
Springer DOI
2412
BibRef
Cai, R.[Rizhao],
Song, Z.[Zirui],
Guan, D.[Dayan],
Chen, Z.H.[Zhen-Hao],
Li, Y.H.[Yao-Hang],
Luo, X.[Xing],
Yi, C.Y.[Chen-Yu],
Kot, A.C.[Alex C.],
BenchLMM: Benchmarking Cross-Style Visual Capability of Large
Multimodal Models,
ECCV24(L: 340-358).
Springer DOI
2412
BibRef
Ma, Z.X.[Zi-Xian],
Huang, W.[Weikai],
Zhang, J.[Jieyu],
Gupta, T.[Tanmay],
Krishna, R.[Ranjay],
m&m's: A Benchmark to Evaluate Tool-use for multi-step multi-modal
Tasks,
ECCV24(X: 18-34).
Springer DOI
2412
WWW Link. and
WWW Link.
BibRef
Miao, Y.[Yang],
Engelmann, F.[Francis],
Vysotska, O.[Olga],
Zhao, Z.H.[Zhong-Han],
Chai, W.H.[Wen-Hao],
Wang, X.[Xuan],
Li, B.[Boyi],
Hao, S.Y.[Sheng-Yu],
Cao, S.D.[Shi-Dong],
Ye, T.[Tian],
Wang, G.A.[Gao-Ang],
See and Think: Embodied Agent in Virtual Environment,
ECCV24(VIII: 187-204).
Springer DOI
2412
BibRef
Liu, Y.[Yuan],
Duan, H.D.[Hao-Dong],
Zhang, Y.[Yuanhan],
Li, B.[Bo],
Zhang, S.Y.[Song-Yang],
Zhao, W.[Wangbo],
Yuan, Y.[Yike],
Wang, J.Q.[Jia-Qi],
He, C.H.[Cong-Hui],
Liu, Z.W.[Zi-Wei],
Chen, K.[Kai],
Lin, D.[Dahua],
MMBENCH: Is Your Multi-Modal Model an All-Around Player?,
ECCV24(VI: 216-233).
Springer DOI
2412
BibRef
Liu, Y.[Yang],
Ding, P.X.[Peng-Xiang],
Huang, S.[Siteng],
Zhang, M.[Min],
Zhao, H.[Han],
Wang, D.L.[Dong-Lin],
PITE: Pixel-Temporal Alignment for Large Video-Language Model,
ECCV24(V: 160-176).
Springer DOI
2412
BibRef
Liu, S.[Shi],
Zheng, K.[Kecheng],
Chen, W.[Wei],
Paying More Attention to Image: A Training-free Method for Alleviating
Hallucination in LVLMS,
ECCV24(LXXXIII: 125-140).
Springer DOI
2412
BibRef
Tu, H.Q.[Hao-Qin],
Cui, C.[Chenhang],
Wang, Z.J.[Zi-Jun],
Zhou, Y.Y.[Yi-Yang],
Zhao, B.C.[Bing-Chen],
Han, J.L.[Jun-Lin],
Zhou, W.[Wangchunshu],
Yao, H.X.[Hua-Xiu],
Xie, C.[Cihang],
How Many Are in This Image A Safety Evaluation Benchmark for Vision
LLMs,
ECCV24(LI: 37-55).
Springer DOI
2412
BibRef
Panagopoulou, A.[Artemis],
Xue, L.[Le],
Yu, N.[Ning],
Li, J.[Junnan],
Li, D.X.[Dong-Xu],
Joty, S.[Shafiq],
Xu, R.[Ran],
Savarese, S.[Silvio],
Xiong, C.M.[Cai-Ming],
Niebles, J.C.[Juan Carlos],
X-instructblip: A Framework for Aligning Image, 3d, Audio, Video to
LLMs and its Emergent Cross-modal Reasoning,
ECCV24(XLV: 177-197).
Springer DOI
2412
BibRef
Mirza, M.J.[M. Jehanzeb],
Karlinsky, L.[Leonid],
Lin, W.[Wei],
Doveh, S.[Sivan],
Micorek, J.[Jakub],
Kozinski, M.[Mateusz],
Kuehne, H.[Hilde],
Possegger, H.[Horst],
Meta-prompting for Automating Zero-shot Visual Recognition with LLMs,
ECCV24(II: 370-387).
Springer DOI
2412
BibRef
Yu, E.[En],
Zhao, L.[Liang],
Wei, Y.[Yana],
Yang, J.R.[Jin-Rong],
Wu, D.M.[Dong-Ming],
Kong, L.Y.[Ling-Yu],
Wang, T.[Tiancai],
Ge, Z.[Zheng],
Zhang, X.Y.[Xiang-Yu],
Tao, W.B.[Wen-Bing],
Merlin: Empowering Multimodal LLMs with Foresight Minds,
ECCV24(IV: 425-443).
Springer DOI
2412
BibRef
Liu, Z.Y.[Zhao-Yang],
Lai, Z.[Zeqiang],
Gao, Z.W.[Zhang-Wei],
Cui, E.[Erfei],
Li, Z.H.[Zi-Heng],
Zhu, X.[Xizhou],
Lu, L.W.[Le-Wei],
Chen, Q.F.[Qi-Feng],
Qiao, Y.[Yu],
Dai, J.F.[Ji-Feng],
Wang, W.H.[Wen-Hai],
ControlLLM: Augment Language Models with Tools by Searching on Graphs,
ECCV24(XII: 89-105).
Springer DOI
2412
BibRef
Yao, Y.[Yi],
Hsu, C.F.[Chan-Feng],
Lin, J.H.[Jhe-Hao],
Xie, H.X.[Hong-Xia],
Lin, T.[Terence],
Huang, Y.N.[Yi-Ning],
Shuai, H.H.[Hong-Han],
Cheng, W.H.[Wen-Huang],
The Fabrication of Reality and Fantasy: Scene Generation with
LLM-assisted Prompt Interpretation,
ECCV24(XXII: 422-438).
Springer DOI
2412
BibRef
Wu, Y.X.[Yi-Xuan],
Wang, Y.Z.[Yi-Zhou],
Tang, S.X.[Shi-Xiang],
Wu, W.H.[Wen-Hao],
He, T.[Tong],
Ouyang, W.L.[Wan-Li],
Torr, P.H.S.[Philip H.S.],
Wu, J.[Jian],
Dettoolchain: A New Prompting Paradigm to Unleash Detection Ability of
MLLM,
ECCV24(XXXII: 164-182).
Springer DOI
2412
BibRef
Song, K.[Kunpeng],
Zhu, Y.Z.[Yi-Zhe],
Liu, B.C.[Bing-Chen],
Yan, Q.[Qing],
Elgammal, A.[Ahmed],
Yang, X.[Xiao],
MOMA: Multimodal LLM Adapter for Fast Personalized Image Generation,
ECCV24(XL: 117-132).
Springer DOI
2412
BibRef
Wang, H.[Han],
Ye, Y.J.[Yong-Jie],
Wang, Y.J.[Yan-Jie],
Nie, Y.X.[Yu-Xiang],
Huang, C.[Can],
Elysium: Exploring Object-level Perception in Videos via MLLM,
ECCV24(XXII: 166-185).
Springer DOI
2412
BibRef
Gou, Y.H.[Yun-Hao],
Chen, K.[Kai],
Liu, Z.[Zhili],
Hong, L.Q.[Lan-Qing],
Xu, H.[Hang],
Li, Z.G.[Zhen-Guo],
Yeung, D.Y.[Dit-Yan],
Kwok, J.T.[James T.],
Zhang, Y.[Yu],
Eyes Closed, Safety on: Protecting Multimodal LLMs via Image-to-text
Transformation,
ECCV24(XVII: 388-404).
Springer DOI
2412
BibRef
Guo, Z.H.[Zong-Hao],
Xu, R.[Ruyi],
Yao, Y.[Yuan],
Cui, J.[Junbo],
Ni, Z.[Zanlin],
Ge, C.J.[Chun-Jiang],
Chua, T.S.[Tat-Seng],
Liu, Z.Y.[Zhi-Yuan],
Huang, G.[Gao],
LLAVA-UHD: An LMM Perceiving Any Aspect Ratio and High-resolution
Images,
ECCV24(LXXXIII: 390-406).
Springer DOI
2412
BibRef
Wang, D.S.[Dong-Sheng],
Cui, J.[Jiequan],
Li, M.[Miaoge],
Lin, W.[Wang],
Chen, B.[Bo],
Zhang, H.W.[Han-Wang],
Instruction Tuning-free Visual Token Complement for Multimodal LLMs,
ECCV24(LXXXI: 446-462).
Springer DOI
2412
BibRef
You, K.[Keen],
Zhang, H.T.[Hao-Tian],
Schoop, E.[Eldon],
Weers, F.[Floris],
Swearngin, A.[Amanda],
Nichols, J.[Jeffrey],
Yang, Y.F.[Yin-Fei],
Gan, Z.[Zhe],
FERRET-UI: Grounded Mobile UI Understanding with Multimodal LLMs,
ECCV24(LXIV: 240-255).
Springer DOI
2412
BibRef
McKinzie, B.[Brandon],
Gan, Z.[Zhe],
Fauconnier, J.P.[Jean-Philippe],
Dodge, S.[Sam],
Zhang, B.[Bowen],
Dufter, P.[Philipp],
Shah, D.[Dhruti],
Du, X.Z.[Xian-Zhi],
Peng, F.[Futang],
Belyi, A.[Anton],
Zhang, H.T.[Hao-Tian],
Singh, K.[Karanjeet],
Kang, D.[Doug],
Hè, H.Y.[Hong-Yu],
Schwarzer, M.[Max],
Gunter, T.[Tom],
Kong, X.[Xiang],
Zhang, A.[Aonan],
Wang, J.Y.[Jian-Yu],
Wang, C.[Chong],
Du, N.[Nan],
Lei, T.[Tao],
Wiseman, S.[Sam],
Lee, M.[Mark],
Wang, Z.[Zirui],
Pang, R.[Ruoming],
Grasch, P.[Peter],
Toshev, A.[Alexander],
Yang, Y.F.[Yin-Fei],
MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training,
ECCV24(XXIX: 304-323).
Springer DOI
2412
BibRef
Zhou, G.Z.[Geng-Ze],
Hong, Y.C.[Yi-Cong],
Wang, Z.[Zun],
Wang, X.E.[Xin Eric],
Wu, Q.[Qi],
NAVGPT-2: Unleashing Navigational Reasoning Capability for Large
Vision-language Models,
ECCV24(VII: 260-278).
Springer DOI
2412
BibRef
Wei, H.R.[Hao-Ran],
Kong, L.Y.[Ling-Yu],
Chen, J.Y.[Jin-Yue],
Zhao, L.[Liang],
Ge, Z.[Zheng],
Wei, J.R.Y.H.R.[Jin-Rong Yang Hao-Ran],
Wang, T.[Tiancai],
Ge, Z.[Zheng],
Zhang, X.Y.[Xiang-Yu],
Tao, W.B.[Wen-Bing],
Vary: Scaling up the Vision Vocabulary for Large Vision-language Model,
ECCV24(IV: 408-424).
Springer DOI
2412
BibRef
Wang, Y.[Yu],
Liu, X.G.[Xiao-Geng],
Li, Y.[Yu],
Chen, M.[Muhao],
Xiao, C.W.[Chao-Wei],
Adashield: Safeguarding Multimodal Large Language Models from
Structure-based Attack via Adaptive Shield Prompting,
ECCV24(XX: 77-94).
Springer DOI
2412
BibRef
He, S.T.[Shu-Ting],
Ding, H.H.[Heng-Hui],
Jiang, X.D.[Xu-Dong],
Wen, B.[Bihan],
Segpoint: Segment Any Point Cloud via Large Language Model,
ECCV24(XXII: 349-367).
Springer DOI
2412
BibRef
Zhao, H.H.[Henry Hengyuan],
Zhou, P.[Pan],
Shou, M.Z.[Mike Zheng],
Genixer: Empowering Multimodal Large Language Model as a Powerful Data
Generator,
ECCV24(XXIII: 129-147).
Springer DOI
2412
BibRef
Fu, X.Y.[Xing-Yu],
Hu, Y.S.[Yu-Shi],
Li, B.Z.[Bang-Zheng],
Feng, Y.[Yu],
Wang, H.Y.[Hao-Yu],
Lin, X.D.[Xu-Dong],
Roth, D.[Dan],
Smith, N.A.[Noah A.],
Ma, W.C.[Wei-Chiu],
Krishna, R.[Ranjay],
Blink: Multimodal Large Language Models Can See but Not Perceive,
ECCV24(XXIII: 148-166).
Springer DOI
2412
BibRef
Zhang, Z.K.[Zhi-Kai],
Li, Y.T.[Yi-Tang],
Huang, H.F.[Hao-Feng],
Lin, M.X.[Ming-Xian],
Yi, L.[Li],
Freemotion: Mocap-free Human Motion Synthesis with Multimodal Large
Language Models,
ECCV24(XXIII: 403-421).
Springer DOI
2412
BibRef
Murugesan, B.[Balamurali],
Silva-Rodríguez, J.[Julio],
Ben Ayed, I.[Ismail],
Dolz, J.[Jose],
Robust Calibration of Large Vision-language Adapters,
ECCV24(XXIV: 147-165).
Springer DOI
2412
BibRef
Xu, R.[Runsen],
Wang, X.L.[Xiao-Long],
Wang, T.[Tai],
Chen, Y.L.[Yi-Lun],
Pang, J.M.[Jiang-Miao],
Lin, D.[Dahua],
Pointllm: Empowering Large Language Models to Understand Point Clouds,
ECCV24(XXV: 131-147).
Springer DOI
2412
BibRef
Cai, K.W.[Kai-Wen],
Duan, Z.K.[Zhe-Kai],
Liu, G.[Gaowen],
Fleming, C.[Charles],
Lu, C.X.X.[Chris Xiao-Xuan],
Self-adapting Large Visual-language Models to Edge Devices Across
Visual Modalities,
ECCV24(XXVIII: 301-318).
Springer DOI
2412
BibRef
Yu, R.[Runpeng],
Yu, W.H.[Wei-Hao],
Wang, X.C.[Xin-Chao],
Attention Prompting on Image for Large Vision-language Models,
ECCV24(XXX: 251-268).
Springer DOI
2412
BibRef
Luo, Y.L.[Yu-Lin],
An, R.[Ruichuan],
Zou, B.[Bocheng],
Tang, Y.M.[Yi-Ming],
Liu, J.[JiaMing],
Zhang, S.H.[Shang-Hang],
Llm as Dataset Analyst: Subpopulation Structure Discovery with Large
Language Model,
ECCV24(XXXIII: 235-252).
Springer DOI
2412
BibRef
Pi, R.J.[Ren-Jie],
Han, T.Y.[Tian-Yang],
Xiong, W.[Wei],
Zhang, J.P.[Ji-Peng],
Liu, R.[Runtao],
Pan, R.[Rui],
Zhang, T.[Tong],
Strengthening Multimodal Large Language Model with Bootstrapped
Preference Optimization,
ECCV24(XXXIII: 382-398).
Springer DOI
2412
BibRef
Chen, Y.[Yuan],
Ding, Z.H.[Zi-Han],
Wang, Z.Q.[Zi-Qin],
Wang, Y.[Yan],
Zhang, L.J.[Li-Jun],
Liu, S.[Si],
Asynchronous Large Language Model Enhanced Planner for Autonomous
Driving,
ECCV24(XXXVI: 22-38).
Springer DOI
2412
BibRef
Huang, Z.J.[Zhi-Jian],
Tang, T.[Tao],
Chen, S.X.[Shao-Xiang],
Lin, S.[Sihao],
Jie, Z.Q.[Ze-Qun],
Ma, L.[Lin],
Wang, G.[Guangrun],
Liang, X.D.[Xiao-Dan],
Making Large Language Models Better Planners with Reasoning-decision
Alignment,
ECCV24(XXXVI: 73-90).
Springer DOI
2412
BibRef
Xia, B.[Bin],
Wang, S.Y.[Shi-Yin],
Tao, Y.[Yingfan],
Wang, Y.T.[Yi-Tong],
Jia, J.Y.[Jia-Ya],
Llmga: Multimodal Large Language Model Based Generation Assistant,
ECCV24(XXXVIII: 389-406).
Springer DOI
2412
BibRef
Zhan, Y.F.[Yu-Fei],
Zhu, Y.[Yousong],
Chen, Z.Y.[Zhi-Yang],
Yang, F.[Fan],
Tang, M.[Ming],
Wang, J.Q.[Jin-Qiao],
Griffon: Spelling Out All Object Locations at Any Granularity with
Large Language Models,
ECCV24(XLII: 405-422).
Springer DOI
2412
BibRef
Li, Y.W.[Yan-Wei],
Wang, C.Y.[Cheng-Yao],
Jia, J.Y.[Jia-Ya],
Llama-vid: An Image is Worth 2 Tokens in Large Language Models,
ECCV24(XLVI: 323-340).
Springer DOI
2412
BibRef
Ju, C.[Chen],
Wang, H.[Haicheng],
Cheng, H.Z.[Hao-Zhe],
Chen, X.[Xu],
Zhai, Z.H.[Zhong-Hua],
Huang, W.L.[Wei-Lin],
Lan, J.S.[Jin-Song],
Xiao, S.[Shuai],
Zheng, B.[Bo],
Turbo: Informativity-driven Acceleration Plug-in for Vision-language
Large Models,
ECCV24(XLVI: 436-455).
Springer DOI
2412
BibRef
Zhao, Q.[Qinyu],
Xu, M.[Ming],
Gupta, K.[Kartik],
Asthana, A.[Akshay],
Zheng, L.[Liang],
Gould, S.[Stephen],
The First to Know: How Token Distributions Reveal Hidden Knowledge in
Large Vision-language Models?,
ECCV24(XLVIII: 127-142).
Springer DOI
2412
BibRef
Lee, B.K.[Byung-Kwan],
Park, B.[Beomchan],
Kim, C.W.[Chae Won],
Ro, Y.M.[Yong Man],
Moai: Mixture of All Intelligence for Large Language and Vision Models,
ECCV24(XLIX: 273-302).
Springer DOI
2412
BibRef
Liu, X.[Xin],
Zhu, Y.C.[Yi-Chen],
Gu, J.D.[Jin-Dong],
Lan, Y.[Yunshi],
Yang, C.[Chao],
Qiao, Y.[Yu],
MM-Safetybench: A Benchmark for Safety Evaluation of Multimodal Large
Language Models,
ECCV24(LVI: 386-403).
Springer DOI
2412
BibRef
Liu, R.[Ruyang],
Li, C.[Chen],
Tang, H.R.[Hao-Ran],
Ge, Y.X.[Yi-Xiao],
Shan, Y.[Ying],
Li, G.[Ge],
ST-LLM: Large Language Models Are Effective Temporal Learners,
ECCV24(LVII: 1-18).
Springer DOI
2412
BibRef
Cheng, H.[Hao],
Xiao, E.[Erjia],
Gu, J.D.[Jin-Dong],
Yang, L.[Le],
Duan, J.[Jinhao],
Zhang, J.[Jize],
Cao, J.H.[Jia-Hang],
Xu, K.D.[Kai-Di],
Xu, R.[Renjing],
Unveiling Typographic Deceptions: Insights of the Typographic
Vulnerability in Large Vision-language Models,
ECCV24(LIX: 179-196).
Springer DOI
2412
BibRef
Lin, Z.[Ziyi],
Liu, D.Y.[Dong-Yang],
Zhang, R.R.[Ren-Rui],
Gao, P.[Peng],
Qiu, L.[Longtian],
Xiao, H.[Han],
Qiu, H.[Han],
Shao, W.Q.[Wen-Qi],
Chen, K.Q.[Ke-Qin],
Han, J.[JiaMing],
Huang, S.Y.[Si-Yuan],
Zhang, Y.[Yichi],
He, X.M.[Xu-Ming],
Qiao, Y.[Yu],
Li, H.S.[Hong-Sheng],
Sphinx: A Mixer of Weights, Visual Embeddings and Image Scales for
Multi-modal Large Language Models,
ECCV24(LXII: 36-55).
Springer DOI
2412
BibRef
Chiquier, M.[Mia],
Mall, U.[Utkarsh],
Vondrick, C.[Carl],
Evolving Interpretable Visual Classifiers with Large Language Models,
ECCV24(LXIV: 183-201).
Springer DOI
2412
BibRef
Zhang, J.[Jinrui],
Wang, T.[Teng],
Zhang, H.G.[Hai-Gang],
Lu, P.[Ping],
Zheng, F.[Feng],
Reflective Instruction Tuning: Mitigating Hallucinations in Large
Vision-language Models,
ECCV24(LXVIII: 196-213).
Springer DOI
2412
BibRef
Li, Y.F.[Yi-Fan],
Guo, H.[Hangyu],
Zhou, K.[Kun],
Zhao, W.X.[Wayne Xin],
Wen, J.R.[Ji-Rong],
Images are Achilles' Heel of Alignment: Exploiting Visual
Vulnerabilities for Jailbreaking Multimodal Large Language Models,
ECCV24(LXXIII: 174-189).
Springer DOI
2412
BibRef
Wu, T.[Tianhe],
Ma, K.[Kede],
Liang, J.[Jie],
Yang, Y.[Yujiu],
Zhang, L.[Lei],
A Comprehensive Study of Multimodal Large Language Models for Image
Quality Assessment,
ECCV24(LXXIV: 143-160).
Springer DOI
2412
BibRef
Muhtar, D.[Dilxat],
Li, Z.[Zhenshi],
Gu, F.[Feng],
Zhang, X.L.[Xue-Liang],
Xiao, P.F.[Peng-Feng],
Lhrs-bot: Empowering Remote Sensing with Vgi-enhanced Large Multimodal
Language Model,
ECCV24(LXXIV: 440-457).
Springer DOI
2412
BibRef
Chen, L.[Liang],
Zhao, H.Z.[Hao-Zhe],
Liu, T.Y.[Tian-Yu],
Bai, S.[Shuai],
Lin, J.Y.[Jun-Yang],
Zhou, C.[Chang],
Chang, B.[Baobao],
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-play Inference
Acceleration for Large Vision-language Models,
ECCV24(LXXXI: 19-35).
Springer DOI
2412
BibRef
Yang, Y.C.[Yu-Chen],
Lee, K.[Kwonjoon],
Dariush, B.[Behzad],
Cao, Y.[Yinzhi],
Lo, S.Y.[Shao-Yuan],
Follow the Rules: Reasoning for Video Anomaly Detection with Large
Language Models,
ECCV24(LXXXI: 304-322).
Springer DOI
2412
BibRef
Chen, Y.C.[Yi-Chia],
Li, W.H.[Wei-Hua],
Sun, C.[Cheng],
Wang, Y.C.A.F.[Yu-Chi-Ang Frank],
Chen, C.S.[Chu-Song],
Sam4mllm: Enhance Multi-modal Large Language Model for Referring
Expression Segmentation,
ECCV24(LXXXI: 323-340).
Springer DOI
2412
BibRef
Zheng, S.[Sipeng],
Zhou, B.[Bohan],
Feng, Y.C.[Yi-Cheng],
Wang, Y.[Ye],
Lu, Z.Q.[Zong-Qing],
Unicode: Learning a Unified Codebook for Multimodal Large Language
Models,
ECCV24(VIII: 426-443).
Springer DOI
2412
BibRef
Shi, B.F.[Bai-Feng],
Wu, Z.Y.[Zi-Yang],
Mao, M.L.[Mao-Lin],
Wang, X.[Xin],
Darrell, T.J.[Trevor J.],
When Do We Not Need Larger Vision Models?,
ECCV24(VIII: 444-462).
Springer DOI
2412
BibRef
Sun, G.H.[Guo-Hao],
Qin, C.[Can],
Wang, J.[Jiamian],
Chen, Z.[Zeyuan],
Xu, R.[Ran],
Tao, Z.Q.[Zhi-Qiang],
SQ-LLAVA: Self-questioning for Large Vision-language Assistant,
ECCV24(IX: 156-172).
Springer DOI
2412
BibRef
Ye, Q.[Qilang],
Yu, Z.T.[Zi-Tong],
Shao, R.[Rui],
Xie, X.Y.[Xin-Yu],
Torr, P.H.S.[Philip H.S.],
Cao, X.C.[Xiao-Chun],
CAT: Enhancing Multimodal Large Language Model to Answer Questions in
Dynamic Audio-visual Scenarios,
ECCV24(X: 146-164).
Springer DOI
2412
BibRef
Yu, Q.H.[Qi-Hang],
Shen, X.H.[Xiao-Hui],
Chen, L.C.[Liang-Chieh],
Towards Open-ended Visual Recognition with Large Language Models,
ECCV24(XIV: 359-376).
Springer DOI
2412
BibRef
Yan, C.[Cilin],
Wang, H.C.[Hao-Chen],
Yan, S.L.[Shi-Lin],
Jiang, X.L.[Xiao-Long],
Hu, Y.[Yao],
Kang, G.L.[Guo-Liang],
Xie, W.[Weidi],
Gavves, E.[Efstratios],
VISA: Reasoning Video Object Segmentation via Large Language Models,
ECCV24(XV: 98-115).
Springer DOI
2412
BibRef
Huang, K.[Kai],
Zou, H.[Hao],
Xi, Y.[Ye],
Wang, B.[BoChen],
Xie, Z.[Zhen],
Yu, L.[Liang],
IVTP: Instruction-guided Visual Token Pruning for Large Vision-language
Models,
ECCV24(XVII: 214-230).
Springer DOI
2412
BibRef
Liu, H.T.[Hao-Tian],
Li, C.Y.[Chun-Yuan],
Li, Y.H.[Yu-Heng],
Lee, Y.J.[Yong Jae],
Improved Baselines with Visual Instruction Tuning,
CVPR24(26286-26296)
IEEE DOI
2410
Training, Connectors, Visualization, Systematics, Codes, Computational modeling
BibRef
Ren, Z.W.[Zhong-Wei],
Huang, Z.C.[Zhi-Cheng],
Wei, Y.C.[Yun-Chao],
Zhao, Y.[Yao],
Fu, D.M.[Dong-Mei],
Feng, J.S.[Jia-Shi],
Jin, X.J.[Xiao-Jie],
PixelLM: Pixel Reasoning with Large Multimodal Model,
CVPR24(26364-26373)
IEEE DOI
2410
Bridges, Image segmentation, Codes, Benchmark testing, Cognition, Decoding
BibRef
Hu, Y.[Yutao],
Li, T.[Tianbin],
Lu, Q.[Quanfeng],
Shao, W.Q.[Wen-Qi],
He, J.J.[Jun-Jun],
Qiao, Y.[Yu],
Luo, P.[Ping],
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for
Medical LVLM,
CVPR24(22170-22183)
IEEE DOI Code:
WWW Link.
2410
Reflectivity, Visualization, Biological system modeling,
Computational modeling, Medical services, Benchmark testing
BibRef
Schiappa, M.[Madeline],
Abdullah, R.[Raiyaan],
Azad, S.[Shehreen],
Claypoole, J.[Jared],
Cogswell, M.[Michael],
Divakaran, A.[Ajay],
Rawat, Y.[Yogesh],
Probing Conceptual Understanding of Large Visual-Language Models,
WhatNext24(1797-1807)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Shape, Snow, Color, Benchmark testing,
Transformers, Robustness, Conceptual understanding
BibRef
Yue, T.T.[Tong-Tian],
Cheng, J.[Jie],
GUo, L.T.[Long-Teng],
Dai, X.Y.[Xing-Yuan],
Zhao, Z.[Zijia],
He, X.J.[Xing-Jian],
Xiong, G.[Gang],
Lv, Y.S.[Yi-Sheng],
Liu, J.[Jing],
SC- Tune: Unleashing Self-Consistent Referential Comprehension in
Large Vision Language Models,
CVPR24(13073-13083)
IEEE DOI Code:
WWW Link.
2410
Training, Codes, Computational modeling, Focusing, Benchmark testing
BibRef
Wu, T.H.[Tsung-Han],
Lian, L.[Long],
Gonzalez, J.E.[Joseph E.],
Li, B.[Boyi],
Darrell, T.J.[Trevor J.],
Self-Correcting LLM-Controlled Diffusion Models,
CVPR24(6327-6336)
IEEE DOI Code:
WWW Link.
2410
Image synthesis, Pipelines, Text to image, Process control,
Detectors, Superluminescent diodes, Diffusion models
BibRef
Yue, X.[Xiang],
Ni, Y.S.[Yuan-Sheng],
Zheng, T.Y.[Tian-Yu],
Zhang, K.[Kai],
Liu, R.[Ruoqi],
Zhang, G.[Ge],
Stevens, S.[Samuel],
Jiang, D.[Dongfu],
Ren, W.M.[Wei-Ming],
Sun, Y.X.[Yu-Xuan],
Wei, C.[Cong],
Yu, B.T.[Bo-Tao],
Yuan, R.B.[Rui-Bin],
Sun, R.L.[Ren-Liang],
Yin, M.[Ming],
Zheng, B.[Boyuan],
Yang, Z.Z.[Zhen-Zhu],
Liu, Y.[Yibo],
Huang, W.H.[Wen-Hao],
Sun, H.[Huan],
Su, Y.[Yu],
Chen, W.[Wenhu],
MMMU: A Massive Multi-Discipline Multimodal Understanding and
Reasoning Benchmark for Expert AGI,
CVPR24(9556-9567)
IEEE DOI
2410
Computational modeling, Artificial general intelligence,
Social sciences, Manuals, Benchmark testing, Cognition, LLMs
BibRef
Li, Z.[Zhuowan],
Jasani, B.[Bhavan],
Tang, P.[Peng],
Ghadar, S.[Shabnam],
Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators
for Reasoning-Based Chart VQA,
CVPR24(13613-13623)
IEEE DOI
2410
Training, Visualization, Technological innovation, Accuracy,
Computational modeling, Training data, Data augmentation
BibRef
Zheng, D.[Duo],
Huang, S.[Shijia],
Zhao, L.[Lin],
Zhong, Y.[Yiwu],
Wang, L.W.[Li-Wei],
Towards Learning a Generalist Model for Embodied Navigation,
CVPR24(13624-13634)
IEEE DOI Code:
WWW Link.
2410
Training, Adaptation models, Solid modeling, Navigation,
Soft sensors, Computational modeling, Visual-Language Navigation,
LLM
BibRef
Singh, S.[Simranjit],
Fore, M.[Michael],
Stamoulis, D.[Dimitrios],
GeoLLM-Engine: A Realistic Environment for Building Geospatial
Copilots,
EarthVision24(585-594)
IEEE DOI
2410
Earth, Geology, Natural languages, Benchmark testing,
Parallel processing, Geospatial analysis, Satellite images,
Benchmark
BibRef
Li, X.C.[Xu-Chen],
Feng, X.K.[Xiao-Kun],
Hu, S.Y.[Shi-Yu],
Wu, M.[Meiqi],
Zhang, D.L.[Dai-Ling],
Zhang, J.[Jing],
Huang, K.Q.[Kai-Qi],
DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based
on LLM,
VDU24(7283-7292)
IEEE DOI
2410
Visualization, Annotations, Semantics, Natural languages, Benchmark testing
BibRef
Zhang, Y.C.[Yue-Chen],
Qian, S.J.[Sheng-Ju],
Peng, B.[Bohao],
Liu, S.[Shu],
Jia, J.Y.[Jia-Ya],
Prompt Highlighter: Interactive Control for Multi-Modal LLMs,
CVPR24(13215-13224)
IEEE DOI
2410
Training, Semantics, Process control, Focusing,
Reliability, Usability, VLM, LLM, Interactive Control, Image Caption,
Training-Free
BibRef
Kaul, P.[Prannay],
Li, Z.Z.[Zhi-Zhong],
Yang, H.[Hao],
Dukler, Y.[Yonatan],
Swaminathan, A.[Ashwin],
Taylor, C.J.,
Soatto, S.[Stefano],
THRONE: An Object-Based Hallucination Benchmark for the Free-Form
Generations of Large Vision-Language Models,
CVPR24(27218-27228)
IEEE DOI
2410
Measurement, Training, Ethics, Accuracy, Computational modeling,
Graphics processing units, hallucination, benchmark, LLM, LVLM,
large vision-language model
BibRef
Özdemir, Ö.[Övgü],
Akagündüz, E.[Erdem],
Enhancing Visual Question Answering through Question-Driven Image
Captions as Prompts,
Prompting24(1562-1571)
IEEE DOI Code:
WWW Link.
2410
Visualization, Computational modeling, Large language models,
Pipelines, Semantics, Question answering (information retrieval),
image captioning
BibRef
Shao, Z.W.[Zhen-Wei],
Yu, Z.[Zhou],
Wang, M.[Meng],
Yu, J.[Jun],
Prompting Large Language Models with Answer Heuristics for
Knowledge-Based Visual Question Answering,
CVPR23(14974-14983)
IEEE DOI
2309
BibRef
Wang, D.K.[Dong-Kai],
Xuan, S.Y.[Shi-Yu],
Zhang, S.L.[Shi-Liang],
LocLLM: Exploiting Generalizable Human Keypoint Localization via
Large Language Model,
CVPR24(614-623)
IEEE DOI Code:
WWW Link.
2410
Location awareness, Training, Large language models, Pipelines,
Training data, Cognition, Keypoint Localization,
Large Language Model
BibRef
Liu, H.[Hanchao],
Zhan, X.H.[Xiao-Hang],
Huang, S.L.[Shao-Li],
Mu, T.J.[Tai-Jiang],
Shan, Y.[Ying],
Programmable Motion Generation for Open-Set Motion Control Tasks,
CVPR24(1399-1408)
IEEE DOI
2410
Motion planning, Large language models, Computational modeling,
Semantics, Dynamics, Training data
BibRef
Zhu, L.[Lanyun],
Chen, T.R.[Tian-Run],
Ji, D.[Deyi],
Ye, J.P.[Jie-Ping],
Liu, J.[Jun],
LLaFS: When Large Language Models Meet Few-Shot Segmentation,
CVPR24(3065-3075)
IEEE DOI
2410
Training, Image segmentation, Visualization, Large language models,
Natural language processing,
Large vision-language models
BibRef
Wu, J.F.[Jun-Feng],
Jiang, Y.[Yi],
Liu, Q.H.[Qi-Hao],
Yuan, Z.H.[Ze-Huan],
Bai, X.[Xiang],
Bai, S.[Song],
General Object Foundation Model for Images and Videos at Scale,
CVPR24(3783-3795)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Image segmentation, Grounding, Soft sensors,
Large language models
BibRef
Xia, Z.F.[Zhuo-Fan],
Han, D.C.[Dong-Chen],
Han, Y.Z.[Yi-Zeng],
Pan, X.[Xuran],
Song, S.[Shiji],
Huang, G.[Gao],
GSVA: Generalized Segmentation via Multimodal Large Language Models,
CVPR24(3858-3869)
IEEE DOI Code:
WWW Link.
2410
Image segmentation, Visualization, Codes, Large language models,
Benchmark testing
BibRef
Zhao, L.[Lirui],
Yang, Y.[Yue],
Zhang, K.[Kaipeng],
Shao, W.Q.[Wen-Qi],
Zhang, Y.X.[Yu-Xin],
Qiao, Y.[Yu],
Luo, P.[Ping],
Ji, R.R.[Rong-Rong],
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large
Language Model,
CVPR24(6390-6399)
IEEE DOI Code:
WWW Link.
2410
Training, Technological innovation, Accuracy, Codes,
Large language models, Computational modeling, LLM Agent, LLM Tool Usage
BibRef
Yao, J.[Junyi],
Liu, Y.J.[Yi-Jiang],
Dong, Z.[Zhen],
Guo, M.F.[Ming-Fei],
Hu, H.[Helan],
Keutzer, K.[Kurt],
Du, L.[Li],
Zhou, D.[Daquan],
Zhang, S.H.[Shang-Hang],
PromptCoT: Align Prompt Distribution via Adapted Chain-of-Thought,
CVPR24(7027-7037)
IEEE DOI
2410
Training, Adaptation models, Visualization, Computational modeling,
Large language models, Semantics, Text to image
BibRef
Cai, Z.P.[Zhi-Peng],
Mueller, M.[Matthias],
Birkl, R.[Reiner],
Wofk, D.[Diana],
Tseng, S.Y.[Shao-Yen],
Cheng, J.[Junda],
Stan, G.B.M.[Gabriela Ben-Melech],
Lai, V.[Vasudev],
Paulitsch, M.[Michael],
L-MAGIC: Language Model Assisted Generation of Images with Coherence,
CVPR24(7049-7058)
IEEE DOI Code:
WWW Link.
2410
Point cloud compression, Solid modeling, Layout, Superresolution,
Estimation, Diffusion models, Image generation, large language models
BibRef
Li, Y.[Yanyu],
Liu, X.[Xian],
Kag, A.[Anil],
Hu, J.[Ju],
Idelbayev, Y.[Yerlan],
Sagar, D.[Dhritiman],
Wang, Y.Z.[Yan-Zhi],
Tulyakov, S.[Sergey],
Ren, J.[Jian],
TextCraftor: Your Text Encoder can be Image Quality Controller,
CVPR24(7985-7995)
IEEE DOI
2410
Training, Measurement, Interpolation, Image synthesis,
Large language models, Pipelines, Text to image, Stable Diffusion,
Image and video synthesis and generation
BibRef
Argaw, D.M.[Dawit Mureja],
Yoon, S.H.[Seung-Hyun],
Heilbron, F.C.[Fabian Caba],
Deilamsalehy, H.[Hanieh],
Bui, T.[Trung],
Wang, Z.W.[Zhao-Wen],
Dernoncourt, F.[Franck],
Chung, J.S.[Joon Son],
Scaling Up Video Summarization Pretraining with Large Language Models,
CVPR24(8332-8341)
IEEE DOI
2410
Analytical models, Large language models, Computational modeling,
Pipelines, Benchmark testing
BibRef
Tong, S.[Shengbang],
Liu, Z.[Zhuang],
Zhai, Y.X.[Yue-Xiang],
Ma, Y.[Yi],
LeCun, Y.[Yann],
Xie, S.[Saining],
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs,
CVPR24(9568-9578)
IEEE DOI
2410
Representation learning, Visualization, Systematics, Correlation,
Grounding, Large language models, Multimodal LLMs, Vision Language Model
BibRef
Lai, X.[Xin],
Tian, Z.[Zhuotao],
Chen, Y.[Yukang],
Li, Y.W.[Yan-Wei],
Yuan, Y.H.[Yu-Hui],
Liu, S.[Shu],
Jia, J.Y.[Jia-Ya],
LISA: Reasoning Segmentation via Large Language Model,
CVPR24(9579-9589)
IEEE DOI
2410
Image segmentation, Vocabulary, Visualization, Target recognition,
Large language models, Benchmark testing
BibRef
Shang, C.[Chenming],
Zhou, S.[Shiji],
Zhang, H.[Hengyuan],
Ni, X.Z.[Xin-Zhe],
Yang, Y.[Yujiu],
Wang, Y.[Yuwang],
Incremental Residual Concept Bottleneck Models,
CVPR24(11030-11040)
IEEE DOI
2410
Measurement, Visualization, Accuracy, Large language models,
Current measurement, Decision making, Closed box
BibRef
Xie, Y.T.[Yu-Tong],
Chen, Q.[Qi],
Wang, S.[Sinuo],
To, M.S.[Minh-Son],
Lee, I.[Iris],
Khoo, E.W.[Ee Win],
Hendy, K.[Kerolos],
Koh, D.[Daniel],
Xia, Y.[Yong],
Wu, Q.[Qi],
PairAug: What Can Augmented Image-Text Pairs Do for Radiology?,
CVPR24(11652-11661)
IEEE DOI Code:
WWW Link.
2410
Data privacy, Medical conditions, Large language models, Radiology,
Data augmentation
BibRef
Dong, Z.K.[Zhi-Kang],
Liu, X.[Xiulong],
Chen, B.[Bin],
Polak, P.[Pawel],
Zhang, P.[Peng],
MuseChat: A Conversational Music Recommendation System for Videos,
CVPR24(12775-12785)
IEEE DOI Code:
WWW Link.
2410
Accuracy, Large language models, Natural languages, Cognition,
Recommender systems, Multimodal Learning,
Music Information Retrieval
BibRef
Li, F.[Feng],
Jiang, Q.[Qing],
Zhang, H.[Hao],
Ren, T.[Tianhe],
Liu, S.[Shilong],
Zou, X.[Xueyan],
Xu, H.Z.[Huai-Zhe],
Li, H.Y.[Hong-Yang],
Yang, J.W.[Jian-Wei],
Li, C.Y.[Chun-Yuan],
Zhang, L.[Lei],
Gao, J.F.[Jian-Feng],
Visual in-Context Prompting,
CVPR24(12861-12871)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Image segmentation, Codes,
Large language models, Computer architecture
BibRef
Sachdeva, R.[Ragav],
Zisserman, A.[Andrew],
The Manga Whisperer: Automatically Generating Transcriptions for
Comics,
CVPR24(12967-12976)
IEEE DOI Code:
WWW Link.
2410
Visualization, Codes, Large language models, Visual impairment,
Oral communication, Linguistics
BibRef
Ranasinghe, K.[Kanchana],
Shukla, S.N.[Satya Narayan],
Poursaeed, O.[Omid],
Ryoo, M.S.[Michael S.],
Lin, T.Y.[Tsung-Yu],
Learning to Localize Objects Improves Spatial Reasoning in
Visual-LLMs,
CVPR24(12977-12987)
IEEE DOI
2410
Training, Location awareness, Visualization, Image coding,
Large language models, Pipelines, Cognition, LLM, VQA, Localization,
Video
BibRef
Xu, J.R.[Jia-Rui],
Zhou, X.Y.[Xing-Yi],
Yan, S.[Shen],
Gu, X.[Xiuye],
Arnab, A.[Anurag],
Sun, C.[Chen],
Wang, X.L.[Xiao-Long],
Schmid, C.[Cordelia],
Pixel Aligned Language Models,
CVPR24(13030-13039)
IEEE DOI
2410
Location awareness, Visualization, Grounding,
Large language models, Machine vision, Computational modeling
BibRef
Ye, Q.H.[Qing-Hao],
Xu, H.Y.[Hai-Yang],
Ye, J.[Jiabo],
Yan, M.[Ming],
Hu, A.[Anwen],
Liu, H.[Haowei],
Qian, Q.[Qi],
Zhang, J.[Ji],
Huang, F.[Fei],
mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with
Modality Collaboration,
CVPR24(13040-13051)
IEEE DOI
2410
Large language models, Computational modeling, Collaboration,
Cognition, Decoding, Vision Language
BibRef
Qi, P.[Peng],
Yan, Z.[Zehong],
Hsu, W.[Wynne],
Lee, M.L.[Mong Li],
Sniffer: Multimodal Large Language Model for Explainable
Out-of-Context Misinformation Detection,
CVPR24(13052-13062)
IEEE DOI
2410
Visualization, Adaptation models, Accuracy, Large language models,
Computational modeling, Data models, multimodal misinformation,
explainability
BibRef
Wu, P.H.[Peng-Hao],
Xie, S.[Saining],
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs,
CVPR24(13084-13094)
IEEE DOI
2410
Training, Visualization, Grounding, Computational modeling, Seals,
Benchmark testing, multimodal large language model,
visual search
BibRef
He, R.[Ruozhen],
Cascante-Bonilla, P.[Paola],
Yang, Z.Y.[Zi-Yan],
Berg, A.C.[Alexander C.],
Ordonez, V.[Vicente],
Improved Visual Grounding through Self-Consistent Explanations,
CVPR24(13095-13105)
IEEE DOI
2410
Location awareness, Visualization, Vocabulary, Grounding,
Large language models, Data augmentation, Data models,
visual grounding
BibRef
Zhong, S.S.[Shan-Shan],
Huang, Z.Z.[Zhong-Zhan],
Gao, S.[Shanghua],
Wen, W.[Wushao],
Lin, L.[Liang],
Zitnik, M.[Marinka],
Zhou, P.[Pan],
Let's Think Outside the Box: Exploring Leap-of-Thought in Large
Language Models with Creative Humor Generation,
CVPR24(13246-13257)
IEEE DOI Code:
WWW Link.
2410
Technological innovation, Codes, Large language models, Games,
Cognition
BibRef
Gao, Z.[Zhi],
Du, Y.T.[Yun-Tao],
Zhang, X.T.[Xin-Tong],
Ma, X.J.[Xiao-Jian],
Han, W.J.[Wen-Juan],
Zhu, S.C.[Song-Chun],
Li, Q.[Qing],
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update,
CVPR24(13258-13268)
IEEE DOI
2410
Continuing education, Visualization, Limiting,
Large language models, Training data, Tagging, Reflection,
Compositional Reasoning
BibRef
Feng, C.[Chun],
Hsu, J.[Joy],
Liu, W.Y.[Wei-Yu],
Wu, J.J.[Jia-Jun],
Naturally Supervised 3D Visual Grounding with Language-Regularized
Concept Learners,
CVPR24(13269-13278)
IEEE DOI
2410
Visualization, Solid modeling, Accuracy, Grounding,
Large language models, 3D visual grounding, Language constraints
BibRef
Li, B.[Bohao],
Ge, Y.Y.[Yu-Ying],
Ge, Y.X.[Yi-Xiao],
Wang, G.Z.[Guang-Zhi],
Wang, R.[Rui],
Zhang, R.M.[Rui-Mao],
Shan, Y.[Ying],
SEED-Bench: Benchmarking Multimodal Large Language Models,
CVPR24(13299-13308)
IEEE DOI Code:
WWW Link.
2410
Accuracy, Codes, Annotations, Image synthesis, Large language models,
Computational modeling, Benchmark, Multimodal, Hierarchical
BibRef
Buettner, K.[Kyle],
Malakouti, S.[Sina],
Li, X.L.[Xiang Lorraine],
Kovashka, A.[Adriana],
Incorporating Geo-Diverse Knowledge into Prompting for Increased
Geographical Robustness in Object Recognition,
CVPR24(13515-13524)
IEEE DOI
2410
Geography, Training, Large language models, Training data, Europe, Robustness
BibRef
Tan, R.[Reuben],
Sun, X.[Ximeng],
Hu, P.[Ping],
Wang, J.H.[Jui-Hsien],
Deilamsalehy, H.[Hanieh],
Plummer, B.A.[Bryan A.],
Russell, B.[Bryan],
Saenko, K.[Kate],
Koala: Key Frame-Conditioned Long Video-LLM,
CVPR24(13581-13591)
IEEE DOI
2410
Visualization, Accuracy, Large language models, Computational modeling,
Benchmark testing, Question answering (information retrieval)
BibRef
Liu, R.[Ruyang],
Li, C.[Chen],
Ge, Y.X.[Yi-Xiao],
Li, T.H.[Thomas H.],
Shan, Y.[Ying],
Li, G.[Ge],
BT-Adapter: Video Conversation is Feasible Without Video Instruction
Tuning,
CVPR24(13658-13667)
IEEE DOI Code:
WWW Link.
2410
Training, Adaptation models, Visualization, Costs,
Computational modeling, Graphics processing units,
Video Large Language Models
BibRef
Ding, X.P.[Xin-Peng],
Han, J.H.[Jian-Hua],
Xu, H.[Hang],
Liang, X.D.[Xiao-Dan],
Zhang, W.[Wei],
Li, X.M.[Xiao-Meng],
Holistic Autonomous Driving Understanding by Bird'View Injected
Multi-Modal Large Models,
CVPR24(13668-13677)
IEEE DOI Code:
WWW Link.
2410
Bridges, Large language models, Semantics, Autonomous vehicles
BibRef
Li, J.X.[Jia-Xuan],
Vo, D.M.[Duc Minh],
Sugimoto, A.[Akihiro],
Nakayama, H.[Hideki],
Evcap: Retrieval-Augmented Image Captioning with External Visual-Name
Memory for Open-World Comprehension,
CVPR24(13733-13742)
IEEE DOI
2410
Training, Visualization, Adaptation models, Costs,
Large language models, Memory management, Image Captioning,
External Memory
BibRef
Song, L.[Lin],
Chen, Y.[Yukang],
Yang, S.[Shuai],
Ding, X.H.[Xiao-Han],
Ge, Y.X.[Yi-Xiao],
Chen, Y.C.[Ying-Cong],
Shan, Y.[Ying],
Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs,
CVPR24(13763-13773)
IEEE DOI
2410
Training, Attention mechanisms, Computational modeling,
Large language models, Benchmark testing, Natural language processing
BibRef
Guo, Q.[Qiushan],
de Mello, S.[Shalini],
Yin, H.X.[Hong-Xu],
Byeon, W.[Wonmin],
Cheung, K.C.[Ka Chun],
Yu, Y.Z.[Yi-Zhou],
Luo, P.[Ping],
Liu, S.[Sifei],
RegionGPT: Towards Region Understanding Vision Language Model,
CVPR24(13796-13806)
IEEE DOI
2410
Training, Visualization, Large language models, Pipelines,
Training data, Object detection, Cognition
BibRef
Yu, T.Y.[Tian-Yu],
Yao, Y.[Yuan],
Zhang, H.[Haoye],
He, T.[Taiwen],
Han, Y.F.[Yi-Feng],
Cui, G.[Ganqu],
Hu, J.Y.[Jin-Yi],
Liu, Z.Y.[Zhi-Yuan],
Zheng, H.T.[Hai-Tao],
Sun, M.[Maosong],
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from
Fine-Grained Correctional Human Feedback,
CVPR24(13807-13816)
IEEE DOI
2410
Image segmentation, Accuracy, Large language models,
Computational modeling, Benchmark testing, Cognition, vision,
hallucination
BibRef
Xuan, S.Y.[Shi-Yu],
Guo, Q.[Qingpei],
Yang, M.[Ming],
Zhang, S.L.[Shi-Liang],
Pink: Unveiling the Power of Referential Comprehension for
Multi-modal LLMs,
CVPR24(13838-13848)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Costs, Accuracy, Annotations, Large language models
BibRef
Ganz, R.[Roy],
Kittenplon, Y.[Yair],
Aberdam, A.[Aviad],
Avraham, E.B.[Elad Ben],
Nuriel, O.[Oren],
Mazor, S.[Shai],
Litman, R.[Ron],
Question Aware Vision Transformer for Multimodal Reasoning,
CVPR24(13861-13871)
IEEE DOI
2410
Visualization, Image coding, Large language models, Focusing,
Computer architecture, Transformers
BibRef
Bansal, H.[Hritik],
Bitton, Y.[Yonatan],
Szpektor, I.[Idan],
Chang, K.W.[Kai-Wei],
Grover, A.[Aditya],
VideoCon: Robust Video-Language Alignment via Contrast Captions,
CVPR24(13927-13937)
IEEE DOI
2410
Large language models, Semantics,
Question answering (information retrieval), Data models,
large multimodal models
BibRef
Wang, S.W.[Shao-Wei],
Zhang, L.L.[Ling-Ling],
Zhu, L.[Longji],
Qin, T.[Tao],
Yap, K.H.[Kim-Hui],
Zhang, X.Y.[Xin-Yu],
Liu, J.[Jun],
CoG-DQA: Chain-of-Guiding Learning with Large Language Models for
Diagram Question Answering,
CVPR24(13969-13979)
IEEE DOI
2410
Bridges, Visualization, Large language models,
Computational modeling, Natural languages, Large Language Model
BibRef
He, J.W.[Jun-Wen],
Wang, Y.F.[Yi-Fan],
Wang, L.J.[Li-Jun],
Lu, H.C.[Hu-Chuan],
He, J.Y.[Jun-Yan],
Lan, J.P.[Jin-Peng],
Luo, B.[Bin],
Xie, X.[Xuansong],
Multi-Modal Instruction Tuned LLMs with Fine-Grained Visual
Perception,
CVPR24(13980-13990)
IEEE DOI Code:
WWW Link.
2410
Image segmentation, Visualization, Technological innovation,
Grounding, Computational modeling, Large language models, Natural languages
BibRef
Yu, Q.[Qiying],
Sun, Q.[Quan],
Zhang, X.S.[Xiao-Song],
Cui, Y.F.[Yu-Feng],
Zhang, F.[Fan],
Cao, Y.[Yue],
Wang, X.L.[Xin-Long],
Liu, J.J.[Jing-Jing],
CapsFusion: Rethinking Image-Text Data at Scale,
CVPR24(14022-14032)
IEEE DOI
2410
Training, Knowledge engineering, Scalability,
Large language models, Computational modeling, Noise
BibRef
Yao, J.W.[Jia-Wei],
Qian, Q.[Qi],
Hu, J.[Juhua],
Multi-Modal Proxy Learning Towards Personalized Visual Multiple
Clustering,
CVPR24(14066-14075)
IEEE DOI Code:
WWW Link.
2410
Deep learning, Bridges, Visualization, Codes, Large language models,
Face recognition
BibRef
Zou, B.[Bo],
Yang, C.[Chao],
Qiao, Y.[Yu],
Quan, C.B.[Cheng-Bin],
Zhao, Y.J.[You-Jian],
LLaMA-Excitor: General Instruction Tuning via Indirect Feature
Interaction,
CVPR24(14089-14099)
IEEE DOI Code:
WWW Link.
2410
Visualization, Adaptation models, Codes, Computational modeling,
Benchmark testing, Instruction Tuning, PEFT,
Large Language Model
BibRef
Huang, B.[Bin],
Wang, X.[Xin],
Chen, H.[Hong],
Song, Z.[Zihan],
Zhu, W.W.[Wen-Wu],
VTimeLLM: Empower LLM to Grasp Video Moments,
CVPR24(14271-14280)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Grounding, Large language models,
Benchmark testing, Cognition
BibRef
Hong, W.[Wenyi],
Wang, W.H.[Wei-Han],
Lv, Q.S.[Qing-Song],
Xu, J.Z.[Jia-Zheng],
Yu, W.[Wenmeng],
Ji, J.H.[Jun-Hui],
Wang, Y.[Yan],
Wang, Z.[Zihan],
Dong, Y.X.[Yu-Xiao],
Ding, M.[Ming],
Tang, J.[Jie],
CogAgent: A Visual Language Model for GUI Agents,
CVPR24(14281-14290)
IEEE DOI Code:
WWW Link.
2410
Visualization, Limiting, Image resolution, Image recognition,
Navigation, Large language models, Benchmark testing
BibRef
Khan, Z.[Zaid],
BG, V.K.[Vijay Kumar],
Schulter, S.[Samuel],
Fu, Y.[Yun],
Chandraker, M.[Manmohan],
Self-Training Large Language Models for Improved Visual Program
Synthesis With Visual Reinforcement,
CVPR24(14344-14353)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Annotations, Large language models,
Object detection, Question answering (information retrieval),
visual question answering
BibRef
Mitra, C.[Chancharik],
Huang, B.[Brandon],
Darrell, T.J.[Trevor J.],
Herzig, R.[Roei],
Compositional Chain-of-Thought Prompting for Large Multimodal Models,
CVPR24(14420-14431)
IEEE DOI Code:
WWW Link.
2410
Bridges, Visualization, Codes, Annotations, Large language models,
Benchmark testing, Large Multimodal Models, Multimodality,
Prompting
BibRef
Li, B.[Boyi],
Wang, Y.[Yue],
Mao, J.[Jiageng],
Ivanovic, B.[Boris],
Veer, S.[Sushant],
Leung, K.[Karen],
Pavone, M.[Marco],
Driving Everywhere with Large Language Model Policy Adaptation,
CVPR24(14948-14957)
IEEE DOI
2410
Measurement, Video on demand, Accuracy, Large language models,
Planning, Large Language Models, Driving Copilot
BibRef
Wei, Y.X.[Yu-Xi],
Wang, Z.[Zi],
Lu, Y.F.[Yi-Fan],
Xu, C.X.[Chen-Xin],
Liu, C.X.[Chang-Xing],
Zhao, H.[Hao],
Chen, S.[Siheng],
Wang, Y.F.[Yan-Feng],
Editable Scene Simulation for Autonomous Driving via Collaborative
LLM-Agents,
CVPR24(15077-15087)
IEEE DOI Code:
WWW Link.
2410
Large language models, Face recognition, Natural languages,
Collaboration, Lighting, Rendering (computer graphics),
LLM agent
BibRef
Shao, H.[Hao],
Hu, Y.X.[Yu-Xuan],
Wang, L.[Letian],
Song, G.L.[Guang-Lu],
Waslander, S.L.[Steven L.],
Liu, Y.[Yu],
Li, H.S.[Hong-Sheng],
LMDrive: Closed-Loop End-to-End Driving with Large Language Models,
CVPR24(15120-15130)
IEEE DOI
2410
Navigation, Large language models, Multimodal sensors,
Natural languages, Benchmark testing, Software, LLM, autonomous driving
BibRef
Ma, Y.S.[Yun-Sheng],
Cui, C.[Can],
Cao, X.[Xu],
Ye, W.Q.[Wen-Qian],
Liu, P.R.[Pei-Ran],
Lu, J.[Juanwu],
Abdelraouf, A.[Amr],
Gupta, R.[Rohit],
Han, K.T.[Kyung-Tae],
Bera, A.[Aniket],
Rehg, J.M.[James M.],
Wang, Z.[Ziran],
LaMPilot: An Open Benchmark Dataset for Autonomous Driving with
Language Model Programs,
CVPR24(15141-15151)
IEEE DOI
2410
Codes, Large language models, Benchmark testing, Cognition, Safety,
Pattern recognition
BibRef
Zhang, J.W.[Jia-Wei],
Xu, C.[Chejian],
Li, B.[Bo],
ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for
Autonomous Vehicles,
CVPR24(15459-15469)
IEEE DOI Code:
WWW Link.
2410
Training, Codes, Large language models, Transforms, Robustness, Safety,
Autonomous Driving, Large Language Model, Safety-Critical Scenario
BibRef
Liu, C.[Chaohu],
Yin, K.[Kun],
Cao, H.Y.[Hao-Yu],
Jiang, X.H.[Xing-Hua],
Li, X.[Xin],
Liu, Y.[Yinsong],
Jiang, D.Q.[De-Qiang],
Sun, X.[Xing],
Xu, L.[Linli],
HRVDA: High-Resolution Visual Document Assistant,
CVPR24(15534-15545)
IEEE DOI
2410
Training, Visualization, Large language models,
Computational modeling, Training data, Transformers,
Multimodal
BibRef
Blau, T.[Tsachi],
Fogel, S.[Sharon],
Ronen, R.[Roi],
Golts, A.[Alona],
Tsiper, S.[Shahar],
Avraham, E.B.[Elad Ben],
Aberdam, A.[Aviad],
Ganz, R.[Roy],
Litman, R.[Ron],
GRAM: Global Reasoning for Multi-Page VQA,
CVPR24(15598-15607)
IEEE DOI
2410
Adaptation models, Visualization, Computational modeling,
Large language models, Benchmark testing, Transformers, Cognition,
Vision Language Models
BibRef
Luo, C.[Chuwei],
Shen, Y.F.[Yu-Fan],
Zhu, Z.Q.[Zhao-Qing],
Zheng, Q.[Qi],
Yu, Z.[Zhi],
Yao, C.[Cong],
LayoutLLM: Layout Instruction Tuning with Large Language Models for
Document Understanding,
CVPR24(15630-15640)
IEEE DOI
2410
Large language models, Layout, Manuals, Inspection, Benchmark testing,
Boosting, Document Understanding, Layout, Large Language Models
BibRef
Yang, Y.[Yue],
Sun, F.Y.[Fan-Yun],
Weihs, L.[Luca],
Vanderbilt, E.[Eli],
Herrasti, A.[Alvaro],
Han, W.[Winson],
Wu, J.J.[Jia-Jun],
Haber, N.[Nick],
Krishna, R.[Ranjay],
Liu, L.J.[Ling-Jie],
Callison-Burch, C.[Chris],
Yatskar, M.[Mark],
Kembhavi, A.[Aniruddha],
Clark, C.[Christopher],
Holodeck: Language Guided Generation of 3D Embodied AI Environments,
CVPR24(16277-16287)
IEEE DOI
2410
Training, Navigation, Large language models, Semantics, Layout, Stars,
Embodied AI, 3D Scene Generation, Language-guided Generation
BibRef
Qin, Y.[Yiran],
Zhou, E.[Enshen],
Liu, Q.[Qichang],
Yin, Z.F.[Zhen-Fei],
Sheng, L.[Lu],
Zhang, R.M.[Rui-Mao],
Qiao, Y.[Yu],
Shao, J.[Jing],
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active
Perception,
CVPR24(16307-16316)
IEEE DOI Code:
WWW Link.
2410
Visualization, Large language models, Active perception, Planning,
Compounds
BibRef
Zhang, S.[Sixian],
Yu, X.Y.[Xin-Yao],
Song, X.H.[Xin-Hang],
Wang, X.H.[Xiao-Han],
Jiang, S.Q.[Shu-Qiang],
Imagine Before Go: Self-Supervised Generative Map for Object Goal
Navigation,
CVPR24(16414-16425)
IEEE DOI Code:
WWW Link.
2410
Training, Geometry, Navigation, Large language models, Semantics,
Layout, Self-supervised learning, Embodied AI, Object Goal Navigation
BibRef
Li, H.[Hao],
Yang, X.[Xue],
Wang, Z.[Zhaokai],
Zhu, X.[Xizhou],
Zhou, J.[Jie],
Qiao, Y.[Yu],
Wang, X.G.[Xiao-Gang],
Li, H.S.[Hong-Sheng],
Lu, L.W.[Le-Wei],
Dai, J.F.[Ji-Feng],
Auto MC-Reward: Automated Dense Reward Design with Large Language
Models for Minecraft,
CVPR24(16426-16435)
IEEE DOI
2410
Learning systems, Codes, Large language models, Lava, Semantics,
Reinforcement learning, Syntactics, Large Language Model, Reward Shaping
BibRef
Liu, M.X.[Ming-Xuan],
Hayes, T.L.[Tyler L.],
Ricci, E.[Elisa],
Csurka, G.[Gabriela],
Volpi, R.[Riccardo],
SHiNe: Semantic Hierarchy Nexus for Open-Vocabulary Object Detection,
CVPR24(16634-16644)
IEEE DOI
2410
Vocabulary, Fuses, Large language models, Semantics, Detectors,
Object detection, Open-vocabulary, Object Detection, Vision-Language
BibRef
Lei, T.[Ting],
Yin, S.F.[Shao-Feng],
Liu, Y.[Yang],
Exploring the Potential of Large Foundation Models for
Open-Vocabulary HOI Detection,
CVPR24(16657-16667)
IEEE DOI Code:
WWW Link.
2410
Vocabulary, Correlation, Large language models, Semantics,
Natural languages, Detectors
BibRef
Kim, J.[Jooyeon],
Cho, E.[Eulrang],
Kim, S.[Sehyung],
Kim, H.W.J.[Hyun-Woo J.],
Retrieval-Augmented Open-Vocabulary Object Detection,
CVPR24(17427-17436)
IEEE DOI Code:
WWW Link.
2410
Portable media players, Visualization, Vocabulary,
Large language models, Semantics, Detectors, Object detection,
Retrieval-Augmentation
BibRef
Saha, O.[Oindrila],
van Horn, G.[Grant],
Maji, S.[Subhransu],
Improved Zero-Shot Classification by Adapting VLMs with Text
Descriptions,
CVPR24(17542-17552)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Large language models, Habitats,
Benchmark testing, Birds, Zero Shot Learning,
Fine-grained Classification
BibRef
Toubal, I.E.[Imad Eddine],
Avinash, A.[Aditya],
Alldrin, N.G.[Neil Gordon],
Dlabal, J.[Jan],
Zhou, W.[Wenlei],
Luo, E.[Enming],
Stretcu, O.[Otilia],
Xiong, H.[Hao],
Lu, C.T.[Chun-Ta],
Zhou, H.[Howard],
Krishna, R.[Ranjay],
Fuxman, A.[Ariel],
Duerig, T.[Tom],
Modeling Collaborator: Enabling Subjective Vision Classification with
Minimal Human Effort via LLM Tool-Use,
CVPR24(17553-17563)
IEEE DOI
2410
Visualization, Computational modeling, Large language models,
Natural languages, Wildlife, Training data, Manuals, tool-use
BibRef
Li, X.Q.[Xiao-Qi],
Zhang, M.X.[Ming-Xu],
Geng, Y.[Yiran],
Geng, H.R.[Hao-Ran],
Long, Y.X.[Yu-Xing],
Shen, Y.[Yan],
Zhang, R.R.[Ren-Rui],
Liu, J.[JiaMing],
Dong, H.[Hao],
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric
Robotic Manipulation,
CVPR24(18061-18070)
IEEE DOI Code:
WWW Link.
2410
Training, Adaptation models, Large language models, Transforms,
Predictive models, Robot sensing systems, Cognition, Embodied AI,
Multi-modal Large Language Model
BibRef
Han, T.[Tengda],
Bain, M.[Max],
Nagrani, A.[Arsha],
Varol, G.[Gül],
Xie, W.[Weidi],
Zisserman, A.[Andrew],
AutoAD III: The Prequel: Back to the Pixels,
CVPR24(18164-18174)
IEEE DOI
2410
Training, Measurement, Visualization, Large language models,
Current measurement, Training data, Computer architecture
BibRef
Song, E.[Enxin],
Chai, W.H.[Wen-Hao],
Wang, G.[Guanhong],
Zhang, Y.C.[Yu-Cheng],
Zhou, H.Y.[Hao-Yang],
Wu, F.[Feiyang],
Chi, H.Z.[Hao-Zhe],
Guo, X.[Xun],
Ye, T.[Tian],
Zhang, Y.T.[Yan-Ting],
Lu, Y.[Yan],
Hwang, J.N.[Jenq-Neng],
Wang, G.[Gaoang],
MovieChat: From Dense Token to Sparse Memory for Long Video
Understanding,
CVPR24(18221-18232)
IEEE DOI Code:
WWW Link.
2410
Visualization, Costs, Large language models,
Computational modeling, Manuals, Transformers
BibRef
Qu, H.X.[Hao-Xuan],
Cai, Y.J.[Yu-Jun],
Liu, J.[Jun],
LLMs are Good Action Recognizers,
CVPR24(18395-18406)
IEEE DOI
2410
Accuracy, Large language models, Computer architecture,
Linguistics, Benchmark testing, Skeleton
BibRef
Chen, J.[Joya],
Lv, Z.Y.[Zhao-Yang],
Wu, S.W.[Shi-Wei],
Lin, K.Q.[Kevin Qinghong],
Song, C.[Chenan],
Gao, D.F.[Di-Fei],
Liu, J.W.[Jia-Wei],
Gao, Z.T.[Zi-Teng],
Mao, D.X.[Dong-Xing],
Shou, M.Z.[Mike Zheng],
VideoLLM-online: Online Video Large Language Model for Streaming
Video,
CVPR24(18407-18418)
IEEE DOI
2410
Training, Large language models, Soft sensors, Pipelines,
Streaming media, Rendering (computer graphics), Data models
BibRef
Zhu, A.[Anqi],
Ke, Q.H.[Qiu-Hong],
Gong, M.M.[Ming-Ming],
Bailey, J.[James],
Part-Aware Unified Representation of Language and Skeleton for
Zero-Shot Action Recognition,
CVPR24(18761-18770)
IEEE DOI Code:
WWW Link.
2410
Visualization, Source coding, Large language models,
Natural languages, Skeleton,
representation learning
BibRef
Chen, T.J.[Tong-Jia],
Yu, H.S.[Hong-Shan],
Yang, Z.G.[Zhen-Geng],
Li, Z.C.[Ze-Chuan],
Sun, W.[Wei],
Chen, C.[Chen],
OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor
for General Video Recognition,
CVPR24(18888-18898)
IEEE DOI
2410
Training, Adaptation models, Visualization, Large language models,
Semantics, Pipelines, Refining, Video Reognition,
Multi-modality Video Understanding
BibRef
Zhao, Q.H.[Qi-Hao],
Dai, Y.[Yalun],
Li, H.[Hao],
Hu, W.[Wei],
Zhang, F.[Fan],
Liu, J.[Jun],
LTGC: Long-Tail Recognition via Leveraging LLMs-Driven Generated
Content,
CVPR24(19510-19520)
IEEE DOI
2410
Semantic segmentation, Large language models,
Computational modeling, Data visualization, Tail, Benchmark testing
BibRef
Siddiqui, Y.[Yawar],
Alliegro, A.[Antonio],
Artemov, A.[Alexey],
Tommasi, T.[Tatiana],
Sirigatti, D.[Daniele],
Rosov, V.[Vladislav],
Dai, A.[Angela],
Nießner, M.[Matthias],
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers,
CVPR24(19615-19625)
IEEE DOI
2410
Geometry, Vocabulary, Solid modeling, Shape, Large language models,
Transformers, Mesh Generation, Generative Models for 3D,
Transformers
BibRef
Yuan, Z.H.[Zhi-Hao],
Ren, J.[Jinke],
Feng, C.M.[Chun-Mei],
Zhao, H.S.[Heng-Shuang],
Cui, S.G.[Shu-Guang],
Li, Z.[Zhen],
Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding,
CVPR24(20623-20633)
IEEE DOI Code:
WWW Link.
2410
Visualization, Vocabulary, Grounding, Annotations, Navigation,
Large language models, Visual Grounding, Point Cloud, Vision and Language
BibRef
Li, Z.[Zhe],
Gao, Z.Y.[Zhang-Yang],
Tan, C.[Cheng],
Ren, B.[Bocheng],
Yang, L.T.[Laurence T.],
Li, S.Z.[Stan Z.],
General Point Model Pretraining with Autoencoding and Autoregressive,
CVPR24(20954-20964)
IEEE DOI Code:
WWW Link.
2410
Point cloud compression, Representation learning, Codes,
Large language models, Vector quantization, Computational modeling
BibRef
Li, K.C.[Kun-Chang],
Wang, Y.[Yali],
He, Y.[Yinan],
Li, Y.Z.[Yi-Zhuo],
Wang, Y.[Yi],
Liu, Y.[Yi],
Wang, Z.[Zun],
Xu, J.[Jilan],
Chen, G.[Guo],
Lou, P.[Ping],
Wang, L.M.[Li-Min],
Qiao, Y.[Yu],
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark,
CVPR24(22195-22206)
IEEE DOI Code:
WWW Link.
2410
Training, Systematics, Large language models, Image annotation,
Manuals, Benchmark testing
BibRef
Taesiri, M.R.[Mohammad Reza],
Feng, T.J.[Tian-Jun],
Bezemer, C.P.[Cor-Paul],
Nguyen, A.[Anh],
GlitchBench: Can Large Multimodal Models Detect Video Game Glitches?,
CVPR24(22444-22455)
IEEE DOI Code:
WWW Link.
2410
Video games, Visualization, Quality assurance, Large language models,
Benchmark testing, Linguistics, Cognition, game testing
BibRef
Zhang, R.[Ruiyi],
Zhang, Y.Z.[Yan-Zhe],
Chen, J.[Jian],
Zhou, Y.F.[Yu-Fan],
Gu, J.X.[Jiu-Xiang],
Chen, C.[Changyou],
Sun, T.[Tong],
TRINS: Towards Multimodal Language Models that Can Read,
CVPR24(22584-22594)
IEEE DOI
2410
Visualization, Annotations, Large language models,
Computational modeling, Optical character recognition, Training data
BibRef
Zhang, H.J.[Hao-Jie],
Su, Y.Y.[Yong-Yi],
Xu, X.[Xun],
Jia, K.[Kui],
Improving the Generalization of Segmentation Foundation Model under
Distribution Shift via Weakly Supervised Adaptation,
CVPR24(23385-23395)
IEEE DOI
2410
Image segmentation, Costs, Large language models, Robustness,
Computational efficiency, Domain Adaptation,
Weakly Supervised Adaptation
BibRef
Dunlap, L.[Lisa],
Zhang, Y.H.[Yu-Hui],
Wang, X.H.[Xiao-Han],
Zhong, R.Q.[Rui-Qi],
Darrell, T.J.[Trevor J.],
Steinhardt, J.[Jacob],
Gonzalez, J.E.[Joseph E.],
Yeung-Levy, S.[Serena],
Describing Differences in Image Sets with Natural Language,
CVPR24(24199-24208)
IEEE DOI Code:
WWW Link.
2410
Analytical models, Large language models, Computational modeling,
Natural languages, Human in the loop
BibRef
Ishmam, A.M.[Alvi Md],
Thomas, C.[Christopher],
Semantic Shield: Defending Vision-Language Models Against Backdooring
and Poisoning via Fine-Grained Knowledge Alignment,
CVPR24(24820-24830)
IEEE DOI
2410
Training, Visualization, Correlation, Computational modeling,
Large language models, Semantics, Adversarial attack and defense,
Vision languge model
BibRef
Wu, H.N.[Hao-Ning],
Zhang, Z.C.[Zi-Cheng],
Zhang, E.[Erli],
Chen, C.F.[Chao-Feng],
Liao, L.[Liang],
Wang, A.[Annan],
Xu, K.X.[Kai-Xin],
Li, C.Y.[Chun-Yi],
Hou, J.W.[Jing-Wen],
Zhai, G.T.[Guang-Tao],
Xue, G.[Geng],
Sun, W.X.[Wen-Xiu],
Yan, Q.[Qiong],
Lin, W.S.[Wei-Si],
Q-Instruct: Improving Low-Level Visual Abilities for Multi-Modality
Foundation Models,
CVPR24(25490-25500)
IEEE DOI
2410
Visualization, Accuracy, Large language models, Natural languages,
Solids, Quality assessment
BibRef
Yang, Y.J.[Yi-Jun],
Zhou, T.Y.[Tian-Yi],
Li, K.[Kanxue],
Tao, D.P.[Da-Peng],
Li, L.[Lusong],
Shen, L.[Li],
He, X.D.[Xiao-Dong],
Jiang, J.[Jing],
Shi, Y.H.[Yu-Hui],
Embodied Multi-Modal Agent trained by an LLM from a Parallel
TextWorld,
CVPR24(26265-26275)
IEEE DOI
2410
Training, Visualization, Imitation learning, Large language models,
Robustness, Reflection, Embodied AI, Large Language Models, Imitation Learning
BibRef
Hong, Y.[Yining],
Zheng, Z.[Zishuo],
Chen, P.H.[Pei-Hao],
Wang, Y.F.[Yi-Fan],
Li, J.[Junyan],
Gan, C.[Chuang],
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model
in 3D World,
CVPR24(26396-26406)
IEEE DOI
2410
Visualization, Correlation, Navigation, Large language models,
Computational modeling
BibRef
Chen, G.[Gongwei],
Shen, L.[Leyang],
Shao, R.[Rui],
Deng, X.[Xiang],
Nie, L.Q.[Li-Qiang],
LION: Empowering Multimodal Large Language Model with Dual-Level
Visual Knowledge,
CVPR24(26530-26540)
IEEE DOI
2410
Visualization, Accuracy, Grounding, Large language models, Semantics,
Benchmark testing
BibRef
Zhang, Y.[Yichi],
Dong, Y.P.[Yin-Peng],
Zhang, S.Y.[Si-Yuan],
Min, T.Z.[Tian-Zan],
Su, H.[Hang],
Zhu, J.[Jun],
Exploring the Transferability of Visual Prompting for Multimodal
Large Language Models,
CVPR24(26552-26562)
IEEE DOI
2410
Training, Visualization, Adaptation models, Computational modeling,
Large language models, Semantics, Feature extraction,
Transferability
BibRef
Han, J.[JiaMing],
Gong, K.X.[Kai-Xiong],
Zhang, Y.Y.[Yi-Yuan],
Wang, J.Q.[Jia-Qi],
Zhang, K.[Kaipeng],
Lin, D.[Dahua],
Qiao, Y.[Yu],
Gao, P.[Peng],
Yue, X.Y.[Xiang-Yu],
OneLLM: One Framework to Align All Modalities with Language,
CVPR24(26574-26585)
IEEE DOI Code:
WWW Link.
2410
Point cloud compression, Large language models, Pipelines,
Benchmark testing, Functional magnetic resonance imaging, Routing
BibRef
Xie, H.X.[Hong-Xia],
Peng, C.J.[Chu-Jun],
Tseng, Y.W.[Yu-Wen],
Chen, H.J.[Hung-Jen],
Hsu, C.F.[Chan-Feng],
Shuai, H.H.[Hong-Han],
Cheng, W.H.[Wen-Huang],
EmoVIT: Revolutionizing Emotion Insights with Visual Instruction
Tuning,
CVPR24(26586-26595)
IEEE DOI Code:
WWW Link.
2410
Visualization, Emotion recognition, Large language models,
Pipelines, Benchmark testing, Cognition
BibRef
Wang, X.Y.[Xin-Yu],
Zhuang, B.[Bohan],
Wu, Q.[Qi],
ModaVerse: Efficiently Transforming Modalities with LLMs,
CVPR24(26596-26606)
IEEE DOI Code:
WWW Link.
2410
Training, Adaptation models, Large language models,
Natural languages, Layout, Data models
BibRef
Lin, J.[Ji],
Yin, H.X.[Hong-Xu],
Ping, W.[Wei],
Molchanov, P.[Pavlo],
Shoeybi, M.[Mohammad],
Han, S.[Song],
VILA: On Pre-training for Visual Language Models,
CVPR24(26679-26689)
IEEE DOI
2410
Degradation, Visualization, Accuracy, Large language models,
Benchmark testing, Cognition
BibRef
Li, L.[Li],
Peng, J.W.[Jia-Wei],
Chen, H.[Huiyi],
Gao, C.Y.[Chong-Yang],
Yang, X.[Xu],
How to Configure Good In-Context Sequence for Visual Question
Answering,
CVPR24(26700-26710)
IEEE DOI Code:
WWW Link.
2410
Visualization, Codes, Design methodology, Large language models,
Question answering (information retrieval)
BibRef
Lyu, Y.H.[Yuan-Huiyi],
Zheng, X.[Xu],
Zhou, J.Z.[Jia-Zhou],
Wang, L.[Lin],
UniBind: LLM-Augmented Unified and Balanced Representation Space to
Bind Them All,
CVPR24(26742-26752)
IEEE DOI
2410
Point cloud compression, Visualization, Large language models,
Knowledge based systems, Infrared imaging, Contrastive learning,
Data mining
BibRef
Liang, T.[Tian],
Huang, J.[Jing],
Kong, M.[Ming],
Chen, L.[Luyuan],
Zhu, Q.[Qiang],
Querying as Prompt: Parameter-Efficient Learning for Multimodal
Language Model,
CVPR24(26845-26855)
IEEE DOI Code:
WWW Link.
2410
Training, Bridges, Adaptation models, Technological innovation,
Codes, Computational modeling, multimodal,
large language model
BibRef
Jiang, C.[Chaoya],
Xu, H.Y.[Hai-Yang],
Dong, M.[Mengfan],
Chen, J.X.[Jia-Xing],
Ye, W.[Wei],
Yan, M.[Ming],
Ye, Q.[Qinghao],
Zhang, J.[Ji],
Huang, F.[Fei],
Zhang, S.K.[Shi-Kun],
Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model,
CVPR24(27026-27036)
IEEE DOI Code:
WWW Link.
2410
Representation learning, Visualization, Codes,
Large language models, Natural languages, Contrastive learning
BibRef
Zhu, L.[Lei],
Wei, F.[Fangyun],
Lu, Y.[Yanye],
Beyond Text: Frozen Large Language Models in Visual Signal
Comprehension,
CVPR24(27037-27047)
IEEE DOI Code:
WWW Link.
2410
Visualization, Vocabulary, Image recognition, Large language models,
Semantics, Transforms, Feature extraction, Multi-modal learning
BibRef
Pi, R.J.[Ren-Jie],
Yao, L.W.[Le-Wei],
Gao, J.[Jiahui],
Zhang, J.P.[Ji-Peng],
Zhang, T.[Tong],
PerceptionGPT: Effectively Fusing Visual Perception Into LLM,
CVPR24(27114-27123)
IEEE DOI
2410
Training, Visualization, Accuracy, Large language models,
Decoding, Multimodal Learning
BibRef
Tai, Y.[Yan],
Fan, W.C.[Wei-Chen],
Zhang, Z.[Zhao],
Liu, Z.W.[Zi-Wei],
Link-Context Learning for Multimodal LLMs,
CVPR24(27166-27175)
IEEE DOI
2410
Training, Image recognition, Large language models,
Oral communication, Propulsion, Cognition
BibRef
Tang, Z.[Zineng],
Yang, Z.[Ziyi],
Khademi, M.[Mahmoud],
Liu, Y.[Yang],
Zhu, C.G.[Chen-Guang],
Bansal, M.[Mohit],
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any
Generation,
CVPR24(27415-27424)
IEEE DOI
2410
Image synthesis, Large language models, Oral communication,
Encoding, Cognition
BibRef
Jain, J.[Jitesh],
Yang, J.W.[Jian-Wei],
Shi, H.[Humphrey],
VCoder: Versatile Vision Encoders for Multimodal Large Language
Models,
CVPR24(27992-28002)
IEEE DOI
2410
Training, Visualization, Image segmentation, Costs, Image synthesis,
Large language models, Machine vision
BibRef
Yuan, Y.Q.[Yu-Qian],
Li, W.[Wentong],
Liu, J.[Jian],
Tang, D.Q.[Dong-Qi],
Luo, X.J.[Xin-Jie],
Qin, C.[Chi],
Zhang, L.[Lei],
Zhu, J.[Jianke],
Osprey: Pixel Understanding with Visual Instruction Tuning,
CVPR24(28202-28211)
IEEE DOI Code:
WWW Link.
2410
Convolutional codes, Visualization, Computational modeling,
Source coding, Large language models, Semantics
BibRef
Zhai, A.J.[Albert J.],
Shen, Y.[Yuan],
Chen, E.Y.[Emily Y.],
Wang, G.X.[Gloria X.],
Wang, X.L.[Xin-Lei],
Wang, S.[Sheng],
Guan, K.Y.[Kai-Yu],
Wang, S.[Shenlong],
Physical Property Understanding from Language-Embedded Feature Fields,
CVPR24(28296-28305)
IEEE DOI Code:
WWW Link.
2410
Point cloud compression, Visualization, Friction,
Large language models, Estimation,
digital twin
BibRef
Zheng, Z.H.[Zhao-Heng],
Wei, J.[Jingmin],
Hu, X.F.[Xue-Feng],
Zhu, H.D.[Hai-Dong],
Nevatia, R.[Ram],
Large Language Models are Good Prompt Learners for Low-Shot Image
Classification,
CVPR24(28453-28462)
IEEE DOI Code:
WWW Link.
2410
Learning systems, Training, Adaptation models, Codes,
Large language models, Computational modeling
BibRef
He, H.Y.[Hao-Yu],
Pan, Z.Z.[Zi-Zheng],
Liu, J.[Jing],
Cai, J.F.[Jian-Fei],
Zhuang, B.[Bohan],
Efficient Stitchable Task Adaptation,
CVPR24(28555-28565)
IEEE DOI Code:
WWW Link.
2410
Training, Deep learning, Adaptation models, Visualization,
Scalability, Pipelines, Memory management, model stitching,
large language model
BibRef
Tian, X.Y.[Xin-Yu],
Zou, S.[Shu],
Yang, Z.Y.[Zhao-Yuan],
Zhang, J.[Jing],
ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models,
CVPR24(28578-28587)
IEEE DOI Code:
WWW Link.
2410
Adaptation models, Visualization, Correlation, Computational modeling,
Large language models, Semantics, few-shot adaptation
BibRef
Han, G.X.[Guang-Xing],
Lim, S.N.[Ser-Nam],
Few-Shot Object Detection with Foundation Models,
CVPR24(28608-28618)
IEEE DOI
2410
Training, Visualization, Large language models,
Computational modeling, Object detection, Benchmark testing,
Large Language Model
BibRef
Roberts, J.[Jonathan],
Lüddecke, T.[Timo],
Sheikh, R.[Rehan],
Han, K.[Kai],
Albanie, S.[Samuel],
Charting New Territories: Exploring the Geographic and Geospatial
Capabilities of Multimodal LLMs,
EarthVision24(554-563)
IEEE DOI
2410
Visualization, Image segmentation, Navigation,
Large language models, Disasters, Focusing, Benchmark testing,
Evaluation
BibRef
Barbany, O.[Oriol],
Huang, M.[Michael],
Zhu, X.L.[Xin-Liang],
Dhua, A.[Arnab],
Leveraging Large Language Models for Multimodal Search,
FGVC24(1201-1210)
IEEE DOI
2410
Large language models, Natural languages, Pipelines,
Image retrieval, Computer architecture, LLM, retrieval, fashion,
multimodal
BibRef
Lv, J.X.[Jia-Xi],
Huang, Y.[Yi],
Yan, M.[Mingfu],
Huang, J.C.[Jian-Cheng],
Liu, J.Z.[Jian-Zhuang],
Liu, Y.F.[Yi-Fan],
Wen, Y.F.[Ya-Fei],
Chen, X.X.[Xiao-Xin],
Chen, S.F.[Shi-Feng],
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation
via Blender-Oriented GPT Planning,
PBDL24(1430-1440)
IEEE DOI Code:
WWW Link.
2410
Image synthesis, Large language models, Text to image, Fluid flow,
Manuals, Diffusion models
BibRef
Baldassini, F.B.[Folco Bertini],
Shukor, M.[Mustafa],
Cord, M.[Matthieu],
Soulier, L.[Laure],
Piwowarski, B.[Benjamin],
What Makes Multimodal In-Context Learning Work?,
Prompting24(1539-1550)
IEEE DOI
2410
Training, Analytical models, Codes, Large language models,
Impedance matching, Large Language Models, Shortcuts learning
BibRef
Wang, J.C.[Jun-Chi],
Ke, L.[Lei],
LLM-Seg: Bridging Image Segmentation and Large Language Model
Reasoning,
WhatNext24(1765-1774)
IEEE DOI Code:
WWW Link.
2410
Training, Image segmentation, Large language models,
Design methodology, Pipelines, Cognition
BibRef
Qu, M.X.[Meng-Xue],
Chen, X.D.[Xiao-Dong],
Liu, W.[Wu],
Li, A.[Alicia],
Zhao, Y.[Yao],
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large
Language Models,
PVUW24(1847-1856)
IEEE DOI
2410
Grounding, Annotations, Large language models, Supervised learning,
Natural languages
BibRef
Hakim, Z.I.A.[Zaber Ibn Abdul],
Sarker, N.H.[Najibul Haque],
Singh, R.P.[Rahul Pratap],
Paul, B.[Bishmoy],
Dabouei, A.[Ali],
Xu, M.[Min],
Leveraging Generative Language Models for Weakly Supervised Sentence
Component Analysis in Video-Language Joint Learning,
MULA24(1975-1985)
IEEE DOI
2410
Training, Adaptation models, Statistical analysis,
Large language models, Estimation, Contrastive learning, Distance measurement
BibRef
Deria, A.[Ankan],
Kumar, K.[Komal],
Chakraborty, S.[Snehashis],
Mahapatra, D.[Dwarikanath],
Roy, S.[Sudipta],
InVERGe: Intelligent Visual Encoder for Bridging Modalities in Report
Generation,
MULA24(2028-2038)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Computational modeling, Radiology,
Transformers, Feature extraction, Decoding, Deep Learning,
Large Language Model
BibRef
Ma, F.P.[Fei-Peng],
Zhou, Y.Z.[Yi-Zhou],
Zhang, Y.[Yueyi],
Wu, S.Y.[Si-Ying],
Zhang, Z.[Zheyu],
He, Z.L.[Zi-Long],
Rao, F.Y.[Feng-Yun],
Sun, X.Y.[Xiao-Yan],
Task Navigator: Decomposing Complex Tasks for Multimodal Large
Language Models,
Reasoning24(2248-2257)
IEEE DOI
2410
Training, Systematics, Navigation, Large language models,
Training data, Language and Vision, Multi-modal Vision
BibRef
Arefeen, M.A.[Md Adnan],
Debnath, B.[Biplob],
Uddin, M.Y.S.[Md Yusuf Sarwar],
Chakradhar, S.[Srimat],
ViTA: An Efficient Video-to-Text Algorithm using VLM for RAG-based
Video Analysis System,
Reasoning24(2266-2274)
IEEE DOI
2410
Accuracy, Large language models, Natural language processing,
Data models, Video Analytics,
Large Language Models (LLMs)
BibRef
Chen, Y.W.[Yu-Wei],
Chu, S.Y.[Shi-Yong],
Large Language Models in Wargaming: Methodology, Application, and
Robustness,
AML24(2894-2903)
IEEE DOI
2410
Navigation, Large language models, Decision making,
Strategic planning, Solids, Robustness, Natural language processing
BibRef
Lai, Z.X.[Zhi-Xin],
Wu, J.[Jing],
Chen, S.[Suiyao],
Zhou, Y.C.[Yu-Cheng],
Hovakimyan, N.[Naira],
Residual-based Language Models are Free Boosters for Biomedical
Imaging Tasks,
DEF-AI-MIA24(5086-5096)
IEEE DOI Code:
WWW Link.
2410
Visualization, Large language models, Fasteners, Transformers,
LLM, Biomedical Imaging
BibRef
Verma, A.A.[Aayush Atul],
Saeidi, A.[Amir],
Hegde, S.[Shamanthak],
Therala, A.[Ajay],
Bardoliya, F.D.[Fenil Denish],
Machavarapu, N.[Nagaraju],
Ravindhiran, S.A.K.[Shri Ajay Kumar],
Malyala, S.[Srija],
Chatterjee, A.[Agneet],
Yang, Y.Z.[Ye-Zhou],
Baral, C.[Chitta],
Evaluating Multimodal Large Language Models across Distribution
Shifts and Augmentations,
GenerativeFM24(5314-5324)
IEEE DOI
2410
Analytical models, Shape, Large language models,
Computational modeling, Perturbation methods, Benchmark testing, Robustness
BibRef
Fang, X.[Xi],
Wang, W.G.[Wei-Gang],
Lv, X.X.[Xiao-Xin],
Yan, J.[Jun],
PCQA: A Strong Baseline for AIGC Quality Assessment Based on Prompt
Condition,
NTIRE24(6167-6176)
IEEE DOI
2410
Image quality, Databases, Large language models, Semantics,
Quality assessment, Ensemble learning, AIGC,
multimodal learning
BibRef
Ye, Z.[Zilyu],
Liu, J.X.[Jin-Xiu],
Cao, J.J.[Jin-Jin],
Chen, Z.Y.[Zhi-Yang],
Xuan, Z.W.[Zi-Wei],
Zhou, M.Y.[Ming-Yuan],
Liu, Q.[Qi],
Qi, G.J.[Guo-Jun],
OpenStory: A Large-Scale Open-Domain Dataset for Subject-Driven
Visual Storytelling,
VDU24(7953-7962)
IEEE DOI
2410
Training, Visualization, Annotations, Large language models,
Pipelines, Manuals
BibRef
Chen, X.Y.[Xiang-Yu],
Liu, J.[Jing],
Wang, Y.[Ye],
Wang, P.P.[Pu Perry],
Brand, M.[Matthew],
Wang, G.H.[Guang-Hui],
Koike-Akino, T.[Toshiaki],
SuperLoRA: Parameter-Efficient Unified Adaptation for Large Vision
Models,
ECV24(8050-8055)
IEEE DOI
2410
Adaptation models, Tensors, Computational modeling,
Large language models, Transfer learning, parameter efficiency,
low-rank adaptation
BibRef
Chen, Z.[Zhe],
Wu, J.N.[Jian-Nan],
Wang, W.H.[Wen-Hai],
Su, W.J.[Wei-Jie],
Chen, G.[Guo],
Xing, S.[Sen],
Zhong, M.[Muyan],
Zhang, Q.L.[Qing-Long],
Zhu, X.[Xizhou],
Lu, L.W.[Le-Wei],
Li, B.[Bin],
Luo, P.[Ping],
Lu, T.[Tong],
Qiao, Y.[Yu],
Dai, J.F.[Ji-Feng],
Intern VL: Scaling up Vision Foundation Models and Aligning for
Generic Visual-Linguistic Tasks,
CVPR24(24185-24198)
IEEE DOI
2410
Training, Visualization, Image recognition, Large language models,
Data models, Question answering (information retrieval),
vision-language model
BibRef
Zhang, J.Y.[Jimu-Yang],
Huang, Z.M.[Zan-Ming],
Ray, A.[Arijit],
Ohn-Bar, E.[Eshed],
Feedback-Guided Autonomous Driving,
CVPR24(15000-15011)
IEEE DOI
2410
Training, Large language models, Cloning, Computer architecture,
Network architecture, Real-time systems, Autonomous Driving,
Large Language Model
BibRef
Wei, C.[Chen],
Liu, C.X.[Chen-Xi],
Qiao, S.Y.[Si-Yuan],
Zhang, Z.S.[Zhi-Shuai],
Yuille, A.L.[Alan L.],
Yu, J.[Jiahui],
De-Diffusion Makes Text a Strong Cross-Modal Interface,
CVPR24(13492-13503)
IEEE DOI
2410
Large language models, Natural languages, Text to image,
Transforms, Diffusion models, Decoding, Diffusion, Generative Model,
Vision and Language
BibRef
Chen, Y.[Yangyi],
Sikka, K.[Karan],
Cogswell, M.[Michael],
Ji, H.[Heng],
Divakaran, A.[Ajay],
DRESS: Instructing Large Vision-Language Models to Align and
Interact with Humans via Natural Language Feedback,
CVPR24(14239-14250)
IEEE DOI
2410
Training, Visualization, Annotations, Large language models,
Natural languages, Reinforcement learning,
Natural Language Feedback
BibRef
Chen, B.[Boyuan],
Xu, Z.[Zhuo],
Kirmani, S.[Sean],
Ichter, B.[Brian],
Sadigh, D.[Dorsa],
Guibas, L.J.[Leonidas J.],
Xia, F.[Fei],
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities,
CVPR24(14455-14465)
IEEE DOI Code:
WWW Link.
2410
Training, Solid modeling, Visualization, Pipelines, Training data, Cognition,
spatial reasoning, large language model, multimodal, vision language model
BibRef
Dorkenwald, M.[Michael],
Barazani, N.[Nimrod],
Snoek, C.G.M.[Cees G. M.],
Asano, Y.M.[Yuki M.],
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs,
CVPR24(13548-13558)
IEEE DOI
2410
Training, Computational modeling, Machine vision,
Large language models, Pipelines, Pins, Vision-Language Models,
Efficient Adaption of VLMs
BibRef
Cha, J.[Junbum],
Kang, W.[Wooyoung],
Mun, J.[Jonghwan],
Roh, B.[Byungseok],
Honeybee: Locality-Enhanced Projector for Multimodal LLM,
CVPR24(13817-13827)
IEEE DOI Code:
WWW Link.
2410
Visualization, Codes, Large language models, Benchmark testing,
Tuning, Multimodal LLM, Vision-Language
BibRef
Huang, Q.D.[Qi-Dong],
Dong, X.Y.[Xiao-Yi],
Zhang, P.[Pan],
Wang, B.[Bin],
He, C.H.[Cong-Hui],
Wang, J.Q.[Jia-Qi],
Lin, D.[Dahua],
Zhang, W.M.[Wei-Ming],
Yu, N.H.[Neng-Hai],
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models
via Over-Trust Penalty and Retrospection-Allocation,
CVPR24(13418-13427)
IEEE DOI Code:
WWW Link.
2410
Training, Measurement, Costs, Codes, Large language models, Focusing,
Hallucination, Large vision-language model, Multimodal LLM, LLM
BibRef
Zhang, Y.[Yichi],
Ma, Z.Q.[Zi-Qiao],
Gao, X.F.[Xiao-Feng],
Shakiah, S.[Suhaila],
Gao, Q.[Qiaozi],
Chai, J.[Joyce],
Groundhog Grounding Large Language Models to Holistic Segmentation,
CVPR24(14227-14238)
IEEE DOI
2410
Training, Visualization, Grounding, Shape, Large language models,
Semantics, Feature extraction, Multi-Modal, Language Grounding,
Vision-Language Model
BibRef
Sun, Z.Y.[Ze-Yi],
Fang, Y.[Ye],
Wu, T.[Tong],
Zhang, P.[Pan],
Zang, Y.H.[Yu-Hang],
Kong, S.[Shu],
Xiong, Y.J.[Yuan-Jun],
Lin, D.[Dahua],
Wang, J.Q.[Jia-Qi],
Alpha-CLIP: A CLIP Model Focusing on Wherever you Want,
CVPR24(13019-13029)
IEEE DOI Code:
WWW Link.
2410
Point cloud compression, Visualization, Image recognition, Codes,
Large language models, CLIP, Vision-language pretraining, MLLMs
BibRef
Parashar, S.[Shubham],
Lin, Z.Q.[Zhi-Qiu],
Liu, T.[Tian],
Dong, X.J.[Xiang-Jue],
Li, Y.[Yanan],
Ramanan, D.[Deva],
Caverlee, J.[James],
Kong, S.[Shu],
The Neglected Tails in Vision-Language Models,
CVPR24(12988-12997)
IEEE DOI
2410
Training, Visualization, Accuracy, Large language models,
Text to image, Tail, Flowering plants, Vision-Language Models,
Long tailed recognition
BibRef
Yu, Q.F.[Qi-Fan],
Li, J.C.[Jun-Cheng],
Wei, L.[Longhui],
Pang, L.[Liang],
Ye, W.T.[Wen-Tao],
Qin, B.S.[Bo-Sheng],
Tang, S.L.[Si-Liang],
Tian, Q.[Qi],
Zhuang, Y.T.[Yue-Ting],
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual
Instruction Data,
CVPR24(12944-12953)
IEEE DOI Code:
WWW Link.
2410
Measurement, Visualization, Toxicology, Correlation, Codes,
Large language models, Hallucinations,
Vision-language reasoning
BibRef
Luo, Y.[Yan],
Shi, M.[Min],
Khan, M.O.[Muhammad Osama],
Afzal, M.M.[Muhammad Muneeb],
Huang, H.[Hao],
Yuan, S.[Shuaihang],
Tian, Y.[Yu],
Song, L.[Luo],
Kouhana, A.[Ava],
Elze, T.[Tobias],
Fang, Y.[Yi],
Wang, M.Y.[Meng-Yu],
FairCLIP: Harnessing Fairness in Vision-Language Learning,
CVPR24(12289-12301)
IEEE DOI Code:
WWW Link.
2410
Deep learning, Bridges, Analytical models, Ethics, Codes,
Computational modeling, Fairness Learning, Large Language Models
BibRef
Zara, G.[Giacomo],
Conti, A.[Alessandro],
Roy, S.[Subhankar],
Lathuilière, S.[Stéphane],
Rota, P.[Paolo],
Ricci, E.[Elisa],
The Unreasonable Effectiveness of Large Language-Vision Models for
Source-free Video Domain Adaptation,
ICCV23(10273-10283)
IEEE DOI
2401
BibRef
Liao, Z.[Zhaohe],
Li, J.T.[Jiang-Tong],
Niu, L.[Li],
Zhang, L.Q.[Li-Qing],
Align and Aggregate: Compositional Reasoning with Video Alignment and
Answer Aggregation for Video Question-Answering,
CVPR24(13395-13404)
IEEE DOI
2410
Measurement, Accuracy, Computational modeling, Aggregates,
Large language models, Pipelines
BibRef
Zhao, H.B.[Hong-Bo],
Ni, B.[Bolin],
Fan, J.S.[Jun-Song],
Wang, Y.X.[Yu-Xi],
Chen, Y.T.[Yun-Tao],
Meng, G.F.[Gao-Feng],
Zhang, Z.X.[Zhao-Xiang],
Continual Forgetting for Pre-Trained Vision Models,
CVPR24(28631-28642)
IEEE DOI Code:
WWW Link.
2410
Continuing education, Privacy, Codes, Large language models,
Face recognition, Object detection, Continual Forgetting, Machine Unlearning
BibRef
Kim, K.[Kibum],
Yoon, K.[Kanghoon],
Jeon, J.[Jaehyeong],
In, Y.[Yeonjun],
Moon, J.[Jinyoung],
Kim, D.H.[Dong-Hyun],
Park, C.[Chanyoung],
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph
Generation,
CVPR24(28306-28316)
IEEE DOI Code:
WWW Link.
2410
Training, Visualization, Grounding, Large language models, Semantics,
Genomics, Focusing, Scene Understanding, Large Language Model,
Long-Tail Problem
BibRef
Zhan, X.Y.[Xin-Yu],
Yang, L.X.[Li-Xin],
Zhao, Y.F.[Yi-Fei],
Mao, K.[Kangrui],
Xu, H.L.[Han-Lin],
Lin, Z.[Zenan],
Li, K.L.[Kai-Lin],
Lu, C.[Cewu],
OakInk2: A Dataset of Bimanual Hands-Object Manipulation in Complex
Task Completion,
CVPR24(445-456)
IEEE DOI Code:
WWW Link.
2410
Annotations, Affordances, Computational modeling,
Large language models, Decoding
BibRef
Li, Y.C.[Yi-Cong],
Zhao, N.[Na],
Xiao, J.B.[Jun-Bin],
Feng, C.[Chun],
Wang, X.[Xiang],
Chua, T.S.[Tat-Seng],
LASO: Language-Guided Affordance Segmentation on 3D Object,
CVPR24(14251-14260)
IEEE DOI Code:
WWW Link.
2410
Visualization, Solid modeling, Shape, Affordances,
Large language models, Semantics, Multimodal, 3D-Language, Vision-Language
BibRef
Rotstein, N.[Noam],
Bensaïd, D.[David],
Brody, S.[Shaked],
Ganz, R.[Roy],
Kimmel, R.[Ron],
FuseCap: Leveraging Large Language Models for Enriched Fused Image
Captions,
WACV24(5677-5688)
IEEE DOI
2404
Training, Surveys, Visualization, Fuses,
Optical character recognition, Training data, Algorithms,
Image recognition and understanding
BibRef
Pan, J.T.[Jun-Ting],
Lin, Z.[Ziyi],
Ge, Y.Y.[Yu-Ying],
Zhu, X.T.[Xia-Tian],
Zhang, R.R.[Ren-Rui],
Wang, Y.[Yi],
Qiao, Y.[Yu],
Li, H.S.[Hong-Sheng],
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen
Large Language Models,
MMFM23(272-283)
IEEE DOI
2401
BibRef
Guo, J.X.[Jia-Xian],
Li, J.[Junnan],
Li, D.X.[Dong-Xu],
Tiong, A.M.H.[Anthony Meng Huat],
Li, B.Y.[Bo-Yang],
Tao, D.C.[Da-Cheng],
Hoi, S.[Steven],
From Images to Textual Prompts: Zero-shot Visual Question Answering
with Frozen Large Language Models,
CVPR23(10867-10877)
IEEE DOI
2309
BibRef
Chapter on Implementations and Applications, Databases, QBIC, Video Analysis, Hardware and Software, Inspection continues in
Image-Text Matching, Image Text Retrieval, Image-Text Retrieval .