VL18 * *Shortcomings in Vision and Language
* Adding Object Detection Skills to Visual Dialogue Agents
* Distinctive-Attribute Extraction for Image Captioning
* How Clever Is the FiLM Model, and How Clever Can it Be?
* Image-Sensitive Language Modeling for Automatic Speech Recognition
* Knowing When to Look for What and Where: Evaluating Generation of Spatial Descriptions with Adaptive Attention
* Knowing Where to Look? Analysis on Attention of Visual Question Answering System
* MoQA: A Multi-modal Question Answering Architecture
* Pre-gen Metrics: Predicting Caption Quality Metrics Without Generating Captions
* Quantifying the Amount of Visual Information Used by Neural Caption Generators
* Towards a Fair Evaluation of Zero-Shot Action Recognition Using External Data
