Introduction
Authors: Lu Jiang, Junwei Liang, Liangliang Cao, Yannis Kalantidis, Sachin Farfade, Alexander Hauptmann
- Introduces a new task and dataset called MemexQA: Based on a collection of videos or photos, answer some questions. Difference from VQA: A question may require reasoning over multiple images instead of one.
- Questions and their answers may require building a database from timestamps, annotations, metadata and other information from images and videos for easy information retrieval.
- Proposes MemexNet, supposedly a unified network architecture for image, text and video QA.
- Paper Link
Dataset
- Numbers: 13,591 personal photos from 630 albums and 101 Flickr users and 20,860 crowdsourced questions with 417,200 multiple choice answers.
- Available at https://memexqa.cs.cmu.edu/
- Answer space supposedly is in between closed set of current VQA tasks and open-ended QA.
- Data includes a timestamp, GPS (if any), a photo title, photo tags, album information as well as photo captions from the SIND data. Used Amazon Mechanical Turk to collect questions.
Claims
- MemexQA is tougher than VQA.
- Questions are not easily answerable even by humans.
- Normal users took 10x more time to solve a MemexQA task than a VQA task.
- MemexNet can be trained with great efficacy and scalability on other QA tasks also.
MemexNet
Architecture of MemexNet
- Inspired by WatsonQA system. Designed for MemexQA dataset but can also be applied to text or video QA.
- Three major modules: Question Understanding, Search Engine and Answer Inference
- Question Understanding:
- Contains two subnetworks: Question encoder and Query Encoder
- Question encoder embeds a question into latent state vector which expresses the question content and its estimated category. This is passed to answer inference module.
- Query encoder produces a query embedding which goes an input to search engine. Usually, a sparse vector over a set of relevant concepts in the concept vocabulary (pretrained).
- Uses LSTM as question encoder and SkipGram and Visual Query Embedding networks as query encoder.
- Search Engine:
- Returns a list of samples ranked by embedded query vector.
- Uses off the shelf state of the art visual search engine and learn attention weights over the retrieved samples
- Answer inference:
- Can be changed for different tasks
- For MemexQA, the modality information and match class embeddings are concatenated in a way, that samples at different rank positions should have different importance. This concatenated embedding vector is passed to an LSTM network output of which are attended.
- This learned embedding vector for top k samples along with question embedding and image CNN features are passed on to a classifier to get the best answer among the multiple choice options present. Look MemexNet section’s Answer Inference subsection for more information.
Experiments
MemexQA
- Compared MemexNet with various state of art and classical model of VQA with some changes to add other features and found out the MemexNet outperforms all of them on MemexQA.
- Significant gap in between human and model accuracies. (0.487 to 0.927)
- Significant drops in “when” and “where” type questions from “how many”, so there is need of MemexLookupNet.
TextQA
- Tested on SQuAD, considers paragraph as album, sentences as images with last 3 layers in BiDAF model as answer inference networks.
- Reported F1 scores for other models seem different from the ones on SQuAD leaderboard.
- Achieves 0.767 F1 score.
VideoQA
- Trained on YFCC100M dataset.
- Doesn’t use user generated metadata to access VideoQA performance based on only video content understanding.
- Achieves 0.52 accuracy.
- Claims that only 1.3 seconds taken to answer a question from over 800k videos on single core Xeon 2.53 GHz CPU with 32GB memory.
Notes
- Claims to be a cross domain QA general model.
- Doesn’t evaluate on VQA directly.
- It might be better to release dataset in separate paper and MemexNet in separate paper with comparisons on VQA.