February 14, 2018

Introduction

Authors: Lu Jiang, Junwei Liang, Liangliang Cao, Yannis Kalantidis, Sachin Farfade, Alexander Hauptmann

  • Introduces a new task and dataset called MemexQA: Based on a collection of videos or photos, answer some questions. Difference from VQA: A question may require reasoning over multiple images instead of one.
  • Questions and their answers may require building a database from timestamps, annotations, metadata and other information from images and videos for easy information retrieval.
  • Proposes MemexNet, supposedly a unified network architecture for image, text and video QA.
  • Paper Link

Dataset

  • Numbers: 13,591 personal photos from 630 albums and 101 Flickr users and 20,860 crowdsourced questions with 417,200 multiple choice answers.
  • Available at https://memexqa.cs.cmu.edu/
  • Answer space supposedly is in between closed set of current VQA tasks and open-ended QA.
  • Data includes a timestamp, GPS (if any), a photo title, photo tags, album information as well as photo captions from the SIND data. Used Amazon Mechanical Turk to collect questions.

Claims

  • MemexQA is tougher than VQA.
  • Questions are not easily answerable even by humans.
  • Normal users took 10x more time to solve a MemexQA task than a VQA task.
  • MemexNet can be trained with great efficacy and scalability on other QA tasks also.

MemexNet

Architecture

Architecture of MemexNet

  • Inspired by WatsonQA system. Designed for MemexQA dataset but can also be applied to text or video QA.
  • Three major modules: Question Understanding, Search Engine and Answer Inference
  • Question Understanding:
    • Contains two subnetworks: Question encoder and Query Encoder
    • Question encoder embeds a question into latent state vector which expresses the question content and its estimated category. This is passed to answer inference module.
    • Query encoder produces a query embedding which goes an input to search engine. Usually, a sparse vector over a set of relevant concepts in the concept vocabulary (pretrained).
    • Uses LSTM as question encoder and SkipGram and Visual Query Embedding networks as query encoder.
  • Search Engine:
    • Returns a list of samples ranked by embedded query vector.
    • Uses off the shelf state of the art visual search engine and learn attention weights over the retrieved samples
  • Answer inference:
    • Can be changed for different tasks
    • For MemexQA, the modality information and match class embeddings are concatenated in a way, that samples at different rank positions should have different importance. This concatenated embedding vector is passed to an LSTM network output of which are attended.
    • This learned embedding vector for top k samples along with question embedding and image CNN features are passed on to a classifier to get the best answer among the multiple choice options present. Look MemexNet section’s Answer Inference subsection for more information.

Experiments

MemexQA

  • Compared MemexNet with various state of art and classical model of VQA with some changes to add other features and found out the MemexNet outperforms all of them on MemexQA.
  • Significant gap in between human and model accuracies. (0.487 to 0.927)
  • Significant drops in “when” and “where” type questions from “how many”, so there is need of MemexLookupNet.

TextQA

  • Tested on SQuAD, considers paragraph as album, sentences as images with last 3 layers in BiDAF model as answer inference networks.
  • Reported F1 scores for other models seem different from the ones on SQuAD leaderboard.
  • Achieves 0.767 F1 score.

VideoQA

  • Trained on YFCC100M dataset.
  • Doesn’t use user generated metadata to access VideoQA performance based on only video content understanding.
  • Achieves 0.52 accuracy.
  • Claims that only 1.3 seconds taken to answer a question from over 800k videos on single core Xeon 2.53 GHz CPU with 32GB memory.

Notes

  • Claims to be a cross domain QA general model.
  • Doesn’t evaluate on VQA directly.
  • It might be better to release dataset in separate paper and MemexNet in separate paper with comparisons on VQA.

Follow me on Twitter and Github, that's where I'm most active these days. You can also email me at [email protected]. Thanks!