Visual Question Answering

  • Combine visual with text sequence
    • CNN + LSTM
    • Generate text from images
      • Automatic scene description
    • Cross-modal

cnn+lstm

  • Word embedding not character

Freeform

  • Encode facts with two text streams vqa-block

Limitations

  • Repetitive answers
    • Not much variation
  • No creativity
    • Wont generalise beyond taught concepts