Abstract

Developing intelligent agents that can perceive and understand the rich visual world around us has been a long-standing goal in the field of artificial intelligence. In the last few years, significant progress has been made towards this goal and deep learning has been attributed to recent incredible advances in general visual and language understanding. Convolutional neural networks have been used to learn image representations while recurrent neural networks have demonstrated the ability to generate text from visual stimuli. In this thesis, we develop methods and techniques using hybrid convolutional and recurrent neural network architectures that connect visual data and natural language utterances.

Towards appreciating these methods, this work is divided into two broad groups. Firstly, we introduce a general purpose attention mechanism modeled using a continuous function for video understanding. The use of an attention based hierarchical approach along with automatic boundary detection advances state-of-the-art video captioning results. We also develop techniques for summarizing and annotating long videos. In the second part, we introduce architectures along with training techniques to produce a common connection space where natural language sentences are efficiently and accurately connected with visual modalities. In this connection space, similar concepts lie close, while dissimilar concepts lie far apart, irrespective` of their modality. We discuss four modality transformations: visual to text, text to visual, visual to visual and text to text. We introduce a novel attention mechanism to align multi-modal embeddings which are learned through a multi-modal metric loss function. The common vector space is shown to enable bidirectional generation of images and text. The learned common vector space is evaluated on multiple image-text datasets for cross-modal retrieval and zero-shot retrieval. The models are shown to advance the state-of-the-art on tasks that require joint processing of images and natural language.

Library of Congress Subject Headings

Machine learning; Neural networks (Computer science); Convolutions (Mathematics); Natural language processing (Computer science); Computer vision

Publication Date

12-2-2018

Document Type

Dissertation

Student Type

Graduate

Degree Name

Imaging Science (Ph.D.)

Department, Program, or Center

Chester F. Carlson Center for Imaging Science (COS)

Advisor

Raymond Ptucha

Advisor/Committee Member

Carl Salvaggio

Advisor/Committee Member

Nathan Cahill

Recommended Citation

Sah, Shagan, "Multi-Modal Deep Learning to Understand Vision and Language" (2018). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/9925

Campus

RIT – Main Campus

Plan Codes

IMGS-PHD

Download

COinS

Theses

Multi-Modal Deep Learning to Understand Vision and Language

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Advisor/Committee Member

Recommended Citation

Campus

Plan Codes

Search

Browse

Author Corner

RIT Links

Theses

Multi-Modal Deep Learning to Understand Vision and Language

Author

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Advisor/Committee Member

Recommended Citation

Campus

Plan Codes

Share

Search

Browse

Author Corner

RIT Links