Humans have an incredible ability to process and understand information from multiple sources such as images, video, text, and speech. Recent success of deep neural networks has enabled us to develop algorithms which give machines the ability to understand and interpret this information. Convolutional Neural Networks (CNN) have become a standard in extracting rich features from visual stimuli. Recurrent Neural Networks (RNNs) and its variants such as Long Short Term Memory (LSTMs) units have been highly successful in encoding and decoding sequential information like speech and text. Although these networks are highly successful when applied to narrow applications, there is a need to both broaden their applicability and develop methods which correlate visual information along with semantic content.
This master’s thesis develops a common vector space between images and text. This vector space maps similar concepts, such as pictures of dogs and the word “puppy” close, while mapping disparate concepts far apart. Most cross-modal problems are solved using deep neural networks trained for specific tasks. This research formulates a unified model using CNN and RNN which projects images and text into a common embedding space and also decodes the image and text embeddings into meaningful sentences. This model shows diverse applications in cross modal retrieval, image captioning and sentence paraphrasing and shows promising directions for neural networks to generalize well on different tasks.
Computer Engineering (MS)
Department, Program, or Center
Computer Engineering (KGCOE)
Peri, Dheeraj Kumar, "Multi-modal learning using deep neural networks" (2018). Thesis. Rochester Institute of Technology. Accessed from
RIT – Main Campus