The exponential growth of deep learning has helped solve problems across different fields of study. Convolutional neural networks have become a go-to tool for extracting features from images. Similarly, variations of recurrent neural networks such as Long-Short Term Memory and Gated Recurrent Unit architectures do a good job extracting useful information from temporal data such as text and time series data. Although, these networks are good at extracting features for a particular modality, learning features across multiple modalities is still a challenging task. In this work, we develop a generative common vector space model in which similar concepts from different modalities are brought closer in a common latent space representation while dissimilar concepts are pushed far apart in this same space. The developed model not only aims at solving the cross-modal retrieval problem but also uses the vector generated by the common vector space model to generate real looking data. This work mainly focuses on image and text modalities. However, it can be extended to other modalities as well. We train and evaluate the performance of the model on Caltech CUB and Oxford-102 datasets.
Library of Congress Subject Headings
Machine learning; Neural networks (Computer science); Convolutions (Mathematics); Information retrieval; Data mining
Computer Engineering (MS)
Department, Program, or Center
Computer Engineering (KGCOE)
Udaiyar, Premkumar, "Cross-modal data retrieval and generation using deep neural networks" (2020). Thesis. Rochester Institute of Technology. Accessed from
RIT – Main Campus