Deep learning has enabled great advances in the field of natural language processing, computer vision and pattern recognition in general. Deep learning frameworks have been very successful in performing classification, object detection, segmentation and translation. Before objects can be processed, a vector representation of that object needs to be created. For example, sentences and images can be encoded with a sent2vec and image2vec function respectively in preparation for input to a machine learning framework. Neural networks are able to learn efficient vector representation of images, text, audio, videos and 3D point clouds. However, the transfer of knowledge from one modality to the other is a challenging task. In this work, we develop vector spaces that can handle data that belongs to multiple modalities at the same time. In these spaces, similar objects are tightly clustered and dissimilar objects are far away irrespective of their modality. Such a vector space can be used in retrieval of objects, searching and generation tasks. For example, given a picture of a person surfing, one can retrieve sentences or audio bites of a person surfing. We build a Multi-stage Common Vector Space (M-CVS) and Reference Vector Space (RVS) that can handle images, text, audios, videos and 3D point cloud data. Both, the M-CVS and RVS can handle the addition of a new modality without having to change the existing transforms or architecture. Our model is evaluated by performing cross modal retrieval on multiple benchmark datasets.
Library of Congress Subject Headings
Vector spaces; Machine learning; Neural networks (Computer science)
Computer Engineering (MS)
Department, Program, or Center
Computer Engineering (KGCOE)
Gopalakrishnan, Sabarish, "Vector Spaces for Multiple Modal Embeddings" (2019). Thesis. Rochester Institute of Technology. Accessed from
RIT – Main Campus