Current research in computer vision and machine learning has demonstrated some great abilities at detecting and recognizing objects in natural images. Current state-of-the-art results in object detection, classification and localization in ImageNet Challenges have the validation accuracy for top 5 predictions for classification to be at 3.08% while similar classification experiments run by trained humans report an accuracy of 5.1%. While some people might argue that human accuracy is a function of training time it can be said with great confidence that automated classification models are at least as good as trained humans in classification problems. The ability of these models to analyze and describe complex images, however, is still an active area of research.
Image description is a good starting point for imparting artificial intelligence to machines by allowing them to analyze and describe complex visual scenes. This thesis work introduces a generic end-to-end trainable Fusion-based Recurrent Multi-Modal (FRMM) architecture to address multi-modal applications. FRMM allows each input modality to be independent in terms of architecture, parameters and length of input sequences. FRMM image description models seamlessly blend convolutional neural network feature descriptors with sequential language data in a recurrent framework. In addition to introducing FRMMs, this work also analyzes the impact of varying activation functions and vocabulary size. For training and testing Flickr8k, Flickr30K and MSCOCO datasets have been used, demonstrating state-of-the-art description results.
Library of Congress Subject Headings
Deep learning (Machine learning); Neural networks (Computer science); Computer vision; Image processing--Digital techniques
Computer Engineering (MS)
Department, Program, or Center
Computer Engineering (KGCOE)
Oruganti, Ram Manohar, "Image Description using Deep Neural Networks" (2016). Thesis. Rochester Institute of Technology. Accessed from
RIT – Main Campus