Abstract

Current research in computer vision and machine learning has demonstrated some great abilities at detecting and recognizing objects in natural images. Current state-of-the-art results in object detection, classification and localization in ImageNet Challenges have the validation accuracy for top 5 predictions for classification to be at 3.08% while similar classification experiments run by trained humans report an accuracy of 5.1%. While some people might argue that human accuracy is a function of training time it can be said with great confidence that automated classification models are at least as good as trained humans in classification problems. The ability of these models to analyze and describe complex images, however, is still an active area of research.

Image description is a good starting point for imparting artificial intelligence to machines by allowing them to analyze and describe complex visual scenes. This thesis work introduces a generic end-to-end trainable Fusion-based Recurrent Multi-Modal (FRMM) architecture to address multi-modal applications. FRMM allows each input modality to be independent in terms of architecture, parameters and length of input sequences. FRMM image description models seamlessly blend convolutional neural network feature descriptors with sequential language data in a recurrent framework. In addition to introducing FRMMs, this work also analyzes the impact of varying activation functions and vocabulary size. For training and testing Flickr8k, Flickr30K and MSCOCO datasets have been used, demonstrating state-of-the-art description results.

Library of Congress Subject Headings

Deep learning (Machine learning); Neural networks (Computer science); Computer vision; Image processing--Digital techniques

Publication Date

5-2016

Document Type

Thesis

Student Type

Graduate

Degree Name

Computer Engineering (MS)

Department, Program, or Center

Computer Engineering (KGCOE)

Advisor

Raymond Ptucha

Advisor/Committee Member

Andreas Savakis

Advisor/Committee Member

Christopher Kanan

Comments

Physical copy available from RIT's Wallace Library at Q325.5 .O78 2016

Recommended Citation

Oruganti, Ram Manohar, "Image Description using Deep Neural Networks" (2016). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/9043

Campus

RIT – Main Campus

Download

COinS

Theses

Image Description using Deep Neural Networks

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Advisor/Committee Member

Comments

Recommended Citation

Campus

Search

Browse

Author Corner

RIT Links

Theses

Image Description using Deep Neural Networks

Author

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Advisor/Committee Member

Comments

Recommended Citation

Campus

Share

Search

Browse

Author Corner

RIT Links