Abstract

The success of deep learning models in challenging tasks of computer vision and natural language processing depend on good vector representations of data. For example, learning efficient and salient video representations is one of the fundamental steps for many tasks like action recognition and next frame prediction. Most methods in deep learning rely on large datasets like ImageNet or MSCOCO for training, which is expensive and time consuming to collect. Some of the earlier works in video representation learning relied on encoder-decoder style networks in an unsupervised fashion, which would take in a few frames at a time. Research in the field of self-supervised learning is growing, and has shown promising results on image-related tasks to both learn data representations as well as pre-learn weights for networks using unlabeled data. However, many of these techniques use static architectures like AlexNet, which fail to take into account the temporal aspect of videos. Learning frame-to-frame temporal relationships is essential to learning latent representations of video. In our work, we propose to learn this temporality by pairing static encodings with a recurrent long short term memory network. This research will also investigate applying different methods of encoding architecture along with the recurrent network, to take in a range of different number of frames. We also introduce a novel self-supervised task in which the neural network has two tasks; predicting if a tuple of input frames is temporally consistent, and if not, predict the positioning of incorrect tuple. The efficacy is finally measured by using these trained networks on downstream tasks like action recognition on standard datasets UCF101 and HMDB51.

Library of Congress Subject Headings

Neural networks (Computer science); Machine learning; Pattern recognition systems; Computer vision; Natural language processing (Computer science); Signal processing--Digital techniques

Publication Date

6-2020

Document Type

Thesis

Student Type

Graduate

Degree Name

Computer Engineering (MS)

Department, Program, or Center

Computer Engineering (KGCOE)

Advisor

Raymond Ptucha

Advisor/Committee Member

Alexander Loui

Advisor/Committee Member

Andres Kwasinski

Recommended Citation

Nagabandi, Sai Shashidhar, "Self-Supervised Video Representation Learning by Recurrent Networks and Frame Order Prediction" (2020). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/10450

Campus

RIT – Main Campus

Plan Codes

CMPE-MS

Download

COinS

Theses

Self-Supervised Video Representation Learning by Recurrent Networks and Frame Order Prediction

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Advisor/Committee Member

Recommended Citation

Campus

Plan Codes

Search

Browse

Author Corner

RIT Links

Theses

Self-Supervised Video Representation Learning by Recurrent Networks and Frame Order Prediction

Author

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Advisor/Committee Member

Recommended Citation

Campus

Plan Codes

Share

Search

Browse

Author Corner

RIT Links