The success of deep learning models in challenging tasks of computer vision and natural language processing depend on good vector representations of data. For example, learning efficient and salient video representations is one of the fundamental steps for many tasks like action recognition and next frame prediction. Most methods in deep learning rely on large datasets like ImageNet or MSCOCO for training, which is expensive and time consuming to collect. Some of the earlier works in video representation learning relied on encoder-decoder style networks in an unsupervised fashion, which would take in a few frames at a time. Research in the field of self-supervised learning is growing, and has shown promising results on image-related tasks to both learn data representations as well as pre-learn weights for networks using unlabeled data. However, many of these techniques use static architectures like AlexNet, which fail to take into account the temporal aspect of videos. Learning frame-to-frame temporal relationships is essential to learning latent representations of video. In our work, we propose to learn this temporality by pairing static encodings with a recurrent long short term memory network. This research will also investigate applying different methods of encoding architecture along with the recurrent network, to take in a range of different number of frames. We also introduce a novel self-supervised task in which the neural network has two tasks; predicting if a tuple of input frames is temporally consistent, and if not, predict the positioning of incorrect tuple. The efficacy is finally measured by using these trained networks on downstream tasks like action recognition on standard datasets UCF101 and HMDB51.
Library of Congress Subject Headings
Neural networks (Computer science); Machine learning; Pattern recognition systems; Computer vision; Natural language processing (Computer science); Signal processing--Digital techniques
Computer Engineering (MS)
Department, Program, or Center
Computer Engineering (KGCOE)
Nagabandi, Sai Shashidhar, "Self-Supervised Video Representation Learning by Recurrent Networks and Frame Order Prediction" (2020). Thesis. Rochester Institute of Technology. Accessed from
RIT – Main Campus