Automating the segmentation of anomalous activities within long video sequences is complicated by the ambiguity of how such events are defined. This thesis approaches the problem by learning generative models with which meaningful sequences can be identified in videos using limited supervision. We propose two types of end-to-end trainable Convolutional Long Short-Term Memory (Conv-LSTM) networks that are able to predict the subsequent video sequence from a given input. The first is an encoder decoder based model that learns spatio-temporal features from stacked non-overlapping image patches, and the second is an autoencoder based model that utilizes max-pooling layers to learn an abstraction of the entire image. The networks learn to model “normal” activities from usual events. Regularity scores are derived from the reconstruction errors of a set of predictions with abnormal video sequences yielding lower regularity scores, as they diverge further from the actual sequence with time. The models utilize a composite structure and examine the effects of “conditioning” to learn more meaningful representations. The best model is chosen based on the reconstruction and prediction accuracies. The Conv-LSTM models are evaluated both qualitatively and quantitatively, demonstrating competitive results on multiple anomaly detection datasets. Conv-LSTM units are shown to provide competitive results for modeling and predicting learned events when compared to state-to-the-art methods.

Library of Congress Subject Headings

Image processing--Digital techniques; Machine learning; Optical pattern recognition

Publication Date


Document Type


Student Type


Degree Name

Computer Engineering (MS)

Department, Program, or Center

Computer Engineering (KGCOE)


Andreas Savakis

Advisor/Committee Member

Andres Kwasinski

Advisor/Committee Member

Roy Melton


Physical copy available from RIT's Wallace Library at TA1637 .M33 2016


RIT – Main Campus

Plan Codes