Bruno Artacho


In this dissertation we present multiple state-of-the-art deep learning methods for computer vision tasks using multi-scale approaches for two main tasks: pose estimation and semantic segmentation. For pose estimation, we introduce a complete framework expanding the fields-of-view of the network through a multi-scale approach, resulting in a significant increasing the effectiveness of conventional backbone architectures, for several pose estimation tasks without requiring a larger network or postprocessing. Our multi-scale pose estimation framework contributes to research on methods for single-person pose estimation in both 2D and 3D scenarios, pose estimation in videos, and the estimation of multiple people’s pose in a single image for both top-down and bottom-up approaches. In addition to the enhanced capability of multi-person pose estimation generated by our multi-scale approach, our framework also demonstrates a superior capacity to expanded the more detailed and heavier task of full-body pose estimation, including up to 133 joints per person. For segmentation, we present a new efficient architecture for semantic segmentation, based on a “Waterfall” Atrous Spatial Pooling architecture, that achieves a considerable accuracy increase while decreasing the number of network parameters and memory footprint. The proposed Waterfall architecture leverages the efficiency of progressive filtering in the cascade architecture while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Additionally, our method does not rely on a postprocessing stage with conditional random fields, which further reduces complexity and required training time.

Publication Date


Document Type


Student Type


Degree Name

Electrical and Computer Engineering (Ph.D)

Department, Program, or Center

Electrical Engineering (KGCOE)


Matthew Dye

Advisor/Committee Member

Andres Kwasinski

Advisor/Committee Member

Raymond Ptucha


RIT – Main Campus