In this dissertation we present multiple state-of-the-art deep learning methods for computer vision tasks using multi-scale approaches for two main tasks: pose estimation and semantic segmentation. For pose estimation, we introduce a complete framework expanding the fields-of-view of the network through a multi-scale approach, resulting in a significant increasing the effectiveness of conventional backbone architectures, for several pose estimation tasks without requiring a larger network or postprocessing. Our multi-scale pose estimation framework contributes to research on methods for single-person pose estimation in both 2D and 3D scenarios, pose estimation in videos, and the estimation of multiple people’s pose in a single image for both top-down and bottom-up approaches. In addition to the enhanced capability of multi-person pose estimation generated by our multi-scale approach, our framework also demonstrates a superior capacity to expanded the more detailed and heavier task of full-body pose estimation, including up to 133 joints per person. For segmentation, we present a new efficient architecture for semantic segmentation, based on a “Waterfall” Atrous Spatial Pooling architecture, that achieves a considerable accuracy increase while decreasing the number of network parameters and memory footprint. The proposed Waterfall architecture leverages the efficiency of progressive filtering in the cascade architecture while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Additionally, our method does not rely on a postprocessing stage with conditional random fields, which further reduces complexity and required training time.
Electrical and Computer Engineering (Ph.D)
Department, Program, or Center
Electrical Engineering (KGCOE)
Artacho, Bruno, "Multi-Scale Architectures for Human Pose Estimation" (2022). Thesis. Rochester Institute of Technology. Accessed from
RIT – Main Campus