Advanced image-based application systems such as image retrieval and visual question answering depend heavily on semantic image region annotation. However, improvements in image region annotation are limited because of our inability to understand how humans, the end users, process these images and image regions. In this work, we expand a framework for capturing image region annotations where interpreting an image is influenced by the end user's visual perception skills, conceptual knowledge, and task-oriented goals. Human image understanding is reflected by individuals' visual and linguistic behaviors, but the meaningful computational integration and interpretation of their multimodal representations (e.g. gaze, text) remain a challenge. Our work explores the hypothesis that eye movements can help us understand experts' perceptual processes and that spoken language descriptions can reveal conceptual elements of image inspection tasks. We propose that there exists a meaningful relation between gaze, spoken narratives, and image content. Using unsupervised bitext alignment, we create meaningful mappings between participants' eye movements (which reveal key areas of images) and spoken descriptions of those images. The resulting alignments are then used to annotate image regions with concept labels. Our alignment accuracy exceeds baseline alignments that are obtained using both simultaneous and a fixed-delay temporal correspondence. Additionally, comparison of alignment accuracy between a method that identifies clusters in the images based on eye movements and a method that identifies clusters using image features shows that the two approaches perform well on different types of images and concept labels. This suggests that an image annotation framework could integrate information from more than one technique to handle heterogeneous images. The resulting alignments can be used to create a database of low-level image features and high-level semantic annotations corresponding to perceptually important image regions. We demonstrate the applicability of the proposed framework with two datasets: one consisting of general-domain images and another with images from the domain of medicine. This work is an important contribution toward the highly challenging problem of fusing human-elicited multimodal data sources, a problem that will become increasingly important as low-resource scenarios become more common.

Publication Date


Document Type


Student Type


Degree Name

Imaging Science (Ph.D.)

Department, Program, or Center

Chester F. Carlson Center for Imaging Science (COS)


Jeff B. Pelz

Advisor/Committee Member

Cecilia O. Alm

Advisor/Committee Member

Emily T. Prud'hommeaux


RIT – Main Campus