We introduce a new system for layout-based indexing and retrieval of mathematical expressions using substitution trees. Substitution trees can efficiently store and find hierarchically-structured data based on similarity. Previously Kolhase and Sucan applied substitution trees to indexing mathematical expressions in operator tree representation (Content MathML) and query-by-expression retrieval. In this investigation, we use substitution trees to index mathematical expressions in symbol layout tree representation (LaTeX) to group expressions based on the similarity of their symbols, symbol layout, sub-expressions and size. We describe our novel substitution tree indexing and retrieval algorithms and our many significant contributions to the behavior of these algorithms, including: allowing substitution trees to index and retrieve layout-based mathematical expressions instead of predicates; introducing a bias in the insertion function that helps group expressions in the index based on similarity in baseline size; modifying the search function to find expressions that are not identical yet still structurally similar to a search query; and ranking search results based on their similarity in symbols and symbol layout to the search query. We provide an experiment testing our system against the term frequency-inverse document frequency (TF-IDF) keyword-based system of Zanibbi and Yuan and demonstrate that: in many cases, the two systems are comparable; our system excelled at finding expressions identical to the search query and expressions containing relevant sub-expressions; and our system experiences some limitations due to the insertion bias and the presence of LaTeX formatting in expressions. Future work includes: designing a different insertion bias that improves the quality of search results; modifying the behavior of the search and ranking functions; and extending the scope of the system so that it can index websites or non-LaTeX expressions (such as MathML or images). Overall, we present a promising first attempt at layout-based substitution tree indexing and retrieval for mathematical expressions.
Library of Congress Subject Headings
Mathematical symbols (Typefaces)--Classification; Information retrieval; Layout (Printing)
Department, Program, or Center
Computer Science (GCCIS)
Schellenberg, Matthew, "Layout-based substitution tree indexing and retrieval for mathematical expressions" (2011). Thesis. Rochester Institute of Technology. Accessed from
RIT – Main Campus