Documents found on the World Wide Web (WWW) may be composed of a single web page, or several web pages that are linked together by a table of contents or some other commonly known document construct. When a document spans multiple web pages, it is often inconvenient to print or download the entire document using available tools. This thesis introduces a concept called the document boundary to facilitate representation and analysis of multi-page web documents, and suggests a two-phase approach towards automated identification of document boundaries. In the first phase, individual pages are examined to determine which links are most likely to represent an intra-document link. This procedure is applied recursively to identify a group of candidate pages which may be part of the same document. In the second phase, the link topology and other features of the identified pages are examined in aggregate for indications of a multi-page document. A test suite of both single- and multi-page web documents was assembled using a mixture of handpicked documents and documents which were gathered by an arbitrary third party. The document boundary detection system was applied to the main page of each document. The document boundary detection system was able to achieve a success rate of 73% when its results were compared to the ground truth documents.
Library of Congress Subject Headings
Department, Program, or Center
Computer Engineering (KGCOE)
Sweet, James, "Reconstructing the Boundary of a Web Document" (2002). Thesis. Rochester Institute of Technology. Accessed from
RIT – Main Campus