Author

James Sweet

Abstract

Documents found on the World Wide Web (WWW) may be composed of a single web page, or several web pages that are linked together by a table of contents or some other commonly known document construct. When a document spans multiple web pages, it is often inconvenient to print or download the entire document using available tools. This thesis introduces a concept called the document boundary to facilitate representation and analysis of multi-page web documents, and suggests a two-phase approach towards automated identification of document boundaries. In the first phase, individual pages are examined to determine which links are most likely to represent an intra-document link. This procedure is applied recursively to identify a group of candidate pages which may be part of the same document. In the second phase, the link topology and other features of the identified pages are examined in aggregate for indications of a multi-page document. A test suite of both single- and multi-page web documents was assembled using a mixture of handpicked documents and documents which were gathered by an arbitrary third party. The document boundary detection system was applied to the main page of each document. The document boundary detection system was able to achieve a success rate of 73% when its results were compared to the ground truth documents.

Library of Congress Subject Headings

Web sites

Publication Date

4-1-2002

Document Type

Thesis

Department, Program, or Center

Computer Engineering (KGCOE)

Advisor

Harrington, Steven

Advisor/Committee Member

Jones, Price

Advisor/Committee Member

Shaaban, Muhammad

Comments

Note: imported from RIT’s Digital Media Library running on DSpace to RIT Scholar Works. Physical copy available through RIT's The Wallace Library at: TK5105.888 .S94 2003

Campus

RIT – Main Campus

Share

COinS