Abstract

Technical logbooks are a challenging and under-explored text type in automated event identification. These texts are typically short and written in non-standard yet technical language, posing challenges to off-the-shelf NLP pipelines. These datasets typically represent a domain (a technical field such as automotive) and an application (e.g., maintenance). The granularity of issue types described in these datasets additionally leads to class imbalance, making it challenging for models to accurately predict which issue each logbook entry describes. In this research, we focus on the problem of technical issue pre-processing, clustering, and classification by considering logbook datasets from the automotive, aviation, and facility maintenance domains. We developed MaintNet, a collaborative open source library including logbook datasets from various domains and a pre-processing pipeline to clean unstructured datasets. Additionally, we adapted a feedback loop strategy from computer vision for handling extreme class imbalance, which resamples the training data based on its error in the prediction process. We further investigated the benefits of using transfer learning from sources within the same domain (but different applications), from within the same application (but different domains), and from all available data to improve the performance of the classification models. Finally, we evaluated several data augmentation approaches including synonym replacement, random swap, and random deletion to address the issue of data scarcity in technical logbooks.

Library of Congress Subject Headings

Natural language processing (Computer science); Technology--Terminology; Machine learning; Data mining

Publication Date

7-2022

Document Type

Dissertation

Student Type

Graduate

Degree Name

Computing and Information Sciences (Ph.D.)

Department, Program, or Center

Computer Science (GCCIS)

Advisor

Travis Desell

Advisor/Committee Member

Marcos Zampieri

Advisor/Committee Member

Christian Newman

Recommended Citation

Akhbardeh, Farhad, "NLP and ML Methods for Pre-processing, Clustering and Classification of Technical Logbook Datasets" (2022). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/11227

Campus

RIT – Main Campus

Plan Codes

COMPIS-PHD

Download

COinS

Theses

NLP and ML Methods for Pre-processing, Clustering and Classification of Technical Logbook Datasets

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Advisor/Committee Member

Recommended Citation

Campus

Plan Codes

Search

Browse

Author Corner

RIT Links

Theses

NLP and ML Methods for Pre-processing, Clustering and Classification of Technical Logbook Datasets

Author

Abstract

Library of Congress Subject Headings

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Advisor/Committee Member

Recommended Citation

Campus

Plan Codes

Share

Search

Browse

Author Corner

RIT Links