Freddy T. Nguyen, MD, PhD

Research Fellow @ Massachusetts Institute of Technology, Transfusion Medicine Fellow @ Dartmouth-Hitchcock Medical Center

An Accessible, Efficient, and Accurate Natural Language Processing Method for Extracting Diagnostic Data from Pathology Reports

An Accessible, Efficient, and Accurate Natural Language Processing Method for Extracting Diagnostic Data from Pathology Reports
Hansen Lam, Freddy T. Nguyen, Xintong Wang, Aryeh Stock, Volha Lenskaya, Maryam Kooshesh, Peizi Li, Mohammad Qazi, Shenyu Wang, Mitra Dehghan, Xia Qian, Qiusheng Si, Alexandros Polydorides. Journal of Pathology Informatics 2022-11-08

Full Text
Analysis of diagnostic information in pathology reports for the purposes of clinical or translational research and quality assessment/control often requires manual data extraction, which can be laborious, time-consuming, and subject to mistakes. Objective We sought to develop, employ, and evaluate a simple, dictionary- and rule-based natural language processing (NLP) algorithm for generating searchable information on various types of parameters from diverse surgical pathology reports. Design Data were exported from the pathology laboratory information system (LIS) into extensible markup language (XML) documents, which were parsed by NLP-based Python code into desired data points and delivered to Excel spreadsheets. Accuracy and efficiency were compared to a manual data extraction method with concordance measured by Cohen’s κ coefficient and corresponding P values. Results The automated method was highly concordant (90-100%, P<.001) with excellent inter-observer reliability (Cohen’s κ: 0.86-1.0) compared to the manual method in 3 clinicopathologic research scenarios, including squamous dysplasia presence and grade in anal biopsies, epithelial dysplasia grade and location in colonoscopic surveillance biopsies, and adenocarcinoma grade and amount in prostate core biopsies. Significantly, the automated method was 24-39 times faster and inherently contained links for each diagnosis to additional variables such as patient age, location, etc., which would require additional manual processing time. Conclusions A simple, flexible, and scaleable NLP-based platform can be used to correctly, safely, and quickly extract and deliver linked data from pathology reports into searchable spreadsheets for clinical and research purposes.
Facebook
Twitter
LinkedIn
Pinterest
Email

Related Posts

Physician-scientist with extensive experience developing and translating nanotechnologies and biomedical optical technologies from the bench to clinic in areas of genetics, oncology, and cardiovascular diseases. Extensive experience in community building in healthcare innovation, research, medical, and physician-scientist communities through various leadership roles.

Research Profiles

Contact