Context.—

The terminology used by pathologists to describe and grade dysplasia and premalignant changes of the cervical epithelium has evolved over time. Unfortunately, coexistence of different classification systems combined with nonstandardized interpretive text has created multiple layers of interpretive ambiguity.

Objective.—

To use natural language processing (NLP) to automate and expedite translation of interpretive text to a single most severe, and thus actionable, cervical intraepithelial neoplasia (CIN) diagnosis.

Design.—

We developed and applied NLP algorithms to 35 847 unstructured cervical pathology reports and assessed NLP performance in identifying the most severe diagnosis, compared to expert manual review. NLP performance was determined by calculating precision, recall, and F score.

Results.—

The NLP algorithms yielded a precision of 0.957, a recall of 0.925, and an F score of 0.94. Additionally, we estimated that the time to evaluate each monthly biopsy file was significantly reduced, from 30 hours to 0.5 hours.

Conclusions.—

A set of validated NLP algorithms applied to pathology reports can rapidly and efficiently assign a discrete, actionable diagnosis using CIN classification to assist with clinical management of cervical pathology and disease. Moreover, discrete diagnostic data encoded as CIN terminology can enhance the efficiency of clinical research.

Population-based cervical screening programs have led to substantial reduction of cervical cancer incidence and mortality worldwide. Cervical cancer screening aims at detecting cervical precancers that can be treated before invasion occurs. Evaluation of screen-positive women with colposcopy and cervical biopsy is a cornerstone of cervical screening in most settings, since histology results determine clinical management, including excisional treatment, surveillance, or return to screening. Given that cervical screening is recommended for a large proportion of the population (currently from age 21 to 65 years in the United States1) and that about 5% of screened women may undergo colposcopy, cervical biopsies are one of the most commonly performed pathology services. This high volume, combined with the need for electronic medical records to support cervical screening programs, underscores the importance of uniform classification of cervical precancers in pathology reports.

The terminology used by pathologists to describe and grade dysplasia and premalignant changes of the cervical epithelium has evolved over time.25  Two classification systems are currently used to report histologic cervical precancers: Lower Anogenital Squamous Terminology (LAST)6  and cervical intraepithelial neoplasia (CIN). Management guidelines rely mostly on the CIN nomenclature; however, the utilization of both squamous intraepithelial lesion and CIN nomenclatures within a single pathology report may create ambiguity. Additional ambiguity may arise from the use of narrative free text including variations in terminology, syntax, or modifiers that increase diagnostic ambiguity and can confound clinical interpretation and challenge the ability to capture discrete diagnoses.

Synoptic reporting has been proposed as a solution to minimize ambiguity in a pathology report. As defined by the College of American Pathologists (CAP),7  however, this has not been widely implemented in clinical settings.

Kaiser Permanente Northern California (KPNC), an integrated health care delivery system with more than 4 million members, does not use synoptic reporting for cervical biopsies. Since the final diagnosis is unstructured, a manual review of each biopsy contained in the final pathology report is required to determine the most severe diagnosis for patient management. To improve reporting of cervical pathology results, we developed and applied a series of natural language processing (NLP) algorithms, a process often applied to extract specific outcomes from free text within the electronic medical record (EMR),812  to unstructured cervical pathology text to identify the most severe diagnosis. The accuracy of the NLP results was compared to results from a manual review completed by cytotechnologists and pathologists. Discordant manual interpretations were adjudicated by a cytopathologist.

Study Sample

Unstructured cervical pathology reports (N = 35 847) were identified by selecting reports by their unique accession number from KPNC's laboratory information system, based on cervical, uterine, and endocervical samples among patients of the KPNC health care delivery system between August 2019 and July 2020 (see Supplemental Table 1 in the supplemental digital content at https://meridian.allenpress.com/aplm in the February 2023 table of contents). This period represents the most current complete year of pathology reports where the NLP-developed algorithms were applied. Each pathology report is composed of 1 or more tissue samples per patient. A narrative result, which may include histologic nomenclature and unstructured text, is given for each tissue sample (see Supplemental Figure 1). A file with a single row of data for each accession number and its corresponding histology results was created as the input for the algorithms by eliminating extraneous spaces and carriage returns (see Supplemental Figure 2). No manual manipulation was made to modify the original text of the unstructured biopsy result.

Biopsy Diagnoses

To evaluate the unstructured histology results within a cervical pathology report as they relate to cervical cancer risk, we defined cervical biopsy diagnoses with discrete categorical labels and assigned a severity rank score. The diagnoses ranged from “Review” (see Supplemental Figure 3), an outcome where NLP was unable to assign any diagnosis (lowest rank), to a diagnosis of cervical cancer (highest rank) (Table 1). The NLP algorithms used these diagnoses and their rank to assign a discrete outcome to each histology result and subsequently the single worst outcome to the pathology report. NLP results were compared to manually abstracted results by using individual categories to determine the exact match. Additionally, biopsy diagnoses were dichotomized as high-risk (CIN2-3 or higher) and low-risk (<CIN2-3) to reflect a clinical treatment threshold and a relevant cutoff used in many research studies.

Table 1

Cervical Biopsy Diagnoses

Cervical Biopsy Diagnoses
Cervical Biopsy Diagnoses

Development and Validation of NLP Algorithms

Although there are several companies that offer NLP software, we used I2E software, version 5.4.1 (Linguamatics NLP Platform, Linguamatics, an IQVIA company, Cambridge, United Kingdom), which enables text mining of unstructured text with user-created, rule-based algorithms to identify biopsy diagnoses.13  I2E is used by many top global pharmaceutical companies and health organizations and offers a unique graphic interface (see Supplemental Figure 4) rather than coding syntax when developing algorithms.

More than 20 algorithms were created and were mostly defined on the basis of CIN classifications. Squamous intraepithelial lesion classifications were considered when CIN classifications were not specified. The algorithms were iteratively created, tested, and validated on a smaller representative training sample (N = 2213) of pathology reports before 2019 with biopsy diagnoses determined by a cytotechnologist and adjudicated by a pathologist. Additionally, this iterative process enabled rules to be created within the algorithms to evaluate pathology report phrases that accompany the diagnosis (see Supplemental Table 2) and that could misclassify an outcome.

The outcomes from developing NLP algorithms were compared to the outcomes of this gold standard sample. Every algorithm was modified and reevaluated until, a priori, greater than 90% exact agreement was achieved with the known outcome. Upon the assignment of a diagnostic result for each histology result (see Supplemental Figure 4) within the pathology report, the most severe diagnosis was assigned to the report (see Supplemental Figures 5 and 6). We also defined a patient at high risk if the most severe pathology report diagnosis was assigned CIN2-CIN3 or higher.

The final algorithms that were developed with the representative training sample were applied to the study sample, a practice to measure NLP accuracy.1418  To validate the NLP algorithms, a database of the study sample with a user interface was created and populated with the pathology text and the final diagnosis from the algorithms. This information was presented, unblinded with the result from the NLP algorithms, for manual cytotechnologist review and pathologist adjudication in order to assign their most severe final diagnosis to each pathology report. This validation process enabled the comparison of the final diagnostic outcome between the NLP algorithms and an expert review.

Statistical Analysis

The goal was to evaluate the performance of the NLP algorithms in the study sample by measuring the level of concordance among the CIN diagnoses for each pathology report between the algorithms and cytotechnologist review, including expert pathologist adjudication among equivocal diagnoses for each pathology report.

The diagnostic result between the cytotechnologist review and the NLP algorithms was compared for each pathology report and given one of the following match criteria: “Exact Match” (same diagnostic grade), “Same Risk Match” (same risk level, eg, high risk), “Risk Mismatch” (differing risk category), and “Review” (unassigned NLP diagnosis). When comparing risk match level, high risk was defined as CIN2-CIN3 or higher. We measured the performance of the NLP algorithms by calculating the precision, recall, and F score for the exact CIN match categories and risk categories. These parameters were calculated as Precision = TP/(TP + FP), Recall = TP/(TP + FN), and F score = 2/(1/Precision + 1/Recall), where TP = true positive, FP = false positive, and FN = false negative.

Among all the pathology reports, 32 823 of 35 847 (91.6%) final NLP-assigned diagnoses matched exactly or were classified as the same risk match with the manual validation (Table 2). A total of 2594 records (7.2%) were not assigned a diagnosis by NLP algorithms and were categorized as “Review.” Of these, the cytotechnologist review assigned 2578 (99.4%) as low-risk and 16 (0.6%) as high-risk diagnoses. The number of high-risk diagnoses were as follows: adenocarcinoma in situ (1), CIN2-CIN3 (7), CIN3 (2), endocervical adenocarcinoma (2), and microinvasive (1) and squamous carcinoma (3). The inability to assign a diagnostic outcome by the algorithms was likely due to the lack of specificity in the interpretive text or the difficulty in identifying specific combinations of CIN and non-CIN terms to assign a definitive outcome.

Table 2

Match Criteria Between Natural Language Processing and Manual Review

Match Criteria Between Natural Language Processing and Manual Review
Match Criteria Between Natural Language Processing and Manual Review

Among the pathology reports that were assigned a histologic diagnosis (excluding reports assigned “Review”) by the algorithms, 31 815 of 33 253 (95.7%) matched exactly to the manually assigned diagnoses at manual review by the cytotechnologist. Among the results with a “Risk Mismatch” (N = 430), 230 (53.5%) were identified by the algorithms as high risk, whereas manual review determined them low risk, and 192 (44.6%) were identified by the algorithms as low risk as opposed to a high-risk determination by review (Table 3). The distribution of individual diagnoses coded from NLP algorithms and final review among risk mismatch reports is shown in Table 4.

Table 3

Comparison of Risk Based on Final and Natural Language Processing (NLP) Codes Among Risk Mismatches and “Reviews”

Comparison of Risk Based on Final and Natural Language Processing (NLP) Codes Among Risk Mismatches and “Reviews”
Comparison of Risk Based on Final and Natural Language Processing (NLP) Codes Among Risk Mismatches and “Reviews”
Table 4

Distribution of Natural Language Processing (NLP) Code and Final Review Codea Among Risk Mismatch Reports

Distribution of Natural Language Processing (NLP) Code and Final Review Codea Among Risk Mismatch Reports
Distribution of Natural Language Processing (NLP) Code and Final Review Codea Among Risk Mismatch Reports

Using 3 match criteria, the accuracy of the NLP algorithms was assessed by calculating precision, recall, and accuracy (F score). When defining true positives as “Exact Matches” (N = 31 815) and false positives as “Same Risk Match” and “Risk Mismatch,” the precision, recall, and F score were 0.96, 0.93, and 0.94, respectively (Table 5). Accuracy improved slightly if we combined the “Exact Matches” and “Same Risk Match” as true positives, with precision, recall, and F score of 0.99, 0.93, and 0.96, respectively. When records that could not be assigned a diagnosis were excluded, the highest precision, recall, and accuracy were obtained as 0.96, 1.0, and 0.98, respectively.

Table 5

Precision, Recall, and F Score

Precision, Recall, and F Score
Precision, Recall, and F Score

In addition to the high level of concordance between the NLP algorithms and the manual validation, the amount of time to assign a diagnostic outcome between the 2 processes was drastically reduced. We estimate the time to manually evaluate a pathology report to be approximately 2 minutes per report, thereby requiring approximately 30 hours to evaluate 3000 pathology reports per month. The final NLP algorithms are able to evaluate the same number of reports within 30 minutes, a 98% reduction in time.

Although cervical pathology reports provide anatomical diagnosis, failure to include CIN grade and/or ambiguity within the narrative text can lead to incorrect interpretation by the clinician and can lead to inappropriate patient management. Furthermore, unstructured text challenges the ability to manually identify and extract the most severe discrete diagnosis to support screening, surveillance, and research efforts. In our study, we created and applied NLP algorithms to unstructured cervical pathology reports in a large health care organization including more than 35 000 cervical pathology reports. The NLP algorithms were able to produce greater than 0.95 precision, recall, and accuracy for both specific CIN and high-risk categories and drastically decreased the evaluation time for pathology reports.

The development of the algorithms and their performance were evaluated individually and collectively. This approach allowed us to focus on specific CIN categories, such as CIN2-CIN3, the most ambiguous in the textual interpretation of reports. Through the process of developing an algorithm for each diagnostic outcome, we gained performance insights by iteratively applying revisions to pathology reports with a previously assigned diagnosis from manual review. Upon finalizing each algorithm and applying them to the study sample, we established a high level of concordance.

Unlike other projects evaluating the accuracy of NLP algorithms where a training or derivation data set is used to initially create algorithms and then applied to the validation or study data set, we developed our algorithms iteratively by using a training data set that contained representative samples of diagnostic outcomes. This iterative process was necessary to account for the unstructured text that contained both histologic and unstructured text. During this process, our goal was to increase the level of agreement without increasing the risk of misclassification, ensuring modifications to a single algorithm would not affect the diagnostic outcome from a different algorithm.

Our NLP approach has important implications for both clinical decision-making and research. The new American Society for Colposcopy and Cervical Pathology (ASCCP) Risk-Based Management Consensus Guidelines incorporate current test results as well as previous screening and biopsy results into clinical decision-making. The use of our algorithms can facilitate the incorporation of these results in the EMR to allow for more efficient and accurate risk estimation. This process is meant to supplement and clarify any ambiguous pathology result and assist with any physician-to-physician communication. With respect to clinical research, adoption of the EMR has provided opportunities to generate large amounts of clinical data that can be leveraged to address important research questions with real-world evidence. However, a major challenge is the accuracy and timeliness to extract an actionable diagnosis contained within narrative text. Our NLP algorithms address this challenge with respect to cervical histology outcomes, achieving greater than 95% accuracy compared to a gold standard, manual review.

A current limitation to our NLP algorithms was the inability to identify 0.6% of all pathology reports as high risk for cervical cancer, along with misclassifications within the low-risk group. Our goal was to minimize or eliminate these misclassifications and to provide the clinician with the most severe discrete diagnosis, similar to a discrete clinical laboratory result. Although the NLP misclassifications in Table 4 would not be acceptable for clinical management, particularly benign diagnoses assigned by NLP, the algorithms attempted to minimize such misclassifications (Table 6). An example of a misclassification of a benign NLP outcome with a manual review outcome of adenocarcinoma in situ is provided in Supplemental Table 3. The NLP algorithms may never be perfect in that misclassifications would occur. Our algorithms are currently not intended to supplant manual review of cervical pathology outcomes but are to be used to expedite the coding of pathology reports with the most severe diagnosis. At the moment, clinical management decisions are not based on the NLP algorithms. Our approach demonstrates that NLP can be applied within health care systems that do not use synoptic reporting as an additional tool to review pathology reports and could be extended to additional applications in pathology, as well as to other unstructured text in the EMR.

Table 6

Percentage of Misclassified Natural Language Processing Assigned “Benign” Diagnoses

Percentage of Misclassified Natural Language Processing Assigned “Benign” Diagnoses
Percentage of Misclassified Natural Language Processing Assigned “Benign” Diagnoses

The lack of clarity of cervical biopsy interpretation potentially impacts patient management, including follow-up and treatment. Our study demonstrates that NLP algorithms can codify an unstructured cervical biopsy report. Until standardized terminology (eg, synoptic reporting) for cervical biopsies is accepted and implemented, NLP algorithms can assist the clinical management of cervical pathology and disease by rapidly identifying a single severity-based discrete outcome when more than 1 biopsy is present and can enhance the efficiency of clinical research.

The authors gratefully acknowledge Jeffrey Nauss, PhD, and Ryan Jansson, BS, of Linguamatics for their NLP technical assistance and support and Kalpana Vadnagra, BS, of the KPNC Regional Lab Technology Group for assisting with the development of the validation database and processes.

1.
United States Preventive Services Task Force (USPSTF) Cervical Cancer Screening, Recommendation: Cervical Cancer: Screening | United States Preventive Services Taskforce. uspreventiveservicestaskforce.org. Accessed
July
30,
2021
.
2.
Waxman
AG
,
Chelmow
D
,
Darragh
TM
,
Lawson
H
,
Moscicki
AB
.
Revised terminology for cervical histopathology and its implications for management of high-grade squamous intraepithelial lesions of the cervix
.
Obstet Gynecol
.
2012
;
120
(6)
:
1465
1471
.
3.
Nayar
R
,
Wilbur
DC
.
The Bethesda System for Reporting Cervical Cytology: a historical perspective
.
Acta Cytol
.
2017
;
61
(4-5)
:
359
372
.
4.
Nuno
T
,
Garcia
F
.
The LAST Project and its implications for clinical care
.
Obstet Gynecol Clin North Am
.
2013
;
40
(2)
:
225
233
.
5.
Stoler
MH
,
Ronnett
BM
,
Joste
NE
,
Hunt
WC
,
Cuzick
J
,
Wheeler
CM;
New Mexico HPV Pap Registry Steering Committee. The interpretive variability of cervical biopsies and its relationship to HPV status
.
Am J Surg Pathol
.
2015
;
39
(6)
:
729
736
.
6.
Darragh
TM
,
Colgan
T
,
Cox
JT
, et al.
The Lower Anogenital Squamous Terminology Standardization Project for HPV-Associated Lesions: background and consensus recommendations from the College of American Pathologists and the American Society for Colposcopy and Cervical Pathology
.
Arch Pathol Lab Med
.
2012
;
136
(10)
:
1266
1297
.
7.
College of American Pathologists
.
Resources & publications: cancer protocols
.
www.cap.org/cancerprotocols. Accessed July 30,
2021
.
8.
Sheikhalishahi
S
,
Miotto
R
,
Dudley
JT
,
Lavelli
A
,
Rinaldi
F
,
Osmani
V
.
Natural language processing of clinical notes on chronic diseases: systematic review
.
JMIR Med Inform
.
2019
;
7
(2)
:
e12239
.
9.
Velupillai
S
,
Suominen
H
,
Liakata
M
, et al.
Using clinical Natural Language Processing for health outcomes research: overview and actionable suggestions for future advances
.
J Biomed Inform
.
2018
;
88
:
11
19
.
10.
Ford
E
,
Carroll
JA
,
Smith
HE
,
Scott
D
,
Cassell
JA
.
Extracting information from the text of electronic medical records to improve case detection: a systematic review
.
J Am Med Inform Assoc
.
2016
;
23
(5)
:
1007
1015
.
11.
Elkin
PL
,
Froehling
D
,
Wahner-Roedler
D
, et al.
NLP-based identification of pneumonia cases from free-text radiological reports
.
AMIA Annu Symp Proc
.
2008
;
2018
:
172
176
.
12.
Si
Y
,
Roberts
K
.
A frame-based NLP system for cancer-related information extraction
.
AMIA Annu Symp Proc
.
2018
;
2018
:
1524
1533
.
13.
Solomon
MD
,
Tabada
G
,
Allen
A
,
Sung
SH
,
Go
AS
.
Large-scale identification of aortic stenosis and its severity using natural language processing on electronic health records
.
Cardiovasc Digital Health J
.
2021
;
2
(3)
:
156
163
.
14.
Chaudhry
R
.
NLP-enabled Decision Support for Cervical Cancer Screening and Surveillance - Final Report. Digital Healthcare Research (prepared by Mayo Clinic under Grant No. R21 HS022911)
.
Rockville, MD
:
Agency for Healthcare Research and Quality;
2017
.
15.
Wagholikar
KB
,
MacLaughlin
KL
,
Henry
MR
, et al.
Clinical decision support with automated text processing for cervical cancer screening
.
J Am Med Inform Assoc
.
2012
;
19
(5)
:
833
839
.
16.
Wang
L
,
Luo
L
,
Wang
Y
,
Wampfler
J
,
Yang
P
,
Liu
H
.
Natural language processing for populating lung cancer clinical research data
.
BMC Med Inform Decis Mak
.
2019
;
19
(suppl 5)
:
239
.
17.
Zeng
Z
,
Espino
S
,
Roy
A
, et al.
Using natural language processing and machine learning to identify breast cancer local recurrence
.
BMC Bioinformatics
.
2018
;
19
(suppl 17)
:
498
.
18.
Moore
CR
,
Farrag
A
,
Ashkin
E
.
Using natural language processing to extract abnormal results from cancer screening reports
.
J Patient Saf
.
2017
;
13
(3)
:
138
143
.

Author notes

Supplemental digital content is available for this article at https://meridian.allenpress.com/aplm in the February 2023 table of contents.

The authors have no relevant financial interest in the products or companies described in this article.

Supplementary data