The presence of allogeneic contamination impacts clinical reporting in cancer next-generation sequencing specimens. Although consensus guidelines recommend the identification of contaminating DNA as a part of quality control, implementation of contamination assessment methods in clinical molecular diagnostic laboratories has not been reported in the literature.
To develop and implement a method to assess allogeneic contamination in clinical cancer next-generation sequencing specimens.
We describe a method to detect contamination based on the evaluation of single-nucleotide polymorphic sites from tumor-only specimens. We validate this method and apply it to a large cohort of cancer sequencing specimens.
Identification of specimen contamination was validated via in silico and in vitro mixtures, and reference range and reproducibility were established in a panel of normal specimens. The algorithm accurately detects an episode of systemic contamination due to reagent impurity. We prospectively applied this algorithm across 7571 clinical cancer specimens from a targeted next-generation sequencing panel, in which 262 specimens (3.5%) were predicted to be affected by greater than 5% contamination.
Allogeneic contamination can be inferred from intrinsic cancer next-generation sequencing data without paired normal sequencing. The adoption of this approach can be useful as a quality control measure for laboratories performing clinical next-generation sequencing.
Next-generation sequencing technology has enabled an efficient workflow for the broad sequencing of targeted cancer genomes. However, an ongoing challenge is allogeneic contamination, or the contamination of a sequencing specimen with DNA from another individual. The presence of contaminating DNA may decrease the observed allele fraction of variants in the target specimen, potentially leading to decreased sensitivity. Alternatively, pathogenic variants in the contaminating DNA may be identified by variant calling algorithms, leading to the reporting of false-positive results.
Allogeneic contamination is particularly important in the sequencing of cancer specimens. Cancer specimens invariably contain an admixture of neoplastic and nonneoplastic DNA, lowering the observed variant allele fractions of actionable somatic mutations. In specimens with low tumor content, the analytic specificity of low–allele fraction variants by next-generation sequencing (NGS) analysis is wholly dependent on the ability to determine the presence of contaminating DNA sequences. Here we describe a method for detecting allogeneic contamination using intrinsic features of cancer NGS data and populational assumptions. We validate this strategy using in silico mixing models, apply the algorithm across targeted NGS specimens, and describe evidence of NGS adapter contamination in our laboratory.
MATERIALS AND METHODS
Targeted Cancer Next-Generation Sequencing
Targeted NGS was performed using OncoPanel version 3, a hybrid capture panel of 447 cancer-associated genes.1 DNA from Centre d'Etude du Polymorphism Humain (CEPH) reference individuals was obtained from the Coriell Institute. Tumor sequencing was performed on frozen specimens or formalin-fixed, paraffin-embedded specimens from which unstained slides were macrodissected to enrich for regions with at least 20% tumor nuclei. Library preparation was performed using the KAPA HTP KK8234 kit (Kapa Biosystems Inc, Wilmington, Massachusetts). Dual-indexed DNA fragments were pooled, and hybrid capture was performed using the Agilent SureSelect XT Fast Reagent Kit (Agilent Technologies Inc, Santa Clara, California). Sequencing was performed on an Illumina HiSeq 2500 instrument (Illumina Inc, San Diego, California).
Sequencing data were analyzed using a custom pipeline.1 Reads were mapped to human genome hg19/GRCh37. MuTect version 1.1.4 was used to detect single-nucleotide variants.2 The Single Nucleotide Polymorphism Database (dbSNP) version 135 and COSMIC version 64 were used to filter variants.3,4 Logarithmic regression with least squares fit was performed using a 1-phase decay model in GraphPad Prism version 6. This study was approved by the Partners Human Research Committee.
Description of Algorithm and Rationale
Germ-line polymorphisms were enriched by filtering to include only single-nucleotide variants present in the dbSNP database and absent in the COSMIC database. Variants on chromosome X were excluded. Two values were evaluated: the number of filtered nonreference variants with variant allele fraction equal to 1 (α) and the total number of filtered nonreference variants (β).
The value α represents the number of variants at dbSNP sites with 100% variant allele fraction that do not match the genomic reference. β represents all variants at dbSNP sites that differ from the genomic reference. This metric operates on the premise that contaminating DNA contains single-nucleotide polymorphisms that differ from the patient specimen. In case of contamination, α would be expected to decrease, whereas β would be expected to increase (Figure 1). The α and β values were then modeled to a set of in silico contamination data to derive an estimated percent contamination, as described below.
Representative plot of dbSNP variants across a targeted gene panel, with variant allele fractions depicted on the y-axis. A, A specimen without contamination. B, The same specimen on a next-generation sequencing run with known contamination.
Representative plot of dbSNP variants across a targeted gene panel, with variant allele fractions depicted on the y-axis. A, A specimen without contamination. B, The same specimen on a next-generation sequencing run with known contamination.
Clinical Evaluation of Contaminated Specimens
Specimens flagged as contaminated were assessed by a team of clinical scientists working with a molecular pathologist. The source of contamination was investigated in 2 ways. Patient medical records were reviewed for history of allogeneic organ transplantation, and batch-level sequencing results were reviewed for shared variants with other specimens in the same sequencing run. The pathologist then made a clinical assessment based on the predicted contamination fraction, the tumor purity of the test specimen, and whether the source of contamination could be identified. The pathologist would then choose to report results without modification, results with modification by suppressing low–variant allele fraction events, or no results because of failed quality control. The pathologist could also request to repeat sample DNA extraction and sequencing before making a final assessment.
RESULTS
Establishing Reference Range and Predicting Contamination Using In Silico and In Vitro Mixtures
A, In silico mixtures of Centre d'Etude du Polymorphism Humain reference DNA with 0% to 10% contamination and effect on contamination-associated variable α/β, which is fitted to a statistical model to predict contamination (solid line). B, Contamination prediction applied to unmixed and in vitro mixed specimens. Each marker represents 1 replicate.
A, In silico mixtures of Centre d'Etude du Polymorphism Humain reference DNA with 0% to 10% contamination and effect on contamination-associated variable α/β, which is fitted to a statistical model to predict contamination (solid line). B, Contamination prediction applied to unmixed and in vitro mixed specimens. Each marker represents 1 replicate.
In vitro mixture studies were performed using 3 nonneoplastic liver specimens mixed with 20% DNA from unrelated individuals. Both unmixed and mixed specimens were sequenced in triplicate, and the median predicted contamination in the unmixed samples was 0.005% (range, 0.0001%–0.01%). The median predicted contamination in the 20% mixed sample was 22.0% (range, 12.6%–31.1%; Figure 2, B).
The reference range was validated using a panel of data from 108 nonneoplastic specimens, which included blood or formalin-fixed, paraffin-embedded blocks with abundant tissue that were at low risk for allogeneic contamination (Figure 3, A). The algorithm consistently predicted less than 1% contamination across these 108 specimens (median, 0.002%; maximum, 0.61%). To assess reproducibility, we analyzed data from a single nonneoplastic liver specimen tested 316 times across 101 sequencing runs (Figure 3, B). Across 316 replicates, this specimen demonstrated a median predicted contamination fraction of 0.0002%. One replicate was an outlier at 19.7% predicted contamination, consistent with an isolated incident of true contamination due to laboratory error. The rate of contamination introduced in the molecular laboratory during library preparation and sequencing was estimated to be 0.3% (1 of 316).
A, Predicted contamination across the panel of 108 nonneoplastic specimens. B, Predicted contamination for 316 replicates of a single nonneoplastic liver control specimen.
A, Predicted contamination across the panel of 108 nonneoplastic specimens. B, Predicted contamination for 316 replicates of a single nonneoplastic liver control specimen.
Identification of Adapter Contamination
During the development of this algorithm, our laboratory received a new batch of indexed adapters. This new set of reagents was used on a set of 40 CEPH DNA samples, which passed our quality control standards at the time, and subsequently on 16 sequencing runs before we recognized an increased number of shared variants between specimens of the same run. This finding was worrisome for cross contamination. Further analysis determined that the pattern of contamination depended on the position of the specimen on the plate during library preparation, which corresponded to the index bar code linked to patient identity. These findings implicated the new adapters as the source of contamination.
We hypothesized that quality control using a contamination detection method could have prospectively identified the adapter contamination issue. In the evaluation of control CEPH DNA, the contamination algorithm predicted contamination involving specimens in row D (median, 12.0% contamination) compared with the rest of the plate (median, 0.004% contamination; Figure 4, A). Similar findings were observed in 419 clinical specimens across the affected sequencing runs (1 representative plate shown in Figure 4, B). Of specimens that were indexed in row D, the median predicted contamination was 8.3%, and 31 of 50 specimens (62%) were predicted to have greater than 5% contamination (Figure 4, C). Of specimens that were indexed in any other row, the median predicted contamination was 0.006%, and 9 of 369 specimens (2.4%) were predicted to have greater than 5% contamination (Fisher exact test, P < .001; Figure 4, D).
Predicted contamination with respect to specimen position during library preparation for 40 dual-indexed Centre d'Etude du Polymorphism Humain DNA control samples (A) and a representative run with clinical sequencing samples (B). Adapter impurity results in apparent contamination of specimens in row D. The distribution of contamination is shown in affected sequencing runs, comparing clinical specimens indexed in row D (C) with specimens indexed in any other row (D).
Predicted contamination with respect to specimen position during library preparation for 40 dual-indexed Centre d'Etude du Polymorphism Humain DNA control samples (A) and a representative run with clinical sequencing samples (B). Adapter impurity results in apparent contamination of specimens in row D. The distribution of contamination is shown in affected sequencing runs, comparing clinical specimens indexed in row D (C) with specimens indexed in any other row (D).
Feasibility in Limited Gene Panels
To assess the feasibility of our approach to smaller gene panels, we analyzed data from sequencing runs with adapter contamination, comparing variants from the full 447-gene panel to variants limited to 122 of 125 genes referenced in Vogelstein et al5 or to 50 genes included in the Ion AmpliSeq Cancer Hotspot Panel version 2 (Thermo Fisher Scientific, Waltham, Massachusetts).
Comparison of the parameter used for contamination estimation α/β showed linear correlation between variants identified in the full gene panel and limited gene panels (R2 = 0.62 for 122 genes, R2 = 0.48 for 50 genes; Figure 5, A and B). We then evaluated the ability of the α/β ratio from limited gene panels to detect specimens with greater than 5% contamination as defined by the full panel via receiver-operating characteristic curves (Figure 5, C and D). The limited panels showed an area under the curve of 0.982 for the 122-gene panel and 0.966 for the 50-gene panel. Finally, we plotted the α/β ratio from limited gene panels with respect to the row of the specimen during library preparation (Figure 5, E and F). As expected, specimens prepared in row D showed a lower α/β ratio, corresponding to a higher contamination fraction, compared with other rows both for variants derived from 122 genes and 50 genes (t test P < .001 for both gene sets). These results demonstrate that systemic contamination would have been detectable in cancer NGS data from limited panels with as few as 50 genes.
Assessment of contamination in limited gene panels shows correlation in contamination-associated variable α/β between the full 447-gene panel and a simulated 122-gene panel (A) and 50-gene panel (B). Receiver operating characteristic curves show the ability of limited gene panels to detect specimens with greater than 5% contamination defined by the full panel, with an area under the curve of 0.982 for the 122-gene panel (C) and 0.966 for the 50-gene panel (D). Adapter contamination involving specimens indexed in row D during library preparation can be identified using only variants from the 122-gene panel (E) or the 50-gene panel (F).
Assessment of contamination in limited gene panels shows correlation in contamination-associated variable α/β between the full 447-gene panel and a simulated 122-gene panel (A) and 50-gene panel (B). Receiver operating characteristic curves show the ability of limited gene panels to detect specimens with greater than 5% contamination defined by the full panel, with an area under the curve of 0.982 for the 122-gene panel (C) and 0.966 for the 50-gene panel (D). Adapter contamination involving specimens indexed in row D during library preparation can be identified using only variants from the 122-gene panel (E) or the 50-gene panel (F).
Application to Cancer NGS
We applied the contamination algorithm to a retrospective cohort of 1485 cancer specimens that underwent clinical NGS to assess the rate of allogeneic contamination in our laboratory (Figure 6, A). These specimens demonstrated a median contamination of 0.004%. Greater than 5% contamination was observed in 32 of 1485 specimens (2.5%). We then introduced this method as a routine component of our quality control program for cancer NGS specimens and prospectively applied the algorithm to an additional 7571 consecutive sequencing specimens (Figure 6, B). Greater than 5% contamination was observed in 262 of 7571 specimens (3.5%). Of the specimens with greater than 5% contamination, 12 were explained by a clinical history of allogeneic transplantation. One was a gestational trophoblastic tumor, in which the appearance of contamination was expected because of paternal DNA contribution in tumor cells admixed with maternal-only DNA in nonneoplastic cells. The contaminating source was identified on the same sequencing run in 9 cases (9 of 7571; 0.1%). The source of contamination was not identified in the other 240 cases, likely because of allogeneic contamination from tissue processing prior to arrival in the molecular laboratory. This rate of allogeneic contamination would be compatible with a reported 8% rate of contamination during histologic slide preparation.6
Clinical application of contamination detection method in a retrospective cohort of 1485 cancer next-generation sequencing specimens (A) and a prospective cohort of 7571 specimens (B). *Height of bar is beyond chart limit and is indicated by numeric value.
Clinical application of contamination detection method in a retrospective cohort of 1485 cancer next-generation sequencing specimens (A) and a prospective cohort of 7571 specimens (B). *Height of bar is beyond chart limit and is indicated by numeric value.
We further evaluated how the detection of contamination affected reporting in 262 cases with greater than 5% contamination (Figure 7). A total of 147 cases (147 of 7571; 1.9%) were reported without modification. A total of 89 cases (1.2%) were reported with modification by suppressing low–variant allele fraction events. A total of 8 cases (0.1%) were sequenced after repeat DNA extraction, and the results were reported after passing quality control on the second sequencing attempt. There were 18 cases (0.2%) that were considered failed and did not have additional tissue available for resequencing. These outcomes show that although the detection of contamination can impact clinical reporting in a subset of cases, the overall failure rate due to allogeneic contamination is low.
Clinical evaluation and outcome of 7571 cancer next-generation sequencing specimens with prospective evaluation by the contamination assessment algorithm.
Clinical evaluation and outcome of 7571 cancer next-generation sequencing specimens with prospective evaluation by the contamination assessment algorithm.
DISCUSSION
Allogeneic DNA contamination is a potential cause of preanalytic error in the clinical reporting of cancer NGS specimens. Contamination may be derived from a variety of sources. In the anatomic pathology laboratory, contaminating tissue may be present in the paraffin block or on the glass slide.6 In the molecular laboratory, contamination can occur during the complex process of multiplexed library preparation either during manual or automated pipetting, or it can be a result of reagent impurity, such as the adapter contamination issue in our laboratory. Some massively parallel sequencing technologies are prone to index switching and cross-library contamination.7
The Association for Molecular Pathology and the College of American Pathologists have recognized the value of identifying sample contamination as a part of a consensus guideline for NGS bioinformatics validation:
Identification of possible sample contamination or cross contamination is another quality metric that is particularly valuable for germline tests of medium- to large-sized panels. For example, one can look for low-frequency alleles at otherwise homozygous variants, particularly benign SNPs. Although these methods can be adapted to somatic tests, establishing criteria is more difficult in this setting because sample heterogeneity is expected and test targets can be small.8
The most widespread tools used for contamination assessment, ContEst and VerifyBamID, use statistical approaches to model contamination using single-nucleotide polymorphisms.9,10 Both tools are designed to be paired with microarray data, and in the absence of known polymorphisms of an individual, these methods do not account for the loss of heterozygosity frequently observed in cancer specimens. Conpair is designed to detect contamination in tumor specimens but requires matched normal sequencing information, which is not universally available in clinical cancer sequencing protocols.11
Our method assumes a reference distribution of single-nucleotide polymorphisms and models expected changes to this distribution in the setting of allogenic contamination. The algorithm uses variant calls from a standard informatics pipeline, allowing for easy implementation. Because all clinical tumor specimens include a mixture of neoplastic and nonneoplastic cells, somatic loss of heterozygosity is not expected to lead to an increase or decrease in the total count of germ-line variants identified in a specimen. Therefore, the calculated contamination estimate is independent of somatic loss of heterozygosity. The development of a methodology that is applicable to tumor-only sequencing specimens is particularly valuable because most molecular laboratories do not use paired normal specimens for clinical sequencing because of cost and logistic constraints. We validated this method across a variety of samples on our targeted NGS platform, including CEPH DNA with in silico mixtures, in vitro DNA mixtures, and a panel of normal specimens. Finally, we used this method to help resolve an adapter contamination issue involving clinical NGS runs and applied this method to estimate contamination rates in a large cohort of clinical NGS specimens. To our knowledge, this is the first description of a quality control program that includes a prospective assessment of allogeneic contamination in cancer NGS specimens.
We have implemented this contamination assessment method as a key component of the overall quality control program for our cancer NGS assay. In addition to evaluating individual specimens, we monitor trends in the rate of contamination over time to ensure the quality of laboratory reagents and standard operating procedures. In our experience, routine assessment for allogeneic contamination as part of a comprehensive quality control program has led to increased confidence in our laboratory's ability to interpret complex NGS results for oncology patient care.
References
Author notes
The authors have no relevant financial interest in the products or companies described in this article.