Next-generation sequencing performed in a clinical environment must meet clinical standards, which requires reproducibility of all aspects of the testing. Clinical-grade genomic databases (CGGDs) are required to classify a variant and to assist in the professional interpretation of clinical next-generation sequencing. Applying quality laboratory standards to the reference databases used for sequence-variant interpretation presents a new challenge for validation and curation.
To define CGGD and the categories of information contained in CGGDs and to frame recommendations for the structure and use of these databases in clinical patient care.
Members of the College of American Pathologists Personalized Health Care Committee reviewed the literature and existing state of genomic databases and developed a framework for guiding CGGD development in the future.
Clinical-grade genomic databases may provide different types of information. This work group defined 3 layers of information in CGGDs: clinical genomic variant repositories, genomic medical data repositories, and genomic medicine evidence databases. The layers are differentiated by the types of genomic and medical information contained and the utility in assisting with clinical interpretation of genomic variants. Clinical-grade genomic databases must meet specific standards regarding submission, curation, and retrieval of data, as well as the maintenance of privacy and security.
These organizing principles for CGGDs should serve as a foundation for future development of specific standards that support the use of such databases for patient care.
Next-generation sequencing (NGS) technology is now affordable for clinical laboratories, and many are implementing NGS tests for patient care. Because of major differences in how NGS data are produced and analyzed, compared with other laboratory tests, pathologists and other laboratory professionals are faced with a new set of challenges in analyzing, interpreting, and reporting NGS test results.1,2 A fundamental requirement of clinical laboratory testing is reproducibility and accuracy of results within and among laboratories. Next-generation sequencing performed in a clinical environment must meet this same reproducibility standard for all aspects of testing, including generation of raw NGS sequence data, data analysis using multiple bioinformatics software packages to align sequence reads and to detect sequence variants (ie, bioinformatics pipelines), and final clinical interpretation. The final interpretation of the clinical relevance of a patient's NGS test results should be based on the highest possible levels of medical evidence according to standards for evaluating such evidence.3 Similarly, use of standard reporting elements, including use of standard nomenclature to describe and categorize variants, is critical to avoiding clinically significant omissions, to eliminating confusion with other variants, and to ensuring data integrity.4–6
Unlike data produced in other areas of the laboratory, NGS data go through iterations of analysis using multiple software packages to transition from raw-sequence data to a final report. This so-called bioinformatics pipeline uses algorithms to align multiple copies of overlapping raw sequences to a human reference sequence and then uses other algorithms to detect where the patient's DNA differs from the reference sequence. These tools, if improperly designed or used, can introduce errors into the analysis.1,7–10 One major problem with the use of multiple bioinformatics software packages is the differences in sensitivities and specificities for detection of different types of DNA sequence variants.1,11,12 Types of sequence variants that can be detected by NGS are listed in Table 1.
Beyond the variability of the bioinformatics pipeline, the interpretation of the significance of a specific variant can be unique to the laboratory that performed the testing (ie, nonreproducible among laboratories). Variability in interpretation for sequence variants is due, in part, to the lack of professionally curated information to support clinical decision making, combined with the amount of information typically generated by such analyses. A single pathologist or other laboratory professional cannot understand the significance of all possible variants that could be generated without database support. Currently, investigation of multiple databases is required to assess the potential significance of even one sequence variant, and that is a cumbersome, time-consuming, and increasingly unfeasible process because the scope of NGS testing continues to increase in the clinical environment.13–17 Adding to that complexity, not all databases contain accurate information, and a single database may have variability in the quality of its information for different variants. For example, laboratories that fail to follow strict quality control and quality assurance practices may submit inaccurate sequences to the databases and, therefore, may confound genotype-phenotype correlations and the interpretation of the clinical significance of specific variants.18 “Clinical grade” databases—that is, the NGS results generated under clinical quality standards, which can be used to identify which variants infer risk of disease, to guide diagnosis, to predict prognosis, and/or to indicate a potential therapeutic target—are needed for broad and effective clinical use of NGS in clinical laboratories. Lack of clinical grade, evidence-based databases poses risks to patient care because use of less-than-sufficient or inaccurate evidence may lead to interpretation error and patient harm.
Several groups have looked at standards for genomic variation databases. However, those groups have not focused on the issues of reproducibility, quality, and clinical laboratory standards for the data being submitted, issues that are required for clinical patient care.19–22 Standards for data submission, data curation, and data retrieval have been proposed, but those standards have generally focused on research use. Quality control by the laboratory generating and analyzing the data has received little attention in the literature to date, despite laboratories being required to ensure that the data are correct throughout the process, from specimen collection to data submission to later retrieval.1,11
To help address these issues, the College of American Pathologists (Northfield, Illinois) introduced an NGS testing section to the Molecular Pathology Checklist in the College of American Pathologists Laboratory Accreditation Program.7 The College of American Pathologists developed NGS clinical laboratory accreditation requirements, and several groups are working on reference standards and proficiency testing materials. The US National Institute of Standards and Technology (Gaithersburg, Maryland)23 is developing NGS reference standards. Horizon Discovery (Cambridge, United Kingdom), AcroMetrix (Life Technologies, Benicia, California), the College of American Pathologists, and others have developed or are developing proficiency testing modules or controls that will assess the reproducibility in variant detection among laboratories performing clinical NGS testing. These proficiency test materials will assess both the data-generation components of NGS tests and the bioinformatics pipelines that are used to align the sequence reads and to detect the sequence variants. To achieve a clinical grade database, data quality from the laboratory that initially reports a variant must be high and must follow standards for data submission, retrieval, and curation. Phenotype data also require standardization in terminology and completeness of the observations submitted.24 Additionally, data security and individual privacy must be addressed for all aspects of the data submitted to the database (see the special section on “Data Security and Privacy” below).
Therefore, a work group of the College of American Pathologists Personalized Health Care Committee examined challenges specifically related to evidence-based resources available to assist with the interpretation of hereditary and acquired sequence variants in the clinical setting. Early in the process, the work group realized that the definition of a clinical grade genomic database (CGGD) needed to be standardized and categorized by function. This is because CGGDs may provide different types of information about and surrounding the variants they describe. The clinical utility of the database is driven, in part, by the type of information contained therein. Therefore, the purpose of this article is to describe the definition of a CGGD as well as the various categories of information that may be included within it. In addition, this article frames recommendations for the structure and use of such databases in the clinical patient-care setting. These organizing principles for CGGDs should serve as a foundation for future development of specific standards that support the use of such databases for patient care.
CLINICAL GRADE GENOMIC DATABASES
A CGGD is a clinical decision-support tool that can be used in the interpretation of human sequence variants for clinical use. Clinical decision-support tools provide evidence and support for decision making, but they do not mandate or require decisions. The final interpretation is dependent on the specific patient for whom testing was performed and the pathologist or other laboratory professionals examining the case, as well as clinical discussions with other health care providers. For a database to be used as a CGGD, the database must contain sequences and/or variants that have been produced from human samples in a laboratory that meets clinical quality standards for the analysis that generates the sequence and/or the variant (the so-called high-quality human sequence/variant [HQHSV]). In the United States, a laboratory certified under the Clinical Laboratory Improvement Amendments of 1988 (CLIA) and accredited by CLIA or a CLIA-deemed organization for high-complexity testing meets high clinical quality standards.25 Similar standards in other countries may include the International Organization for Standardization (Geneva, Switzerland)26 or the UK National External Quality Assessment Service (Sheffield, United Kingdom).27 A noncertified laboratory can meet the same quality standards but would need a mechanism to document and define their quality standards to ensure that their data are generated under the same quality standards as certified and accredited clinical laboratories. Regardless, applying those standards to reference databases for sequence variants presents a new challenge for validation and curation. If sequence data from laboratories not meeting these standards are included in a CGGD, then a mechanism for identifying the data generated under clinical standards must be available in the search criteria, as well as the determination of whether recommendations are made based on clinical quality data only or on all data.
To achieve practical utility, CGGDs should be easy to access. Given the cost to develop and maintain such databases, a fee may be required for access and use. The CGGDs may be developed by an institution or by a group of institutions, a commercial enterprise, the government, or through a public-private partnership. Currently, multiple databases are cross referenced for interpretation of NGS data. Use of multiple databases for interpretation of clinical case material presents certain challenges as well as opportunities. Even if a database contains HQHSVs and medical evidence of the highest quality, it is possible for different databases to have disparate information for the same variant. In addition, some databases may have sequences and variants from selected populations, which may not be representative of the reference population for the clinical case being interpreted. Understandably, international efforts at development and integration of CGGDs may yield the most-complete variant database from a “population” perspective.28 However, although different databases present challenges, competition among databases may ultimately lead to better products. Some databases may focus on sequence and variants in particular areas of the genome or in reference to a particular disease or group of diseases. These smaller, less-comprehensive efforts may provide in-depth knowledge for the area of interest.
Therefore, in a CGGD, the breadth of the human genome covered (eg, genome, exome, gene, or single variant) may vary, provided that the database contains HQHSVs. In addition, the types of information available in a CGGD may also vary. Specifically, some databases are simply repositories of human sequence and variants, whereas other databases contain large amounts of sound medical evidence related to the clinical significance of variants. In discussing the latter, the work group identified 3 different classes of information for CGGDs. We chose to call these different classes “layers,” because their value content is applied differently in interpretation of clinical case material.
Layers of CGGDs
The work group defined 3 layers of information for CGGDs: clinical genomic variant repositories (CGVRs), genomic medical data repositories (GMDRs), and genomic medicine evidence databases (GMEDs). The layers are differentiated by the types of genomic and medical information contained and the utility in assisting with clinical interpretation of genomic variants.
To be designated a CGGD, the database must contain information from at least one of the layers that meets the criteria described below. However, a single database may also include information from 2 or all 3 layers as well. In most database constructs, the types of information are additive in sequential order, but that may not always be the case because layer-3 GMED databases may contain little if any genomic sequence data. The names and the general proportion of sequence data are given in Table 2 and are shown graphically in Figure 1. The types of information needed for clinical interpretation are heavily dependent on the type of tissue being analyzed, the population being studied, and the clinical questions being asked. As such, CGGDs are intended to be high-quality clinical tools for assisting with interpretation of genomic test results in the context of these factors and the clinical judgment of the pathologist or other laboratory professional.
Layer 1: CGVRs
Clinical genomic variant repository databases are aggregations of sequence data, which may range from raw sequence reads to a list of identified sequence variants (compared with a reference sequence). The CGVRs must contain HQHSVs. The work group could not agree on whether CGVRs could contain nonhuman sequences or sequences not generated using clinical quality standards, so our compromise position was as follows: Some CGVRs may contain sequences from nonhumans or human sequences from noncertified laboratories, and there is no doubt that sequence information from nonhuman model systems (eg, yeast and mice) and from basic science investigations of human disease (eg, cell lines) can provide extremely useful information regarding the potential effect of a sequence change. Nonetheless, a CGVR must have functionality that allows end users to restrict their examination of sequences to HQHSVs only.
Although CGVRs are generally of limited use in clinical interpretation, they may be used to identify the frequency of a given variant in the human population (mean allele frequency) or to provide insight into the amount of data supporting identification of a novel and/or pathogenic sequence variant. The CGVRs that are restricted to sequence information do not have associated clinical information about the patients from whom the HQHSVs were derived, so the mean allele frequency may not be representative of the population of interest or as a whole. In addition, they do not have information on the clinical significance of the HQHSV. Similarly, CGVRs can be helpful in determining whether a variant has been previously described, but the clinical implications of the variant (benign versus pathogenic, hereditary versus acquired) can only be determined in the context of other information (as illustrated in Figure 2). Examples of existing databases that contain the types of information found in a layer-1 database include the 1000 Genomes Project and the Exome Sequencing Project (US National Heart, Lung, and Blood Institute, Bethesda, Maryland).17,29,30 However, these databases lack assurance regarding the quality under which sequences were generated and, therefore, are not currently consistent with a CGGD.
Layer 2: GMDRs
Genomic medical data repository databases are CGVRs plus associated clinical information of significance to genomic test interpretation. These include, but are not limited to, the race, ethnicity, phenotype, and diagnoses for the patient from whom the specimen was collected. In addition, information regarding the specimen source and associated findings (presence of tumor, percentage of tumor, tumor type, histologic grade, confirmed somatic, among others) is present. Clinical information may include treatment and outcome data, if available and submitted. Outcome data for an individual patient's sequence may be included in GMDRs, but that requires the information to be updated as the patient's condition and treatment changes for it to be useful. The GMDRs are useful for generating allele frequencies in specific populations (such as patients with breast cancer or hereditary ataxia) and across populations. However, GMDRs remain largely descriptive databases because synthesis of variant occurrence and outcomes across patients with the same variant or variants is not included (see Figure 2). An example database that includes the type of information contained in both layers 1 and 2 is the Catalogue of Somatic Mutations in Cancer (COSMIC; Wellcome Trust Sanger Institute, Hinxton, Cambridge, England) database, but this database currently lacks assurances regarding the quality under which the sequences were generated and the certainty of the associated findings and is, therefore, not consistent with a GMDR.31,32
Layer 3: GMEDs
Genomic medicine evidence databases build on the variant and clinical information from the first 2 layers but are distinct because they only house information related to variants whose clinical significance has been proven in a sound and reproducible, scientific manner. These databases describe the classification, disease-association, or therapy association for a given variant, with citations of the evidence supporting that information. The content of layers 1 and 2 serve as a foundation upon which the medical evidence for layer 3 is either derived or built. Such medical evidence may include functional genomic studies and clinical trials reports. Layer-3 databases may also include data on successful treatment of lesions associated with specific variants. Of all the layers of databases, a GMED is the most useful clinical decision-support tool for laboratory professionals and pathologists in the interpretation of NGS results (Figure 2). ClinVar (US National Center for Biotechnology Information, Bethesda, Maryland) is an example of a database that includes information from all 3 layers. However, ClinVar currently lacks assurances regarding the quality under which sequences were generated, and the standards used for the medical evidence of clinical usefulness are not yet defined.13,33,34 Although it remains a useful tool, it cannot currently be classified as a GMED under the requirements listed above.
To different degrees, each database layer is useful for the interpretation of a sequence variant for a specific patient. However, all layers, even GMEDs, must be used in the context of the patient being examined, along with other evidence, professional standards, practice guidelines, and clinical judgment for accurate interpretation.
RECOMMENDED STANDARDS FOR EACH CGGD LAYER
Each layer of a CGGD must meet specific standards. Most standards will apply to all 3 layers of a CGGD, whereas other standards are unique to 1 or 2 layers. Standards are categorized by activity, and activities include submission, curation, and retrieval of data, as well as implementation and maintenance of privacy and security. Within the submission, curation, and retrieval activities, there are shared subcategories, which include authorization, selection, transmission, verification, annotation, attestation, notification, and publication. Implementation and maintenance of privacy and security are activities performed by database programmers and administrators rather than end users, so those subcategories do not apply.
Standards for submission must describe who is allowed to submit data to the CGGD, what data to submit, the format of the data, and the transfer of data, and they must ensure the accuracy of the data after transfer (Table 3). Standards for curating the database include the selection, organization, and maintenance of the data within the database over time, including how frequently data are reviewed and updated, and the tracking of that information (ie, versions of the database) (Table 4). Submissions should be held to a scheduled review, with e-mail reminders or other forms of communication. Standards for data retrieval and use include determining who can access the database, the process for database queries, the tools for filtering query results, the format of the data for viewing, and the transfer of data (Table 5). Data security and privacy issues include patient consent, controls regarding access to data, and the security of the data at rest and during transmission.
The standards that are layer specific include how and what clinical or phenotypic data to submit (GMDRs and GMEDs), the definition of variant classification (GMEDs), and the levels of evidence used to determine classification, disease-association, or therapy association (GMEDs).
Recommended Standards Common to All CGGDs
The standards in this section are discussed in the context of a CGVR (layer 1) but are applicable to GMDRs (layer 2) and GMEDs (layer 3) as well. No standards are unique to CGVRs (layer 1).
Standards for data submission include what data to submit, who is allowed to submit data, the quality of the data, the format of the data, and the transfer of data. Submitted data include the sequencing data and metadata. Sequencing data could include raw sequencing reads (eg, FASTQ-formatted files), sequence data after alignment, and quality metrics (eg, binary alignment map [BAM] files), or identified variants compared with a specified reference sequence of genome (eg, variant call format [VCF] files) (Table 6). If only identified variants are submitted, all variants identified (including benign variants, variants of unknown significance, and presumed pathogenic variants) should be included in the database so that population-based information is more readily available. The genome reference sequence must also be specified.
There are pros and cons to the different options above. For example, raw-sequencing reads require large amounts of storage but allow reanalysis by others using different bioinformatics pipelines. Because clinical laboratories do not have a gold standard for actual sequence, reanalysis allows research on the comparison of different bioinformatics pipelines for assessment of different types of variants.
Metadata provides information about one or more aspects of the data, such as who, what, when, where, and how data were generated and the quality standards used by the laboratory generating the data. Metadata regarding the specimen, the laboratory testing process, and bioinformatics process should be included in all CGGDs. Metadata about the specimen may include specimen type (blood, prostate, kidney, lung, and other types), specimen preparation (fresh; frozen; formalin-fixed, paraffin-embedded, among others), the species from which the specimen was obtained, the time between collection and fixation or processing, and the percentage of reads with a variant. Metadata about the testing process may include extraction methods, library preparation methods, target-enrichment methodology (if used), instrument platform, instrument software version, quality scores, depth of coverage (total and variant), date of sequencing and analysis, and the submitting institution or laboratory identifier with contact information. Bioinformatics metadata includes the algorithm(s) used, the filters applied, and the reference genome. Each database will have to make decisions on who is allowed to submit data and exactly what data are submitted; that information should be transparent. Contact information for the submitter is useful if the users of the database need to contact the submitter with questions regarding specific data. Inclusion of data lacking one or more required annotations is not recommended because that would compromise the value of the database for clinical use.
Different aspects of data quality and integrity to be considered in CGVRs include the technical quality, the variant call quality, and minimizing data input/transfer errors. Standardized data submission is important to ensure the quality of the data in the database and to avoid errors during the transfer of data. Standards for data quality should include not only the quality data for the sequencing itself but also some assurance that there has been quality control during the entire process from patient identification and specimen collection to DNA extraction to bioinformatics analysis. Performance of the testing in a high-complexity, clinically certified laboratory is one mechanism for ensuring minimal quality standards for the entire testing process. Laboratories without clinical accreditation/certification would need a process to demonstrate equivalent validation, quality management, and test-performance standards, with a mechanism for noting that in the database. Alternately, a database might include that information as a field that can be filtered (see section on “Data Retrieval”).
Data should be in a standard format for submission to a database. Variant calls should use standard nomenclature, including the official Hugo Gene Nomenclature Committee (HGNC, European Bioinformatics Institute, Hinxton, Cambridge, England) gene names, standardized variant nomenclature (eg, http://www.hgvs.org/mutnomen/),35 and unambiguous location coordinates (either chromosomal coordinates and/or complementary DNA [cDNA] coordinates with a reference transcript).36,37 Such metadata should have a defined format.38
Use of automated data transfer is preferred to eliminate errors form manual entry. Automated data transfer will require standard messaging formats and interoperable communication systems. These standard messaging formats are still in their infancy and are not completely developed. However, some metadata may still need to be entered manually. Who submits or annotates those data and ensures the accuracy of the manually entered data fields is important. Trained annotators generally provide higher quality and more-uniform data with fewer mistakes,39 and clinical laboratories that use manual entry most often have a second check of manually entered information before release to ensure accuracy and to reduce errors. Quick, easy, and automated data transfer will improve submission rates, although incentives for data submission may be required.40
Curation of information for all CGGD databases ensures that information is complete and accurate at the time of submission. Curators also monitor changes to data and correct errors. Mechanisms for feedback to the database managers regarding errors or other problems identified by users accessing the database should be established. Given the size of these databases, some automation of this process will likely be required.37 Lastly, updates need to be tracked, and each version should have a publication or certification date.
Data retrieval should use standard formats. If databases contain data from nonhuman species, those data should be able to be filtered for human entries. Likewise, if a database contains data both from laboratories that have used stringent validation and quality practices (certification to perform high-complexity testing in a clinical patient-care environment ) and from laboratories that have not, the database should allow an end user to filter the data to only HQHSVs. The ability to query by either gene or variant will be necessary. A Web-based interface would be optimal. Once relevant data are retrieved, the display and manipulation of the data will need to be user friendly and relatively intuitive. Tools for data analysis or manipulation may be part of the database, or the format of the retrieved data may be compatible with third-party tools. Having a standardized format that allows interaction with other databases or tools is also desirable.20,39 The data should also be displayed in a format that is easy to interpret and to visualize. The database should track database queries, and access to that information by database users may be useful.
Data Security and Privacy
Data security and privacy are applicable to all 3 layers of CGGDs. There are no security or privacy standards that are unique to any of the layers; therefore, security and privacy will only be discussed in this section.
By definition, CGGDs are repositories of human genetic information. During the past 10 years, the US Federal Government has enacted increasingly protective laws for the privacy and security of health information, especially genetic information. A list of the major US federal statutes regarding privacy and security of health information is provided in Table 7, along with a brief description of the pertinent information related to genetic data. Many current databases containing genetic/genomic information were developed without realizing the requirements for compliance with federal law in the United States or other countries. The collection, manipulation, management, storage, and retrieval of sensitive and individually identifiable health information must comply with applicable privacy laws at the national, regional, and local levels. Given the cloud-based nature of many of these databases, both users and managers of the database must be aware of the country in which the data are housed and ensure that its location is compliant with applicable regulations. For example, if protected health information from US patients is stored in non-US databases, those databases have no requirement to comply with US privacy and security laws. In that situation, the individual or entity who submits the data to a noncompliant database may be legally liable for security breaches under the US Health Information Technology for Economic and Clinical Health Act of 2009 (Title XIII of Pub L 111-5) and the US Health Insurance Portability and Accountability Act of 1996 (HIPAA, Pub L 104–191). Conversely, if sensitive and protected information on a patient from a non-US country is stored in a database located within the United States, those data are subject to the US Patriot Act of 2001 (Pub L 107–56).41 Access to those data by representatives of the US federal government may violate the privacy laws of the country in which the specimen was collected and analyzed.
Human genetic information as defined by the Genetic Information Nondiscrimination Act of 2008 (GINA, Pub L 110–233; see Table 7) should only be submitted to a CGGD if the laboratory has informed consent from the patient. Although it is certainly true that short sequences containing commonly identified variants are not individually identifiable, those sequences and variants are included in the definition of genetic information under GINA. The HIPAA Omnibus Rule (effective date: September 23, 2013) specifically references the definition of genetic information under GINA when it updated the definition of protected health information to include genetic information. No clarification on this significant conundrum for both patient care and research has been provided by the federal government, and laboratories should consult with an attorney before submitting any sequence or other data from a human patient to a database without informed consent. To support the requirement for informed consent before submission of genetic information, a CGGD must have a place for a laboratory to attest that the laboratory has obtained informed consent from the patient. This work group does not recommend the actual consent be uploaded to the CGGD because that would release unnecessary additional patient identifiers (name, date of birth, and other such information) and violate the minimum necessary rule under HIPAA.
In addition to informed consent, many other privacy and security measures exist and require compliance by the managers of a CGGD. In the United States, the CGGD must comply with the Final Security Rule of HIPAA. For an extensive discussion of the requirements of the Final Security Rule of HIPAA, the reader is referred to several references.42,43 This rule requires administrative (assessments, policies, and procedures), physical (hardware), and technical (software) safeguards for protected health information. Databases in the cloud that contain protected health information (including genetic information as defined by GINA) must satisfy the same requirements as data that are not in the cloud, specifically regarding the HIPAA Final Security Rule. Many cloud services, for example, state that they are compliant with HIPAA, but that compliance typically only extends to some of the administrative and most, if not all, of the physical safeguards. Those cloud services cannot comply with the technical safeguards if they do not support or maintain the actual software being used by the end user. Software used to deploy CGGDs must enable controls for person authentication, access, auditing, data integrity, and transmission security. Person authentication means that the CGGD has confirmed the identity of the user who desires access to the database before granting logon. Access controls require that the individual user have a unique user identifier and password as well as automatic log offs. Authorization of the individual to view the data has been previously described in the submission, curation, and retrieval activities but is also included among the various privacy and security measures that must be implemented. Audit controls must be enabled so that events in which software users who view or alter an individual patient's protected health information (in the case of a CGGD, genetic or medical information) are recorded. A quality program must ensure that the integrity of the data is not compromised. Data encryption is recommended, although not mandated, during transmission to and from the database, as well as within the database itself (at rest). Entities in the United States that meet or exceed the encryption standards set forth by federal guidelines both at rest44 and during transmission45–47 are obviated from complying with the Security Breach Notification Rule48,49 in the event of breach, theft, or loss of data. Before submission of genetic information to any database, documented privacy and security practices should be reviewed by a knowledgeable individual to determine whether the database is qualified to house such information.
Recommended Standards for Layer-2 GMDRs
The submission of clinical and/or phenotypic data for a GMDR (layer 2) creates opportunity for understanding clinical findings based on molecular information. Required information depends on whether the database is to be used primarily for cancer, infectious disease, or constitutional changes, including inherited conditions and polymorphisms that may affect drug therapy (ie, pharmacogenetics). Important information for a cancer database includes certain patient demographics (eg, age, race, ethnicity), tumor type and stage, previous therapeutic intervention, and response to therapy (either before or after performance of cancer genomics). Important information for an inherited-disease database includes patient demographics, clinical phenotype, family history and pedigree, segregation studies (if available), and, in some cases, clinical laboratory and/or radiologic studies. Demographic information such as race, age, and gender is essential for both cancer and inherited-disease databases. Phenotypic terminology and formatting should be standardized for databases.36 In the case of cancer, the submitter should be required to state whether the corresponding healthy and tumor tissue were tested. Metadata should include the percentage of tumor in the analyzed specimen (taking into account any microdissection that may have taken place). Some tools and standardized formats have been developed by the International Standards for Cytogenomic Arrays (National Center for Biotechnology Information) Consortium for their chromosome microarray database.24 Systematized Nomenclature of Medicine–Clinical Terms (SNOMED-CT) codes or International Classification of Disease (ICD)-9/ICD-10 codes could be considered for standardizing disease information because they are widely used, usually available, and transfer of data could potentially be automated; however, they have their limitations mostly related to a lack of completeness of the clinical description and accuracy.50
Curation of GMDRs (layer 2) includes points discussed for CGVRs (layer 1) above, plus mechanisms to ensure the integrity of the clinical information that has been submitted. Some of the clinical information may not be available at initial sequence-data submission, so mechanisms that allow additional information, such as evolution of symptoms or clinical outcomes, to be added at a later time without compromising patient confidentiality are critically important to the utility of a CGVR. In the case of constitutional (heritable disorder) testing, follow-up information from biological relatives is also essential. Furthermore, incentives for submitters to provide follow-up data may increase submissions and the completeness of the clinical data, and therefore, the clinical usefulness of the CGGD.40
The ability to query by phenotype (eg, tissue of origin, tumor type and subtype, clinical signs or symptoms, among others) and gene or mutation will be necessary. Standardization of entry (eg, cancer type) is imperative to facilitate data retrieval. The same data retrieval considerations for CGVRs (layer 1) also apply to GMDRs (layer 2).
Recommended Standards Unique to Layer-3 GMEDs
Submission of data to layer-3 databases includes submitting data that specifically supports the biological classification of a particular variant.51 Although layer 1 and layer 2 databases must contain sequences and/or variants derived from human samples, layer 3 databases may contain medical evidence pertaining only to previously identified variants. Such medical evidence may include the source of peer-review information, the quality and extent of the evidence supporting a clinical association or causal clinical linkage, reference to causal information (eg, pharmacogenomic studies), or follow-up from an author of a published study regarding RNA or protein studies, segregation analysis, or outcome data after treatment. Evidence could also be based on the analysis of data within a CGGD, with comments and edits from users of the database, as long as the lack of peer review is a searchable field. Standard classifications of the clinical significance of variants with definitions of evidence levels for each must be used for a GMED. Incentives for submitters to provide additional data may be necessary.40
Although curation is important for all 3 layers, it is most critical for a GMED (layer 3). Curation of a GMED has unique issues, including the definition of clinical relevance, the levels of evidence required for classification of a variant, the frequency at which data are reviewed and updated, and the tracking of database modifications. The definition of clinical relevance of a variant and levels of evidence may differ for inherited diseases and cancer. For inherited diseases, a variant that is demonstrated to be disease-causing in some patients, may not have complete penetrance or may not cause disease in other patients or family members, with the hypothesis that the effect of the variant is modified by other variants in other genes/proteins. For cancer, a clinically relevant variant may not be disease causing but may determine diagnosis, classification, prognosis, or therapeutic effectiveness. Standards for determining the reliability and significance of information in the literature are important. Evidence synthesis and review must adhere to evidentiary standards for determining the quality of a specific article, such as those developed the Evaluation of Genomic Applications in Practice and Prevention (EGAPP) group.52 Such evidence syntheses must account for the different significances of a variant in different disease processes or tissues. For example, a BRAF c.1799T>A, p.Val600Glu (V600E) mutation in melanoma has therapeutic implications, but the therapeutic implications for the same mutation in colorectal carcinoma or hairy cell leukemia is not known at this time.
Standards for interpreting and reporting variants as deleterious, benign, or uncertain in inherited diseases have been recommended by the American College of Medical Genetics (Bethesda, Maryland); however, the amount of evidence required to determine whether a variant is benign or pathogenic is less clear.5 Levels of evidence refers to a ranking system describing the strength of results measured in a research study or clinical trial. Examples of levels of evidence are provided by the University of Oxford Centre for Evidence-Based Medicine (Oxford, England).53 However, those levels of evidence may not translate well for rare diseases where the only studies are case reports or small case series. Inherited diseases may require their own levels of evidence that include functional studies of an affected protein. Possible sources of evidence include patterns emerging from GMDRs (layer 2), functional studies, clinical trials, publications, guidelines, or in silico predictions.
Levels of evidence for reporting a variant in cancer are more complex. A variant in cancer can have different clinical uses (diagnosis, prognosis, therapy selection, or genetic predisposition for cancer). Each of those would have its own level of evidence for the variant classification. Furthermore, evidence for the significance of a particular variant may be excellent in one tissue or tumor type but not in others. It is challenging to have adequate studies that cover every possible tissue/tumor type, especially for less-common tumors. The National Comprehensive Cancer Network (Fort Washington, Pennsylvania), for example, has a categorization based on the degree of evidence, as well as expert consensus.54
As our understanding of the clinical significance of human genetic variation in health care increases, the classification of a variant may change over time. A variant that has unknown significance on submission to a CGGD may be proven later to be benign or pathogenic. In addition, the development of new, targeted therapies may make a variant important for therapeutic reasons. As medical knowledge evolves, GMEDs (layer 3) will need to be updated. The date of the update, the version of the data, and the data content of that version need to be tracked and be referenced within each query or use of the database. In addition, notification of updates or changes in classification for a given variant could be considered but would be logistically challenging.
Two unique issues should be noted regarding data retrieval from GMEDs (layer 3). The first is to automate correlation of the sequence information in the database with data from the laboratory information system or electronic health record. The second issue is to provide a mechanism for reviewing and assessing the relevance of the evidence supporting a given variant classification. If a database will be used to provide variant classification of sequencing results from an individual patient, automated retrieval of information for all variants detected in a patient is required. Ideally, a system will be in place so that the patient file automatically queries the database for all variants and receives the classification results. These results should easily link to the evidence supporting the interpretation, such as a link to references or specific databases.
Although the primary use of a CGGD is to guide care of individual patients, a further consideration in data retrieval from GMEDs (layer 3) is whether the patient's clinical phenotype should be layered across the full database of patient phenotypes. In essence, a GMED (layer 3) should provide functionality for data retrieval based on (1) individual patient demographics (race, sex, age), (2) genotypic findings, and/or (3) clinical phenotypic findings. Such analytics may help individual health care providers gain insight into the potential relevance of their patient's identified sequence variants related to specific phenotypic characteristics and therapeutic responses and may further the process of discovery as well as provide new differential diagnoses for specific patients. Whether nonhuman data should be included in GMEDs (layer 3) is problematic. We recognize that some novel variants will lack any human data regarding classification and that, in those cases, data from animal models, conservation across species, and/or cell-line data may be used to interpret whether a novel variant might be responsible for disease and warrant further consideration. However, those data cannot be considered clinical grade and, therefore, require both a notation in the database as to the limitation of the sources used for classification and a notation in the patient report about the limits of the data used for interpretation. This issue is related to the levels of evidence supporting variant classification, as discussed above.
Discussions about clinical-grade genomic databases will have different emphases depending on what layers of information are included in a given database. We have identified 3 layers of information that may be contained in a database: the sequence data (CGVR, layer 1), the clinical and phenotypic information (GMDR, layer 2), and the classification or association information (GMED, layer 3). The composition of a given database will affect the structure, function, and clinical usefulness of the database. Information from CGVRs and GMDRs (layers 1 and 2, respectively) can be used to expand our knowledge of the significance of different variants and may identify needed scientific functional studies to determine the clinical relevance of specific variants or genes in a disease, as well as allow compilation of the rarer occurrences of variants in a specific disease across multiple testing sites to begin to identify new variant-disease associations requiring additional study. The GMEDs (layer 3) will be useful for the interpretation of patient NGS test results.
Regardless of the composition of a database, if it will be used for clinical care, all data must be high quality. Quality is important at all levels, including how the data are produced, transmitted, organized, retrieved, filtered, and interpreted. In addition, a database must be useable, be kept up to date, allow data to be added later, and still provide data-security measures to protect patient confidentiality. Documentation of all procedures and versions of the database as updates and submissions occur are essential. Databases composed of quality clinical-grade genomic data will be important for the advancement of our understanding of the clinical significance of genetic variants and for more reliable and reproducible interpretation of patient NGS test results.
The goal of this article is to provide an overview of the issues and standards to be considered when creating a CGGD. National and international efforts are underway to create comprehensive genomic databases, such as ClinGen and the Human Variome Project International (Victoria Australia).55,56 As these databases will contain information obtained from clinical laboratories and/or be used for patient care, we encourage consideration of these standards when making decisions about the structure and function of the databases as CGGDs. We have not defined the specific requirements for these standards. Those specific requirements, such as the level of evidence required to classify a variant, will require input from multiple stakeholders. Although creation of a GCCD is a daunting task, it is achievable. Clinical-grade databases exist for other types of genetic data, for example, the Database of Genomic Variation and Phenotype in Humans Using Ensembl Resources (DECIPHER, Wellcome Trust Sanger Institute) for copy number variation, translocations, and inversions associated with inherited syndromes.57 That project has tackled some of the issues discussed in this article, including patient privacy, data security, and curation issues, such as updating clinical information. A CGGD for sequence variants can be built using the lessons learned by the databases that have come before.
The authors have no relevant financial interest in the products or companies described in this article.
Supplemental digital content is available for this article at www.archivesofpathology.org in the November 2015 table of contents.