Context.—Next-generation sequencing (NGS) is revolutionizing the discipline of laboratory medicine, with a deep and direct impact on patient care. Although it empowers clinical laboratories with unprecedented genomic sequencing capability, NGS has brought along obvious and obtrusive informatics challenges. Bioinformatics and clinical informatics are separate disciplines with typically a small degree of overlap, but they have been brought together by the enthusiastic adoption of NGS in clinical laboratories. The result has been a collaborative environment for the development of novel informatics solutions. Sustaining NGS-based testing in a regulated clinical environment requires institutional support to build and maintain a practical, robust, scalable, secure, and cost-effective informatics infrastructure.
Objective.—To discuss the novel NGS informatics challenges facing pathology laboratories today and offer solutions and future developments to address these obstacles.
Data Sources.—The published literature pertaining to NGS informatics was reviewed. The coauthors, experts in the fields of molecular pathology, precision medicine, and pathology informatics, also contributed their experiences.
Conclusions.—The boundary between bioinformatics and clinical informatics has significantly blurred with the introduction of NGS into clinical molecular laboratories. Next-generation sequencing technology and the data derived from these tests, if managed well in the clinical laboratory, will redefine the practice of medicine. In order to sustain this progress, adoption of smart computing technology will be essential. Computational pathologists will be expected to play a major role in rendering diagnostic and theranostic services by leveraging “Big Data” and modern computing tools.
Next-generation sequencing (NGS) technology, using a massively parallel sequencing paradigm, has markedly altered the landscape of genomic medicine. The high-throughput capabilities of NGS systems have resulted in an exponential accumulation of sequence data that exceeds our current technologic capacity to comprehensively manage and interpret genomic information. The rapidly decreasing cost of sequencing per base, in conjunction with the introduction of cost-effective benchtop laboratory sequencers, has sparked growing demand in the field of personalized medicine for incorporation of discrete NGS data within the clinical arena.1 Despite the fact that NGS platforms are still technically maturing, the technology has progressively infiltrated into the clinical molecular laboratory, driven by the enthusiasm of physicians, pathologists, molecular biologists, scientists, and clinical administrators. Even patients are aware that high-throughput genomic analysis will generate answers for many more variants at a lower cost than would be achieved if each variant were studied individually or in small panels. With the current clinical informatics infrastructure, the rapid adoption of NGS creates significant operational disequilibrium. Because this disruptive technology is undeniably the gateway to the new era of clinical genomic medicine, the operational gap between NGS bioinformatics and clinical informatics must be bridged in order to provide a long-term sustainable environment for the next generation of molecular diagnostics.
Gullapalli and colleagues1,2 published two peer-reviewed articles in 2012, describing the then contemporary NGS technologies and highlighting some of the informatics issues related to the challenges in clinical implementation. Modern technologies, such as virtualization and cloud computing, were briefly discussed with respect to NGS analytics. However, in recent years NGS technology has significantly evolved through new advances that have also introduced completely new challenges to NGS data management. Moreover, analytics and cloud computing have undergone significant advancement, becoming an invaluable resource for high-throughput genomics. In a recently published book, Next-Generation DNA Sequencing Informatics, the authors present a comprehensive introduction to NGS technology. However, this text does not address the operational aspects of NGS in a clinical environment.3 More recently, the Next Generation Sequencing: Standardization of Clinical Testing II (Nex-StoCTII) informatics workgroup published principles and guidelines for clinical NGS informatics detailing the design, optimization, validation, and implementation of bioinformatics pipelines for detection of germ line sequence variants. This guideline, however, did not address the laboratory workflow and informatics infrastructure to support the implementation of bioinformatics pipelines.4
As a result, there is a need for a contemporary guide to help pathology informaticists cope with NGS in the clinical laboratory environment. This review not only outlines the current state of sequencing technology and related bioinformatics, but also provides a detailed discussion about the various aspects of laboratory workflow informatics and data management challenges that pertain to the deployment and maintenance of NGS-based molecular testing. Potential solutions for existing bottlenecks and scope for future development are summarized, based on published scientific data and the personal experience of the authors dealing with these issues at their respective institutions. This review is not intended to serve as a technical manual for construction and/or troubleshooting a bioinformatics pipeline.
SEQUENCING TECHNOLOGIES: EVOLUTION AND CURRENT STATE
Frederick Sanger introduced DNA sequencing in 1977 when he described a chain-termination method of replicating the nucleotide sequence of a single-stranded DNA fragment (500–1000 bases).5 To replicate the original DNA sequence Sanger used a combination of DNA polymerase, a short oligonucleotide primer, chain-extending and chain-terminating nucleotides, chemically modified nucleotide bases, polyacrylamide gels, and radioactive labeling. Subsequent improvements in sequencing chemistry and methodology—such as incorporation of fluorophore-labeled dideoxynucleotides for dye terminator sequencing (chain-termination method), cycled sequencing reactions catalyzed by thermostable DNA polymerases, automated capillary electrophoresis instruments, and laser detection methods—enhanced assay sensitivity.6–8 The replicated DNA fragments generated a series of fluorophore signal peaks after electrophoretic separation that comprised the nucleotide sequence of interest in a chromatogram or trace file that was converted to a FASTA or FASTQ file format, the latter including base quality scores (Phred quality score). Downstream analysis required alignment of these fragment reads to a reference genome in order to define their genomic origin and identify variants. The Sanger method remains the benchmark in the field for accurately determining DNA sequence, with an error rate of less than 1 in 10 000 bases.
Next-generation sequencing platforms offer the innovative capacity to perform synthesis of many overlapping short DNA fragments (50–400 bases) from prepared libraries of target DNA by spatially segregating them on beads or arrays and replicating them in parallel.1,9–11 Simultaneous monitoring of nucleotide incorporation for each replicating DNA strand over the length of the sequencing reaction generates millions to billions of short stretches of DNA sequence (reads). In this manner, each base of target DNA is synthesized multiple times. The minimum number of times that each base being analyzed is synthesized into an overlapping fragment is referred to as depth of coverage. Assembly of these short reads in register with a reference genome allows the original DNA sequence to be re-created ranging from a small region to the entire genome. Although a single synthesized strand of DNA using an NGS method has a higher rate of error (1 error in 1000 bases) than Sanger sequencing (1 in 10 000 bases), a high depth of coverage combined with a minimum threshold requirement for sequence variants can have a lower risk of error than with Sanger sequencing. It is this ability to preserve discrete information on the content (sequence) and redundancy (depth of coverage) of the sequenced genome that underlies the high-throughput nature of this technology and endows unprecedented biologic insight into the generated data. The transition from single-tube chemistry (Sanger) to massively parallel chemistry (NGS) has also opened the way to exponential decreases in the minimal “size” of individual sequencing reactions. Analogous to the revolution in semiconductor computer chips during the past 30 years with decreasing component sizes, this has reduced the time and cost of genomic sequencing by several orders of magnitude, thereby fueling genomic research and offering sufficient fidelity and acuity for use in clinical diagnostic applications.
Since the completion of the Human Genome Project in 2003, advances in NGS technology have been sponsored predominantly by major commercial entities, including 454 Life Sciences Inc (Roche Applied Science, Branford, Connecticut), Illumina Inc (San Diego, California), and ThermoFisher Scientific Inc (Waltham, Massachusetts). These commercial NGS platforms are distinguished predominantly by different sequencing chemistries (synthesis versus ligation), methods for clonal polymerase chain reaction (PCR) amplification of DNA fragments (bead-based emulsion PCR versus flow cell bridge PCR), and targeted approach (hybrid capture versus PCR amplification). The Roche 454 platform carries out sequencing by synthesis, detecting pyrophosphate release upon nucleotide incorporation. Illumina also uses sequencing by synthesis but uses a fluorescently labeled reversible terminator to detect nucleotide base incorporation during replicative synthesis. In contrast, the ThermoFisher SOLiD platform performs sequencing by ligation using 4 fluorescent dibase probes to interrogate the first and second bases incorporated during each sequential ligation reaction. Both the 454 and SOLiD systems decorate beads with monoclonal DNA fragments that are amplified during emulsion PCR as the substrate for sequencing. In contrast, the Illumina technology immobilizes DNA fragments by attachment to adaptors distributed across the surface of a flow cell, and the amplified sequence binds to an adjacent primer, forming a bridge that generates a clonal cluster with successive PCR cycles.9,10,12
The development of affordable and compact laboratory benchtop sequencers has made it possible to bring NGS technology into the clinical diagnostic laboratory. The Roche 454 GS Junior and the Illumina MiSeq models were derived by miniaturization of the original technology developed for their flagship whole-genome sequencers. In contrast, ThermoFisher (formerly Life Technologies Inc) entered this market through the acquisition of Ion Torrent technology, which incorporates the use of nonoptical, single-nucleotide, semiconductor-based sequencing (Ion Torrent Personal Genome Machine, Ion Proton, and Ion S5). This technology combines aspects of parallel sequencing, including bead-based emulsion PCR, but employs a complementary metal oxide semiconductor chip with microwells that serve as pH-sensitive pixels to detect the release of a hydrogen ion (registered as an electrical signal) when a nucleotide is incorporated during sequencing by synthesis. This approach obviates the need for chemiluminescent dyes, serial optical image acquisition, a motorized camera stage, and extensive storage of preanalytic files for subsequent processing.9,10,12,13 DNA barcodes are unique short sequences of nucleotide bases that are incorporated into target DNA fragments of interest adjacent to the adapter sequences during the library preparation step. These DNA bar codes are of fixed length and must uniquely identify the source (patient) of the target DNA being analyzed in the assay run. Use of these bar codes allows multiplexing of multiple samples in a single sequencing reaction, therefore making NGS a cost-effective technology.
A major limitation of the current NGS workflow is the laborious and time-consuming preparation of enriched DNA fragment or amplicon libraries for clonal isolation and amplification on beads or flow cell clusters. Advances in engineering and automation promise to reduce the labor, time, and complexity of this process to meet clinical requirements for accurate, expeditious results. In that context, it is important to recognize that emerging third-generation sequencing technologies bypass this requirement by direct detection of DNA molecules through the use of nanoscale chambers or pores. Pacific Biosystems Inc (Menlo Park, California) developed a platform that performs direct sequencing by synthesis of DNA strands in a nanophotonic chamber (zero-mode waveguide), detecting incorporation of individual phospho-linked fluorescent nucleotides in real time as they replicate the original sequence.14 Alternatively, Oxford Nanopore (Oxford, United Kingdom) has pioneered a system that detects individual DNA molecules in strands of DNA as they are incrementally conveyed through a protein nanopore passage, where an alteration in current accurately predicts the identity of the DNA bases passing through the pore.15 Although the accuracy of single-molecule detection systems does not yet match that of NGS using template libraries, the ability to read long DNA strand lengths and directly detect epigenetic modifications of DNA with these systems confers important advantages to these approaches.
GENERAL SCHEMA OF BIOINFORMATICS WORKFLOW FOR PROCESSING NGS DATA
The most popular NGS technologies are designed to yield millions of relatively short sequence reads (50–400 base pairs) redundantly overlapping a specified genomic region of interest (targeted sequencing) or potentially extending across the whole genome (whole-genome sequencing). The portion of the targeted genome for which reads are actually generated during sequencing represents the extent of coverage provided by a sequencing run. A bioinformatics pipeline (Figure 1) refers to a series of complex and computationally expensive data analysis processes that derive a list of genomic alterations from raw NGS signal output subsequent to signal processing and alignment against a reference genome. A typical bioinformatics pipeline typically begins with proprietary, platform-specific algorithms generating sequential base calls from primary fluorescent, chemiluminescent, or electrical current signals. Each of the predicted nucleotide bases is assigned a quality score (Phred-like score or Q score), which reflects the degree of statistical confidence that the base call is correct. The sequence reads generated during this process are stored in one of the several file formats (FASTQ, XSEQ, unaligned BAM, or FASTA) with or without the base quality score information. Because of the platform-specific and proprietary nature of this portion of the pipeline, these quality scores (Q scores) are not comparable across different sequencing systems.
General schema for bioinformatics workflow for next-generation sequencing (NGS) testing. The bioinformatics algorithms employed in the signal processing and quality control (QC) steps are usually platform specific and often proprietary. The variant detection process is variable, and a variety of software applications are available for different requirements. Abbreviations: CNV, copy number variation; SAM/BAM, sequence alignment/map format (BAM is a binary format of SAM); SNV, single-nucleotide variant; VCF, variant call format.
General schema for bioinformatics workflow for next-generation sequencing (NGS) testing. The bioinformatics algorithms employed in the signal processing and quality control (QC) steps are usually platform specific and often proprietary. The variant detection process is variable, and a variety of software applications are available for different requirements. Abbreviations: CNV, copy number variation; SAM/BAM, sequence alignment/map format (BAM is a binary format of SAM); SNV, single-nucleotide variant; VCF, variant call format.
Subsequent analysis involves performing a quality control (QC) check on the sequenced data (typically FASTQ) to assess read length distribution, quality scores, GC content, overrepresented, sequences and k-mer content.16 The purpose of these checks is to determine if the sequences generated have indicators of poor sequence quality (eg, if the GC base content of the region is at a higher or lower percentage than what is expected). FastQC is one of the popular tools used for this process.16 Additional steps, such as adapter and poor-quality sequence trimming, may be required depending on the QC results and pipeline configuration. The QC check phase is followed by alignment of the overlapping reads in a FASTQ file against a reference human genome (eg, GRCh37). Good alignment algorithms have code designed to overcome ambiguities of repetitive sequence and sequencing errors. Several alignment programs (open source and proprietary) have been developed that vary in their performance specifications and impact the ability to subsequently detect sequence variations or specific genomic alterations, such as short insertions/deletions (Indels) and large structural variants.17,18 The most widely accepted alignment algorithm is the Burrows-Wheeler Aligner, which is designed for short reads. Hash table-based methods (eg, Sequence Search and Alignment by Hashing Algorithm, and BLAT) and the Smith-Waterman alignment algorithm are suited for longer reads.19,20 Popular short-read aligners include bowtie,21 Burrows-Wheeler Aligner,20 SOAP2,22 MAQ,23 Novoalign,24 and mrFAST,25 among others. The aligned reads, which are tagged with several metadata, including alignment scores, are outputted in a sequence alignment map (SAM) or binary form of SAM (BAM) format. For certain platforms, such as Ion Torrent and SOLiD, the reads are aligned using a customized alignment algorithm optimized for the platform chemistry and error profile (Tmap for Ion Torrent and LifeScope for SOLiD).
The aligned reads in a SAM/BAM file form an input for a variety of applications that detect single-nucleotide variants (SNVs), short Indels, large structural alterations, copy number variations, and gene fusions. A variety of open source and commercial applications are available, ranging from individual variant callers to solutions offering the entire pipeline from alignment through variant calling. Popular variant callers include Genome Analysis ToolKit (GATK),26 VarScan2,27 Atlas2,28 and MuTect.29 The performance of these algorithms varies widely depending on both the type of variant and the sequencing platform.30 Comprehensive analysis of the most widely used variant detection algorithms for identification of SNV and Indel variants revealed high discordance among all callers tested, including GATK (Unified Genotyper), Mpileup (SAM Tools), Ion Torrent (torrent suite variant caller [TSVC]), glftools, Atlas2, MuTect, and Varscan2.17,30 Generally, local sequence realignment, duplicate marking, and adjusted base quality score thresholds are employed to enhance the discrimination of variant callers directed at SNV and Indels. However, structural variants and copy number aberrations require different processing pipelines that are still undergoing substantial development and testing.31,32 Current recommendations suggest the use of a battery of algorithms to capture the broadest range of variants, and then selecting among them based on a minimum consensus threshold followed by secondary validation. This approach is applicable to the discovery of novel variants, but validated standards to calibrate and optimize the sensitivity of callers will be necessary to perform rapid screening of established variants. The list of sequence variants is typically rendered in one of the many variant call formats, such as variant call format (VCF), genomic VCF (gVCF), and general feature format (GFF3). The Centers for Disease Control and Prevention and a number of partners are working on a clinical grade version of the VCF file standard (http://vcfclin.org/; accessed December 15, 2015). Of the 3 file types generated in NGS (FASTA/FASTQ, SAM/BAM, and VCF), the VCF takes up the least amount of memory and is the easiest to read computationally. Therefore, it may be the ideal candidate file to use to transmit and store genomic data on patients between NGS instruments, laboratory information systems (LISs), and electronic health records (EHRs).
The detected variants are further annotated with metadata, such as Human Genome Variation Society annotations, predicted effect on transcription and translation, pathway analysis, genotype-phenotype correlation data, clinical features, therapeutic trials, and outcome data. Public databases, such as the Database of Single Nucleotide Polymorphisms (dbSNP; http://www.ncbi.nlm.nih.gov/SNP/; accessed December 15, 2015), Online Mendelian Inheritance in Man (OMIM; http://omim.org/; accessed December 15, 2015), ClinVar (Sequence Variation Related to Human Health; http://www.ncbi.nlm.nih.gov/clinvar/; accessed December 15, 2015), and Catalogue Of Somatic Mutations In Cancer (COSMIC; http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/; accessed December 15, 2015), provide a source of valuable annotations.
CLINICAL NGS INFORMATICS WORKFLOW
The primary and secondary analyses of a bioinformatics pipeline remain an integral part of NGS testing in both research and clinical laboratories. Data analysis algorithms continue to be refined in concert with improved sequencing hardware and chemistry. Desktop servers with preconfigured and optimized bioinformatics pipelines that attempt to make the user experience as seamless as possible, typically accompany sequencing instruments. Having been around in research and open source domains for a significant amount of time before entry into the clinical laboratory, primary and secondary bioinformatics processes have had the opportunity for substantial improvement to achieve clinical-grade levels of precision and accuracy. Like any other laboratory test, introduction of a new test into a clinical environment requires significant interoperability with the existing informatics infrastructure for successful implementation. However, the clinical informatics domain was stormed by speedy introduction of benchtop NGS sequencers into clinical laboratories, with little time devoted to developing informatics solutions to adapt to the magnitude and complexity of NGS data or to seamlessly integrate it into the existing clinical informatics infrastructure (eg, LISs and EHRs). Amid the excitement of rapidly embracing NGS technology and the ability of NGS sequencing systems to generate human-readable variant files, little attention was paid to the downstream aspects of laboratory workflow, such as interoperability with existing information systems, patient and sample information integration with sequence data, report synthesis, data storage and transmission, and quality assurance (QA). These elements, which are both upstream and downstream of an NGS sequencing run, are crucial for the successful establishment of NGS-based testing in a clinical environment. The accentuated gap between clinical informatics and NGS technology is partly attributed to the inability of existing LISs and EHRs to import, transmit, and store any NGS data, including the small VCF file. The current gap, however, is also due in part to the inability of NGS platforms to accept, use, and transmit health data using standard electronic messaging protocols (eg, health level 7 [HL7]) in a secure manner.
NGS CLINICAL INFORMATICS WORKFLOW: KEY ELEMENTS, BOTTLENECKS, AND IMPLEMENTATION STRATEGIES
Clinical informatics workflow in a laboratory performing NGS-based molecular testing has similarities as well as unique attributes compared with any standard laboratory workflow. Figure 2 shows the schematics of information flow in a clinical environment superimposed with the unique requirements and resources that are mostly outside the scope of a conventional LIS, necessary to successfully implement NGS testing. The discussion below is centered on data management related to NGS testing with reference to preanalytic, analytic, and postanalytic components of laboratory workflow.
Schematic diagram showing the Molecular Laboratory Information System (M-LIS), indicating workflow around next-generation sequencing (NGS) testing and desired features to optimize efficiency. The central box represents a conceptual space for information management systems (IMS) development that incorporates the following desired modules: (1) Core LIS modules for performing standard functions, currently offered by most LIS vendors. Whole-slide imaging (WSI) module to bridge the gap between digital slides and molecular testing. (2) Workflow and process monitoring for NGS data analysis, which will provide real-time status updates and error reporting. (3) Data-parsing module to enable parsing of data files, such as variant call file, sequence alignment file, and other downstream analysis files in a variety of formats to the LIS internal database. This will allow integration of genomic sequencing results with patient sample identifiers for reporting. (4) Data backup module, synchronized to a large institutional data center for redundancy. Data analytic module that will allow real-time access to various types of statistical, workflow, and patient-related data on current and archived test results. Abbreviations: CRM, customer relationship management; EHR, electronic health record; HL7, health level 7; JSON, JavaScript object notation; OR, operating room; QA, quality assurance; QC, quality control; R&D, research and development; WES, whole-exome sequencing; WGS, whole-genome sequencing; XML, Extensible Markup Language.
Schematic diagram showing the Molecular Laboratory Information System (M-LIS), indicating workflow around next-generation sequencing (NGS) testing and desired features to optimize efficiency. The central box represents a conceptual space for information management systems (IMS) development that incorporates the following desired modules: (1) Core LIS modules for performing standard functions, currently offered by most LIS vendors. Whole-slide imaging (WSI) module to bridge the gap between digital slides and molecular testing. (2) Workflow and process monitoring for NGS data analysis, which will provide real-time status updates and error reporting. (3) Data-parsing module to enable parsing of data files, such as variant call file, sequence alignment file, and other downstream analysis files in a variety of formats to the LIS internal database. This will allow integration of genomic sequencing results with patient sample identifiers for reporting. (4) Data backup module, synchronized to a large institutional data center for redundancy. Data analytic module that will allow real-time access to various types of statistical, workflow, and patient-related data on current and archived test results. Abbreviations: CRM, customer relationship management; EHR, electronic health record; HL7, health level 7; JSON, JavaScript object notation; OR, operating room; QA, quality assurance; QC, quality control; R&D, research and development; WES, whole-exome sequencing; WGS, whole-genome sequencing; XML, Extensible Markup Language.
Data Acquisition
In the context of test life cycle, data acquisition is the process of acquiring information from different nodes (points of information transfer) of a laboratory's workflow and integrating it with other data points for a sample or patient. For traditional clinical laboratory tests, the bulk of data acquisition has been associated with the preanalytic phase of an assay (eg, order entry, sample receipt, etc). However, the elaborate and complex workflow of NGS testing generates several valuable QC and interpretive data throughout the process, necessitating accurate and continuous acquisition of these data points. This section will discuss such aspects of data acquisition across the entire workflow of an NGS assay (Figure 2).
Order Receipt and Specimen Accession
Like any other test, clinical NGS testing starts with the ordering process, along with receipt of the specimen for testing. Orders may be received electronically from EHRs (internal or external), from external LISs, or by paper requisition in many instances. The last method is inherently inefficient for many reasons. These include the inability to incorporate all of the details inherent to a complex NGS-based test menu, the inability to track orders from bedside to the laboratory, delays in receipt, errors due to both handwritten labels and illegible handwriting, overwritten transcripts, and physical damage. With the wide repertoire of molecular tests based on multigene NGS panels, interactive test menus with evidence-based clinical decision support are highly desirable for improved customer experience and better use of this relatively expensive testing. When deployed in the computerized physician order entry components of an EHR, especially in large hospital networks, test ordering can be significantly streamlined and use improved. Large molecular laboratories, in addition to receiving physician test orders, also handle a large bulk of orders from other client laboratories that may or may not use an electronic ordering process. Web-based client portal for managing such test orders is becoming a popular feature in some of the newer laboratory information management systems (LIMSs). Clinical decision support algorithms can also detect and alert both providers and laboratory staff to possible unnecessary duplicate test orders. Potential future functions of computerized physician order entry could include the ability to perform checks at the time a drug order is placed that alert the provider if adverse pharmacogenetic variants are known to be harbored by the patient.
Specimen Bar Codes
Unlike the DNA bar codes discussed in the previous section, physical bar codes allow for the accurate identification and tracking of specimens during testing. Given some of the issues reported with linear (1-dimensional) bar codes,33 compounded with the small size of most molecular microfuge tubes, 2-dimensional bar codes work best in the clinical and particularly in the molecular environment. Manual handwritten labeling, accessioning, and tracking of each sample throughout the testing process is time-consuming and creates a fertile ground for identification errors, sample mix-ups, and wrong tests performed. Use of a physical bar code to label samples and aliquots provides a major solution to this problem. Bar coding leverages a high level of accuracy and data integrity in clinical laboratories. Physical bar codes facilitate complete automation of clinical laboratory workflow in clinical chemistry and hematology laboratories. A 2-dimensional bar code system can incorporate more information within a smaller label, which is ideal for the small vials, tubes, and flow cells used in molecular laboratories. Bar codes can be used during the entire testing and reporting process, especially if the LIS and test instruments support them. Multipoint data acquisition requirements for supporting NGS testing workflow can be greatly facilitated by use of bar codes.
Workflow Management and Sample Tracking
The heart and soul of a laboratory lie in workflow logistics. A well-designed and tightly monitored laboratory workflow ensures efficiency and accuracy in test results. Like any other clinical laboratory, implementation of NGS testing requires good workflow with an informatics infrastructure to closely monitor the processes. The level of workflow process monitoring in the setting of NGS is granular and extensive. Several workflow nodes can potentially contribute data points to the monitoring system, such as order receipt and specimen accessioning, sample assessment, nucleic acid extraction, library preparation, chip/flow cell loading, sequencing reaction, bioinformatics analysis, interpretation, reporting, and QA documentation. Each node in the workflow can be a potential source of error, and such errors may get propagated and amplified downstream into the workflow. Given the terabytes of data points generated from an NGS run, finding such errors can be akin to “finding a needle in a haystack.” Therefore, a highly reliable workflow management and tracking system is critical to the successful implementation of clinical NGS testing. Although workflow tracking solutions are currently available from vendors of NGS instruments, their scope is limited to the sequencing reaction steps and related analytics. Upstream sample and patient information management, presequencing wet bench processes, and downstream variant annotation, interpretation, and reporting are typically not addressed with these vendor solutions. As a matter of fact, most complex wet bench processes prior to sequencing involve a significant amount of manual data transfer. Some large academic and commercial molecular laboratories have implemented custom solutions that manage the upstream and/or downstream processes in their laboratory. As a result, the laboratory has to deal with multiple pieces of middleware, which typically do not interoperate with each other or the LIS, resulting in a fragmented workflow and increased risk of error. The ideal monitoring solution, preferably part of a LIS, should be able to procure real-time data on the test status for samples in different workflows across the laboratory. Not surprisingly, this will require significant interoperability between the monitoring system and the various instruments/workstations in the laboratory, using appropriate messaging protocols. Electronic audit trails will also be a critical aspect of workflow management in order to facilitate troubleshooting in case of assay failure. Thankfully, NGS sequencing instruments and accompanying analysis servers allow some level of data communication, unlike traditional instruments, such as a regular PCR machine.
Workflow QC
Monitoring and QC of the complex NGS testing workflow is an essential part of molecular laboratories offering this type of testing. Because there are several nodes in the workflow, it is important to optimize the process to achieve maximal efficiency in a very busy laboratory. Process improvement methodologies, such as Lean and Six Sigma, are valuable tools for optimizing laboratory workflow and have been shown to be effective in other laboratory settings of anatomic and clinical pathology.34–37 Specifically in molecular laboratory settings, Lean process was used to improve the service quality of the molecular diagnostic laboratory at Henry Ford Hospital.38 Similarly, Lean process was applied to the molecular microbiology laboratory at the Mayo Clinic to improve overall laboratory performance and achieve higher cost efficiency.39 Next-generation sequencing testing workflow is significantly more complex than conventional molecular testing, and in order to take advantage of these process improvement strategies, systematic and discrete data acquisition is critical.
Data Validation
Next-generation testing involves a series of steps from library preparation to detection of genomic variations in a given sample. It is important to recognize the occurrence of rare but systematic errors that may be seen in raw signal outputs. These errors are attributable to sample and library preparation, as well as underlying sequencing chemistry for each platform. For example, sequencing artifacts (C>T; G>A) caused by deamination are detected with low template amounts of nucleotides from formalin-fixed, paraffin-embedded specimens. Similarly, sequencing errors may be introduced during amplification-based target capture, library preparation, and cluster generation processes that are attributable to PCR infidelity (error rate <0.3% per base). It has also been demonstrated that the various DNA polymerases used in amplification of these DNA libraries can introduce systematic coverage biases in the G-C content and the length of the input libraries generated for sequencing.13 Platform-specific nucleotide base misincorporation also occurs during sequencing by synthesis and sequencing by ligation reactions, particularly in certain regions of the genome with homopolymer sequences, high GC content, inverted repeat sequences, and palindromic sequences. The Illumina sequencing platform exhibits miscalls immediately after a triplet of identical base calls, which are believed to be associated with hairpin formation at the inverted repeats, which blocks nucleotide addition.40 Both Roche 454 and Ion Torrent sequencing platforms exhibit base calling errors that increase with the length of homopolymers and are believed to be associated with inaccurate flow calls.41 Illumina sequencers produce more substitution miscalls, whereas Roche 454 and Ion Torrent sequencers are biased toward reporting single base deletions, particularly within the context of trinucleotide repeats and homopolymers.42 It is important to recognize, however, that platform providers are continuously improving the quality and performance of their assays while developing filter thresholds to remove a large portion of miscalled reads prior to assembly.
The above error profiles can be attributed to both the “wet” and “dry” sequencing components of NGS. Wet bench components include several processes, such as specimen handling and preservation, nucleic acid extraction, amplification, library preparation, chip/flow cell loading, and generation of sequence reads. The dry bench refers to computational and bioinformatics analyses. In order to preemptively monitor potential errors during the sequencing workflow, it is necessary to critically examine the QC metrics of each step. For example, flow cell or semiconductor chip loading metrics, total number of reads, and mapped reads on target reflect the quality of library preparation. The bioinformatics algorithms involved in base calling, read alignment, and variant calling generate probabilistic scores for each nucleotide base, aligned reads, and variant. These quality scores reflect the accuracy of the individual processes and contribute to the overall quality of the generated results. Therefore, in order to ensure a high reliability of test results, the quality scores for every NGS run need to be recorded and analyzed. Capturing and monitoring these QC metrics, however, represents a major informatics challenge because of the sheer volume and complexity of the generated data; there are millions of data points in a typical NGS run. Regardless, real-time monitoring of these data points can detect subtle changes in trends of QC parameters and alert the laboratory to potential errors, enabling early corrective actions and minimizing assay failure rates. Metrics such as average base quality scores, mapping quality score thresholds for variant calling, and coverage and variant frequency cutoffs are employed by clinical laboratories to enforce QC. For example, Levy-Jennings plots for numeric QC metrics, frequently used in automated chemistry and hematology laboratories, can be used to identify problematic trends in NGS sequencing runs, triggering an appropriate corrective action. A vigilant, ongoing QC process will help in detecting and troubleshooting errors not detected during initial assay validation.
In addition to monitoring the quality of NGS runs, data validation is required for data transmission, electronic messaging, and data storage processes in the molecular laboratory. One such example is the use of digital checksums. Checksums are small-sized data values computed from a given data source (files or data stream) that are used to verify the integrity of data files during transfer and storage. When various analyses and result files are created during the process of NGS testing, each file is assigned a checksum value (MD5 being one of the most popular algorithms). After these files are moved or copied from the laboratory's server to the centralized data center, or are shared across other laboratories, the process of deriving the checksum is repeated on the received file and compared to the checksum embedded in the file to verify data integrity.
One of the major challenges in validating data output from NGS sequencers is the heterogeneity of quality scoring systems across different vendors. Overall, Q scores as well as individual QC metrics from each vendor-specific NGS pipeline are based on algorithms (often proprietary) that are tailored to the platform chemistry. These algorithms are optimized to increase the accuracy and sensitivity of variant detection. As a consequence, thresholds for NGS QC metrics are not comparable across NGS platforms. This poses a challenge in the event of disagreement of results during the process of cross-platform validation, interlaboratory exchange, or proficiency testing. As of 2015, there are no software utilities that allow comparison of different quality scores. It is desirable and expected that standard metrics to address this problem will be developed for implementation across different platforms.
Data Analytics
As described above, data analysis forms a core component of NGS-based testing. Typically, these resource-intensive processes involve several bioinformatics steps to convert raw sequence data into appropriate aligned sequences, from which a list of variant calls are generated. Hardware requirements to perform these computational tasks are exceptional in terms of central processing unit (CPU) cores needed, as well as demands for available memory and low-latency input/output (“hot”) storage. Interestingly, in addition to raw sequence data, there are numerous nodes in the informatics workflow that emanate tremendous volumes of data in the background (nonsequence data). Examples of such data include test orders, preprocessing and nucleic acid extraction information, QC metrics generated during the entire testing cycle, orthogonal variant confirmation, repeat analysis log, test results, pathologic interpretations, semistructured and unstructured pathology reports, and billing information. Taken together with the inherent volume of variant information available on the Internet, the velocity with which it is produced, and the variability of the data and their clinical significance, interpreting NGS data is a Big Data problem. Although there is no consensus and/or quantitative definition of Big Data, the core concepts that are shared among the various definitions from different industry leaders include the three Vs (volume of data, variety of data formats with input by multiple users, and velocity with which the data is amassed). Analysis often requires alternative data storage requirements (such as NoSQL databases) and powerful data processing (distributed analytics) for gaining meaningful insight. The inability of conventional computational systems to manage such large-scale data is the underlying concept of the National Institute of Standards and Technology (NIST) definition of Big Data. Although Oracle's (Redwood Shores, California) definition emphasizes the unstructured (nonrelational) nature of Big Data, Intel's definition focuses more on the volume of data.43 Undoubtedly, clinical NGS testing data (sequence and nonsequence) fit the profile for Big Data in the context of conventional computational capacity in the health care industry.
In addition to the challenges in interpreting variants in the context of myriad online databases, another aspect of this problem that will require Big Data analytic solutions is sequence variant reinterpretation in the context of constantly evolving knowledge and clinical outcomes of patients. Our understanding of the molecular basis of diseases is constantly improving and changing. Based on ongoing research, new information is available almost every day. As a consequence of the explosion in base knowledge, clinical interpretation of genomic alterations in the context of a given disease can potentially change over a period of time with new biologic insights. New data are also generated as a patient's disease responds to treatment regimens. Such temporal outcome data are immensely valuable in order to understand the clinical significance and biologic implication of genomic variants, especially for variants of unknown significance. This brings up the challenge of revisiting previously analyzed cases and reinterpreting genomic alterations in the context of updated information. Undeniably, with NGS testing this can quickly become overwhelming. As of 2015, there are no regulations or recommendations on requirements for reinterpretation of genomic variants. Before such mandates become effective, it will be important to design the appropriate analytic solutions to support the reinterpretive process by pathologists.
It is important to recognize the potential benefits of tapping into Big Data for extracting knowledge. For a molecular laboratory, Big Data analytics that have been well designed and thoroughly validated translate into improved and efficient patient care, as well as financial growth using business intelligence (ie, laboratory decision support systems). For example, it is usually an arduous task for a molecular laboratory to extract test volume data and perform analyses for annual or semiannual review. This undertaking often involves significant manual work, generating several spreadsheets and requiring hours of time. Use of spreadsheets and Access databases for such activities is time-consuming and often fraught with data modeling issues. Although it simplifies the data extraction process to some extent, scalability to a large-volume testing environment is not realistic. Similarly, reviewing other data points for laboratory QA/QC activities, analyzing test ordering practices, and performing cost-benefit analyses in real time is challenging. High-volume laboratories and health care institutions are turning to business intelligence systems, which are appropriately configured and allow analysis of large-scale data sets for mining usable knowledge. Such systems can streamline test ordering, avoid duplicate testing, and provide high-quality results at a lower operational cost, in turn allowing higher return on investment for high-throughput molecular testing. Because most LISs do not have this capability as part of their native product, laboratories may wish to consider purchasing a third-party system for this type of analysis. Hence, it is important that vendors developing the next-generation LIS take this into design consideration.
In the era of personalized, cost-efficient, competitive, and value-based health care, customer relationship management is becoming a critical component of laboratory operations. Customer relationship management is not merely releasing high-quality molecular pathology reports in a timely manner, but rather refers to a package of services provided by the laboratory to the patient and health care provider. These include several components, such as providing test details and specimen requirements prior to testing, providing dedicated personnel to answer questions related to testing, providing on-the-fly professional consultation for molecular testing and result interpretation, maintaining the shortest possible turnaround time without compromising quality, and offering pretest and posttest genetic counseling, as well as critical value and delayed test alerts. Customer relationship management is important for the public image of the molecular laboratory, and reputation is closely tied to it. Again, real-time and advanced analytic solutions can provide the required support for executing these processes. Robust decision support systems facilitate customer relationship management, which in turn is based on real-time data analytics.
Data Reporting
Clinical interpretation and reporting of NGS test results is a complex process subject to marked variability across different laboratories and institutions. The complexity of variant interpretation and reporting has escalated because of the abundance of variants detected by NGS testing, many of which are beyond our current understanding of molecular pathology. The sheer novelty of genomic alterations seen in NGS testing requires extensive literature review, historic review of variant data, genotype-phenotype correlations, in silico variant effect predictions, and review of clinical trials. Details of variant interpretation, including analytic tools and algorithms, are beyond the scope of this review article. Instead, this discussion will focus on the clinical informatics aspect of reporting NGS data and tools that may facilitate this process.
Genomic data interpretation and report generation are closely related processes, and therefore it is critical that informatics solutions provide a seamless integration of these tasks. Currently, variant interpretation involves the use of many software solutions to retrieve annotations and visualize genomic data. Several of these steps are performed manually, including the interrogation of references in the literature and search of historic test data. Analytic tools, including in silico risk prediction algorithms, exist on different Web resources that require individual accession by the pathologist to retrieve information. The subsequent step of report generation, with rare exceptions, remains a predominantly manual process that is both time consuming and error prone. To obviate this problem, LISs will need to be developed that can handle native genomic data and synthesize readable reports. Additional challenges arise when complex reports need to be transmitted to an EHR from the LIS. As of 2015, there are few informatics solutions that provide variable reporting capabilities for NGS test results (Table), with some that are interoperable with the EMR or LIS.
INTEROPERABILITY
In addition to challenges related to data storage and information technology (IT) infrastructure, the molecular laboratory must be cognizant of interoperability needs and emerging data-sharing standards. Interoperability refers to the capacity of different information systems (eg, LIS, EHR) and software applications (eg, instrument firmware, middleware) to communicate with other systems and exchange data. Interoperability is a serious issue with NGS because most library preparation devices, sequencing instruments, and software tools were not designed to work in a clinical networked environment. The challenge of achieving “plug-and-play” interoperability is compounded by the ever-increasing rate of technologic advancement and software upgrades. Ideally, molecular information systems and related devices should be designed to support syntactic interoperability, such as using HL7 messaging for communication with the LIS and EHR (Figure 3).
Schematic diagram depicting inbound (a) and outbound (b) interoperability between a molecular laboratory information system (M-LIS) and different health information systems (HIS). Middleware (connectors, filters, and transformers) enables the conversion of LIS native data (rich text format, XML, JSON) to health level 7 message, and vice versa. Abbreviation: EPP, external program plugin. Reprinted with permission from GenoLogics Life Sciences Software, An Illumina Company. GenoLogics Clarity LIMS Working with HL7 Connected Systems, A white paper. http://learn.genologics.com/HL7TechnicalNote_LandingPG.html. Accessed December 15, 2015.
Schematic diagram depicting inbound (a) and outbound (b) interoperability between a molecular laboratory information system (M-LIS) and different health information systems (HIS). Middleware (connectors, filters, and transformers) enables the conversion of LIS native data (rich text format, XML, JSON) to health level 7 message, and vice versa. Abbreviation: EPP, external program plugin. Reprinted with permission from GenoLogics Life Sciences Software, An Illumina Company. GenoLogics Clarity LIMS Working with HL7 Connected Systems, A white paper. http://learn.genologics.com/HL7TechnicalNote_LandingPG.html. Accessed December 15, 2015.
HL7 is the single most commonly used messaging standard for transmitting electronic health information, including clinical laboratory data, between hospital and laboratory information systems. Although HL7 remains the de facto standard for messaging clinical and anatomic pathology information data, it is not fully optimized for handling specific information classes, such as molecular and digital imaging data. This is a particular problem when dealing with large-scale and highly heterogeneous NGS data. The HL7 Clinical Genomics Working Group has made some progress in this area, but much remains to be done in conjunction with the clinical-grade VCF file to make HL7 completely suitable for this task.44 Other concerns regarding the HL7 Clinical Genomics standard relate to its heavy dependence on LOINC (Logical Observations Identifiers and Names Codes), which may partially work for describing a few genetic tests but is not well suited to reporting of genomic results. These issues are contributing to the lack of HL7 interfacing capability in most NGS instruments and related software solutions. In an attempt to address the complex data representation problem, the HL7v3 version was created in 2005. This new version of HL7 represents health information data using extensible markup language (XML) and XML schema definition (XSD) document structuring convention, a popular messaging standard used in application programming interface (API) and Web applications. Although more verbose than version 2 of HL7, it is computer readable and allows representation of complex and hierarchical data sets. XML is used to represent and share genomic data using Web services by several national and international genomic data repositories, such as the National Center for Biotechnology Information; the European Bioinformatics Institute; NIST; University of California, Santa Cruz; and others. Ongoing efforts by the HL7 Clinical Genomics workgroup are aimed at establishing standards for discrete genomic data representation and transmission across disparate research and clinical information systems.45,46 In addition, Electronic Medical Records and Genomics (eMERGE) consortium, a national network organized and funded by the National Human Genome Research Institute, in collaboration with the Pharmacogenomics Research Network, is spearheading the integration of genomic data into clinical systems.46 The outcome of these national and international efforts is expected to streamline the interoperability of clinical genomic data with existing clinical systems in the near future.
JavaScript Object Notation (JSON) is a relatively new but very popular messaging format used widely in Web APIs. It is lightweight and allows representation of structured as well as dynamic (unstructured) data using key-value pairs. This format is particularly suitable for representing and messaging genomic data using representational state transfer (REST) interfaces,47–49 and therefore offers a viable solution for clinical NGS interoperability challenges.
Although messaging interfaces are being optimized for genomic data, early adopters of clinical NGS sequencing continue to face interoperability challenges, which has led to custom and commercial NGS software development as an alternative, albeit short-term, solution. The general core principles of these applications focus on importing patient identifiers from the LIS or EHR and integrating them with sequencing data, annotation, prioritization, filtering of false positives, interpretation, and comprehensive reporting of genomic data that is eventually incorporated back into the LIS or EHR, typically by “copy and paste” actions of an appropriately formatted text report. Optimization of interoperability in the clinical NGS environment is critically important and is being addressed by only a few LIS vendors. One such laboratory information management system (LIMS) solution is a product from Genologics called Clarity LIMS, which is designed as an open system to enable a significant level of interoperability, as desired by the end user, between the LIMS and NGS-based and non–NGS-based instruments, and other information systems in the laboratory by extensive use of APIs, HL7, and custom scripting. This LIMS system predominantly addresses the preanalytic and analytic parts of workflow. However, it provides minimal, if any, support for the postanalytic component (ie, test result annotation to reporting) of the workflow. The Clinical Genomicists Workstation (PierianDx, St Louis, Missouri) is another comprehensive NGS solution that provides a start-to-finish workflow management system with a particular emphasis on genomic test results and knowledgebase management, and also addresses reporting of NGS test results, as well as integration with the EHR and/or LIS using HL7.
DATA STEWARDSHIP
Ownership
Data generated by a molecular laboratory may be stored for internal laboratory use (eg, QA, future test validation, knowledge database), in the EHR, or externally, such as in an enterprise data warehouse or public database. The pathology informaticist and/or the expert in computational pathology50 who is responsible for overseeing Big Data needs to be aware of the stakeholders involved and related data ownership issues.51 The stakeholders include patients who underwent genetic testing, the providers who ordered these tests, the laboratory generating these data, institutions or enterprises housing the data, and researchers who want to access and analyze this data. Stewardship of Big Data brings with it power, but this responsibility can also be politically taxing. Data ownership includes concerns related to information control (eg, data access, modification, deidentification) and the right to assign these access privileges to others. Data sharing has administrative, institutional, legal, and ethical implications.52 The molecular laboratory may need to follow processes that comply with internal and external (eg, Institutional Review Board, privacy board, National Institutes of Health) data-sharing policies and state/federal law. Information architecture, analytics tools, and security measures ideally need to satisfy the needs of all key players. Ownership of genomic data may also encompass issues related to patient consent, privacy, and the risk of genetic discrimination.
Data Storage
A uniform theme for NGS testing is the generation of colossal amounts of sequence data per run, which may quickly constrain storage resources. The size of data increases exponentially with increasing scope of sequencing the genome. For example, the average size of a sequence alignment file (BAM) is approximately 2 GB for a small- to moderate-sized gene panel, in contrast to approximately 10 GB or even 150 GB for whole-exome and whole-genome analyses, respectively. Most clinical molecular laboratories that are entering this arena do not have the infrastructure to efficiently and securely store such large volumes of data. The sequential processes of a typical NGS analysis pipeline also generate several intermediate files of significant size that further add to total storage requirements for a given case or run, in addition to the raw sequence files (Figure 1). Downstream analyses, such as variant annotation and data management, add significant storage overhead. Although storage costs are steadily becoming cheaper, consistent with Kryder's law,53 the rate of production of new sequence data is faster, with a doubling time of approximately 5 months.54
As an initial buffer, some benchtop sequencers come bundled with a dedicated analysis server (discrete or onboard) with 7 to 11 GB of storage space that is sufficient to hold the results of several small- to medium-sized gene panel runs. Portable or external storage devices are other frequently used media in the clinical laboratory that are used to back up large test result data. Although local storage provides a short-term solution to manage test data, it clearly lacks the critical features needed for secure and reliable long-term data storage, such as redundancy, automatic scheduling, tiered storage, off-site backup location, and disaster recovery protocols. Arguably, portable server racks or redundant array of independent disks (RAID) disk clusters installed in the laboratory environment can provide some of the above features. However, this is expensive and requires significant informatics expertise and dedicated full-time employees for maintenance, consuming valuable resources of the laboratory. Cloud storage, provided by enterprise data centers or commercial service providers, is currently the most appropriate solution for archiving and managing large-scale NGS data. The logistics for efficiently managing cloud-based data are best handled by dedicated IT professionals in data centers, allowing the clinical laboratory to rather focus on optimizing testing and improving patient care.
Despite cheaper and more efficient storage, capacity is always finite. Because NGS data generate a vast array of file types, it is important for laboratories to triage the types of files that need to be stored, from both technical and financial standpoints. For example, archiving gigantic raw data files for all cases is impractical because it quickly overwhelms the storage capacity. In addition, the bioinformatics components involved in raw data processing and base calling are comparatively stable and rarely need to be repeated. Storage of these very large files therefore offers few significant benefits for the incurred cost of storage space. In contrast, intermediate files (eg, BAM, FASTQ) are relatively smaller than raw data files, and these formats are platform-independent entry points into common bioinformatics tools. They therefore offer a much a higher utility for the incurred cost of storage. Of particular note, BAM files include all of the information present in FASTQ files, with the exception of unaligned/filtered reads, and FASTQ files can be regenerated from BAM files if necessary. It is also important to consider the length of storage. Unlike intermediate and downstream result files, raw data, if archived, should be stored for a limited time period as determined to be appropriate by the molecular laboratory. At the Molecular and Genomic Pathology Laboratories at the University of Pittsburgh Medical Center, raw data files are automatically purged after 3 months, unless otherwise indicated.55 For appropriate scalability of storage capacity, it is important that molecular laboratories make annual projections of test volumes and forecast storage requirements in consultation with their clinical IT team.
Accidental loss of mission-critical clinical data is damaging to an institution's business and growth. Therefore, as part of the preparatory work for setting up NGS testing, the laboratory should formulate a backup plan and disaster recovery plan with the help of its IT support team to ensure seamless recovery of clinical test data after a catastrophic event. Offsite data archiving in enterprise data centers is ideal, whenever such an option is available.
Data Security and Integrity
One of the formidable challenges when implementing clinical NGS testing using cloud computing technology is data privacy concerns. It is important to recognize that all clinical NGS must be Clinical Laboratory Improvement Amendments licensed, and so must also meet the additional requirements specific for molecular genetic testing and NGS-based testing (eg, as specified by the College of American Pathologists56 or the New York State Department of Health57). In the United States, the Health Insurance Portability and Accountability Act (HIPAA) sets regulations for the security of electronic protected health information and requires compliance by all covered entities.58 With the Health Information Technology for Economic and Clinical Health Act (HITECH)59 and the subsequent HIPAA Omnibus Rule,60 covered entities and business associates are now required to report breaches of protected health information, which not only negatively impacts patients but may also result in significant loss of reputation and revenue for an institution. Genomic data pose a unique challenge because it is now increasingly recognized that large genomic data sets (eg, whole genome, whole exome) may uniquely identify individuals, particularly in today's interconnected and information-rich world with high-throughput sequencing systems.61 As a proof-of-concept, Gymrek and colleagues62 demonstrated the ability to extract personal identifying information from a combination of publicly available genomic repositories and recreational genetic genealogy databases. The authors expressed concern for the potential of complete identification of individuals with increasing silos of genomic data, even without nongenomic identifiers being present. These evolving concepts around large genomic data sets question the currently accepted notion of complete deidentification of a patient's genomic data.61 To this effect, the Genetic Information Nondiscriminatory Act (GINA),63 enforced since 2008, prevents insurance providers and employers from discriminating against individuals based on their genetic findings. Adopting from GINA and reinforcing the HIPAA privacy rule, the Omnibus Final Rule in 2013 acknowledged genetic information as identifiable patient information. This raises fundamental questions about the contemporary practices of data management, access, and security in clinical and research settings.
Although cloud computing offers the most comprehensive data management solution for clinical NGS testing, it presents novel challenges in the context of protected health information, as alluded to above. The primary concerns revolve around security, control, and liability.64 Most large-scale cloud service providers (CSPs) have significant resources dedicated to maintaining state-of-the-art physical and electronic security for their data centers. Such centers may even be better than institutional data centers. However, application of this technology to clinical genomic and health care data is relatively new. Therefore, with limited experience, users are reluctant to use CSPs in case of unforeseen security vulnerabilities. In addition, CSPs typically have multiple data centers in diverse geographic locations around the globe. This is even more complex with “cloud-stacking,” where multiple CSPs provide different layers of support for a given service.64 Because national health care policies and data privacy rules are not applicable across international borders, hosting clinical NGS data in the cloud may violate privacy regulations for a given region. Conversely, such regulations cannot be enforced anymore on the data stored in a public cloud because of loss of control on the hosted data secondary to the seamless and unpredictable geographic localization.
In order to improve customer satisfaction, particularly in alleviating security concerns, CSPs are seeking compliance with nationally and internationally recognized security and auditing standards, such as ISO/IEC 27001, SOC1/SSAE 16/ ISAE 3402, HIPAA, the DoD Information Assurance Certification and Accreditation Process (DIACAP), the Federal Information Security Modernization Act (FISMA), and Compliance Safety Accountability (CSA) certification. More recently, CSPs are able to sign a Business Associate Agreement (BAA) with clients for specific services provided. It is strongly recommended that organizations or individuals seeking cloud computing technology review and understand the CSP's technology, data statements, and provider disclosures before making an informed decision. The CSP service reliability, outage and data loss policies, as well as data deletion policies after discontinuation of service, are other important aspects that users must take into consideration.
INFORMATION TECHNOLOGY INFRASTRUCTURE
Network (Data Communication)
The critical need for off-site storage of clinical genomics data was discussed in the prior section Data Storage. In addition, the network bandwidth requirement for movement of massive genomic data sets to off-site archives is a common informatics bottleneck. The typical network bandwidth in a many health care systems is in the range of 10 to 100 Mbps, which is suboptimal for large data set traffic. This poses a high risk for network congestion and consequent malfunction of health information systems. High-bandwidth network infrastructure using fiber-optic technology that can allow up to peak speeds of 10 Gbps offers a scalable solution to facilitate high-volume NGS data transfers to off-site locations, particularly in a complex interconnected clinical environment. It is preferable, whenever possible, that such network resources be dedicated to clinical NGS laboratories to avoid interference with routine EHR and LIS transactional events. Such networking infrastructure is expensive. Alternative strategies, such as monitored scheduled backups during off-peak hours over slower networks, may be reasonable. It is also important to note that data transfer, regardless of the infrastructure, requires using appropriate data security protocols.55
Computational Requirements
Genomics data with or without integrated medical record metadata is complex and voluminous, requires high-throughput processing, and is continually evolving. As a result, these data possess the characteristic challenges of Big Data analytics. Therefore, appropriate computational resources are becoming a necessity for clinical NGS testing. The harmony between continuous technologic advancement and the regulatory framework in a clinical molecular laboratory is particularly difficult to achieve without a robust IT infrastructure that can offer high-performance computation in a versioned and secure manner. Thankfully, the application of innovative computing technologies for Big Data management has unveiled a promising landscape for enabling full-scale implementation of high-throughput genomics in clinical laboratories.
Cluster Computing
The core principle of cluster computing is harnessing the power of multiple commodity off-the-shelf computer hardware in a networked environment to perform resource-intensive data processing, such that the operational efficiency and capacity are larger than those of a single powerful workstation. An individual computer in a cluster is commonly referred to as a node. For example, a computationally intensive process, such as sequence alignment, can be split up into multiple small parallel processes (scattered) by a master node and distributed to individual slave nodes within the cluster. After completion of all distributed jobs, the results of the individual analyses are aggregated (gathered) and returned to the user via the master node. Modern-day high-performance clusters are composed of blades (computer hardware) stacked in racks with high-bandwidth interconnectivity (eg, infiniband) that provides massive computing power. One of the biggest advantages of cluster computing is scalability, where increasing compute power involves adding more nodes to the cluster without any change to the existing architecture. Hadoop and MapReduce are popular frameworks that facilitate fault-tolerant Big Data analytics on complex data sets (including genomics) using specifically engineered file systems and a simple programming model.45,65,66 Analytic advantage is gained by minimizing movement of large data sets across the cluster. Bioinformatics applications, such as GATK from the Broad Institute,26 are a popular component in many NGS analysis pipelines in clinical laboratories. Such applications have the ability to leverage the power of cluster computing using the Queue tool. For clinical laboratories, this potentially translates to shortened and well-controlled turnaround times for molecular test results.
Grid and Cloud Computing
Grid computing provides enhanced federated computing power by employing networked computers and clusters over a wide geographic area. It also introduces the concept of compute resource provisioning, where multiple data analysis jobs are handled such that individual computing resources are optimally used and turned off or on. The popular concept of cloud computing further enables “on demand” or “on the fly” provisioning of compute resources and services. NIST defines cloud computing as “a model for enabling ubiquitous, convenient and on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”67 Indeed, cloud computing offers different services (storage, compute, application, hardware) in a rapidly available, highly scalable, and configurable format. It also supports a range of different computing requirements. Services are typically tiered based on the end user requirement of software (SaaS), platform (PaaS), or infrastructure (IaaS) as services (Figure 4). For example, a vendor can deploy a LIMS in the cloud and make it available to its users via SaaS. Amazon (Seattle, Washington) EC2 instance, on the other hand, provides a framework to completely configure hardware, operating system (OS), and software settings for a virtual machine (ie, IaaS). Next-generation sequencing testing requirements, such as data storage and analytics, can be best addressed by using cloud services. For management of genomic data, cloud computing provides numerous advantages, such as on-demand scalability, load balancing and dynamic resource allocation, virtualization solutions, multiple operating system deployment, highly reliable data backup, and shared resource for a multitenant environment. The pay-per-use model, popular among many CSPs, eliminates upfront capital investments and can be a cost-effective long-term option for NGS data analysis and storage. It virtually eliminates the need for on-site (within the laboratory's physical space) hardware purchases and maintenance.
This figure illustrates a simplified model of service architecture of a typical cloud computing environment. Service-oriented architecture is the principal benefit of using cloud computing. As shown in this figure, users have the flexibility to choose only the resources that they want in a pay-as-you-go model, without significant capital investment. It also relieves users of the responsibility of maintaining on-premises hardware and software.
This figure illustrates a simplified model of service architecture of a typical cloud computing environment. Service-oriented architecture is the principal benefit of using cloud computing. As shown in this figure, users have the flexibility to choose only the resources that they want in a pay-as-you-go model, without significant capital investment. It also relieves users of the responsibility of maintaining on-premises hardware and software.
Cloud computing may be deployed in one of the 4 described models: public cloud, private cloud, hybrid cloud, and community cloud (Figure 5). Private cloud is designed to serve a closed user group. Large academic institutions often deploy this model (institutional cloud), which resides inside the institutional firewall. Public cloud, in contrast, is made available to all users worldwide, typically with a fee for usage. In a hybrid model, the data center core resides within the corporate (institutional) firewall but provides a subset of cloud services to users outside the institution. Community cloud architecture is very similar to public cloud; however, services are available to a restricted group or community of users who are bound by a business agreement or contract.
Cloud computing infrastructure can be deployed as a private cloud where computing resources as well as data remain within an institution's firewall. This is in sharp contrast to public cloud models, where the converse is true. Hybrid cloud, a more recent concept, allows institutional control over the data by leveraging public computing resources. This is the most challenging and difficult infrastructure to set up, especially from a security perspective. In addition, a community cloud model is set up based on mutual agreement among a group of institutions to share computing resources and data in a federated environment. Although private and community cloud infrastructures require significant capital investment, the latter model is more cost effective for each of the participating institutions. Abbreviation: PHI, protected health information.
Cloud computing infrastructure can be deployed as a private cloud where computing resources as well as data remain within an institution's firewall. This is in sharp contrast to public cloud models, where the converse is true. Hybrid cloud, a more recent concept, allows institutional control over the data by leveraging public computing resources. This is the most challenging and difficult infrastructure to set up, especially from a security perspective. In addition, a community cloud model is set up based on mutual agreement among a group of institutions to share computing resources and data in a federated environment. Although private and community cloud infrastructures require significant capital investment, the latter model is more cost effective for each of the participating institutions. Abbreviation: PHI, protected health information.
Investigators in the biomedical research domain and academic bioinformaticians using NGS technology have been early adopters of cloud computing. More recently, bioinformatics solutions have been developed and entirely deployed using cloud infrastructure for analyzing NGS data.68–71 Some commercial LIS/LIMS developers in the genomics space offer their cloud-based informatics solutions as SaaS. Virtual machine instances in the cloud containing validated NGS analytic pipelines by a laboratory can be easily shared with other collaborators or laboratories without significant configuration and infrastructure requirements. Conceptually, federated cloud storage can be leveraged to administer proficiency testing by regulatory organizations that may involve sharing of large genomic files (eg, FASTA, BAM) across participating clinical laboratories. Despite several advantages of cloud computing, privacy and reliability concerns have limited its use in clinical molecular diagnostics and patient care,64 largely because of data security issues (refer to the section Data Security and Integrity for details).
Virtualization
Virtualization is software technology that allows abstraction of the OS from the underlying hardware, typically using a hypervisor that enables efficient use and management of server hardware (Figure 6). The hypervisor performs the job of communicating the necessary instructions between the OS and the underlying hardware in such a way that multiple different operating systems can be run on the same physical hardware independently. This technology is one of the core components of cloud computing that allows for total virtualization of racks of server hardware and high-performance clusters into one giant hardware resource pool. This has enabled cloud services to be more cost effective and ecofriendly.64 Sophisticated software systems assist with the hardware resource use and management of virtual machines in the cloud. With technological advancements, virtualization can be achieved at multiple levels above the OS. This has simplified the provisioning of cloud computing services to the extent that end users can configure their own virtual environment through a Web interface without the assistance of the service provider. Complex, resource-intensive NGS analysis pipelines can be provisioned in minutes by creating a single large machine or an array of virtual machines (virtual cluster) from a CSP. Such a virtual computing environment can be shared among users, fostering a collaborative environment for genomics research. Virtualization technology provides tools that can be leveraged to lock down an NGS pipeline after clinical validation and perform version control for subsequent refinements to the pipeline.
Virtualization technology enables highly efficient use of computing resources, thereby allowing economies of scale for developing massive data centers and cloud computing infrastructure. In contrast to the typical single-server architecture (top left), virtualization (top right) facilitates multiple guest operating system (OS) environments to run on the same hardware in a noninterfering manner. This is made possible by a specialized software, the hypervisor. A hypervisor abstracts the underlying hardware from the guest OSs, providing a paravirtualized environment for each of the OSs to effectively use the pool of hardware resources. An additional level of scalability is possible by virtualization of multiple servers (bottom) into a single pool of hardware resources that is completely abstracted (dashed line) from the software layer. Multiple virtual machines can then be furnished from this common scalable resource pool to provision the required architecture for cloud computing.
Virtualization technology enables highly efficient use of computing resources, thereby allowing economies of scale for developing massive data centers and cloud computing infrastructure. In contrast to the typical single-server architecture (top left), virtualization (top right) facilitates multiple guest operating system (OS) environments to run on the same hardware in a noninterfering manner. This is made possible by a specialized software, the hypervisor. A hypervisor abstracts the underlying hardware from the guest OSs, providing a paravirtualized environment for each of the OSs to effectively use the pool of hardware resources. An additional level of scalability is possible by virtualization of multiple servers (bottom) into a single pool of hardware resources that is completely abstracted (dashed line) from the software layer. Multiple virtual machines can then be furnished from this common scalable resource pool to provision the required architecture for cloud computing.
Besides on-demand high-performance computing, virtual desktop is another popular use of this technology. Similar to provisioning virtual servers, a graphic user interface–rich environment can be configured, providing an experience that is similar to that of a traditional desktop computer. However, the user interacts with the virtual desktop using a secure thin client, and all resources, including data storage, are provided by the virtualized environment. For clinical molecular laboratories, this translates into low cost and a controlled work environment, ensuring data integrity and security. For large health care enterprise environments offering NGS testing, it helps enforce data security by eliminating local storage of sensitive data on user's computers.
MOLECULAR LIS AND FUTURE PERSPECTIVE
In the preceding sections, we reviewed the unique milieu and requirements of NGS testing in the clinical molecular laboratory. Several aspects of NGS testing ultimately require use of a unified information management system (IMS) to ensure reliable, secure, efficient, and productive laboratory workflow. An IMS for NGS testing should have several unique features, many of which are not available in contemporary commercial LIS/LIMS products. Traditionally, the LIMS was designed to manage core laboratory workflow operations, such as sample tracking, inventory management, test menu management, and protocol design tools, and to support limited reporting capabilities. The LIS, in addition, provides more comprehensive reporting capabilities (including faxing reports), interoperability with other electronic information systems (eg, EHR, other LIS/LIMS), handling of test orders, and billing information management. In research laboratories performing NGS-based experiments, a LIMS typically manages workflow. With NGS testing performed in clinical laboratories, the additional requirements for data security, billing, formal reporting, and interoperability with other clinical systems require the IMS to exhibit features of both a LIMS and an LIS. Such unique needs justify the concept of a molecular LIS (M-LIS) to handle integrated genomic and health information data. The Table lists some of the newly developed commercial molecular NGS information management systems that have incorporated dual functionalities to variable extents.
A lot is expected of an M-LIS in terms of novel workflow and information management requirements for modern molecular laboratories. A major challenge in the clinical setting is the need to support a dynamic workflow associated with the constant growth of the laboratory's NGS test menu and expanding specimen volume. Laboratory QA is critical in this dynamic environment to enforce audit trails and ensure continued compliance with regulatory standards and recommended best practices. Quality assurance activity, in turn, is highly dependent on the appropriate analysis of data generated by the laboratory (eg, test results and workflow metadata). Ongoing changes in the laboratory's workflow, along with constant improvements in sequencing technology, require plasticity in the laboratory's informatics requirements. A “newly” implemented LIS/LIMS may quickly become obsolete unless active development is part of the package. The M-LIS vendor should provide for ongoing improvements in a reasonable time frame based on the laboratory's dynamic needs, and/or enable an open software architecture that allows custom modification by the laboratory. Newer molecular LIS/LIMS are designed using a modular approach where active development is enabled in a “plug-and-play” fashion.
Modern Web technology has changed the fundamental basis for communication and data sharing through seamless access across a wide array of devices and platforms. Unlike traditional desktop software, Web-enabled tools have enhanced user interaction and experience. In addition, Web application deployment and maintenance in an enterprise environment are significantly easier and cost effective compared with thick client applications. It is therefore not surprising that vendors developing NGS data management systems are increasingly leveraging Web technology to meet users' needs, permit data visualization, and facilitate application maintenance. The latter is particularly useful for large health care organizations, where maintenance of clinical applications can be a daunting task. Downtime in the clinical lab can have a significant impact on operations and patient care. Several of the novel features, transactions, and interfaces expected with the M-LIS are shown in Figure 5. Arguably, depending on hardware and software platforms selected, database type and schema, deployment environment, and capital resources, the M-LIS can be designed to incorporate or delegate functions within and among a group of distributed hardware and software services. Unfortunately, resource provisioning is highly variable across institutions, and therefore the M-LIS should be designed to accommodate this. Another novel but important feature that is expected from the new breed of M-LIS is the ability to perform advanced data analytics to provide real-time insights from integrated genomic, pathologic, and clinical data to improve patient care. Diverse data science technologies are being explored to mine genomics and big health care data.
CONCLUSIONS
The boundary between bioinformatics and clinical informatics has significantly blurred with the introduction of NGS into clinical molecular laboratories. The utility of NGS-based testing for precision medicine has led to widespread adoption in medicine. Informatics challenges continuously emerge as the technology in this field evolves. These include new analytic pipelines, image management systems, patient privacy concerns, and laboratory regulations. Next-generation sequencing technology and the data derived from these tests, if managed well in the clinical laboratory, will redefine the practice of medicine. In order to sustain the rapidly progressing field of clinical molecular diagnostics, adoption of smart computing technology is essential. The contemporary concept of “stacking and reclaiming” health care data needs to be succeeded by “managing and analyzing” unstructured data, transforming it into usable knowledge for real-time patient care. In the near future, computational pathologists will be expected to routinely work with Big Data and modern computer tools to render diagnostic and theranostic services.50
References
Competing Interests
Dr Nikiforov and Dr Nikiforova are consultants for Quest Diagnostics. Dr Nagarajan is the chief informatics officer and founder of PierianDx, with stock options. The other authors have no relevant financial interest in the products or companies described in this article.