The healthcare landscape is rapidly evolving through the integration of diverse data sources such as electronic health records, omics, and genomic data into patient profiles, enhancing personalized medicine and system interoperability. However, this transformation faces challenges in data integration and analysis, compounded by technologic advancements and the increasing volume of health data.
This study introduces a novel hybrid edge-cloud framework designed to manage the surge of multidimensional genomic and omics data in the healthcare sector. It combines the localized processing capabilities of edge computing with the scalable resources of cloud computing. Evaluations involved using simulated cytometry datasets to demonstrate the architecture’s effectiveness.
The implementation of the hybrid edge-cloud framework demonstrated improvements in key performance metrics. Network efficiency was enhanced by reducing data transfer latency through localized edge processing. Operational costs were minimized using advanced compression techniques, with the Zstandard (ZSTD) codec significantly reducing data size and improving upload times. The framework also ensured enhanced data privacy by leveraging edge-based anonymization techniques, which process sensitive information locally before transfer to the cloud. These findings highlight the framework’s ability to optimize large-scale omics data management through innovative approaches, achieving significant gains in scalability and security.
Integrating edge computing into a cloud-based omics data management framework significantly enhances processing efficiency, reduces data size, and speeds up upload times. This approach offers a transformative potential for omics and genomic data processing in healthcare, with a balanced emphasis on efficiency, cost, and privacy.
INTRODUCTION
The healthcare landscape is rapidly transforming owing to technologic advancements and the growing availability of health data, with an emphasis on integrating diverse data sources like electronic health records, genomic sequencing, and wearables into comprehensive patient profiles to improve care and facilitate evidence-based decisions.[1–5] This integration is crucial for personalized medicine, public health management, and enhancing interoperability across healthcare systems. However, realizing the full potential of health data integration presents several challenges. The multidimensional and complex nature of health data poses significant integration and analysis hurdles. Compounded by the rapid pace of technologic advancement in healthcare, the generation of new data types is accelerating. The volume of stored data has grown by 40 PB between 2016 and 2018[6] alone, outpacing Moore’s law, which demonstrates an exponential increase rather than a linear trend. This rapid data growth, combined with the increasing diversity of data types, underscores the critical need for efficient data integration, consolidation, and management. The current landscape of health data tools is mostly characterized by a dichotomy between on-premises and cloud-based solutions, each presenting unique advantages and challenges in terms of control, cost, and convenience.
On-premises or local solutions such as AbioTrans,[7] compcodeR,[8] CANEapp,[9] BioAnalyzer,[10] and GNomEx[11] specifically address the intricate needs of genomic research and other omics data analysis. These platforms excel in omics analysis, alternative splicing, and single-cell RNA sequencing, among other functionalities, offering a blend of gene expression analysis, statistical distribution fitting, and correlation analysis. They present advantages in terms of predictable costs, enhanced data privacy, and faster data access speeds. However, they face limitations in scalability and collaboration and incur increased operational costs, posing challenges for large omics and genomic dataset analysis.[12]
In contrast, cloud-based architectures herald a new era of accessibility and scalability characterized by both horizontal scaling by adding more virtual machines (VMs) of the same size and type as needed to increase parallelism and vertical scaling. Cloud-based architectures enhance the computational capabilities of existing VMs by upgrading their size/type within a series/family or switching to a VM series/family with superior capabilities, such as those based on newer central processing units.[13] In health data analysis, the integration of cloud computing with distributed computing and machine learning has introduced innovative approaches to managing large-scale bioinformatics data.[14,15] Platforms such as the Cancer Genomics Cloud[16] and G-DOC Plus demonstrate the potential of cloud-based solutions in merging genomic and multi-omics data with electronic health record information and managing diverse biomedical data, respectively. The versatility of these solutions is further illustrated by OncDRS[17] and SparkSeq,[18] which facilitate precision medicine applications and the analysis of next-generation sequencing data.
Despite their substantial contributions, these frameworks require further exploration and enhancement to fully address scalability, collaboration, and the efficient management of large genomic and omics data volumes. Existing systems struggle to keep pace with the massive data volumes produced by next-generation sequencing techniques.[19] This issue is compounded by the dependency on consistent and reliable internet access. In regions where internet access is not universally available, this dependency could potentially limit the effectiveness and reach of cloud-based solutions.[12] The costs associated with cloud storage, while eliminating some upfront expenses, can lead to higher long-term costs due to ongoing subscription fees.[20] Furthermore, the regulatory landscape in healthcare poses considerable challenges. There are significant concerns regarding data security and the necessity to comply with various healthcare regulations.[21] Ensuring that cloud computing solutions adhere to these legal and ethical standards is essential for their effective and responsible use.
This article introduces a novel hybrid edge-cloud framework that addresses scalability and adaptability. Tailored specifically for genomic and other omics applications, our framework optimizes data processing for real-time and near real-time analytics by strategically combining the strengths of edge and cloud computing. Edge computing refers to the practice of processing data near the source of its generation rather than in a centralized data-processing warehouse, improving response times and saving bandwidth.[22] The focus on omics data is justified by its complexity, volume, and importance in personalized medicine, as it includes genomics, transcriptomics, proteomics, and metabolomics, all crucial for understanding disease mechanisms and developing targeted therapies. Its high dimensionality and privacy concerns present challenges for processing and management. To address these challenges, the framework emphasizes network efficiency, operational cost reduction, and data privacy, which are critical for managing large-scale omics data securely and efficiently. The specific objectives of this study are:
Present a novel framework that uniquely integrates edge computing optimized for near real-time omics data analytics processing with edge-based anonymization techniques, such as SHA-256 hashing,[23] to ensure data security before transmission to cloud storage. This dual approach significantly enhances network efficiency, operational cost-effectiveness, and data protection.
Use state-of-the-art compression codecs like Zstandard (ZSTD), a near real-time compression algorithm developed by Facebook, using LZ77 combined with fast Finite State Entropy and Huffman coding,[24] to reduce data storage and transfer costs while saving time.
Improve of scalability and accessibility by processing data at the edge.
Network efficiency, operational cost reduction, and data privacy were selected as primary outcomes owing to their importance in securely and cost-effectively managing high-volume omics data. These metrics are critical for evaluating the practical feasibility and adoption of the framework in clinical and research settings. This study hypothesizes that the hybrid edge-cloud framework will (1) improve network efficiency for omics data processing, (2) reduce operational costs through effective data compression, and (3) enhance data privacy compliance by minimizing data exposure during processing.
METHODS
Our hybrid edge-cloud framework is designed for omics applications, leveraging established methodologies from traditional frameworks, as proposed by Mrozek [13] and Nguyen et al.[25] It uses edge-cloud computing to optimize data processing near the source, as shown in Figure 1. This approach minimizes response times and conserves bandwidth.[22] Furthermore, Figure 2 visually illustrates the main components and workflow of the hybrid edge-cloud framework. It provides a detailed graphical representation of the interactions between edge computing processes, cloud storage solutions, and data management techniques.
To evaluate the framework, we used both simulated and real cytometry datasets. Formal ethical approval was not required for this phase of the project as our focus was on developing systems architecture rather than analyzing personal or sensitive information. Furthermore, all data were de-identified, ensuring that there was no risk of re-identification.
Dataset Specifications
Simulated dataset: The simulated dataset mimics chronic lymphocytic leukemia (CLL) samples, including both CLL cases and normal controls. It encompasses measurements of 10 cellular markers and six scatter parameters crucial for cytometry analysis. The dataset varies in size, ranging from 50,000 to 10 million rows. This variety allows for rigorous testing under diverse load conditions.
Real patient data: This study also used real patient cytometry data from B-cell acute lymphoblastic leukemia (B-ALL) and CLL to assess the framework’s effectiveness and computational efficiency:
○ Dataset 1 (B-ALL): Contains 6.3 million records (approximately 1.64 GB), featuring a broad spectrum of 11 cellular markers. This extensive data volume and marker variety provided a rich environment for rigorous analysis and performance testing.
○ Dataset 2 (CLL): Similar to dataset 1 but larger, with 12 markers and 41 million records (approximately 2.07 GB). Its massive scale offers an exceptional opportunity to evaluate the framework’s scalability and efficiency, particularly in edge-based cytometry data processing and analysis.
Implementation Environment
In this research, initial data processing within edge computing environments was performed using a standard laptop equipped with an 11th Gen Intel Core i7-1185G7 processor, 16.0 GB of RAM, and an internet connection speed of 20 Mbps. The operating system used was Windows 10 Pro (64-bit), and all algorithms were implemented using Python (version 3.8), with relevant packages such as ZSTD (version 0.23.0), SNAPPY (version 0.7.3), LZ4 (version 4.3.2), and GZIP (part of the standard Python library). The cloud component used Microsoft Azure services, integrating Azure Data Lakehouse for data storage. This setup ensures consistency and replicability across different environments.
Edge Computing and Initial Data Processing
Initial data processing within edge computing environments involved the following steps:
Data cleaning (Figure 2, Step 1.1): This critical phase starts with validating the accuracy and completeness of the dataset. It includes identifying and flagging duplicate entries for removal and pinpointing irrelevant information. A key technique used during this phase is the use of variance threshold–based selection criteria. This approach helps in systematically identifying data that exhibit minimal variation and are thus considered redundant or irrelevant for the analysis. The identification and flagging process can be significantly automated and enhanced through advanced machine learning models. An exemplary tool in this regard is the FlowClean algorithm,[26] available as a package for the R programming language, which can be installed from the machine learning models marketplace.
Data anonymization (Figure 2, Step 1.2): To ensure privacy and security, data anonymization SHA-256 hashing[23] techniques are used to convert personal identifiers and sensitive data into unique hash values, effectively masking the original data while preserving its utility for analysis. Additionally, tokenization methods are used to replace sensitive data elements with nonsensitive equivalents, known as tokens, which can be mapped back to the original data only through a secured tokenization system. To enhance data privacy compliance, tokens matching anonymized data can be securely exported and stored locally on physical storage devices, remaining outside of cloud systems. This setup adheres to country-specific privacy laws and provides an additional layer of security, limiting access only to authorized personnel. These anonymization practices adhere to stringent data protection standards, including the Health Insurance Portability and Accountability Act (HIPAA) in the United States and the ISO/IEC 27001:2013 international standard for information security management. By conforming to these standards, organizations ensure the confidentiality, integrity, and security of sensitive information, safeguarding against unauthorized access and breaches.
Data compression and upload (Figure 2, Step 1.3): This model of the framework is optimized for preparing data for storage and access. The principal technique used in the module is an advanced compression strategy that incorporates both the Parquet file format and the ZSTD compression algorithms. The Parquet format excels in managing large datasets due to its efficient columnar storage capability, which significantly minimizes storage space while maximizing read and write speeds. Concurrently, ZSTD provides high compression rates,[27] further reducing the data footprint in storage. These technologies together aim to maximize storage efficiency and boost the speed of data retrieval, key factors for scalable and cost-effective data management.
Cloud-Based Data Management Storage
Data management and storage are facilitated by using Azure and Azure Data Lakehouse. This approach integrates two key components:
Relational database (Figure 2, Step 2.1): Microsoft SQL Server 2022 databases are used for metadata management, reference data, schema, and access policies. This ensures structured storage, efficient data retrieval, and robust schema enforcement.
Data Lakehouse (Figure 2, Step 2.2): The Data Lakehouse used for storing compressed data can handle large volumes of unstructured omics data. The scalability and support for diverse data ingestion are critical for managing complex datasets.
To ensure data availability and resilience, the cloud storage infrastructure uses geo-replication, automatically replicating data across multiple geographically distinct data centers. This approach ensures that even if one region experiences a failure, the system can seamlessly retrieve data from another location, maintaining uninterrupted access. In conjunction with edge devices’ local buffering, geo-replication enhances disaster recovery capabilities and minimizes data loss risks during cloud outages.
Testing Strategies
To evaluate our proposed edge-computing framework, we conducted two distinct case studies. The first used a simulated dataset with CLL characteristics, and the second leveraged real patient cytometry data from both B-ALL and CLL to assess the framework’s performance under varied complexities.
Both case studies focused on assessing the effectiveness of data compression, speed, and cost at the edge layer within the hybrid framework. The experimental setups for both studies were standardized to ensure consistency in testing conditions. Across the case studies, we compared the performance of four leading compression codecs—SNAPPY, LZ4, ZSTD, and GZIP—on key metrics relevant to edge computing, such as compression ratio, network bandwidth utilization, and storage cost impact.
Data Analysis
This study used quantitative methods to analyze the efficiency of different compression codecs, specifically focusing on their ability to reduce data size and upload times. The evaluated codecs included ZSTD, Snappy, LZ4, and GZIP, selected for their known efficiency in data compression. The evaluation involved comparing these codecs, based on their compression ratios and speed of execution, aligning with the study’s objectives of reducing storage costs and optimizing upload times. The analysis calculated the percentage reduction in data size and the improvements in upload times after compression for each codec.
Other Design Components
Open platform marketplace (Figure 2, Step 5): This component is envisioned as an innovative open marketplace designed to foster collaboration among researchers and developers by enabling the seamless exchange and application of machine learning models. Modelled after GitHub, but specifically tailored for scientific applications, this marketplace offers flexibility in infrastructure management. Users can choose a centrally managed option, maintained by a scientific consortium or academic institution, ensuring governance, sustainability, and compliance with privacy regulations. Alternatively, users can opt for a self-hosted version, allowing organizations to retain full control over their infrastructure. A key advantage of this setup is that it eliminates the need for data to be transferred to external servers, thereby significantly enhancing data privacy and reducing bandwidth usage. Additionally, models could be deployed directly at the network’s edge, which would optimize data cleaning and analysis processes. This component is not evaluated. It is part of the framework and workflow process, serving as a conceptual foundation for future development and implementation. By providing flexibility and leveraging a familiar platform structure such as GitHub, the marketplace supports diverse needs within the scientific community and minimizes costs.
AI-driven querying engine (Figure 2, Step 4): The AI-driven querying engine represents a conceptual leap in data retrieval technology, using generative AI to interpret natural language queries and convert them into precise database operations; this innovative engine would enable users to access data simply by describing their needs in plain language, eliminating the necessity for complex query syntax. This approach significantly reduces the learning curve and speeds up the adoption process, making data access more democratic, intuitive, and user-friendly. Although promising, this component is not evaluated. It is part of the framework and workflow process, serving as a conceptual foundation for future development and implementation.
Reasoning for Method Selection
These methods were chosen over alternative approaches owing to their compatibility with the high dimensionality of omics data and the scalability required for near real-time edge computing. Purely on-premises or entirely cloud-based solutions were found insufficient for balancing scalability, security, and cost-effectiveness. In contrast, our hybrid edge-computing approach offers significant advantages for real-time processing by leveraging both local and cloud resources efficiently. This study adheres to the STROBE reporting guidelines,[28] as outlined by the EQUATOR network, ensuring transparency and rigor in the methodologic process.
RESULTS
The study analyzed data from two distinct datasets: a simulated dataset mimicking CLL characteristics and real patient cytometry data. The CLL dataset comprised simulations with parameters for 10 cellular markers and six scatter parameters, and the real patient data included cytometry results from 41 million records covering both patients with B-ALL and CLL.
Both case studies demonstrated that ZSTD outperforms other codecs in terms of data reduction, achieving an average compression rate of 56.09% compared to GZIP’s 51.32%, Snappy’s 20.23%, and LZ4’s 23.78%. These average compression rates were calculated by summing the percentage compression achieved for each file within the datasets and dividing by the total number of files, providing a representative value for each codec’s performance. The analysis yielded effect sizes of d = 0.95 for ZSTD compared to GZIP, d = 7.17 for ZSTD compared to Snappy, and d = 6.46 for ZSTD compared to LZ4, indicating moderate to large improvements over GZIP and substantial improvements over Snappy and LZ4. All comparisons showed statistically significant differences (p < 0.05, 95% CI). This establishes ZSTD as the most effective codec, significantly reducing storage requirements. For example, it compressed a dataset of 47.3 million records from 3.71 GB to 1.60 GB, achieving an average data size reduction of 56.87%. This reduction translates to monthly cost savings based on Azure LRS Hot storage pricing, which ranges from $1.70 to $2.08 per GB per month. Figures 3 and 4 visually summarize our findings on compression efficiency for case studies 1 and 2. Additionally, our analysis revealed that using ZSTD resulted in a 57% reduction in data upload times due to its compression efficiency.
By applying this formula, we demonstrated that ZSTD achieves a statistically significant effect size of d = 0.95, supporting its superior performance in data reduction. Assuming a dataset representative of large-scale genomic projects, such as 100 million genomic reads per sample and 1000 samples, the original data size could reach approximately 40 TB. This aligns with typical sizes for whole genome sequencing (100–200 GB per sample) as reported by projects like the 1000 Genomes Project[29] and The Cancer Genome Atlas.[30] Implementing ZSTD would reduce storage needs to about 16 TB, resulting in estimated monthly cost savings of $41,779.20 to $51,148.80 in a cloud setting with standard pricing. This broader context underscores the potential impact of using ZSTD for large-scale data management in genomic research.
DISCUSSION
This study evaluated the efficacy of a hybrid edge-cloud framework in managing and analyzing cytometry data, with a particular focus on CLL and B-ALL. The experimental findings demonstrated that our framework, especially when using the ZSTD codec, substantially improved data compression and reduced upload times, affirming our hypothesis that edge computing can enhance data processing efficiency in genomic studies.
Our integration of edge computing within a hybrid framework for cytometry data is an area that has not been extensively explored in existing literature. The significant gains in data compression and upload speed are likely a result of processing data close to its generation sites, thereby reducing the latency and bandwidth usage often associated with centralized cloud processing. We demonstrated a capable framework that not only mitigates bandwidth and storage costs but also enhances the privacy and security of sensitive health data. Our findings emphasize the framework’s dual capacity to lower operational costs and safeguard sensitive health information, making it a compelling solution for handling complex datasets in high-stakes environments.
Our results align with advancements in cloud-based genomic solutions, such as the Cancer Genomics Cloud,[16] which similarly manage and analyze diverse data types. However, our hybrid framework differs significantly by introducing an edge component, enhancing local data processing capabilities, particularly in environments with intermittent or unreliable internet access. This contrasts with studies that focus solely on cloud-based processing, where upload times and compression efficiency may be affected by network latency and bandwidth limitations. By integrating edge computing, we reduce the dependency on stable internet connectivity, offering improved upload times and compression efficiency when compared to purely cloud-based models. This suggests that while existing cloud-based solutions achieve moderate efficiency, our edge-cloud hybrid model provides superior performance, especially in clinical settings with limited infrastructure. Furthermore, although other frameworks have adopted cloud-based models, few have embraced the hybrid approach because of infrastructure complexities and regulatory constraints. By addressing these challenges, our framework demonstrates its versatility in managing complex health data and improving operational efficiency without relying exclusively on cloud connectivity and without sacraficing compliance or performance.
Although ZSTD proved most effective for cytometry data compression, its efficacy may vary with other data types. As different genomic and clinical data formats emerge, exploring tailored codec options optimized for specific data types may be beneficial. Future research could help establish guidelines for codec selection, aligning data types with specific compression techniques to standardize and enhance processing efficiencies.
Although the proposed framework shows significant promise, several limitations must be acknowledged. Firstly, the reliance on cytometry datasets may not fully capture the complexities and variances encountered in genomics and omics data. Additionally, the initial setup and configuration of the hybrid edge-cloud infrastructure can be resource-intensive and may pose challenges for smaller organizations. The framework’s performance and scalability need further validation through real-world implementations and diverse datasets to ensure its robustness across different use cases. Another limitation is that the envisioned marketplace component for machine learning models and the AI-querying engine component have not been evaluated within this study.
Real-world implementation of the framework may encounter challenges such as the initial setup’s resource intensity, the need for technical expertise in edge-cloud integration, and ensuring compliance with privacy regulations like General Data Protection Regulation (GDPR) and HIPAA. For practical adoption, a gradual rollout in low-stakes environments is recommended to refine the configuration process while addressing privacy, security, and integration concerns. Such an approach allows for controlled testing and adjustments, ensuring that the framework meets regulatory standards and integrates smoothly with existing healthcare infrastructure.
Future work will focus first on optimizing data-cleaning techniques and refining compression processes at the edge to enhance real-time processing capabilities and reduce computational overhead. Following this, the development of a dynamic marketplace for machine learning models will be prioritized, allowing researchers to collaborate by submitting, evaluating, and using models, thereby fostering innovation and increasing the framework’s adaptability across various genomics projects. Additionally, we plan to explore dynamic data compression strategies tailored to diverse genomic and omics data types, which will enhance processing flexibility and efficiency as new data formats emerge. Finally, validating the framework’s scalability in diverse healthcare environments and data scenarios will be essential to ensure its robustness and practicality in real-world settings, particularly under varying infrastructure conditions. These steps are structured to streamline the framework’s real-world applicability and maximize its potential in advancing healthcare innovation.
CONCLUSION
This study demonstrates that a hybrid edge-cloud framework significantly optimizes omics data management, offering a transformative approach that balances efficiency, cost, and privacy factors critical for the advancement of future healthcare applications. The evaluation focused on cytometry data processing efficiency, specifically targeting CLL and B-ALL, and highlighted the advantages of integrating edge computing within this framework. A key finding is that integrating edge computing within this framework significantly reduces data size and upload times. Notably, the ZSTD compression codec was identified as a particularly effective tool, optimizing data-handling capacities in terms of both speed and cost. Although these results are promising, it is important to acknowledge that the study’s findings are based on specific data types and settings. The demonstrated benefits highlight the potential of edge computing in omics research, suggesting a valuable direction for future investigations to expand on these findings. Future research will expand this work by developing a machine learning model marketplace for secure, edge-based deployment, enhancing collaboration while maintaining data privacy. Additionally, tailored dynamic compression strategies to evolving genomic and omics data types will be explored to optimize performance. The framework’s scalability will be validated across broader genomic and omics applications, such as whole genome sequencing, and tested in diverse clinical environments, including those with limited technologic infrastructure, ensuring practical, real-world applicability.
References
Competing Interests
Sources of Support: None. Conflict of Interest: None.