A Deep Learning-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets
Autor: | Mohammed Aledhari, Fahad Saeed, Mohamed Hefeida, Marianne Di Pierro |
---|---|
Rok vydání: | 2021 |
Předmět: |
0301 basic medicine
Information Systems and Management File Transfer Protocol Hypertext Transfer Protocol 020205 medical informatics business.industry Computer science computer.internet_protocol Deep learning Bandwidth (signal processing) Code word 02 engineering and technology computer.software_genre 03 medical and health sciences 030104 developmental biology 0202 electrical engineering electronic engineering information engineering Artificial intelligence Minification Data mining business computer Information Systems Data transmission Communication channel |
Zdroj: | IEEE Transactions on Big Data. 7:271-284 |
ISSN: | 2372-2096 |
DOI: | 10.1109/tbdata.2018.2805687 |
Popis: | In the age of Big Genomics Data, institutions such as the National Human Genome Research Institute (NHGRI) are challenged in their efforts to share volumes of data between researchers, a process that has been plagued by unreliable transfers and slow speeds. These occur due to throughput bottlenecks of traditional transfer technologies. Two factors that affect the efficiency of data transmission are the channel bandwidth and the amount of data. Increasing the bandwidth is one way to transmit data efficiently, but might not always be possible due to resource limitations. Another way to maximize channel utilization is by decreasing the bits needed for transmission of a dataset. Traditionally, transmission of big genomic data between two geographical locations is done using general-purpose protocols, such as hypertext transfer protocol (HTTP) and file transfer protocol (FTP) secure. In this paper, we present a novel deep learning-based data minimization algorithm that 1) minimizes the datasets during transfer over the carrier channels; 2) protects the data from the man-in-the-middle (MITM) and other attacks by changing the binary representation (content-encoding) several times for the same dataset: we assign different codewords to the same character in different parts of the dataset. Our data minimization strategy exploits the alphabet limitation of DNA sequences and modifies the binary representation (codeword) of dataset characters using deep learning-based convolutional neural network (CNN) to ensure a minimum of code word uses to the high frequency characters at different time slots during the transfer time. This algorithm ensures transmission of big genomic DNA datasets with minimal bits and latency and yields an efficient and expedient process. Our tested heuristic model, simulation, and real implementation results indicate that the proposed data minimization algorithm is up to 99 times faster and more secure than the currently used content-encoding scheme used in HTTP of the HTTP content-encoding scheme and 96 times faster than FTP on tested datasets. The developed protocol in C# will be available to the wider genomics community and domain scientists. |
Databáze: | OpenAIRE |
Externí odkaz: |