Autor: |
Saja Taha Ahmed, Loay E. George |
Jazyk: |
angličtina |
Rok vydání: |
2022 |
Předmět: |
|
Zdroj: |
Journal of King Saud University: Computer and Information Sciences, Vol 34, Iss 7, Pp 4669-4678 (2022) |
Druh dokumentu: |
article |
ISSN: |
1319-1578 |
DOI: |
10.1016/j.jksuci.2021.04.005 |
Popis: |
Data reduction has gained growing emphasis due to the rapidly unsystematic increase in digital data and has become a sensible approach to big data systems. Data deduplication is a technique to optimize the storage requirements and plays a vital role to eliminate redundancy in large-scale storage. Although it is robust in finding suitable chunk-level break-points for redundancy elimination, it faces key problems of (1) low chunking performance, which causes chunking stage bottleneck, (2) a large variation in chunk-size that reduces the efficiency of deduplication, and (3) hash computing overhead. To handle these challenges, this paper proposes a technique for finding proper cut-points among chunks using a set of commonly repeated patterns (CRP), it picks out the most frequent sequences of adjacent bytes (i.e., contiguous segments of bytes) as breakpoints. Besides to scalable lightweight triple-leveled hashing function (LT-LH) is proposed, to mitigate the cost of hashing function processing and storage overhead; the number of hash levels used in the tests was three, these numbers depend on the size of data to be de-duplicated. To evaluate the performance of the proposed technique, a set of tests was conducted to analyze the dataset characteristics in order to choose the near-optimal length of bytes used as divisors to produce chunks. Besides this, the performance assessment includes determining the proper system parameter values leading to an enhanced deduplication ratio and reduces the system resources needed for data deduplication. Since the conducted results demonstrated the effectiveness of the CRP algorithm is 15 times faster than the basic sliding window (BSW) and about 10 times faster than two thresholds two divisors (TTTD). The proposed LT-LH is faster five times than Secure Hash Algorithm 1 (SHA1) and Message-Digest Algorithm 5 (MD5) with better storage saving. |
Databáze: |
Directory of Open Access Journals |
Externí odkaz: |
|