Popis: |
Most of the oil industries have a substantial volume of physical subsurface data generated as part of the exploration study. This data is collected over many decades and exists in various formats such as tapes, cartridges, CDs, DVDs, paper media comprising of maps, technical well reports, and seismic logs. These items are usually stored in large offsite repositories across the globe and is maintained by third-party vendors. Access to this historical data is crucial for oil companies as it helps to find potential prospects for oil extraction which otherwise require an exploratory study by geologists using satellite imagery, surface rocks, terrain, and seismology. Storing large volumes of technical data in offsite repositories also posts many key challenges such as high storage cost, high retrieval time and inaccessibility of information. To address the above challenges, companies are digitizing the physical data and complementing with rich metadata extraction by Optical Character Recognition(OCR). This introduces some more technical challenges while dealing with lower Dots Per Inch (DPI) scans, poor quality scans, and huge file size. Several frameworks are developed which store the data in local repositories but these frameworks have limitations with respect to the number of documents processed, huge file size and storage scalability. To deal with above-mentioned problems, we present a high-performance computing cloud-based framework by storing the digitized data in the cloud, metadata enrichment through OCR along with image enhancement by a series of Image Processing (IP) techniques and provide high data availability to users using cloud-based search. We have tested this framework with big oil and gas company’s data on a huge scale and the results are encouraging. Although this paper addresses oil industries domain problem, the proposed framework can be applied to other domains that have huge physical data. |