Autor: |
Mukaddem KT; Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K., Beard EJ; Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.; ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K., Yildirim B; Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K., Cole JM; Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.; ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.; Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K. |
Abstrakt: |
The rise of data science is leading to new paradigms in data-driven materials discovery. This carries an essential notion that large data sources containing chemical structure and property information can be mined in a fashion that detects and exploits structure-property relationships, such that chemicals can be predicted to suit a given material application. The success of material predictions is predicated on these large data sources of chemical structure and property information being suited to a target application. Microscopy is commonly used to characterize chemical structure, especially in fields such as nanotechnology where material properties are highly dependent on the size and shape of nanoparticles. Large data sources of nanoparticle information stemming from microscopy images would thus be highly beneficial. Millions of microscopy images exist, but they lie fragmented across the literature, typically presented individually within a paper article and usually in a qualitative fashion therein, even though they harbor a wealth of numeric information. We present the ImageDataExtractor toolkit that autoidentifies and autoextracts microscopy images from scientific documents, whereupon it autonomously analyzes each image to produce quantitative particle size and shape information about its subject material. Each image is quantified by decoding its scale bar information using optical character recognition, with help from super-resolution convolutional neural networks where required. Individual particles are detected and profiled using various thresholding, segmentation, polygon fitting, and edge correction routines. The high-throughput operational capability of ImageDataExtractor means that it can be used to generate large-data sources of particle information for data-driven materials discovery. Evaluation metrics, precision and recall, are greater than 80% for the majority of the image processing steps, and precision is above 80% for all critical steps. The ImageDataExtractor tool is released under the MIT license and is available to download from http://www.imagedataextractor.org. |