Popis: |
The amount of on-line text available is continuously growing and has reached hundreds of billions of words. A lot of research has been done using this data, trying to improve results on different problems. Algorithms are continuously optimized, tested and compared after training on corpora with only one million words or less. Most research focuses on the accuracy of the results generated by these algorithms often overlooking the running time or the cost associated with running those algorithms. The main goal of this paper is to show the effect that large data has on the running time and performance of those algorithms in Natural Language Processing. To achieve this goal, three Named Entity Recognition tools were selected. We evaluated the trade-off between quality, running time, and the effect of increasing the data size on performance on the best variety of tools in NER domain. The result shows that the existing tools are unable to work with increasing data size. Also by increasing data size quality is increasing but performance is decreasing; therefore, rendering the existing tools inefficient. By optimizing these tools, large data sizes can be processed; unfortunately, latency is still high. |