Zdroj: |
Grover, C, Matheson, C, Mikheev, A & Moens, M 2000, LT TTT-A Flexible Tokenisation Tool . in Proceedings of the Second International Conference on Language Resources and Evaluation, LREC 2000, 31 May-June 2, 2000, Athens, Greece . https://doi.org/10.1.1.43.4952 |
Popis: |
We describe LT TTT, a recently developed software system which provides tools to perform text tokenisation and mark-up. The system includes ready-made components to segment text into paragraphs, sentences, words and other kinds of token but, crucially, it also allows users to tailor rule-sets to produce mark-up appropriate for particular applications. We present three case studies of our use of LT TTT: named-entity recognition (MUC-7), citation recognition and mark-up and the preparation of a corpus in the medical domain. We conclude with a discussion of the use of browsers to visualise marked-up text. |