@PhilosTEI: Building Corpora for Philosophers
Autor: | Betti, A., Reynaert, M., Berg, H. van den, Odijk, J., Hessen, A. van |
---|---|
Přispěvatelé: | Odijk, J., Hessen, A. van, ILLC (FGw), Logic and Language (ILLC, FNWI/FGw), ILLC (FNWI/FGw), Creative Computing |
Rok vydání: | 2017 |
Předmět: |
OCR post-correction
Matching (statistics) Information retrieval History and philosophy of science and technology @PhilosTEI business.industry Computer science Software for humanities Field (computer science) Language & Communication Moment (mathematics) Software Workflow Textual and linguistic corpora ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Tesseract History of ideas and intellectual history Language & Speech Technology business TICCL |
Zdroj: | Odijk, J.; Hessen, A. van (ed.), CLARIN in the LOW Countries, pp. 379-392 Odijk, J.; Hessen, A. van (ed.), CLARIN in the LOW Countries, 379-392. London : Ubiquity Press STARTPAGE=379;ENDPAGE=392;TITLE=Odijk, J.; Hessen, A. van (ed.), CLARIN in the LOW Countries CLARIN in the Low Countries, 379-392 STARTPAGE=379;ENDPAGE=392;TITLE=CLARIN in the Low Countries CLARIN-NL in the Low Countries, 379-392 STARTPAGE=379;ENDPAGE=392;TITLE=CLARIN-NL in the Low Countries |
Popis: | For philosophers to be able to take a computational turn in their field, especially if that field relies heavily on historical material, it is crucial to be able to build high-quality, easily and freely accessible corpora in a sustainable format composed from multi-language, multi-script books from different historical periods. At the moment, corpora matching these needs are virtually non-existent. Within the CLARIN-NL project @PhilosTEI, we have addressed the problem of building this kind of corpora by developing an open-source, web-based, user-friendly workflow from textual images to TEI, based on state-of-the-art open-source OCR software Tesseract, and a multi-language version of TICCL, a powerful OCR post-correction tool. We have demonstrated the utility of the @PhilosTEI tool by applying it to a multilingual, multi-script corpus of important 18th to 20th century European philosophical texts. |
Databáze: | OpenAIRE |
Externí odkaz: |