La production de corpus d'occitan médiéval et prémoderne: problèmes et perspectives de travail
Autor: | Jean-Baptiste Camps, Gilles Guilhem Couffignal |
---|---|
Přispěvatelé: | Centre Jean Mabillon (CJM), École nationale des chartes (ENC), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), Patrimoine, Littérature, Histoire (PLH), Université Toulouse - Jean Jaurès (UT2J), Association internationale d'études occitanes (AIEO) |
Jazyk: | francouzština |
Rok vydání: | 2017 |
Předmět: |
FOS: Computer and information sciences
Artificial intelligence Xml-tei corpora [SHS.LITT]Humanities and Social Sciences/Literature Computer Vision and Pattern Recognition (cs.CV) Philologie romane Computer Science - Computer Vision and Pattern Recognition reconnaissance des écritures manuscrites [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV] Occitan [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing handwritten text recognition Optical character recognition OCR Reconnaissance optique de caractères Romance philology Intelligence articielle ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Lemmatisation [SHS.HIST]Humanities and Social Sciences/History |
Zdroj: | Actes du XIIe Congrès de l’Association internationale d’études occitanes Albi, 2017 Actes du XIIe Congrès de l’Association internationale d’études occitanes Albi, 2017, Association internationale d'études occitanes (AIEO), Jul 2017, Albi, France HAL |
Popis: | At a time when the quantity of - more or less freely - available data is increasing significantly, thanks to digital corpora, editions or libraries, the development of data mining tools or deep learning methods allows researchers to build a corpus of study tailored for their research, to enrich their data and to exploit them.Open optical character recognition (OCR) tools can be adapted to old prints, incunabula or even manuscripts, with usable results, allowing the rapid creation of textual corpora. The alternation of training and correction phases makes it possible to improve the quality of the results by rapidly accumulating raw text data. These can then be structured, for example in XML/TEI, and enriched.The enrichment of the texts with graphic or linguistic annotations can also be automated. These processes, known to linguists and functional for modern languages, present difficulties for languages such as Medieval Occitan, due in part to the absence of big enough lemmatized corpora. Suggestions for the creation of tools adapted to the considerable spelling variation of ancient languages will be presented, as well as experiments for the lemmatization of Medieval and Premodern Occitan.These techniques open the way for many exploitations. The much desired increase in the amount of available quality texts and data makes it possible to improve digital philology methods, if everyone takes the trouble to make their data freely available online and reusable.By exposing different technical solutions and some micro-analyses as examples, this paper aims to show part of what digital philology can offer to researchers in the Occitan domain, while recalling the ethical issues on which such practices are based. Comment: in French. Actes du XIIe Congr{\`e}s de l'Association internationale d'{\'e}tudes occitanes Albi, 2017, Association internationale d'{\'e}tudes occitanes (AIEO), Jul 2017, Albi, France |
Databáze: | OpenAIRE |
Externí odkaz: |