On the Strength of Character Language Models for Multilingual Named Entity Recognition
Autor: | Mark Sammons, Dan Roth, Stephen Mayhew, Xiaodong Yu |
---|---|
Rok vydání: | 2018 |
Předmět: |
FOS: Computer and information sciences
Computer Science - Computation and Language Computer science Property (programming) business.industry Character (computing) 020206 networking & telecommunications 02 engineering and technology computer.software_genre Computer Science - Information Retrieval Task (project management) Named entity Set (abstract data type) Named-entity recognition Simple (abstract algebra) 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Artificial intelligence Language model business Computation and Language (cs.CL) computer Information Retrieval (cs.IR) Natural language processing |
Zdroj: | EMNLP |
DOI: | 10.18653/v1/d18-1345 |
Popis: | Character-level patterns have been widely used as features in English Named Entity Recognition (NER) systems. However, to date there has been no direct investigation of the inherent differences between name and non-name tokens in text, nor whether this property holds across multiple languages. This paper analyzes the capabilities of corpus-agnostic Character-level Language Models (CLMs) in the binary task of distinguishing name tokens from non-name tokens. We demonstrate that CLMs provide a simple and powerful model for capturing these differences, identifying named entity tokens in a diverse set of languages at close to the performance of full NER systems. Moreover, by adding very simple CLM-based features we can significantly improve the performance of an off-the-shelf NER system for multiple languages. 5 pages, EMNLP 2018 short paper |
Databáze: | OpenAIRE |
Externí odkaz: |