Deep Learning Based Unsupervised POS Tagging for Sanskrit
Autor: | Vrashabh Prasad Jain, Kushal Chauhan, Joydip Dhar, Deepanshu Aggarwal, Prakhar Srivastava, Anupam Shukla |
---|---|
Rok vydání: | 2018 |
Předmět: |
business.industry
Computer science Deep learning Dimensionality reduction 02 engineering and technology computer.software_genre Autoencoder language.human_language 03 medical and health sciences 0302 clinical medicine Robustness (computer science) 030221 ophthalmology & optometry 0202 electrical engineering electronic engineering information engineering language 020201 artificial intelligence & image processing Word2vec Artificial intelligence Sanskrit business Cluster analysis computer Natural language processing Vector space |
Zdroj: | Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence. |
DOI: | 10.1145/3302425.3302487 |
Popis: | In this paper, we present a deep learning based approach to assign POS tags to words in a piece of text given to it as input. We propose an unsupervised approach owing to the lack of a large Sanskrit annotated corpora and use the untagged Sanskrit Corpus prepared by JNU for our purpose. The only tagged corpora for Sanskrit is created by JNU which has 115,000 words which are not sufficient to apply supervised deep learning approaches. For the tag assignment purpose and determining model accuracy, we utilize this tagged corpus. We explore various methods through which each Sanskrit word can be represented as a point multi-dimensional vector space whose position accurately captures its meaning and semantic information associated with it. We also explore other data sources to improve performance and robustness of the vector representations. We use these rich vector representations and explore autoencoder based approaches for dimensionality reduction to compress these into encodings which are suitable for clustering in the vector space. We experiment with different dimensions of these compressed representations and present one which was found to offer the best clustering performance. For modelling the sequence in order to preserve the semantic information we feed these embeddings to a bidirectional LSTM autoencoder. We assign a POS tag to each of the obtained clusters and produce our result by testing the model on the tagged corpus. |
Databáze: | OpenAIRE |
Externí odkaz: |