Deep Learning Based Unsupervised POS Tagging for Sanskrit

Autor: Vrashabh Prasad Jain, Kushal Chauhan, Joydip Dhar, Deepanshu Aggarwal, Prakhar Srivastava, Anupam Shukla
Rok vydání: 2018
Předmět:
Zdroj: Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence.
DOI: 10.1145/3302425.3302487
Popis: In this paper, we present a deep learning based approach to assign POS tags to words in a piece of text given to it as input. We propose an unsupervised approach owing to the lack of a large Sanskrit annotated corpora and use the untagged Sanskrit Corpus prepared by JNU for our purpose. The only tagged corpora for Sanskrit is created by JNU which has 115,000 words which are not sufficient to apply supervised deep learning approaches. For the tag assignment purpose and determining model accuracy, we utilize this tagged corpus. We explore various methods through which each Sanskrit word can be represented as a point multi-dimensional vector space whose position accurately captures its meaning and semantic information associated with it. We also explore other data sources to improve performance and robustness of the vector representations. We use these rich vector representations and explore autoencoder based approaches for dimensionality reduction to compress these into encodings which are suitable for clustering in the vector space. We experiment with different dimensions of these compressed representations and present one which was found to offer the best clustering performance. For modelling the sequence in order to preserve the semantic information we feed these embeddings to a bidirectional LSTM autoencoder. We assign a POS tag to each of the obtained clusters and produce our result by testing the model on the tagged corpus.
Databáze: OpenAIRE