Support Vector Machine based Named Entity Recognition for Sinhala

Mallikarachchi, P.S.; Lorensuhewa, S.A.S.; Kalyani, M.A.L.

Support Vector Machine based Named Entity Recognition for Sinhala

Files

Page 113 - IRCUWU2021-246 -Mallikarachchi- Support Vector Machine Based Named Entity Recognition for Sinhala.pdf (149.97 KB)

Date

2021

Authors

Mallikarachchi, P.S.

Lorensuhewa, S.A.S.

Kalyani, M.A.L.

Publisher

Uva Wellassa University of Sri Lanka

Abstract

Named Entity Recognition (NER) can be defined as identifying Named Entities (NE) in human language and classifying them. A NER system is a major fundamental subtask that facilitates more complex tasks like automatic text summarization, question answering, etc. Today automated language tools are a more solved problem for resource-rich languages like English. But for Sinhala, which is a low resourced South Asian language, only a few prior works can be observed. Unfortunately, systems developed for the English language cannot be directly used for Indo-Aryan languages. Considering the attempts on Sinhala NER systems, it can be observed that only Conditional Random Fields (CRF) and Maximum Entropy (ME) were used. But for other low resourced Indo-Aryan languages, several other algorithms have been used and among them Support Vector Machines (SVM) have given more prominent results. In this paper, we present a novel NER system using SVM for the Sinhala language. Here we have only considered PER (person), LOC (location) and ORG (organization) tags. Since this is a data driven approach preprocessing of the training data is a crucial task. The most suitable format for the training data is word-per-line format (CONLL-2002). For a more extended classification task Beginning-Inside-Outside tagging scheme was followed increasing the total number of tags into 7. The dataset consisted of 100,000 tokens and the first we have observed that with size of the training data, performance is increasing. As the prior works have shown the effect of language features next we have observed the behavior of different feature combinations and figure out that gazetteers, clue words, word-length and Part-of-Speech features as the most effective for PER, LOC. Excluding the word-length from above mentioned features remaining are the best for ORG. Ultimately both sets of tags were able to prove the effect of gazetteers with SVM. Next we have set up the experiments to observe the impact of the word-length of 4,5,6,7. Lengths of 4 and 5 were best matched for the purpose of this work. As future work we have planned to experiment the influence of varying the kernels, context and degree while expanding the training data. Keywords: SVM; NER; NLP; BIO

Keywords

Computing and Information Science, Language, Information Science

URI

http://www.erepo.lib.uwu.ac.lk/bitstream/handle/123456789/9578/Page%20113%20-%20IRCUWU2021-246%20-Mallikarachchi-%20Support%20Vector%20Machine%20Based%20Named%20Entity%20Recognition%20for%20Sinhala.pdf?sequence=1&isAllowed=y

Collections

International Research Conference of UWU-2021

Full item page