Browsing by Author "Kalyani, M.A.L."
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Support Vector Machine based Named Entity Recognition for Sinhala(Uva Wellassa University of Sri Lanka, 2021) Mallikarachchi, P.S.; Lorensuhewa, S.A.S.; Kalyani, M.A.L.Named Entity Recognition (NER) can be defined as identifying Named Entities (NE) in human language and classifying them. A NER system is a major fundamental subtask that facilitates more complex tasks like automatic text summarization, question answering, etc. Today automated language tools are a more solved problem for resource-rich languages like English. But for Sinhala, which is a low resourced South Asian language, only a few prior works can be observed. Unfortunately, systems developed for the English language cannot be directly used for Indo-Aryan languages. Considering the attempts on Sinhala NER systems, it can be observed that only Conditional Random Fields (CRF) and Maximum Entropy (ME) were used. But for other low resourced Indo-Aryan languages, several other algorithms have been used and among them Support Vector Machines (SVM) have given more prominent results. In this paper, we present a novel NER system using SVM for the Sinhala language. Here we have only considered PER (person), LOC (location) and ORG (organization) tags. Since this is a data driven approach preprocessing of the training data is a crucial task. The most suitable format for the training data is word-per-line format (CONLL-2002). For a more extended classification task Beginning-Inside-Outside tagging scheme was followed increasing the total number of tags into 7. The dataset consisted of 100,000 tokens and the first we have observed that with size of the training data, performance is increasing. As the prior works have shown the effect of language features next we have observed the behavior of different feature combinations and figure out that gazetteers, clue words, word-length and Part-of-Speech features as the most effective for PER, LOC. Excluding the word-length from above mentioned features remaining are the best for ORG. Ultimately both sets of tags were able to prove the effect of gazetteers with SVM. Next we have set up the experiments to observe the impact of the word-length of 4,5,6,7. Lengths of 4 and 5 were best matched for the purpose of this work. As future work we have planned to experiment the influence of varying the kernels, context and degree while expanding the training data. Keywords: SVM; NER; NLP; BIOItem Word Embedding as Word Representations for Clustering Sinhala News Documents(Uva Wellassa University of Sri Lanka, 2021) Weerasiri, R.I.; Lorensuhewa, S.A.S.; Kalyani, M.A.L.News articles are increasing by the day and the manual clustering or classification has become an impossible task. So there has been a need for new methods for clustering these articles. There is a huge number of text documents created and added to many sources including the internet daily. Manually clustering or classifying these documents into related fields has become an impossible task. Therefore, finding similarities in these documents has turned out to be a very inclusive topic. It helps save time by specifically searching articles. We evaluated the applicability of word embedding mechanisms like fastText to find its applicability to increase the accuracies in the classification process. We explored the feasibility of word embedding models like fastText, doc2vec as a word representation methodology compared to frequent methods like Term Frequency–Inverse Document Frequency in these documents and evaluate its accuracies. The research is based on evaluating the performance of different word representations for clustering and classification of Sinhala news documents. Initially about 10,000 Sinhala news documents were collected by a scraping algorithm from different news websites. They were cleaned, preprocessed to remove irrelevant characters and words. The models were checked for accuracy with changing the number of documents with each model. This model is used for representing words in the model and checked for higher accuracies with various representation mechanisms for both clustering and classification where models like kmeans used for clustering and k nearest neighbours and support vector machines for classification. We have tested the accuracies of various word representations like Term Frequency–Inverse Document Frequency, doc2vec and fastText and upon research and experimenting we have found that fastText models as word representations give best results for both clustering and classification. Therefore, using fastText word embedding models to represent documents for classification and clustering purposes will increase the accuracy. Keywords: Clustering; Classification; Word embedding; FastText; Sinhala documents