Word Embedding as Word Representations for Clustering Sinhala News Documents

dc.contributor.authorWeerasiri, R.I.
dc.contributor.authorLorensuhewa, S.A.S.
dc.contributor.authorKalyani, M.A.L.
dc.date.accessioned2022-09-01T09:41:10Z
dc.date.available2022-09-01T09:41:10Z
dc.date.issued2021
dc.description.abstractNews articles are increasing by the day and the manual clustering or classification has become an impossible task. So there has been a need for new methods for clustering these articles. There is a huge number of text documents created and added to many sources including the internet daily. Manually clustering or classifying these documents into related fields has become an impossible task. Therefore, finding similarities in these documents has turned out to be a very inclusive topic. It helps save time by specifically searching articles. We evaluated the applicability of word embedding mechanisms like fastText to find its applicability to increase the accuracies in the classification process. We explored the feasibility of word embedding models like fastText, doc2vec as a word representation methodology compared to frequent methods like Term Frequency–Inverse Document Frequency in these documents and evaluate its accuracies. The research is based on evaluating the performance of different word representations for clustering and classification of Sinhala news documents. Initially about 10,000 Sinhala news documents were collected by a scraping algorithm from different news websites. They were cleaned, preprocessed to remove irrelevant characters and words. The models were checked for accuracy with changing the number of documents with each model. This model is used for representing words in the model and checked for higher accuracies with various representation mechanisms for both clustering and classification where models like kmeans used for clustering and k nearest neighbours and support vector machines for classification. We have tested the accuracies of various word representations like Term Frequency–Inverse Document Frequency, doc2vec and fastText and upon research and experimenting we have found that fastText models as word representations give best results for both clustering and classification. Therefore, using fastText word embedding models to represent documents for classification and clustering purposes will increase the accuracy. Keywords: Clustering; Classification; Word embedding; FastText; Sinhala documentsen_US
dc.identifier.isbn978-624-5856-04-6
dc.identifier.urihttp://www.erepo.lib.uwu.ac.lk/handle/123456789/9587
dc.language.isoenen_US
dc.publisherUva Wellassa University of Sri Lankaen_US
dc.subjectJournalismen_US
dc.subjectComputing and Information Scienceen_US
dc.subjectCommunication Technologyen_US
dc.subjectClassificationen_US
dc.titleWord Embedding as Word Representations for Clustering Sinhala News Documentsen_US
dc.title.alternativeInternational Research Conference 2021en_US
dc.typeOtheren_US
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Page 122 - IRCUWU2021-344 -Weerasiri- Word Embedding as Word Representations for Clustering Sinhala News Documents.pdf
Size:
223.88 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: