Short Text Topic Modelling using Non-negative Matrix Factorization with Neighbourhood-based Assistance
| dc.contributor.author | Athukorala, W.S. | |
| dc.contributor.author | Mohotti, W.A. | |
| dc.date.accessioned | 2022-09-01T04:57:20Z | |
| dc.date.available | 2022-09-01T04:57:20Z | |
| dc.date.issued | 2021 | |
| dc.description.abstract | A massive number of short texts are generated every day in the forms of tweets, news headlines, questions, and answers. Analyzing short texts is an effective method to acquire valuable insights from these online archives that show diverse applications in community detection, trend analysis, classification, and summarization. Topic modeling is a widely used technique for this purpose as it is capable of latent topic discovery, and finding relationships among terms, topics, and text documents. In discovering thematic structure in collections of texts, a higher number of terms appear in the document × term matrix representation and associated sparseness creates issues for distance-based and density-based document similarities calculations. This phenomenon is known as distance concentration where the distance differences between points become negligible due to sparseness in high dimensions. Additionally, the short text shows a shorter length compared to conventional documents. This leads short texts to create extremely sparse, high-dimensional text and challenge finding documents that share the same topic structure within them. Non-negative Matrix Factorization (NMF) which is aligned with the natural non-negativity of text data is proposed as an effective technique that handles high dimensional representation with lower-dimensional projection. However, this higher-to-lower dimensional projection results in an information loss. This paper proposes Neighbourhood-based assistance to compensate for this loss. Neighborhood information within documents is captured using Jaccard similarity considering term sets included in the documents. We coupled a symmetric document × document matrix that carries this neighborhood information with the document × term matrix using NMF to identify the lower order topic × document matrix. This unsupervised method learns a dense lower-order topic presentation by minimizing the encoding error of factor matrices. We empirically evaluate the effectiveness of the method against the state-of-the-art short text topic modeling methods belongs to probabilistic and matrix factorization categories. Experimental results using three Twitter datasets show that the proposed approach is able to deal with information loss attached with higher dimensional matrix factorization of short-text and attain high accuracy compared to relevant benchmarking methods. Keywords: Topic Modelling; Short Text; Non-negative Matrix Factorization; Neighbourhood-based Assistance | en_US |
| dc.identifier.isbn | 978-624-5856-04-6 | |
| dc.identifier.uri | http://www.erepo.lib.uwu.ac.lk/bitstream/handle/123456789/9582/Page%20117%20-%20IRCUWU2021-310%20-Athukorala-%20Short%20Text%20Topic%20Modelling%20using%20Non-negative%20Matrix%20Factorization%20with%20Neighbourhood.pdf?sequence=1&isAllowed=y | |
| dc.language.iso | en | en_US |
| dc.publisher | Uva Wellassa University of Sri Lanka | en_US |
| dc.subject | Computing and Information Science | en_US |
| dc.subject | Computer Science | en_US |
| dc.subject | Language | en_US |
| dc.subject | Education | en_US |
| dc.title | Short Text Topic Modelling using Non-negative Matrix Factorization with Neighbourhood-based Assistance | en_US |
| dc.title.alternative | International Research Conference 2021 | en_US |
| dc.type | Other | en_US |
Files
Original bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- Page 117 - IRCUWU2021-310 -Athukorala- Short Text Topic Modelling using Non-negative Matrix Factorization with Neighbourhood.pdf
- Size:
- 147.12 KB
- Format:
- Adobe Portable Document Format
- Description:
License bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- license.txt
- Size:
- 1.71 KB
- Format:
- Item-specific license agreed upon to submission
- Description: