A Comparative Study: Best Machine Learning Algorithm for Social Media Sentiment Analysis

dc.contributor.authorManthrirathna, M.A.L.
dc.contributor.authorWeerakoon, W.M.H.G.T.C.K.
dc.contributor.authorRathnayaka, R.M.K.T.
dc.date.accessioned2021-02-01T06:35:28Z
dc.date.available2021-02-01T06:35:28Z
dc.date.issued2020
dc.description.abstractSentiment analysis is a field of study that aims to derive the sentiment or the opinion of a text using natural language processing techniques. Performing sentiment analysis on Twitter data has a vast number of applications including predicting stock market prices, product recommendations, etc. Sentiment analysis can be done in lexicon-based, machine learning-based, or hybrid approaches. K Nearest Neighbor, Support Vector Machine, Logistic Regression, Naïve Bayes, K Means Clustering, Decision Trees, and Random Forest are the few most popular machine learning algorithms. This study aims to conduct a comparative analysis among the usage of K Nearest Neighbor, Support Vector Machine, Logistic Regression, and Multinomial Naïve Bayes machine learning algorithms combined with sentword net lexicon to suggest which one provides the best accuracy in sentiment classification of Twitter data. A data set of 1028 tweets was acquired using the Twitter Standard Search API (Application Programming Interface) and Tweepy python library. The name of a popular brand of mobile phones was used to search for tweets. 570 tweets remained after the duplication removal and cleaning process. Then the remaining data was classified as positive, negative, or neutral using sentiword net lexicon and used to train selected machine learning algorithms.80% of the data was used for training and 20% was used for testing. Word counts in the tweets were used as features. Multinomial Naïve Bayes is proved to be the best machine learning algorithm with a model accuracy of 74.56% and K Nearest Neighbor (k=3) is the worst-performing algorithm with an accuracy of 54.38%. Logistic Regression and Support Vector Machine (linear kernel) respectively had accuracies: 72.80% and 70.17%. The result of this research proves Multinomial Naïve Bayes performs relatively better in Twitter sentiment analysis than K Nearest Neighbor, Support Vector Machine, Logistic Regression. This is because two basic assumptions for applying the Multinomial Naïve Bayes algorithm: feature independency and multinomial distribution are well satisfied by the features selected for this study. Also, Multinomial Naïve Bayes can perform well with high dimensional data like tweet text. On the other hand, the poor performance of the K Nearest Neighbor is due to the same reason. K Nearest Neighbor cannot handle a large number of features very well. Keywords: Sentiment analysis, Twitter, Hybrid approach, Machine learning algorithms, Comparative analysis.en_US
dc.identifier.isbn9789550481293
dc.identifier.urihttp://www.erepo.lib.uwu.ac.lk/bitstream/handle/123456789/5716/proceeding_oct_08-193.pdf?sequence=1&isAllowed=y
dc.language.isoenen_US
dc.publisherUva Wellassa University of Sri Lankaen_US
dc.relation.ispartofseries;International Research Conference
dc.subjectComputer Scienceen_US
dc.subjectSocial Mediaen_US
dc.subjectInformation Scienceen_US
dc.subjectComputing and Information Managementen_US
dc.titleA Comparative Study: Best Machine Learning Algorithm for Social Media Sentiment Analysisen_US
dc.title.alternativeInternational Research Conference 2020en_US
dc.typeOtheren_US
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
proceeding_oct_08-193.pdf
Size:
31.22 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: