Toxic Comment Classification Using Machine Learning

Pramodya, L.A.S.; Rathnayaka, R.M.G.U.; Lahiru, K.K.S.; Thambawita, D.R.V.L.B.

Toxic Comment Classification Using Machine Learning

dc.contributor.author	Pramodya, L.A.S.
dc.contributor.author	Rathnayaka, R.M.G.U.
dc.contributor.author	Lahiru, K.K.S.
dc.contributor.author	Thambawita, D.R.V.L.B.
dc.date.accessioned	2019-04-06T06:49:08Z
dc.date.available	2019-04-06T06:49:08Z
dc.date.issued	2019-02
dc.description.abstract	Comment classification models are available today for “flagging” the comments. However, determining whether or not a comment should be “flagged” is difficult and time-consuming. Another major problem is the lack of sufficient data for training the model, and there are some issues with the available datasets because those are annotated by the human raters and those annotations are dependent on their personal beliefs. Lack of multi-label comment classification model causes for issues of abusive behavior. This paper presents models for multi-label text classification for identifying the different level of toxicity within a comment. In this paper, we use Wikipedia comments which have been labeled by human raters for toxic behavior provided by Kaggle. Comments have been categorized into six categories as toxic, severe-toxic, obscene, threat, insult, and identityhate. The dataset contains 159572 comments. For data analyzing we use python seaborn library and python matploitlib library. It is understood that the dataset is highly skewed. Most of the comments do not belong to any of the six categories. Researchers used undersampling for majority class to correct the bias in the original dataset. We tested three models: a feed-forward neural network with Keras and word embedding, a Naive Bayes model with Scikit-Learn, and a LightGBM with 4-fold cross-validation. For the neural network, it took 3.5 hours to be trained on Nvidia GeForce 840M which is having 384 CUDA cores, Naive Bayes model with Scikit-Learn took 3 hours where LightGBM with k-fold took 4 hours. Researchersran 100 epochs from each model. At the end of 100 epoch, the neural network gave 0.9930 of validation accuracy and loss was just 0.2714, Naive Bayes model with Scikit-Learn gave 0.9556 validation accuracy and loss was 0.4121 where LightGBM with k-fold accuracy was 0.9000 and validation loss was 0.4263. The neural network gave the best accuracy at the end of the 100th epoch.	en_US
dc.identifier.isbn	9789550481255
dc.identifier.uri	http://www.erepo.lib.uwu.ac.lk/bitstream/handle/123456789/112/73.pdf?sequence=1&isAllowed=y
dc.language.iso	en	en_US
dc.publisher	Uva Wellassa University of Sri Lanka	en_US
dc.subject	Computer Science	en_US
dc.subject	Information Science	en_US
dc.subject	Computing and Information Science	en_US
dc.title	Toxic Comment Classification Using Machine Learning	en_US
dc.title.alternative	International Research Conference 2019	en_US
dc.type	Other	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 73.pdf
Size:: 114.75 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

International Research Conference of UWU-2019