Comparison of Traditional and Ensemble Machine Learning in Classifying Hate Speech Sentences

Authors

  • Ridwan Duan
  • Riyan Latifahul Hasanah Universitas Nusa Mandiri, Jakarta
  • Eni Heni Hermaliani Universitas Nusa Mandiri, Jakarta

DOI:

https://doi.org/10.33506/insect.v8i2.2321

Keywords:

hate speech, Logistic Regression, ensemble, Adaptive Boosting

Abstract

Sentences of Hate Speech are criminal acts that are expressed to individuals or groups in the form of insults, slander, or insults related to race, religion, culture, etc. Hate Speech is often conveyed through social media such as Twitter. To help overcome the spread of hate speech, this study aims to analyze the categorization of hate speech sentences using machine learning. To achieve this goal, pre-processing stages are needed, namely removing punctuations, lowercase, tokenizing, filtering, and stemming. The dataset has an unbalanced data distribution, so the SMOTE (Synthetic Minority Over-sampling Technique) method is very suitable to use, followed by applying the features engineering model, namely TF-IDF (Term Frequency-Inverse Document Frequency) and using the Logistic Regression algorithm, Decision Tree, and Naïve Bayes, then developed machine learning algorithms using ensemble methods, namely Adaptive Boosting (AdaBoost) and Random Forest. The Logistic Regression Algorithm gets the best accuracy value of 91.40 and can outperform other algorithms

References

D. A. N. Taradhita and I. K. G. D. Putra, “Hate speech classification in Indonesian language tweets by using convolutional neural network,” J. ICT Res. Appl., vol. 14, no. 3, pp. 225–239, 2021, doi: 10.5614/itbj.ict.res.appl.2021.14.3.2.

J. Patihullah and E. Winarko, “Hate Speech Detection for Indonesia Tweets Using Word Embedding And Gated Recurrent Unit,” IJCCS (Indonesian J. Comput. Cybern. Syst., vol. 13, no. 1, p. 43, 2019, doi: 10.22146/ijccs.40125.

A. Bayhaqy, S. Sfenrianto, K. Nainggolan, and E. R. Kaburuan, “Sentiment Analysis about E-Commerce from Tweets Using Decision Tree, K-Nearest Neighbor, and Naïve Bayes,” 2018 Int. Conf. Orange Technol. ICOT 2018, pp. 1–6, 2018, doi: 10.1109/ICOT.2018.8705796.

F. Wenando and E. Fuad, “Detection of Hate Speech in Indonesian Language on Twitter Using Machine Learning Algorithm,” Int. Conf. Recent Adv. Nat. Lang. Process. RANLP, vol. 4, pp. 467–472, 2019.

S. Ahammed, M. Rahman, M. H. Niloy, and S. M. M. H. Chowdhury, “Implementation of Machine Learning to Detect Hate Speech in Bangla Language,” Proc. 2019 8th Int. Conf. Syst. Model. Adv. Res. Trends, SMART 2019, pp. 317–320, 2020, doi: 10.1109/SMART46866.2019.9117214.

K. Sreelakshmi, B. Premjith, and K. P. Soman, “Detection of Hate Speech Text in Hindi-English Code-mixed Data,” Procedia Comput. Sci., vol. 171, no. 2019, pp. 737–744, 2020, doi: 10.1016/j.procs.2020.04.080.

A. M. U. D. Khanday, S. T. Rabani, Q. R. Khan, and S. H. Malik, “Detecting twitter hate speech in COVID-19 era using machine learning and ensemble learning techniques,” Int. J. Inf. Manag. Data Insights, vol. 2, no. 2, p. 100120, 2022, doi: 10.1016/j.jjimei.2022.100120.

N. Badri, F. Kboubi, and A. H. Chaibi, “Combining FastText and Glove Word Embedding for Offensive and Hate speech Text Detection,” Procedia Comput. Sci., vol. 207, no. Kes, pp. 769–778, 2022, doi: 10.1016/j.procs.2022.09.132.

Febiana Anistya and Erwin Budi Setiawan, “Hate Speech Detection on Twitter in Indonesia with Feature Expansion Using GloVe,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 5, no. 6, pp. 1044–1051, 2021, doi: 10.29207/resti.v5i6.3521.

S. Abro, S. Shaikh, Z. Ali, S. Khan, G. Mujtaba, and Z. H. Khand, “Automatic hate speech detection using machine learning: A comparative study,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 8, pp. 484–491, 2020, doi: 10.14569/IJACSA.2020.0110861.

Bimrew Sendekie Belay, “Detecting Hate Speech In Twitter Using Long Short-Term Memory And Naïve Bayes Method,” הארץ, vol. 7, no. 8.5.2017, pp. 2003–2005, 2022.

S. Thaiparnit, N. Chumuang, and M. Ketcham, “A Comparitive Study of Clasification Liver Dysfunction with Machine Learning,” 2018 Int. Jt. Symp. Artif. Intell. Nat. Lang. Process. iSAI-NLP 2018 - Proc., vol. 283, pp. 1–4, 2018, doi: 10.1109/iSAI-NLP.2018.8692808.

K. Ahammed, M. S. Satu, M. I. Khan, and M. Whaiduzzaman, “Predicting Infectious State of Hepatitis C Virus Affected Patient’s Applying Machine Learning Methods,” 2020 IEEE Reg. 10 Symp. TENSYMP 2020, no. June, pp. 1371–1374, 2020, doi: 10.1109/TENSYMP50017.2020.9230464.

Y. E. Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, “Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study,” pp. 473–481, 1999, [Online]. Available: https://support.twitter.eom/articles/l.

C. Schröer, F. Kruse, and J. M. Gómez, “A systematic literature review on applying CRISP-DM process model,” Procedia Comput. Sci., vol. 181, no. 2019, pp. 526–534, 2021, doi: 10.1016/j.procs.2021.01.199.

Merinda Lestandy, Abdurrahim Abdurrahim, and Lailis Syafa’ah, “Analisis Sentimen Tweet Vaksin COVID-19 Menggunakan Recurrent Neural Network dan Naïve Bayes,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 5, no. 4, pp. 802–808, 2021, doi: 10.29207/resti.v5i4.3308.

S. Visalakshi and V. Radha, “A literature review of feature selection techniques and applications: Review of feature selection in data mining,” 2014 IEEE Int. Conf. Comput. Intell. Comput. Res. IEEE ICCIC 2014, no. 1997, 2015, doi: 10.1109/ICCIC.2014.7238499.

S. A. J. Zaidi, S. Tariq, and S. B. Belhaouari, “Future prediction of covid-19 vaccine trends using a voting classifier,” Data, vol. 6, no. 11, 2021, doi: 10.3390/data6110112.

L. M. Candanedo, V. Feldheim, and D. Deramaix, “Data driven prediction models of energy use of appliances in a low-energy house,” Energy Build., vol. 140, pp. 81–97, 2017, doi: 10.1016/j.enbuild.2017.01.083.

Y. Wang, F. Wang, and H. Wang, “Influencing factors regression analysis of heating energy consumption of rural buildings in China,” Procedia Eng., vol. 205, pp. 3585–3592, 2017, doi: 10.1016/j.proeng.2017.10.207.

I. C. Sari and Y. Ruldeviyani, “Sentiment Analysis of the Covid-19 Virus Infection in Indonesian Public Transportation on Twitter Data: A Case Study of Commuter Line Passengers,” 2020 Int. Work. Big Data Inf. Secur. IWBIS 2020, pp. 23–28, 2020, doi: 10.1109/IWBIS50925.2020.9255531.

J. Singh, S. Bagga, and R. Kaur, “Software-based Prediction of Liver Disease with Feature Selection and Classification Techniques,” Procedia Comput. Sci., vol. 167, no. 2019, pp. 1970–1980, 2020, doi: 10.1016/j.procs.2020.03.226.

G. S. Uttreshwar and A. A. Ghatol, “Hepatitis B diagnosis using logical inference and generalized regression neural networks,” 2009 IEEE Int. Adv. Comput. Conf. IACC 2009, no. March, pp. 1587–1595, 2009, doi: 10.1109/IADCC.2009.4809255.

S. Sun and R. Huang, “An adaptive k-nearest neighbor algorithm,” Proc. - 2010 7th Int. Conf. Fuzzy Syst. Knowl. Discov. FSKD 2010, vol. 1, no. Fskd, pp. 91–94, 2010, doi: 10.1109/FSKD.2010.5569740.

F. Belaid, A. Ben Youssef, and N. Omrani, “Investigating the factors shaping residential energy consumption patterns in France: Evidence form quantile regression,” Eur. J. Comp. Econ., vol. 17, no. 1, pp. 127–151, 2020, doi: 10.25428/1824-2979/202001-127-151.

Okfalisa, I. Gazalba, Mustakim, and N. G. I. Reza, “Comparative analysis of k-nearest neighbor and modified k-nearest neighbor algorithm for data classification,” Proc. - 2017 2nd Int. Conf. Inf. Technol. Inf. Syst. Electr. Eng. ICITISEE 2017, vol. 2018-Janua, pp. 294–298, 2018, doi: 10.1109/ICITISEE.2017.8285514.

Published

2023-03-31

How to Cite

Ridwan, Hasanah, R. L., & Hermaliani, E. H. (2023). Comparison of Traditional and Ensemble Machine Learning in Classifying Hate Speech Sentences. Insect (Informatics and Security): Jurnal Teknik Informatika, 8(2), 121–131. https://doi.org/10.33506/insect.v8i2.2321

Issue

Section

Articles