With the increasing usage of social networking platforms seen over recent years, there has been an extensive rise in hate speech usage between the users. Hence, Government and social media platforms face lots of responsibility and challenges to control, detect and eliminate massively growing hateful content as early as possible to prevent future criminal acts such as cyber violence and real-life hate crimes. Since Twitter is used globally by people from various backgrounds and nationalities, the platform contains tweets posted in different languages, including code-mixed language, namely Hindi-English. Due to the informal format of tweets with variations in spelling and grammar, hate speech detection is challenging, especially in code-mixed text containing a mixture of different languages. In this paper, we tackle the critical issue of hate speech on social media, with a focus on a mix of English and Hindi-English (code-mixed) text messages (tweets) on Twitter. We perform hate speech classification using the benefits of character-level embedding representations of tweets and Deep Neural Networks (DNN). We built two architectures, namely Convolutional Neural Network (CNN) and a combination of CNN and Long Short-Term Memory (LSTM) algorithms with character-level embedding as an improvement over Elouali et al. (2020)’s work. Both the models were trained using an imbalanced (original) as well as oversampled (balanced) version of the training dataset and were evaluated on the test set. Extensive experimental analysis was performed by tuning the hyperparameters of our models and evaluating their performance in terms of accuracy, efficiency (runtime) and scalability in detecting whether a tweet is hate speech or non-hate. The performance of our proposed models is compared with Elouali et al. (2020)’s model, and it is observed that our method has an improved accuracy and a significantly improved runtime and is scalable. Among our best performing models, CNN-LSTM performed slightly better than CNN with an accuracy of 88.97%.
Professional Studies (MS)
Department, Program, or Center
Graduate Programs & Research (Dubai)
Khalil Al Hussaeni
Sameer, Mohamed, "Hate Speech Detection in a mix of English and Hindi-English (Code-Mixed) Tweets" (2022). Thesis. Rochester Institute of Technology. Accessed from