Abstract

With the increasing usage of social networking platforms seen over recent years, there has been an extensive rise in hate speech usage between the users. Hence, Government and social media platforms face lots of responsibility and challenges to control, detect and eliminate massively growing hateful content as early as possible to prevent future criminal acts such as cyber violence and real-life hate crimes. Since Twitter is used globally by people from various backgrounds and nationalities, the platform contains tweets posted in different languages, including code-mixed language, namely Hindi-English. Due to the informal format of tweets with variations in spelling and grammar, hate speech detection is challenging, especially in code-mixed text containing a mixture of different languages. In this paper, we tackle the critical issue of hate speech on social media, with a focus on a mix of English and Hindi-English (code-mixed) text messages (tweets) on Twitter. We perform hate speech classification using the benefits of character-level embedding representations of tweets and Deep Neural Networks (DNN). We built two architectures, namely Convolutional Neural Network (CNN) and a combination of CNN and Long Short-Term Memory (LSTM) algorithms with character-level embedding as an improvement over Elouali et al. (2020)’s work. Both the models were trained using an imbalanced (original) as well as oversampled (balanced) version of the training dataset and were evaluated on the test set. Extensive experimental analysis was performed by tuning the hyperparameters of our models and evaluating their performance in terms of accuracy, efficiency (runtime) and scalability in detecting whether a tweet is hate speech or non-hate. The performance of our proposed models is compared with Elouali et al. (2020)’s model, and it is observed that our method has an improved accuracy and a significantly improved runtime and is scalable. Among our best performing models, CNN-LSTM performed slightly better than CNN with an accuracy of 88.97%.

Publication Date

5-10-2022

Document Type

Master's Project

Student Type

Graduate

Degree Name

Professional Studies (MS)

Department, Program, or Center

Graduate Programs & Research (Dubai)

Advisor

Sanjay Modak

Advisor/Committee Member

Khalil Al Hussaeni

Recommended Citation

Sameer, Mohamed, "Hate Speech Detection in a mix of English and Hindi-English (Code-Mixed) Tweets" (2022). Thesis. Rochester Institute of Technology. Accessed from
https://repository.rit.edu/theses/11160

Campus

RIT Dubai

Download

COinS

Theses

Hate Speech Detection in a mix of English and Hindi-English (Code-Mixed) Tweets

Abstract

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Recommended Citation

Campus

Search

Browse

Author Corner

RIT Links

Theses

Hate Speech Detection in a mix of English and Hindi-English (Code-Mixed) Tweets

Author

Abstract

Publication Date

Document Type

Student Type

Degree Name

Department, Program, or Center

Advisor

Advisor/Committee Member

Recommended Citation

Campus

Share

Search

Browse

Author Corner

RIT Links