eprintid: 15444
rev_number: 9
eprint_status: archive
userid: 2
dir: disk0/00/01/54/44
datestamp: 2024-11-28 23:30:18
lastmod: 2024-11-28 23:30:20
status_changed: 2024-11-28 23:30:18
type: article
metadata_visibility: show
creators_name: Ashiq, Waqar
creators_name: Kanwal, Samra
creators_name: Rafique, Adnan
creators_name: Waqas, Muhammad
creators_name: Khurshaid, Tahir
creators_name: Caro Montero, Elizabeth
creators_name: Bustamante Alonso, Alicia
creators_name: Ashraf, Imran
creators_id: 
creators_id: 
creators_id: 
creators_id: 
creators_id: 
creators_id: elizabeth.caro@uneatlantico.es
creators_id: alicia.bustamante@uneatlantico.es
creators_id: 
title: Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization
ispublished: pub
subjects: uneat_eng
divisions: uneatlantico_produccion_cientifica
divisions: uninimx_produccion_cientifica
divisions: uninipr_produccion_cientifica
divisions: unic_produccion_cientifica
divisions: uniromana_produccion_cientifica
full_text_status: public
keywords: s Hate speech detection, Deep learning, Model optimization, Urdu text classification
abstract: With the rapid increase of users over social media, cyberbullying, and hate speech problems have arisen over the past years. Automatic hate speech detection (HSD) from text is an emerging research problem in natural language processing (NLP). Researchers developed various approaches to solve the automatic hate speech detection problem using different corpora in various languages, however, research on the Urdu language is rather scarce. This study aims to address the HSD task on Twitter using Roman Urdu text. The contribution of this research is the development of a hybrid model for Roman Urdu HSD, which has not been previously explored. The novel hybrid model integrates deep learning (DL) and transformer models for automatic feature extraction, combined with machine learning algorithms (MLAs) for classification. To further enhance model performance, we employ several hyperparameter optimization (HPO) techniques, including Grid Search (GS), Randomized Search (RS), and Bayesian Optimization with Gaussian Processes (BOGP). Evaluation is carried out on two publicly available benchmarks Roman Urdu corpora comprising HS-RU-20 corpus and RUHSOLD hate speech corpus. Results demonstrate that the Multilingual BERT (MBERT) feature learner, paired with a Support Vector Machine (SVM) classifier and optimized using RS, achieves state-of-the-art performance. On the HS-RU-20 corpus, this model attained an accuracy of 0.93 and an F1 score of 0.95 for the Neutral-Hostile classification task, and an accuracy of 0.89 with an F1 score of 0.88 for the Hate Speech-Offensive task. On the RUHSOLD corpus, the same model achieved an accuracy of 0.95 and an F1 score of 0.94 for the Coarse-grained task, alongside an accuracy of 0.87 and an F1 score of 0.84 for the Fine-grained task. These results demonstrate the effectiveness of our hybrid approach for Roman Urdu hate speech detection.
date: 2024-11
publication: Scientific Reports
volume: 14
number: 1
id_number: doi:10.1038/s41598-024-79106-7
refereed: TRUE
issn: 2045-2322
official_url: http://doi.org/10.1038/s41598-024-79106-7
access: open
language: en
citation:   Artículo Materias > Ingeniería <http://repositorio.unic.co.ao/view/subjects/uneat=5Feng.html> Universidad Europea del Atlántico > Investigación > Producción Científica <http://repositorio.unic.co.ao/view/divisions/uneatlantico=5Fproduccion=5Fcientifica.html>
Universidad Internacional Iberoamericana México > Investigación > Producción Científica <http://repositorio.unic.co.ao/view/divisions/uninimx=5Fproduccion=5Fcientifica.html>
Universidad Internacional Iberoamericana Puerto Rico > Investigación > Producción Científica <http://repositorio.unic.co.ao/view/divisions/uninipr=5Fproduccion=5Fcientifica.html>
Universidad Internacional do Cuanza > Investigación > Artículos y libros <http://repositorio.unic.co.ao/view/divisions/unic=5Fproduccion=5Fcientifica.html>
Universidad de La Romana > Investigación > Producción Científica <http://repositorio.unic.co.ao/view/divisions/uniromana=5Fproduccion=5Fcientifica.html> Abierto Inglés With the rapid increase of users over social media, cyberbullying, and hate speech problems have arisen over the past years. Automatic hate speech detection (HSD) from text is an emerging research problem in natural language processing (NLP). Researchers developed various approaches to solve the automatic hate speech detection problem using different corpora in various languages, however, research on the Urdu language is rather scarce. This study aims to address the HSD task on Twitter using Roman Urdu text. The contribution of this research is the development of a hybrid model for Roman Urdu HSD, which has not been previously explored. The novel hybrid model integrates deep learning (DL) and transformer models for automatic feature extraction, combined with machine learning algorithms (MLAs) for classification. To further enhance model performance, we employ several hyperparameter optimization (HPO) techniques, including Grid Search (GS), Randomized Search (RS), and Bayesian Optimization with Gaussian Processes (BOGP). Evaluation is carried out on two publicly available benchmarks Roman Urdu corpora comprising HS-RU-20 corpus and RUHSOLD hate speech corpus. Results demonstrate that the Multilingual BERT (MBERT) feature learner, paired with a Support Vector Machine (SVM) classifier and optimized using RS, achieves state-of-the-art performance. On the HS-RU-20 corpus, this model attained an accuracy of 0.93 and an F1 score of 0.95 for the Neutral-Hostile classification task, and an accuracy of 0.89 with an F1 score of 0.88 for the Hate Speech-Offensive task. On the RUHSOLD corpus, the same model achieved an accuracy of 0.95 and an F1 score of 0.94 for the Coarse-grained task, alongside an accuracy of 0.87 and an F1 score of 0.84 for the Fine-grained task. These results demonstrate the effectiveness of our hybrid approach for Roman Urdu hate speech detection. metadata Ashiq, Waqar; Kanwal, Samra; Rafique, Adnan; Waqas, Muhammad; Khurshaid, Tahir; Caro Montero, Elizabeth; Bustamante Alonso, Alicia y Ashraf, Imran mail SIN ESPECIFICAR, SIN ESPECIFICAR, SIN ESPECIFICAR, SIN ESPECIFICAR, SIN ESPECIFICAR, elizabeth.caro@uneatlantico.es, alicia.bustamante@uneatlantico.es, SIN ESPECIFICAR     <http://repositorio.unic.co.ao/id/eprint/15444/1/s41598-024-79106-7.pdf>     (2024) Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization.  Scientific Reports, 14 (1).   ISSN 2045-2322     
document_url: http://repositorio.unic.co.ao/id/eprint/15444/1/s41598-024-79106-7.pdf