relation: http://repositorio.unic.co.ao/id/eprint/8800/
canonical: http://repositorio.unic.co.ao/id/eprint/8800/
title: Real Word Spelling Error Detection and Correction for Urdu Language
creator: Aziz, Romila
creator: Anwar, Muhammad Waqas
creator: Jamal, Muhammad Hasan
creator: Bajwa, Usama Ijaz
creator: Kuc Castilla, Ángel Gabriel
creator: Uc-Rios, Carlos
creator: Bautista Thompson, Ernesto
creator: Ashraf, Imran
subject: Ingeniería
description: Non-word and real-word errors are generally two types of spelling errors. Non-word errors are misspelled words that are nonexistent in the lexicon while real-word errors are misspelled words that exist in the lexicon but are used out of context in a sentence. Lexicon-based lookup approach is widely used for non-word errors but it is incapable of handling real-word errors as they require contextual information. Contrary to the English language, real-word error detection and correction for low-resourced languages like Urdu is an unexplored area. This paper presents a real-word spelling error detection and correction approach for the Urdu language. We develop an extensive lexicon of 593,738 words and use this lexicon to develop a dataset for real-word errors comprising 125562 sentences and 2,552,735 words. Based on the developed lexicon and dataset, we then develop a contextual spell checker that detects and corrects real-word errors. For the real-word error detection phase, word-gram features are used along with five machine learning classifiers, achieving a precision, recall, and F1-score of 0.84,0.79, and 0.81 respectively. We also test the proposed approach with a 40% error density. For real-word error correction, the Damerau-Levenshtein distance is used along with the n-gram model for further ranking of the suggested candidate words, achieving an accuracy of up to 83.67%.
date: 2023-09
type: Artículo
type: PeerReviewed
format: text
language: en
rights: cc_by_nc_nd_4
identifier: http://repositorio.unic.co.ao/id/eprint/8800/1/Real_Word_Spelling_Error_Detection_and_Correction_for_Urdu_Language.pdf
identifier:   Artículo Materias > Ingeniería <http://repositorio.unic.co.ao/view/subjects/uneat=5Feng.html> Universidad Europea del Atlántico > Investigación > Producción Científica <http://repositorio.unic.co.ao/view/divisions/uneatlantico=5Fproduccion=5Fcientifica.html> Fundación Universitaria Internacional de Colombia > Investigación > Producción Científica <http://repositorio.unic.co.ao/view/divisions/unincol=5Fproduccion=5Fcientifica.html> Universidad Internacional Iberoamericana México > Investigación > Producción Científica <http://repositorio.unic.co.ao/view/divisions/uninimx=5Fproduccion=5Fcientifica.html> Universidad Internacional Iberoamericana Puerto Rico > Investigación > Producción Científica <http://repositorio.unic.co.ao/view/divisions/uninipr=5Fproduccion=5Fcientifica.html> Universidad Internacional do Cuanza > Investigación > Producción Científica <http://repositorio.unic.co.ao/view/divisions/unic=5Fproduccion=5Fcientifica.html> Abierto Inglés Non-word and real-word errors are generally two types of spelling errors. Non-word errors are misspelled words that are nonexistent in the lexicon while real-word errors are misspelled words that exist in the lexicon but are used out of context in a sentence. Lexicon-based lookup approach is widely used for non-word errors but it is incapable of handling real-word errors as they require contextual information. Contrary to the English language, real-word error detection and correction for low-resourced languages like Urdu is an unexplored area. This paper presents a real-word spelling error detection and correction approach for the Urdu language. We develop an extensive lexicon of 593,738 words and use this lexicon to develop a dataset for real-word errors comprising 125562 sentences and 2,552,735 words. Based on the developed lexicon and dataset, we then develop a contextual spell checker that detects and corrects real-word errors. For the real-word error detection phase, word-gram features are used along with five machine learning classifiers, achieving a precision, recall, and F1-score of 0.84,0.79, and 0.81 respectively. We also test the proposed approach with a 40% error density. For real-word error correction, the Damerau-Levenshtein distance is used along with the n-gram model for further ranking of the suggested candidate words, achieving an accuracy of up to 83.67%. metadata Aziz, Romila; Anwar, Muhammad Waqas; Jamal, Muhammad Hasan; Bajwa, Usama Ijaz; Kuc Castilla, Ángel Gabriel; Uc-Rios, Carlos; Bautista Thompson, Ernesto y Ashraf, Imran mail SIN ESPECIFICAR, SIN ESPECIFICAR, SIN ESPECIFICAR, SIN ESPECIFICAR, SIN ESPECIFICAR, carlos.uc@unini.edu.mx, ernesto.bautista@unini.edu.mx, SIN ESPECIFICAR     <http://repositorio.unic.co.ao/id/eprint/8800/1/Real_Word_Spelling_Error_Detection_and_Correction_for_Urdu_Language.pdf>     (2023) Real Word Spelling Error Detection and Correction for Urdu Language.  IEEE Access.  p. 1.  ISSN 2169-3536     
relation: http://doi.org/10.1109/ACCESS.2023.3312730
relation: doi:10.1109/ACCESS.2023.3312730
language: en