TY  - JOUR
JF  - IEEE Access
ID  - unic8800
A1  - Aziz, Romila
A1  - Anwar, Muhammad Waqas
A1  - Jamal, Muhammad Hasan
A1  - Bajwa, Usama Ijaz
A1  - Kuc Castilla, Ángel Gabriel
A1  - Uc-Rios, Carlos
A1  - Bautista Thompson, Ernesto
A1  - Ashraf, Imran
UR  - http://doi.org/10.1109/ACCESS.2023.3312730
SN  - 2169-3536
AV  - public
Y1  - 2023/09//
N2  - Non-word and real-word errors are generally two types of spelling errors. Non-word errors are misspelled words that are nonexistent in the lexicon while real-word errors are misspelled words that exist in the lexicon but are used out of context in a sentence. Lexicon-based lookup approach is widely used for non-word errors but it is incapable of handling real-word errors as they require contextual information. Contrary to the English language, real-word error detection and correction for low-resourced languages like Urdu is an unexplored area. This paper presents a real-word spelling error detection and correction approach for the Urdu language. We develop an extensive lexicon of 593,738 words and use this lexicon to develop a dataset for real-word errors comprising 125562 sentences and 2,552,735 words. Based on the developed lexicon and dataset, we then develop a contextual spell checker that detects and corrects real-word errors. For the real-word error detection phase, word-gram features are used along with five machine learning classifiers, achieving a precision, recall, and F1-score of 0.84,0.79, and 0.81 respectively. We also test the proposed approach with a 40% error density. For real-word error correction, the Damerau-Levenshtein distance is used along with the n-gram model for further ranking of the suggested candidate words, achieving an accuracy of up to 83.67%.
TI  - Real Word Spelling Error Detection and Correction for Urdu Language
KW  - Real-word errors
KW  -  spelling correction
KW  -  spelling detection
KW  -  spell checker
SP  - 1
EP  - 1
ER  -