Hamaz (U+0621) is wrongly considered a diacritic in Arabic
Created by: mzeidhassan
Hi all,
I am trying to normalize some Arabic text. I use the following code and it works just fine and removes all diacritics, but it also removes the Arabic character 'Hamaz' (U+0621) which shouldn't be considered a diacritic. Is there a way to skip removing hamza from Arabic text.
Here is my function that I am using:
from icu import UnicodeString, Transliterator, UTransDirection def normalize_string(s): u = UnicodeString(s) t = Transliterator.createInstance("NFD; [:M:] Remove; NFC", UTransDirection.FORWARD) t.transliterate(u) normalized = str(u) return normalized normalize_string('صِراطَ الَّذينَ أَنعَمتَ عَلَيهِم غَيرِ المَغضوبِ عَلَيهِم وَلَا الضّالّينَ')
The output is:
'صراط الذين انعمت عليهم غير المغضوب عليهم ولا الضالين'
Thanks, Mohamed Zeid