Hamaz (U+0621) is wrongly considered a diacritic in Arabic

Created by: mzeidhassan

Hi all,

I am trying to normalize some Arabic text. I use the following code and it works just fine and removes all diacritics, but it also removes the Arabic character 'Hamaz' (U+0621) which shouldn't be considered a diacritic. Is there a way to skip removing hamza from Arabic text.

Here is my function that I am using: from icu import UnicodeString, Transliterator, UTransDirection def normalize_string(s): u = UnicodeString(s) t = Transliterator.createInstance("NFD; [:M:] Remove; NFC", UTransDirection.FORWARD) t.transliterate(u) normalized = str(u) return normalized normalize_string('صِراطَ الَّذينَ أَنعَمتَ عَلَيهِم غَيرِ المَغضوبِ عَلَيهِم وَلَا الضّالّينَ')

The output is:

'صراط الذين انعمت عليهم غير المغضوب عليهم ولا الضالين'

Thanks, Mohamed Zeid