Skip to content

Wrong UnicodeString to unicode conversion for surrogate pair type characters

Created by: ohadshany

If a UnicodeString object contains characters that are UTF16 surrogate pair (i.e u' \U000216d5') then direct conversion to unicode results in an invalid value. For example:

import icu print unicode(icu.UnicodeString(u'\U000216d5')) Results in: u'\ud845\uded5'

Which is wrong since the correct value should be u' \U000216d5'

You can also see that UnicodeString(unicode(icu.UnicodeString(u' \U000216d5'))) fails because of incorrect value.

A quick fix is to use explicit encoding like this: unicode(icu.UnicodeString(u'\U000216d5').encode('utf8'), 'utf8')

I've encountered this problem on linux python 2.7 with PyICU 1.9.3 and ICU 57 and also with PyICU 1.8 and ICU 53.