Wrong UnicodeString to unicode conversion for surrogate pair type characters
Created by: ohadshany
If a UnicodeString object contains characters that are UTF16 surrogate pair (i.e u' \U000216d5') then direct conversion to unicode results in an invalid value. For example:
import icu
print unicode(icu.UnicodeString(u'\U000216d5'))
Results in: u'\ud845\uded5'
Which is wrong since the correct value should be u' \U000216d5'
You can also see that UnicodeString(unicode(icu.UnicodeString(u' \U000216d5')))
fails because of incorrect value.
A quick fix is to use explicit encoding like this:
unicode(icu.UnicodeString(u'\U000216d5').encode('utf8'), 'utf8')
I've encountered this problem on linux python 2.7 with PyICU 1.9.3 and ICU 57 and also with PyICU 1.8 and ICU 53.