SystemError: invalid maximum character passed to PyUnicode_New
Looks like conversion from icu.UnicodeString
(UTF-16) to PyUnicode_New
is breaking when a PUA-B (16) character is used along with a character from any other planes besides BMP (0) and PUA-B (16).
How to repro:
In [120]: icu.Transliterator.createInstance("NFC").transliterate("Hello \U00010001")
Out[120]: 'Hello 𐀁'
In [121]: icu.Transliterator.createInstance("NFC").transliterate("Hello \U00100010")
Out[121]: 'Hello \U00100010'
In [122]: icu.Transliterator.createInstance("NFC").transliterate("Hello \U00010001\U00100010")
Traceback (most recent call last):
File "/main_instance_shell/besfahbod/venv3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3397, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-122-f76c5ffe0345>", line 1, in <cell line: 1>
icu.Transliterator.createInstance("NFC").transliterate("Hello \U00010001\U00100010")
SystemError: invalid maximum character passed to PyUnicode_New
---------------------------------------------------------------------------
SystemError Traceback (most recent call last)
Input In [122], in <cell line: 1>()
----> 1 icu.Transliterator.createInstance("NFC").transliterate("Hello \U00010001\U00100010")
SystemError: invalid maximum character passed to PyUnicode_New
or:
In [127]: icu.RegexPattern.compile("\p{N}").matcher("Hello \U00010001").replaceAll("")
Out[127]: 'Hello 𐀁'
In [128]: icu.RegexPattern.compile("\p{N}").matcher("Hello \U00100010").replaceAll("")
Out[128]: 'Hello \U00100010'
In [129]: icu.RegexPattern.compile("\p{N}").matcher("Hello \U00010001\U00100010").replaceAll("")
Traceback (most recent call last):
File "/main_instance_shell/besfahbod/venv3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3397, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-129-4515d7975282>", line 1, in <cell line: 1>
icu.RegexPattern.compile("\p{N}").matcher("Hello \U00010001\U00100010").replaceAll("")
SystemError: invalid maximum character passed to PyUnicode_New
---------------------------------------------------------------------------
SystemError Traceback (most recent call last)
Input In [129], in <cell line: 1>()
----> 1 icu.RegexPattern.compile("\p{N}").matcher("Hello \U00010001\U00100010").replaceAll("")
SystemError: invalid maximum character passed to PyUnicode_New
Versions
In [130]: icu.VERSION
Out[130]: '2.9'
In [131]: icu.ICU_VERSION
Out[131]: '71.1'