README.md 12.5 KB
Newer Older
1
# README file for PyICU
ovalhub's avatar
ovalhub committed
2

3
## Welcome
ovalhub's avatar
ovalhub committed
4

5
Welcome to PyICU, a Python extension wrapping the ICU C++ libraries.
ovalhub's avatar
ovalhub committed
6

7
8
9
10
11
ICU stands for "International Components for Unicode".
These are the i18n libraries of the Unicode Consortium.
They implement much of the Unicode Standard,
many of its companion Unicode Technical Standards,
and much of Unicode CLDR.
ovalhub's avatar
ovalhub committed
12

Andi Vajda's avatar
Andi Vajda committed
13
The PyICU source code is hosted at https://gitlab.pyicu.org/main/pyicu.
ovalhub's avatar
ovalhub committed
14

ovalhub's avatar
ovalhub committed
15
The ICU homepage is http://site.icu-project.org/
ovalhub's avatar
ovalhub committed
16

17
See also the CLDR homepage at http://cldr.unicode.org/
ovalhub's avatar
ovalhub committed
18

19
20
## Installing PyICU

Andi Vajda's avatar
Andi Vajda committed
21
22
23
PyICU is a python extension implemented in C++ that wraps the C/C++ ICU library.
It is known to also work as a [PyPy](https://www.pypy.org/) extension.
Unless ``pkg-config`` and the ICU libraries and headers are already installed,
Andi Vajda's avatar
Andi Vajda committed
24
25
26
building PyICU from the sources on [PyPI](https://pypi.org/project/PyICU/)
involves more than just a ``pip`` call. Many operating systems distribute
pre-built binary packages of ICU and PyICU, see below.
Andi Vajda's avatar
Andi Vajda committed
27

28
  - Mac OS X
29
    - Ensure ICU is installed and can be found by `pkg-config` (as `icu-config` was [deprecated](http://userguide.icu-project.org/howtouseicu#TOC-C-Makefiles) as of ICU 63.1), either by following [ICU build instructions](https://unicode-org.github.io/icu/userguide/icu4c/build.html), or by using Homebrew:
30
31
32
33
34
35
36
37
      ```sh
      # install libicu (keg-only)
      brew install pkg-config icu4c

      # let setup.py discover keg-only icu4c via pkg-config
      export PATH="/usr/local/opt/icu4c/bin:/usr/local/opt/icu4c/sbin:$PATH"
      export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/usr/local/opt/icu4c/lib/pkgconfig"
      ```
38
39
    - Install PyICU **with the same C++ compiler as your Python distribution**
      ([more info](https://gitlab.pyicu.org/main/pyicu/merge_requests/140#issuecomment-782283491)):
40
41
42
      ```sh
      # EITHER - when using a gcc-built CPython (e.g. from Homebrew)
      export CC="$(which gcc)" CXX="$(which g++)"
43
      # OR - when using system CPython or another clang-based CPython, ensure system clang is used (for proper libstdc++ https://gitlab.pyicu.org/main/pyicu/issues/5#issuecomment-291631507):
44
45
46
47
48
49
      unset CC CXX

      # avoid wheels from previous runs or PyPI
      pip install --no-binary=:pyicu: pyicu
      ```

Andi Vajda's avatar
Andi Vajda committed
50
51
52
53
54
55
56
    - ICU and PyICU binaries are both available via [Macports](https://www.macports.org/) as well. The same limitations about mixing binaries may apply.
      ```sh
      # see versions available
      /opt/local/bin/port search pyicu
      sudo /opt/local/bin/port install ...
      ```

57
58
59
60
61
62
63
64
65
66
67
  - Debian
    ```sh
    apt-get update
    
    # EITHER - from apt directly https://packages.debian.org/source/stable/pyicu
    apt-get install python3-icu
    # OR - from source
    apt-get install pkg-config libicu-dev
    pip install --no-binary=:pyicu: pyicu
    ```

Andi Vajda's avatar
Andi Vajda committed
68
69
  - Ubuntu: similar to Debian, there is a pyicu
    [package](https://packages.ubuntu.com/source/xenial/python/pyicu)
Andi Vajda's avatar
Andi Vajda committed
70
71
    available via ``apt``.

Andi Vajda's avatar
Andi Vajda committed
72
73
74
  - Alpine Linux: there is a pyicu
    [package](https://pkgs.alpinelinux.org/package/edge/community/x86/py3-icu)
    available via ``apk``.
Andi Vajda's avatar
Andi Vajda committed
75

Andi Vajda's avatar
Andi Vajda committed
76
  - NetBSD: there is a pyicu [package](https://pkgsrc.se/textproc/py-ICU)
Andi Vajda's avatar
Andi Vajda committed
77
    available via ``pkg_add``.
Andi Vajda's avatar
Andi Vajda committed
78

Andi Vajda's avatar
Andi Vajda committed
79
  - OpenBSD: there is a pyicu [package](https://openports.se/textproc/py-ICU)
Andi Vajda's avatar
Andi Vajda committed
80
    available via ``pkg_add``.
Andi Vajda's avatar
Andi Vajda committed
81

Andi Vajda's avatar
Andi Vajda committed
82
  - Other operating systems: see below.
Andi Vajda's avatar
Andi Vajda committed
83

84
## Building PyICU
ovalhub's avatar
ovalhub committed
85
86

Before building PyICU the ICU libraries must be built and installed. Refer
87
to each system's [instructions](https://unicode-org.github.io/icu/userguide/icu4c/build.html) for more information.
ovalhub's avatar
ovalhub committed
88

89
PyICU is built from sources with ``setuptools`` or with ``build`` and ``pip``:
90

91
92
   - verify that ``pkg-config`` is available (the ``icu-config`` program is
     [deprecated](http://userguide.icu-project.org/howtouseicu#TOC-C-Makefiles)
Andi Vajda's avatar
Andi Vajda committed
93
     as of ICU 63.1)
Andi Vajda's avatar
Andi Vajda committed
94
95
96
     ```sh
     pkg-config --cflags --libs icu-i18n
     ```
Andi Vajda's avatar
Andi Vajda committed
97
     If this command returns an error or doesn't return the paths expected
98
99
     then ensure that the ``INCLUDES``, ``LFLAGS``, ``CFLAGS`` and ``LIBRARIES``
     dictionaries in ``setup.py`` contain correct values for your platform.
Andi Vajda's avatar
Andi Vajda committed
100
     Starting with ICU 60, ``-std=c++11`` must appear in your CFLAGS or be the
101
     default for your C++ compiler.
ovalhub's avatar
ovalhub committed
102

103
   - **either** build and install PyICU with ``setuptools``
Andi Vajda's avatar
Andi Vajda committed
104
105
106
107
     ```sh
     python setup.py build
     sudo python setup.py install
     ```
ovalhub's avatar
ovalhub committed
108

109
110
111
   - **or** build PyICU with ``build`` and install it with ``pip``
     ```sh
     python -m build
Andi Vajda's avatar
Andi Vajda committed
112
     sudo python -m pip install dist/PyICU-<version>-<platform>.whl
113
114
     ```

115
116
117
118
119
120
121
122
123
124
125
   - **either** test PyICU with ``setuptools``
     ```sh
     python setup.py test
     ```

   - **or** test PyICU with ``pytest``
     ```sh
     python -m pytest
     ```


126
## Running PyICU
ovalhub's avatar
ovalhub committed
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141

  - Mac OS X
    Make sure that ``DYLD_LIBRARY_PATH`` contains paths to the directory(ies)
    containing the ICU libs.

  - Linux & Solaris
    Make sure that ``LD_LIBRARY_PATH`` contains paths to the directory(ies)
    containing the ICU libs or that you added the corresponding ``-rpath``
    argument to ``LFLAGS``.

  - Windows
    Make sure that ``PATH`` contains paths to the directory(ies)
    containing the ICU DLLs.


142
## What's available
ovalhub's avatar
ovalhub committed
143

144
See the [CHANGES](https://gitlab.pyicu.org/main/pyicu/blob/main/CHANGES) file
Andi Vajda's avatar
Andi Vajda committed
145
for an up to date log of changes and additions.
ovalhub's avatar
ovalhub committed
146
147


148
## API Documentation
ovalhub's avatar
ovalhub committed
149
150

There is no API documentation for PyICU. The API for ICU is documented at
Andi Vajda's avatar
Andi Vajda committed
151
152
153
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ and the
following patterns can be used to translate from the C++ APIs to the
corresponding Python APIs.
ovalhub's avatar
ovalhub committed
154

155
156
### strings

157
The ICU string type, [UnicodeString](https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1UnicodeString.html), is a type pointing at a mutable array of [UChar](https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/umachine_8h.html#a6bb9fad572d65b305324ef288165e2ac) Unicode 16-bit wide characters and is described [here](https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1UnicodeString.html#details). The Python 3 [str](https://docs.python.org/3/library/stdtypes.html#str) type is described [here](https://docs.python.org/3/library/stdtypes.html#index-26) and [here](https://docs.python.org/3/howto/unicode.html). The Python 2 [unicode](https://docs.python.org/2.7/reference/datamodel.html#index-23) type is described [here](https://docs.python.org/2.7/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange).
Andi Vajda's avatar
Andi Vajda committed
158

159
Because of their differences, ICU's and Python's string objects are not merged
Andi Vajda's avatar
Andi Vajda committed
160
into the same type when crossing the C++ boundary but converted.
161
162

ICU APIs taking ``UnicodeString`` arguments have been overloaded to also
Andi Vajda's avatar
Andi Vajda committed
163
164
165
accept arguments that are Python 3 ``str`` or Python 2 ``unicode`` objects.
Python 2 ``str`` objects are auto-decoded into ICU strings using the ``utf-8``
encoding.
166

Andi Vajda's avatar
Andi Vajda committed
167
168
169
To convert a Python 3 ``bytes`` or a Python 2 ``str`` object encoded in an
encoding other than ``utf-8`` to an ICU ``UnicodeString`` use the
``UnicodeString(str, encodingName)`` constructor.
170
171
172
173
174
175
176

ICU's C++ APIs accept and return ``UnicodeString`` arguments in several
ways: by value, by pointer or by reference.
When an ICU C++ API is documented to accept a ``UnicodeString`` reference
parameter, it is safe to assume that there are several corresponding
PyICU python APIs making it accessible in simpler ways:

Andi Vajda's avatar
Andi Vajda committed
177
178
179
For example, the ``'UnicodeString &Locale::getDisplayName(UnicodeString &)'``
API, documented
[here](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1Locale.html#a61def321a9cfd9904b59e3f1897f835e),
180
181
182
183
184
185
186
187
can be invoked from Python in several ways:

1. The ICU way

        >>> from icu import UnicodeString, Locale
        >>> locale = Locale('pt_BR')
        >>> string = UnicodeString()
        >>> name = locale.getDisplayName(string)
ovalhub's avatar
ovalhub committed
188
        >>> name
Andi Vajda's avatar
Andi Vajda committed
189
        <UnicodeString: 'Portuguese (Brazil)'>
190
191
        >>> name is string
        True                  <-- string arg was returned, modified in place
ovalhub's avatar
ovalhub committed
192

193
2. The Python way
ovalhub's avatar
ovalhub committed
194

195
196
197
198
        >>> from icu import Locale
        >>> locale = Locale('pt_BR')
        >>> name = locale.getDisplayName()
        >>> name
Andi Vajda's avatar
Andi Vajda committed
199
        'Portuguese (Brazil)'
ovalhub's avatar
ovalhub committed
200

201
    A ``UnicodeString`` object was allocated and converted to a Python
Andi Vajda's avatar
Andi Vajda committed
202
    ``str`` object.
203

Andi Vajda's avatar
Andi Vajda committed
204
205
206
207
208
A UnicodeString can be converted to a Python unicode string with Python 3's
``str()`` or Python 2's ``unicode()`` constructor. The usual ``len()``,
comparison, `[]`` and ``[:]`` operators are all available, with the additional
twists that slicing is not read-only and that ``+=`` is also available since a
UnicodeString is mutable. For example:
209
210

    >>> name = locale.getDisplayName()
Andi Vajda's avatar
Andi Vajda committed
211
    'Portuguese (Brazil)'
212
213
    >>> name = UnicodeString(name)
    >>> name
Andi Vajda's avatar
Andi Vajda committed
214
215
216
    <UnicodeString: 'Portuguese (Brazil)'>
    >>> str(name)
    'Portuguese (Brazil)'
217
218
    >>> len(name)
    19
Andi Vajda's avatar
Andi Vajda committed
219
    >>> str(name)
220
221
    'Portuguese (Brazil)'
    >>> name[3]
Andi Vajda's avatar
Andi Vajda committed
222
    't'
223
    >>> name[12:18]
Andi Vajda's avatar
Andi Vajda committed
224
    <UnicodeString: 'Brazil'>
225
226
    >>> name[12:18] = 'the country of Brasil'
    >>> name
Andi Vajda's avatar
Andi Vajda committed
227
    <UnicodeString: 'Portuguese (the country of Brasil)'>
228
229
    >>> name += ' oh joy'
    >>> name
Andi Vajda's avatar
Andi Vajda committed
230
    <UnicodeString: 'Portuguese (the country of Brasil) oh joy'>
231
232
233
234
235
236
237
238
239
240

### error reporting

The C++ ICU library does not use C++ exceptions to report errors. ICU
C++ APIs return errors via a ``UErrorCode`` reference argument. All such
APIs are wrapped by Python APIs that omit this argument and throw an
``ICUError`` Python exception instead. The same is true for ICU APIs
taking both a ``ParseError`` and a ``UErrorCode``, they are both to be
omitted.

241
For example, the ``'UnicodeString &DateFormat::format(const Formattable &, UnicodeString &, FieldPosition &, UErrorCode &)'`` API, documented [here](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1DateFormat.html#aae63209f1202550c91e2beed5691b062) is invoked from Python with:
242
243
244
245
246
247
248

    >>> from icu import DateFormat, Formattable
    >>> df = DateFormat.createInstance()
    >>> df
    <SimpleDateFormat: M/d/yy h:mm a>
    >>> f = Formattable(940284258.0, Formattable.kIsDate)
    >>> df.format(f)
Andi Vajda's avatar
Andi Vajda committed
249
    '10/18/99 3:04 PM'
250

251
Of course, the simpler ``'UnicodeString &DateFormat::format(UDate, UnicodeString &)'`` documented [here](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1DateFormat.html#a5940ccf5676d3fa043d8255c55b7ddd1) can be used too:
252
253
254
255
256
257

    >>> from icu import DateFormat
    >>> df = DateFormat.createInstance()
    >>> df
    <SimpleDateFormat: M/d/yy h:mm a>
    >>> df.format(940284258.0)
Andi Vajda's avatar
Andi Vajda committed
258
    '10/18/99 3:04 PM'
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281

### dates

ICU uses a double floating point type called ``UDate`` that represents the
number of milliseconds elapsed since 1970-jan-01 UTC for dates.

In Python, the value returned by the ``time`` module's ``time()``
function is the number of seconds since 1970-jan-01 UTC. Because of this
difference, floating point values are multiplied by 1000 when passed to
APIs taking ``UDate`` and divided by 1000 when returned as ``UDate``.

Python's ``datetime`` objects, with or without timezone information, can
also be used with APIs taking ``UDate`` arguments. The ``datetime``
objects get converted to ``UDate`` when crossing into the C++ layer.

### arrays

Many ICU API take array arguments. A list of elements of the array
element types is to be passed from Python.

### StringEnumeration

An ICU ``StringEnumeration`` has three ``next`` methods: ``next()`` which
Andi Vajda's avatar
Andi Vajda committed
282
returns ``str`` objects, ``unext()`` which returns ``str`` objects in Python 3
Andi Vajda's avatar
Andi Vajda committed
283
or ``unicode`` objects in Python 2 and ``snext()`` which returns
Andi Vajda's avatar
Andi Vajda committed
284
285
``UnicodeString`` objects. Any of these methods can be used as an iterator,
using the Python built-in ``iter`` function.
286

Andi Vajda's avatar
Andi Vajda committed
287
For example, let ``e`` be a ``StringEnumeration`` instance:
288
289

```python
Andi Vajda's avatar
Andi Vajda committed
290
e = TimeZone.createEnumeration()
Andi Vajda's avatar
Andi Vajda committed
291
292
293
[s for s in e] # a list of 'str' objects
[s for s in iter(e.unext, '')] # a list of 'str' or 'unicode' objects
[s for s in iter(e.snext, '')] # a list of 'UnicodeString' objects
294
295
296
297
298
```

### timezones

The ICU ``TimeZone`` type may be wrapped with an ``ICUtzinfo`` type for
Andi Vajda's avatar
Andi Vajda committed
299
usage with Python's ``datetime`` type. For example:
300
301

```python
Andi Vajda's avatar
Andi Vajda committed
302
from datetime import datetime
303
304
305
306
tz = ICUtzinfo(TimeZone.createTimeZone('US/Mountain'))
datetime.now(tz)
```

Andi Vajda's avatar
Andi Vajda committed
307
or, even simpler:
308
309
310
311
312
313

```python
tz = ICUtzinfo.getInstance('Pacific/Fiji')
datetime.now(tz)
```

Andi Vajda's avatar
Andi Vajda committed
314
To get the default time zone use:
315
316
317
318
319
320

```python
defaultTZ = ICUtzinfo.getDefault()
```

To get the time zone's id, use the ``tzid`` attribute or coerce the time
Andi Vajda's avatar
Andi Vajda committed
321
zone to a string:
322
323
324
325
326

```python
ICUtzinfo.getInstance('Pacific/Fiji').tzid -> 'Pacific/Fiji'
str(ICUtzinfo.getInstance('Pacific/Fiji')) -> 'Pacific/Fiji'
```
Andi Vajda's avatar
Andi Vajda committed
327
328
329
330

## Further Reading

The [unit tests](https://gitlab.pyicu.org/main/pyicu/-/tree/main/test) have
Andi Vajda's avatar
Andi Vajda committed
331
332
333
334
335
more examples of actual PyICU usage.

There are also a few
[samples](https://gitlab.pyicu.org/main/pyicu/-/tree/main/samples) ported from
ICU C/C++.