README.md 13.3 KB
Newer Older
1
# README file for PyICU
ovalhub's avatar
ovalhub committed
2

3
## Welcome
ovalhub's avatar
ovalhub committed
4

5
Welcome to PyICU, a Python extension wrapping the ICU C++ libraries.
ovalhub's avatar
ovalhub committed
6

7
8
9
10
11
ICU stands for "International Components for Unicode".
These are the i18n libraries of the Unicode Consortium.
They implement much of the Unicode Standard,
many of its companion Unicode Technical Standards,
and much of Unicode CLDR.
ovalhub's avatar
ovalhub committed
12

Andi Vajda's avatar
Andi Vajda committed
13
The PyICU source code is hosted at https://gitlab.pyicu.org/main/pyicu.
ovalhub's avatar
ovalhub committed
14

ovalhub's avatar
ovalhub committed
15
The ICU homepage is http://site.icu-project.org/
ovalhub's avatar
ovalhub committed
16

17
See also the CLDR homepage at http://cldr.unicode.org/
ovalhub's avatar
ovalhub committed
18

19
20
## Installing PyICU

Andi Vajda's avatar
Andi Vajda committed
21
22
23
PyICU is a python extension implemented in C++ that wraps the C/C++ ICU library.
It is known to also work as a [PyPy](https://www.pypy.org/) extension.
Unless ``pkg-config`` and the ICU libraries and headers are already installed,
Andi Vajda's avatar
Andi Vajda committed
24
25
26
building PyICU from the sources on [PyPI](https://pypi.org/project/PyICU/)
involves more than just a ``pip`` call. Many operating systems distribute
pre-built binary packages of ICU and PyICU, see below.
Andi Vajda's avatar
Andi Vajda committed
27

28
  - Mac OS X
29
    - Ensure ICU is installed and can be found by `pkg-config` (as `icu-config` was [deprecated](http://userguide.icu-project.org/howtouseicu#TOC-C-Makefiles) as of ICU 63.1), either by following [ICU build instructions](https://unicode-org.github.io/icu/userguide/icu4c/build.html), or by using Homebrew:
30
31
32
33
34
35
36
37
      ```sh
      # install libicu (keg-only)
      brew install pkg-config icu4c

      # let setup.py discover keg-only icu4c via pkg-config
      export PATH="/usr/local/opt/icu4c/bin:/usr/local/opt/icu4c/sbin:$PATH"
      export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/usr/local/opt/icu4c/lib/pkgconfig"
      ```
38
39
    - Install PyICU **with the same C++ compiler as your Python distribution**
      ([more info](https://gitlab.pyicu.org/main/pyicu/merge_requests/140#issuecomment-782283491)):
40
41
42
      ```sh
      # EITHER - when using a gcc-built CPython (e.g. from Homebrew)
      export CC="$(which gcc)" CXX="$(which g++)"
43
      # OR - when using system CPython or another clang-based CPython, ensure system clang is used (for proper libstdc++ https://gitlab.pyicu.org/main/pyicu/issues/5#issuecomment-291631507):
44
45
46
47
48
49
      unset CC CXX

      # avoid wheels from previous runs or PyPI
      pip install --no-binary=:pyicu: pyicu
      ```

Andi Vajda's avatar
Andi Vajda committed
50
51
52
53
54
55
56
    - ICU and PyICU binaries are both available via [Macports](https://www.macports.org/) as well. The same limitations about mixing binaries may apply.
      ```sh
      # see versions available
      /opt/local/bin/port search pyicu
      sudo /opt/local/bin/port install ...
      ```

57
58
59
60
61
62
63
64
65
66
67
  - Debian
    ```sh
    apt-get update
    
    # EITHER - from apt directly https://packages.debian.org/source/stable/pyicu
    apt-get install python3-icu
    # OR - from source
    apt-get install pkg-config libicu-dev
    pip install --no-binary=:pyicu: pyicu
    ```

Andi Vajda's avatar
Andi Vajda committed
68
69
  - Ubuntu: similar to Debian, there is a pyicu
    [package](https://packages.ubuntu.com/source/xenial/python/pyicu)
Andi Vajda's avatar
Andi Vajda committed
70
71
    available via ``apt``.

Andi Vajda's avatar
Andi Vajda committed
72
73
74
  - Alpine Linux: there is a pyicu
    [package](https://pkgs.alpinelinux.org/package/edge/community/x86/py3-icu)
    available via ``apk``.
Andi Vajda's avatar
Andi Vajda committed
75

Andi Vajda's avatar
Andi Vajda committed
76
  - NetBSD: there is a pyicu [package](https://pkgsrc.se/textproc/py-ICU)
Andi Vajda's avatar
Andi Vajda committed
77
    available via ``pkg_add``.
Andi Vajda's avatar
Andi Vajda committed
78

Andi Vajda's avatar
Andi Vajda committed
79
  - OpenBSD: there is a pyicu [package](https://openports.se/textproc/py-ICU)
Andi Vajda's avatar
Andi Vajda committed
80
    available via ``pkg_add``.
Andi Vajda's avatar
Andi Vajda committed
81

Andi Vajda's avatar
Andi Vajda committed
82
  - Other operating systems: see below.
Andi Vajda's avatar
Andi Vajda committed
83

84
## Building PyICU
ovalhub's avatar
ovalhub committed
85

Andi Vajda's avatar
editing    
Andi Vajda committed
86
*Please, refer to [next section](#building-pyicu-python-3-and-icu-from-sources) for building Python, ICU and PyICU from sources.
Andi Vajda's avatar
editing    
Andi Vajda committed
87
The current section is about building only PyICU from sources, with all dependencies such as Python and ICU already present.*
88

ovalhub's avatar
ovalhub committed
89
Before building PyICU the ICU libraries must be built and installed. Refer
90
to each system's [instructions](https://unicode-org.github.io/icu/userguide/icu4c/build.html) for more information.
ovalhub's avatar
ovalhub committed
91

92
PyICU is built from sources with ``setuptools`` or with ``build`` and ``pip``:
93

94
95
   - verify that ``pkg-config`` is available (the ``icu-config`` program is
     [deprecated](http://userguide.icu-project.org/howtouseicu#TOC-C-Makefiles)
Andi Vajda's avatar
Andi Vajda committed
96
     as of ICU 63.1)
Andi Vajda's avatar
Andi Vajda committed
97
98
99
     ```sh
     pkg-config --cflags --libs icu-i18n
     ```
Andi Vajda's avatar
Andi Vajda committed
100
     If this command returns an error or doesn't return the paths expected
101
102
     then ensure that the ``INCLUDES``, ``LFLAGS``, ``CFLAGS`` and ``LIBRARIES``
     dictionaries in ``setup.py`` contain correct values for your platform.
Andi Vajda's avatar
Andi Vajda committed
103
     Starting with ICU 60, ``-std=c++11`` must appear in your CFLAGS or be the
104
     default for your C++ compiler.
ovalhub's avatar
ovalhub committed
105

106
   - **either** build and install PyICU with ``setuptools``
Andi Vajda's avatar
Andi Vajda committed
107
108
109
110
     ```sh
     python setup.py build
     sudo python setup.py install
     ```
ovalhub's avatar
ovalhub committed
111

112
113
114
   - **or** build PyICU with ``build`` and install it with ``pip``
     ```sh
     python -m build
Andi Vajda's avatar
Andi Vajda committed
115
     sudo python -m pip install dist/PyICU-<version>-<platform>.whl
116
117
     ```

118
119
120
121
122
123
124
125
126
127
128
   - **either** test PyICU with ``setuptools``
     ```sh
     python setup.py test
     ```

   - **or** test PyICU with ``pytest``
     ```sh
     python -m pytest
     ```


Andi Vajda's avatar
editing    
Andi Vajda committed
129
## Building PyICU, Python 3 and ICU from sources
130

Andi Vajda's avatar
Andi Vajda committed
131
The instructions at [note_855](https://gitlab.pyicu.org/main/pyicu/-/issues/153#note_855) contain the complete steps for building everything from sources into
132
133
134
135
136
137
a self-contained directory, without modifying any system directories. They were
made and tested on an M1 Mac but they can be modified and reused for any unix
environment. In particular, they outline how to build PyICU from sources
without icu-config or pkg-config being present.


138
## Running PyICU
ovalhub's avatar
ovalhub committed
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153

  - Mac OS X
    Make sure that ``DYLD_LIBRARY_PATH`` contains paths to the directory(ies)
    containing the ICU libs.

  - Linux & Solaris
    Make sure that ``LD_LIBRARY_PATH`` contains paths to the directory(ies)
    containing the ICU libs or that you added the corresponding ``-rpath``
    argument to ``LFLAGS``.

  - Windows
    Make sure that ``PATH`` contains paths to the directory(ies)
    containing the ICU DLLs.


154
## What's available
ovalhub's avatar
ovalhub committed
155

156
See the [CHANGES](https://gitlab.pyicu.org/main/pyicu/blob/main/CHANGES) file
Andi Vajda's avatar
Andi Vajda committed
157
for an up to date log of changes and additions.
ovalhub's avatar
ovalhub committed
158
159


160
## API Documentation
ovalhub's avatar
ovalhub committed
161
162

There is no API documentation for PyICU. The API for ICU is documented at
Andi Vajda's avatar
Andi Vajda committed
163
164
165
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ and the
following patterns can be used to translate from the C++ APIs to the
corresponding Python APIs.
ovalhub's avatar
ovalhub committed
166

167
168
### strings

169
The ICU string type, [UnicodeString](https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1UnicodeString.html), is a type pointing at a mutable array of [UChar](https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/umachine_8h.html#a6bb9fad572d65b305324ef288165e2ac) Unicode 16-bit wide characters and is described [here](https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1UnicodeString.html#details). The Python 3 [str](https://docs.python.org/3/library/stdtypes.html#str) type is described [here](https://docs.python.org/3/library/stdtypes.html#index-26) and [here](https://docs.python.org/3/howto/unicode.html). The Python 2 [unicode](https://docs.python.org/2.7/reference/datamodel.html#index-23) type is described [here](https://docs.python.org/2.7/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange).
Andi Vajda's avatar
Andi Vajda committed
170

171
Because of their differences, ICU's and Python's string objects are not merged
Andi Vajda's avatar
Andi Vajda committed
172
into the same type when crossing the C++ boundary but converted.
173
174

ICU APIs taking ``UnicodeString`` arguments have been overloaded to also
Andi Vajda's avatar
Andi Vajda committed
175
176
177
accept arguments that are Python 3 ``str`` or Python 2 ``unicode`` objects.
Python 2 ``str`` objects are auto-decoded into ICU strings using the ``utf-8``
encoding.
178

Andi Vajda's avatar
Andi Vajda committed
179
180
181
To convert a Python 3 ``bytes`` or a Python 2 ``str`` object encoded in an
encoding other than ``utf-8`` to an ICU ``UnicodeString`` use the
``UnicodeString(str, encodingName)`` constructor.
182
183
184
185
186
187
188

ICU's C++ APIs accept and return ``UnicodeString`` arguments in several
ways: by value, by pointer or by reference.
When an ICU C++ API is documented to accept a ``UnicodeString`` reference
parameter, it is safe to assume that there are several corresponding
PyICU python APIs making it accessible in simpler ways:

Andi Vajda's avatar
Andi Vajda committed
189
190
191
For example, the ``'UnicodeString &Locale::getDisplayName(UnicodeString &)'``
API, documented
[here](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1Locale.html#a61def321a9cfd9904b59e3f1897f835e),
192
193
194
195
196
197
198
199
can be invoked from Python in several ways:

1. The ICU way

        >>> from icu import UnicodeString, Locale
        >>> locale = Locale('pt_BR')
        >>> string = UnicodeString()
        >>> name = locale.getDisplayName(string)
ovalhub's avatar
ovalhub committed
200
        >>> name
Andi Vajda's avatar
Andi Vajda committed
201
        <UnicodeString: 'Portuguese (Brazil)'>
202
203
        >>> name is string
        True                  <-- string arg was returned, modified in place
ovalhub's avatar
ovalhub committed
204

205
2. The Python way
ovalhub's avatar
ovalhub committed
206

207
208
209
210
        >>> from icu import Locale
        >>> locale = Locale('pt_BR')
        >>> name = locale.getDisplayName()
        >>> name
Andi Vajda's avatar
Andi Vajda committed
211
        'Portuguese (Brazil)'
ovalhub's avatar
ovalhub committed
212

213
    A ``UnicodeString`` object was allocated and converted to a Python
Andi Vajda's avatar
Andi Vajda committed
214
    ``str`` object.
215

Andi Vajda's avatar
Andi Vajda committed
216
217
218
219
220
A UnicodeString can be converted to a Python unicode string with Python 3's
``str()`` or Python 2's ``unicode()`` constructor. The usual ``len()``,
comparison, `[]`` and ``[:]`` operators are all available, with the additional
twists that slicing is not read-only and that ``+=`` is also available since a
UnicodeString is mutable. For example:
221
222

    >>> name = locale.getDisplayName()
Andi Vajda's avatar
Andi Vajda committed
223
    'Portuguese (Brazil)'
224
225
    >>> name = UnicodeString(name)
    >>> name
Andi Vajda's avatar
Andi Vajda committed
226
227
228
    <UnicodeString: 'Portuguese (Brazil)'>
    >>> str(name)
    'Portuguese (Brazil)'
229
230
    >>> len(name)
    19
Andi Vajda's avatar
Andi Vajda committed
231
    >>> str(name)
232
233
    'Portuguese (Brazil)'
    >>> name[3]
Andi Vajda's avatar
Andi Vajda committed
234
    't'
235
    >>> name[12:18]
Andi Vajda's avatar
Andi Vajda committed
236
    <UnicodeString: 'Brazil'>
237
238
    >>> name[12:18] = 'the country of Brasil'
    >>> name
Andi Vajda's avatar
Andi Vajda committed
239
    <UnicodeString: 'Portuguese (the country of Brasil)'>
240
241
    >>> name += ' oh joy'
    >>> name
Andi Vajda's avatar
Andi Vajda committed
242
    <UnicodeString: 'Portuguese (the country of Brasil) oh joy'>
243
244
245
246
247
248
249
250
251
252

### error reporting

The C++ ICU library does not use C++ exceptions to report errors. ICU
C++ APIs return errors via a ``UErrorCode`` reference argument. All such
APIs are wrapped by Python APIs that omit this argument and throw an
``ICUError`` Python exception instead. The same is true for ICU APIs
taking both a ``ParseError`` and a ``UErrorCode``, they are both to be
omitted.

253
For example, the ``'UnicodeString &DateFormat::format(const Formattable &, UnicodeString &, FieldPosition &, UErrorCode &)'`` API, documented [here](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1DateFormat.html#aae63209f1202550c91e2beed5691b062) is invoked from Python with:
254
255
256
257
258
259
260

    >>> from icu import DateFormat, Formattable
    >>> df = DateFormat.createInstance()
    >>> df
    <SimpleDateFormat: M/d/yy h:mm a>
    >>> f = Formattable(940284258.0, Formattable.kIsDate)
    >>> df.format(f)
Andi Vajda's avatar
Andi Vajda committed
261
    '10/18/99 3:04 PM'
262

263
Of course, the simpler ``'UnicodeString &DateFormat::format(UDate, UnicodeString &)'`` documented [here](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1DateFormat.html#a5940ccf5676d3fa043d8255c55b7ddd1) can be used too:
264
265
266
267
268
269

    >>> from icu import DateFormat
    >>> df = DateFormat.createInstance()
    >>> df
    <SimpleDateFormat: M/d/yy h:mm a>
    >>> df.format(940284258.0)
Andi Vajda's avatar
Andi Vajda committed
270
    '10/18/99 3:04 PM'
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293

### dates

ICU uses a double floating point type called ``UDate`` that represents the
number of milliseconds elapsed since 1970-jan-01 UTC for dates.

In Python, the value returned by the ``time`` module's ``time()``
function is the number of seconds since 1970-jan-01 UTC. Because of this
difference, floating point values are multiplied by 1000 when passed to
APIs taking ``UDate`` and divided by 1000 when returned as ``UDate``.

Python's ``datetime`` objects, with or without timezone information, can
also be used with APIs taking ``UDate`` arguments. The ``datetime``
objects get converted to ``UDate`` when crossing into the C++ layer.

### arrays

Many ICU API take array arguments. A list of elements of the array
element types is to be passed from Python.

### StringEnumeration

An ICU ``StringEnumeration`` has three ``next`` methods: ``next()`` which
Andi Vajda's avatar
Andi Vajda committed
294
returns ``str`` objects, ``unext()`` which returns ``str`` objects in Python 3
Andi Vajda's avatar
Andi Vajda committed
295
or ``unicode`` objects in Python 2 and ``snext()`` which returns
Andi Vajda's avatar
Andi Vajda committed
296
297
``UnicodeString`` objects. Any of these methods can be used as an iterator,
using the Python built-in ``iter`` function.
298

Andi Vajda's avatar
Andi Vajda committed
299
For example, let ``e`` be a ``StringEnumeration`` instance:
300
301

```python
Andi Vajda's avatar
Andi Vajda committed
302
e = TimeZone.createEnumeration()
Andi Vajda's avatar
Andi Vajda committed
303
304
305
[s for s in e] # a list of 'str' objects
[s for s in iter(e.unext, '')] # a list of 'str' or 'unicode' objects
[s for s in iter(e.snext, '')] # a list of 'UnicodeString' objects
306
307
308
309
310
```

### timezones

The ICU ``TimeZone`` type may be wrapped with an ``ICUtzinfo`` type for
Andi Vajda's avatar
Andi Vajda committed
311
usage with Python's ``datetime`` type. For example:
312
313

```python
Andi Vajda's avatar
Andi Vajda committed
314
from datetime import datetime
315
316
317
318
tz = ICUtzinfo(TimeZone.createTimeZone('US/Mountain'))
datetime.now(tz)
```

Andi Vajda's avatar
Andi Vajda committed
319
or, even simpler:
320
321
322
323
324
325

```python
tz = ICUtzinfo.getInstance('Pacific/Fiji')
datetime.now(tz)
```

Andi Vajda's avatar
Andi Vajda committed
326
To get the default time zone use:
327
328
329
330
331
332

```python
defaultTZ = ICUtzinfo.getDefault()
```

To get the time zone's id, use the ``tzid`` attribute or coerce the time
Andi Vajda's avatar
Andi Vajda committed
333
zone to a string:
334
335
336
337
338

```python
ICUtzinfo.getInstance('Pacific/Fiji').tzid -> 'Pacific/Fiji'
str(ICUtzinfo.getInstance('Pacific/Fiji')) -> 'Pacific/Fiji'
```
Andi Vajda's avatar
Andi Vajda committed
339
340
341
342

## Further Reading

The [unit tests](https://gitlab.pyicu.org/main/pyicu/-/tree/main/test) have
Andi Vajda's avatar
Andi Vajda committed
343
344
345
346
347
more examples of actual PyICU usage.

There are also a few
[samples](https://gitlab.pyicu.org/main/pyicu/-/tree/main/samples) ported from
ICU C/C++.