Python and Unicode Panjabi(ਪੰਜਾਬੀ)

Written by Arvinder Singh / July 20, 2012 / 8 mins read / Filed under Python, / Unicode, / Panjabi

Documenting the weird ways of Python 2.7 vs Python 3 when dealing with Unicode characters.

Examples in Punjabi text. Python 3 is a liberator, but still far away. Come follow along.

> python

In [0]:# Let's bhangra!

In [1]: unicode('panjab')
Out[1]: u'panjab'

In [2]: unicode('ਪੰਜਾਬ')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/Users/askang/Desktop/<ipython-input-2-ef57dc919a30> in <module>()
----> 1 unicode('ਪੰਜਾਬ')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

In [3]: unicode(u'ਪੰਜਾਬ')
Out[3]: u'\u0a2a\u0a70\u0a1c\u0a3e\u0a2c'

In [4]: # String vs Unicode

In [5]: s = 'punjab'

In [6]: type(s)
Out[6]: str

In [7]: s = u'punjab'

In [8]: s
Out[8]: u'punjab'

In [9]: type(s)
Out[9]: unicode

Out[10]: # Western Bhangra Fusion

In [13]: s =  u'panjab ਪੰਜਾਬ'

In [14]: type(s)
Out[14]: unicode

In [15]: s = u'panjab ਪੰਜਾਬ ਇੰਟਰਨੈਸਨਲ'

In [16]: s
Out[16]: u'panjab \u0a2a\u0a70\u0a1c\u0a3e\u0a2c \u0a07\u0a70\u0a1f\u0a30\u0a28\u0a48\u0a36\u0a28\u0a32'

In [17]: type(s)
Out[17]: unicode

In [17]: # Thus Python 2.* stores all non-ascii characters internally as code point representations. Glyphs are someone else's headache.

In [18]: s.count('a')
Out[18]: 2

In [19]: s.count(u'a')
Out[19]: 2

In [20]: s.count(u'ੲ')
Out[20]: 0

In [21]: s.count(u'ਇ')
Out[21]: 1

In [22]: s.count(u'ਇੰ')
Out[22]: 1

In [23]: # ੲ (GURMUKHI IRI \u0A72) and ਇ (GURMUKHI LETTER I \u 0A07). Consonent vs Vowel. In my opinion poor choice because we are going to run into issues when mining data.

In [24]: unichr(38677)
Out[24]: u'\u9715'

In [25]: ord(u'\u9715')
Out[25]: 38677

In [25] # ord() returns entity number for unicode hexadecimal, starting at zero for the first element of unicode table (unassigned)

In [27]: s = u'punjab'

In [28]: s = u'punjab ਪੰਜਾਬ'

In [29]: type(s)
Out[29]: unicode

In [30]: s
Out[30]: u'punjab \u0a2a\u0a70\u0a1c\u0a3e\u0a2c'

In [31]: s.encode('utf-8')
Out[31]: 'punjab \xe0\xa8\xaa\xe0\xa9\xb0\xe0\xa8\x9c\xe0\xa8\xbe\xe0\xa8\xac'

In [32]: se = s.encode('utf-8')

In [33]: type(se)
Out[33]: str

In [33]: # encode() converts a unicode into a string.

In [35]: sd = se.decode('utf-8')

In [36]: sd
Out[36]: u'punjab \u0a2a\u0a70\u0a1c\u0a3e\u0a2c'

In [37]: type(sd)
Out[37]: unicode

In [38]: # decode() converts string back into encoding, utf-8 in this case

Unicode literals in source code

Try running the following script -

#!/usr/bin/env python

s = u'Panjab International ਪੰਜਾਬ ਇੰਟਰਨੈਸ਼ਨਲ'
print 'Unicode String: ', s

we get

$ ❯ python python_text.py
  File "python_text.py", line 4
SyntaxError: Non-ASCII character '\xe0' in file python_text.py on line 4,
but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

Python’s default encoding is assumed to be latin-1. Let decalare encoding as utf-8.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

s = u'Panjab International ਪੰਜਾਬ ਇੰਟਰਨੈਸ਼ਨਲ'
print 'Unicode String: ', s

returns

$ ❯ python python_text.py
Unicode String:  Panjab International ਪੰਜਾਬ ਇੰਟਰਨੈਸ਼ਨਲ

Code point properties

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import unicodedata

s = u'Panjab International ਪੰਜਾਬ ਇੰਟਰਨੈਸ਼ਨਲ'

print "====================================================="
print "index", '\t', 'ord(c)', '\t', "category", '\t', "name"
print "====================================================="

for index, char in enumerate(s):
    print index, '\t', ord(char), '\t', unicodedata.category(char), '\t', '\t', unicodedata.name(char)

returns

$❯ python python_text.py
=====================================================
index 	ord(c) 	category 	name
=====================================================
0 	80 	Lu 		LATIN CAPITAL LETTER P
1 	97 	Ll 		LATIN SMALL LETTER A
2 	110 	Ll 		LATIN SMALL LETTER N
3 	106 	Ll 		LATIN SMALL LETTER J
4 	97 	Ll 		LATIN SMALL LETTER A
5 	98 	Ll 		LATIN SMALL LETTER B
6 	32 	Zs 		SPACE
7 	73 	Lu 		LATIN CAPITAL LETTER I
8 	110 	Ll 		LATIN SMALL LETTER N
9 	116 	Ll 		LATIN SMALL LETTER T
10 	101 	Ll 		LATIN SMALL LETTER E
11 	114 	Ll 		LATIN SMALL LETTER R
12 	110 	Ll 		LATIN SMALL LETTER N
13 	97 	Ll 		LATIN SMALL LETTER A
14 	116 	Ll 		LATIN SMALL LETTER T
15 	105 	Ll 		LATIN SMALL LETTER I
16 	111 	Ll 		LATIN SMALL LETTER O
17 	110 	Ll 		LATIN SMALL LETTER N
18 	97 	Ll 		LATIN SMALL LETTER A
19 	108 	Ll 		LATIN SMALL LETTER L
20 	32 	Zs 		SPACE
21 	2602 	Lo 		GURMUKHI LETTER PA
22 	2672 	Mn 		GURMUKHI TIPPI
23 	2588 	Lo 		GURMUKHI LETTER JA
24 	2622 	Mc 		GURMUKHI VOWEL SIGN AA
25 	2604 	Lo 		GURMUKHI LETTER BA
26 	32 	Zs 		SPACE
27 	2567 	Lo 		GURMUKHI LETTER I
28 	2672 	Mn 		GURMUKHI TIPPI
29 	2591 	Lo 		GURMUKHI LETTER TTA
30 	2608 	Lo 		GURMUKHI LETTER RA
31 	2600 	Lo 		GURMUKHI LETTER NA
32 	2632 	Mn 		GURMUKHI VOWEL SIGN AI
33 	2614 	Lo 		GURMUKHI LETTER SHA
34 	2600 	Lo 		GURMUKHI LETTER NA
35 	2610 	Lo 		GURMUKHI LETTER LA

Python 3

In Python 3, all strings are sequences of Unicode characters.

(py3)Desktop  python
Python 3.2.3 (default, Jul 18 2012, 15:57:31)
Type \e to get an external editor.

>>> s = 'punjab'
>>> type(s)
<class 'str'>
>>> s = "ਪੰਜਾਬ"
>>> type(s)
<class 'str'>
>>> s
'ਪੰਜਾਬ'

From Dive into Python 3

In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.

Other Resources