Python and Unicode Panjabi(ਪੰਜਾਬੀ)
Written by Arvinder Singh
/
July 20, 2012
/
8 mins read
/
Filed under
Python,
/
Unicode,
/
Panjabi
Documenting the weird ways of Python 2.7 vs Python 3 when dealing with Unicode characters.
Examples in Punjabi text. Python 3 is a liberator, but still far away. Come follow along.
> python
In [ 0 ]: # Let's bhangra!
In [ 1 ]: unicode ( 'panjab' )
Out [ 1 ]: u'panjab'
In [ 2 ]: unicode ( 'ਪੰਜਾਬ' )
---------------------------------------------------------------------------
UnicodeDecodeError Traceback ( most recent call last )
/ Users / askang / Desktop /< ipython - input - 2 - ef57dc919a30 > in < module > ()
----> 1 unicode ( 'ਪੰਜਾਬ' )
UnicodeDecodeError : 'ascii' codec can 't decode byte 0xe0 in position 0: ordinal not in range(128)
In [3]: unicode(u' ਪੰਜਾਬ ')
Out[3]: u' \u0a2a \u0a70 \u0a1c \u0a3e \u0a2c '
In [4]: # String vs Unicode
In [5]: s = ' punjab '
In [6]: type(s)
Out[6]: str
In [7]: s = u' punjab '
In [8]: s
Out[8]: u' punjab '
In [9]: type(s)
Out[9]: unicode
Out[10]: # Western Bhangra Fusion
In [13]: s = u' panjab ਪੰਜਾਬ '
In [14]: type(s)
Out[14]: unicode
In [15]: s = u' panjab ਪੰਜਾਬ ਇੰਟਰਨੈਸ ਼ ਨਲ '
In [16]: s
Out[16]: u' panjab \u0a2a \u0a70 \u0a1c \u0a3e \u0a2c \u0a07 \u0a70 \u0a1f \u0a30 \u0a28 \u0a48 \u0a36 \u0a28 \u0a32 '
In [17]: type(s)
Out[17]: unicode
In [17]: # Thus Python 2.* stores all non-ascii characters internally as code point representations. Glyphs are someone else' s headache .
In [ 18 ]: s . count ( 'a' )
Out [ 18 ]: 2
In [ 19 ]: s . count ( u'a' )
Out [ 19 ]: 2
In [ 20 ]: s . count ( u'ੲ' )
Out [ 20 ]: 0
In [ 21 ]: s . count ( u'ਇ' )
Out [ 21 ]: 1
In [ 22 ]: s . count ( u'ਇੰ' )
Out [ 22 ]: 1
In [ 23 ]: # ੲ (GURMUKHI IRI \u0A72) and ਇ (GURMUKHI LETTER I \u 0A07). Consonent vs Vowel. In my opinion poor choice because we are going to run into issues when mining data.
In [ 24 ]: unichr ( 38677 )
Out [ 24 ]: u' \u9715 '
In [ 25 ]: ord ( u' \u9715 ' )
Out [ 25 ]: 38677
In [ 25 ] # ord() returns entity number for unicode hexadecimal, starting at zero for the first element of unicode table (unassigned)
In [ 27 ]: s = u'punjab'
In [ 28 ]: s = u'punjab ਪੰਜਾਬ'
In [ 29 ]: type ( s )
Out [ 29 ]: unicode
In [ 30 ]: s
Out [ 30 ]: u'punjab \u0a2a\u0a70\u0a1c\u0a3e\u0a2c '
In [ 31 ]: s . encode ( 'utf-8' )
Out [ 31 ]: 'punjab \xe0\xa8\xaa\xe0\xa9\xb0\xe0\xa8\x9c\xe0\xa8\xbe\xe0\xa8\xac '
In [ 32 ]: se = s . encode ( 'utf-8' )
In [ 33 ]: type ( se )
Out [ 33 ]: str
In [ 33 ]: # encode() converts a unicode into a string.
In [ 35 ]: sd = se . decode ( 'utf-8' )
In [ 36 ]: sd
Out [ 36 ]: u'punjab \u0a2a\u0a70\u0a1c\u0a3e\u0a2c '
In [ 37 ]: type ( sd )
Out [ 37 ]: unicode
In [ 38 ]: # decode() converts string back into encoding, utf-8 in this case
Unicode literals in source code
Try running the following script -
#!/usr/bin/env python
s = u'Panjab International ਪੰਜਾਬ ਇੰਟਰਨੈਸ਼ਨਲ'
print 'Unicode String: ' , s
we get
$ ❯ python python_text.py
File "python_text.py" , line 4
SyntaxError: Non-ASCII character '\xe0' in file python_text.py on line 4,
but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Python’s default encoding is assumed to be latin-1. Let decalare encoding as utf-8.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
s = u'Panjab International ਪੰਜਾਬ ਇੰਟਰਨੈਸ਼ਨਲ'
print 'Unicode String: ' , s
returns
$ ❯ python python_text.py
Unicode String: Panjab International ਪੰਜਾਬ ਇੰਟਰਨੈਸ਼ਨਲ
Code point properties
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import unicodedata
s = u'Panjab International ਪੰਜਾਬ ਇੰਟਰਨੈਸ਼ਨਲ'
print "====================================================="
print "index" , ' \t ' , 'ord(c)' , ' \t ' , "category" , ' \t ' , "name"
print "====================================================="
for index , char in enumerate ( s ):
print index , ' \t ' , ord ( char ), ' \t ' , unicodedata . category ( char ), ' \t ' , ' \t ' , unicodedata . name ( char )
returns
$❯ python python_text.py
=====================================================
index ord( c) category name
=====================================================
0 80 Lu LATIN CAPITAL LETTER P
1 97 Ll LATIN SMALL LETTER A
2 110 Ll LATIN SMALL LETTER N
3 106 Ll LATIN SMALL LETTER J
4 97 Ll LATIN SMALL LETTER A
5 98 Ll LATIN SMALL LETTER B
6 32 Zs SPACE
7 73 Lu LATIN CAPITAL LETTER I
8 110 Ll LATIN SMALL LETTER N
9 116 Ll LATIN SMALL LETTER T
10 101 Ll LATIN SMALL LETTER E
11 114 Ll LATIN SMALL LETTER R
12 110 Ll LATIN SMALL LETTER N
13 97 Ll LATIN SMALL LETTER A
14 116 Ll LATIN SMALL LETTER T
15 105 Ll LATIN SMALL LETTER I
16 111 Ll LATIN SMALL LETTER O
17 110 Ll LATIN SMALL LETTER N
18 97 Ll LATIN SMALL LETTER A
19 108 Ll LATIN SMALL LETTER L
20 32 Zs SPACE
21 2602 Lo GURMUKHI LETTER PA
22 2672 Mn GURMUKHI TIPPI
23 2588 Lo GURMUKHI LETTER JA
24 2622 Mc GURMUKHI VOWEL SIGN AA
25 2604 Lo GURMUKHI LETTER BA
26 32 Zs SPACE
27 2567 Lo GURMUKHI LETTER I
28 2672 Mn GURMUKHI TIPPI
29 2591 Lo GURMUKHI LETTER TTA
30 2608 Lo GURMUKHI LETTER RA
31 2600 Lo GURMUKHI LETTER NA
32 2632 Mn GURMUKHI VOWEL SIGN AI
33 2614 Lo GURMUKHI LETTER SHA
34 2600 Lo GURMUKHI LETTER NA
35 2610 Lo GURMUKHI LETTER LA
Python 3
In Python 3, all strings are sequences of Unicode characters.
( py3 ) Desktop ❯ python
Python 3.2 . 3 ( default , Jul 18 2012 , 15 : 57 : 31 )
Type \e to get an external editor .
>>> s = 'punjab'
>>> type ( s )
< class ' str '>
>>> s = "ਪੰਜਾਬ"
>>> type(s)
<class ' str '>
>>> s
' ਪੰਜਾਬ '
From Dive into Python 3
In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.
Other Resources