One big difference between python2 and python3 is about strings. I have been bitten by this serval times. The motivation of this blog is clearify some basic questions.
Code Point
We all know computers can only understand 0s and 1s, to represent a character or a string, we need such kind binary representation, a well-known code isASCII
, for example:
1 | a = 01100001 |
The problem with ASCII
‘s 8 digit is that there are only 256 combinations, of course there are some other encodings, but that is beyond this blog.
In a nutshell, a code point is an integer, represents a character.
Unicode code point
The difference between Code Point
and Unicode code point
is, Unicode code points are usually noted as hexadecimal with a U+
in the very begining, it is the convention. Then the Unicode code point
for letter a, b and c are:
1 | a = U+0061 |
Text vs Bytes
When talking about the strings, we usually mean the text you see on the screen as this blog, let’s call it human text
, and the way a machine can understand is binary data, so bytes
/str
is employed in Python3/Python2 to represent the machine text
, binary data
.
Following is an example of the text café
1 | # Python3 version |
If you copy the code above in Python2, you will see:
1 | s = 'café' |
We will dive into this error later and find a solution for it.
Encode vs Decode
I was confused about the encode and decode step until I find a good analogy of it.
- Encode / Encrypt
Consider the encode step is a encryption, we are translate ahuman text
tomachine text
orbytes
, which can be only understood by machines, that iscafé
->caf\xc3\xa9
, the first three characters are still represnted by themselves for convenience, but theé
is reprensted as two bytes.
Without knowing the encoding method,caf\xc3\xa9
can be interpreted as a different string:
1 | b'caf\xc3\xa9'.decode("gb2312") |
The code above explains the analogy between encode
and encrypt
.
- Decdoe / Decrypt
Given the the context of encode, we should naturally think about the analogy betweendecode
anddecrypt
. To correctly “decrypt” thebytes
, we need to know the encoding charset.
str vs unicode vs bytes vs bytearray
After talking about the encode and decode, let’s take a look about the types for strings.
There are two built-in types in both Python2 and Python3:
Solve the code snippet issue
Let’s answer why the very first code snippet only works in Python3, and how to write the Python2 code.
- Analysis
In Python2, when you define a string withr''
orsingle quote
ordouble quotes
ortriple quotes
, it meansstr
in Python2, which is the machine text, binary data, whereas in Python3, a string defined withr''
orsingle quote
ordouble quotes
ortriple quotes
also meansstr
in Python3, but it is HUMAN TEXT, so you can encode it. - Solution
Addfrom __future__ import unicode_literals
, this will make variables
aunicode
in Python2 when you creating it, which is equivalent tostr
in Python3. Refer this link about the pros and cons of this statement
We will save the bytes
and bytesarray
in the Compatibility section
Compatbility
- bytes in Python2
In python2,bytes
is just an alias forstr
, you can simply check the source code here and source_code here, or search&PyString_Type
to confirm it. - bytearray in Python2
Thebytearray
was imported to python2 after people makebytes
immutable(PEP3137), it is just an array of bytes, the only caveat is:- The slice of
bytearray
is stillbytearray
- The indexing of
bytearray
is integer - All python array’s indexing except bytearray will return the type of element itself.
- The slice of
- use
from __future__ import unicode_literals
in python2 code, pros and cons, so that you will getunicode
type in Python2 when you create strings.
best practice
- use
from __future__ import unicode_literals
wisely, pros and cons - always specify encoding, and “utf8” is strongly recommended.
- when reading text, always decode it to
human text
, then do whatever processing you like, and translate it tobytes
to persist the data.
bonus knowledge
- if you are given a sequence of bytes without the encoding, you can
guess
the encoding withchardet