One big difference between python2 and python3 is about strings. I have been bitten by this serval times. The motivation of this blog is clearify some basic questions.
Code Point
We all know computers can only understand 0s and 1s, to represent a character or a string, we need such kind binary representation, a well-known code isASCII, for example:
1 | a = 01100001 |
The problem with ASCII‘s 8 digit is that there are only 256 combinations, of course there are some other encodings, but that is beyond this blog.
In a nutshell, a code point is an integer, represents a character.
Unicode code point
The difference between Code Point and Unicode code point is, Unicode code points are usually noted as hexadecimal with a U+ in the very begining, it is the convention. Then the Unicode code point for letter a, b and c are:
1 | a = U+0061 |
Text vs Bytes
When talking about the strings, we usually mean the text you see on the screen as this blog, let’s call it human text, and the way a machine can understand is binary data, so bytes/str is employed in Python3/Python2 to represent the machine text, binary data.
Following is an example of the text café
1 | # Python3 version |
If you copy the code above in Python2, you will see:
1 | s = 'café' |
We will dive into this error later and find a solution for it.
Encode vs Decode
I was confused about the encode and decode step until I find a good analogy of it.
- Encode / Encrypt
Consider the encode step is a encryption, we are translate ahuman texttomachine textorbytes, which can be only understood by machines, that iscafé->caf\xc3\xa9, the first three characters are still represnted by themselves for convenience, but theéis reprensted as two bytes.
Without knowing the encoding method,caf\xc3\xa9can be interpreted as a different string:
1 | b'caf\xc3\xa9'.decode("gb2312") |
The code above explains the analogy between encode and encrypt.
- Decdoe / Decrypt
Given the the context of encode, we should naturally think about the analogy betweendecodeanddecrypt. To correctly “decrypt” thebytes, we need to know the encoding charset.
str vs unicode vs bytes vs bytearray
After talking about the encode and decode, let’s take a look about the types for strings.
There are two built-in types in both Python2 and Python3:
Solve the code snippet issue
Let’s answer why the very first code snippet only works in Python3, and how to write the Python2 code.
- Analysis
In Python2, when you define a string withr''orsingle quoteordouble quotesortriple quotes, it meansstrin Python2, which is the machine text, binary data, whereas in Python3, a string defined withr''orsingle quoteordouble quotesortriple quotesalso meansstrin Python3, but it is HUMAN TEXT, so you can encode it. - Solution
Addfrom __future__ import unicode_literals, this will make variablesaunicodein Python2 when you creating it, which is equivalent tostrin Python3. Refer this link about the pros and cons of this statement
We will save the bytes and bytesarray in the Compatibility section
Compatbility
- bytes in Python2
In python2,bytesis just an alias forstr, you can simply check the source code here and source_code here, or search&PyString_Typeto confirm it. - bytearray in Python2
Thebytearraywas imported to python2 after people makebytesimmutable(PEP3137), it is just an array of bytes, the only caveat is:- The slice of
bytearrayis stillbytearray - The indexing of
bytearrayis integer - All python array’s indexing except bytearray will return the type of element itself.
- The slice of
- use
from __future__ import unicode_literalsin python2 code, pros and cons, so that you will getunicodetype in Python2 when you create strings.
best practice
- use
from __future__ import unicode_literalswisely, pros and cons - always specify encoding, and “utf8” is strongly recommended.
- when reading text, always decode it to
human text, then do whatever processing you like, and translate it tobytesto persist the data.
bonus knowledge
- if you are given a sequence of bytes without the encoding, you can
guessthe encoding withchardet