SAS to Python: Different Length of Special Characters
I would like to share the difference between SAS and Python in dealing with special characters in this artical. Please note that this article is not a comprehensive discussion about the unicode, but just focuses on a tiny issue.
The lengths of the special characters are different in SAS and Python. For example, the length of ö is three using length
function in SAS, but is 2 using len
function in Python. Another example is ♞ whose length is 3 in SAS and 1 in Python.
Why it matters?
In my case, I need to replicate the results in SAS. The SAS code takes the first 5 characters of one string, ‘a♞dicefkdl’, for example. Since the length of ♞ is 3, then substr('a♞dicefkdl', 1, 5)
in SAS gives ‘a♞d’. If I use string slicing in Python, 'a♞dicefkdl'[:5]
, it gives ‘a♞dic’, which is different from result in SAS.
Why it is different?
It turns out that in SAS the length of a string is calculated on UTF-8 encoding. The UTF-8 encoding of ♞ is ‘\xe2\x99\x9e’, so the length is 3 in SAS. I am not sure why the length is 2 in Python.
How to solve?
Method 1 encode the whole string in Python first, then take the first 5, and then decode it back.
Method 2 adjust the slicing length first. Find the length difference between UTF-8 encoding and normal form first and then subtract the difference from the requested length of the string. The following code gives the correct answer.
Unicode
In Python’s document, it says
[Unicode] (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.
Python uses unicode standard to represent characters.