Unicode is tough

So today, I dared open a bug on python. Which at one point should make me feel mortified, since it has proven that I misunderstood what a character is.

The point was in python 3.2:
foo⋅bar=42
#  File "stdin", line 1
#    foo⋅bar=42
#            ^
#SyntaxError: invalid character in identifier
### This is another bug that is not in the scope of the post
### http://bugs.python.org/issue2382
print(ord("foo⋅bar"[3]))
# 8901
foo·bar = 42
print(ord("foo·bar"[3]))
# 183

A point is a punctuation mark, no? And variable names shouldn't use punctuation.
Plus it looks the same, shouldn't it be considered the same?

So I opened a bug and I was pointed very nicely to the fact that unicode characters "MIDDLE DOT" is indeed a punctuation but it also has the unicode property Other_ID_Continue. And as stated in python rules for identifiers, it is totally legitimate.

That is the point where you actively search for a good documentation to understand what in your brain malfunctioned. Then a Perl coder pointed me to Perl Unicode Essentials from Tom Christiansen. Even if the 1st third is about Perl, it is the best presentation so far on unicode I have read.


And then I understood my mistakes:
  • I (visually) confused a glyph with a character: a same glyph can be used for different characters;
  • unicode is much more than simply extending the usable glyphs (that I knew, but I did not grasped that I new so little).

By the way if you need a reason to switch to the current production version 3.3.0
remember Py3.3 is still improving in unicode support

py3.2 :
"ß".upper()
# ß  

which is a wrong result while in py3

"ß".upper()
# SS  


No comments: