I've been busting my head over this for days, trying to figure out why some links that include Greek (UTF-8) characters in the URL don't work correctly in my blog.
Eventually, I found it. Follow me, it’s an interesting and educational journey :-)
$ mkdir test $ cd test $ touch άέίόύή $ ls $ ls | hexdump
If you run the above commands in Linux, the last one will give you
0000000 acce adce afce 8ccf 8dcf aece 000a 000000d
That’s expected, isn’t it? We created a filename with 6 unicode characters, and that’s what hexdump shows (plus 000a 000d for LF+CR)
Now, go through the exact sequence in OS X. You will get...
0000000 ce b1 cc 81 ce b5 cc 81 ce b9 cc 81 ce bf cc 81 0000010 cf 85 cc 81 ce b7 cc 81 0a 0000019
I’ll tell you what. For some strange reason, HFS+ stores accented characters in filenames as character+accent. You typed “touch άέίόύή” and OS X decided to create a file named “α’ε’ι’ο’υ’”. Notice the multiple occurences of “cc 81” sequence in the hexdump, it’s the accent.
And it turns out that this is not “some strange reason”.
This is one of the two ways a system can represent an accented unicode character. (Read about NFC vs NFD in the wikipedia entry for “Unicode equivalence”
But wait. There’s hope.
If you are working in Python, consider unicodedata.normalize. I guess other languages will have something similar too.
The following code should give you an example on how and why you should use unicodedata.normalize
# -*- coding: utf-8 -*- import unicodedata import os orig = u'ά' print ‘\nOriginal = ', repr(orig) f = open(orig, 'w') f.close() print ‘\nPass #1' for f in os.listdir('.'): print f, repr(f) print ‘\nPass #2' for f in os.listdir('.'): print f, repr( unicodedata.normalize('NFC', f.decode('utf8')) )
The result is something like this in Linux (no change between pass #1 and #2)
Original = u'\u03ac' Pass #1 ά u'\u03ac' Pass #2 ά u'\u03ac'
and something like this in OS X
Original = u'\u03ac' Pass #1 ά u'\u03b1\u0301' Pass #2 ά u'\u03ac'