archive about

HFS+ and utf8 accented characters

I've been busting my head over this for days, trying to figure out why some links that include Greek (UTF-8) characters in the URL don't work correctly in my blog.

Eventually, I found it. Follow me, it’s an interesting and educational journey :-)

$ mkdir test
$ cd test
$ touch άέίόύή
$ ls
$ ls | hexdump

If you run the above commands in Linux, the last one will give you

0000000 acce adce afce 8ccf 8dcf aece 000a     
000000d

That’s expected, isn’t it? We created a filename with 6 unicode characters, and that’s what hexdump shows (plus 000a 000d for LF+CR)

Now, go through the exact sequence in OS X. You will get...

0000000 ce b1 cc 81 ce b5 cc 81 ce b9 cc 81 ce bf cc 81
0000010 cf 85 cc 81 ce b7 cc 81 0a                     
0000019

WTF???

I’ll tell you what. For some strange reason, HFS+ stores accented characters in filenames as character+accent. You typed “touch άέίόύή” and OS X decided to create a file named “α’ε’ι’ο’υ’”. Notice the multiple occurences of “cc 81” sequence in the hexdump, it’s the accent.

And it turns out that this is not “some strange reason”.

This is one of the two ways a system can represent an accented unicode character. (Read about NFC vs NFD in the wikipedia entry for “Unicode equivalence”

Fuck.

But wait. There’s hope.

If you are working in Python, consider unicodedata.normalize. I guess other languages will have something similar too.

The following code should give you an example on how and why you should use unicodedata.normalize

# -*- coding: utf-8 -*-
import unicodedata
import os

orig = u'ά'
print ‘\nOriginal = ', repr(orig)

f = open(orig, 'w')
f.close()

print ‘\nPass #1'
for f in os.listdir('.'):
        print f, repr(f)

print ‘\nPass #2'
for f in os.listdir('.'):
        print f, repr( unicodedata.normalize('NFC', f.decode('utf8')) )

The result is something like this in Linux (no change between pass #1 and #2)

Original =  u'\u03ac'

Pass #1
ά u'\u03ac'

Pass #2
ά u'\u03ac'

and something like this in OS X

Original =  u'\u03ac'

Pass #1
ά u'\u03b1\u0301'

Pass #2
ά u'\u03ac'