Unison is a pretty awesome file synchronizing utility. It's free, open source, highly customizable and scriptable. It does, however, have one big flaw: it doesn't support Unicode. As long as you synchronize between file systems of identical encoding, it doesn't matter. Unfortunately however, Windows, Linux and MacOSX all use different encodings per default.
My setup synchronizes files between 3 different OSX-machines using a Windows server as the central node. File names containing non-ascii characters like ÅÄÖ gets messed up when transferred, eg. the OSX file räksmörgås.txt will appear as raÌˆksmoÌˆrgaÌŠs.txt on the Windows machine.
This is very annoying. I really like my synchronization setup, and this is the only problem I have with it. What to do? Windows uses latin1 encoding for file names, and OSX uses utf8. What if you could trick windows into using utf8 also? Linux supports utf8 file names, so maybe cygwin can help. Nope, turns out Cygwin does not support Unicode... Googled "cygwin unicode" and found a hack to cygwin which enables Unicode and utf8 support for file names. My hope was rising as räksmörgås.txt seemed to correctly appear on the Windows side. Yes I had done it! Ran unison again to to double check, and the file was now for some reason flagged as new on the windows side, and the whole operation failed when unison tried to copy the file back to the OSX side and failing when discovering that the file was already there.
So, it turns out that there is such a thing as Unicode Normalization. Short story: The same character can be represented in different ways in Unicode, namely composed or decomposed. And, to make matters worse, OSX uses the decomposed form (NFD), and Windows/hacked Cygwin uses the composed form (NFKC). So even though the file is called räksmörgås.txt on both machines, the exact bit representation of the name is different. If I had used a Unicode aware program, this wouldn't have been a problem and the file names would have been recognized as identical. But as I said, Unison is NOT such a program...
I've done some research (ie, googled) there doesn't seem to be any plans to incorporate Unicode support in Unison. It turns out Unison is written in OCaml, which doesn't nativly support Unicode, so adding support for this would according to Unisons developers be pretty hard.
But how hard can it really be? I just need to make sure that both filenames are normalized before they are compared. And there are third party libraries to enable Unicode support in OCaml. So I went off and downloaded the Unison source code, the OCaml binaries, and the Unicode library (Camomile). It was pretty easy to locate the piece of code where the normalization should, or at least could, be done. Only one problem remains: Camomile is very poorly documented, and comes with absolutely no example code! Right, two problems: OCaml is a functional languange (like Haskell), and it turns out I hate functional languages!
To be continued (hopefully)...
UPDATE: Problem kind of solved!