Sunday, April 12, 2009

Unison Unicode problems

Unison is a pretty awesome file synchronizing utility. It's free, open source, highly customizable and scriptable. It does, however, have one big flaw: it doesn't support Unicode. As long as you synchronize between file systems of identical encoding, it doesn't matter. Unfortunately however, Windows, Linux and MacOSX all use different encodings per default.

My setup synchronizes files between 3 different OSX-machines using a Windows server as the central node. File names containing non-ascii characters like ÅÄÖ gets messed up when transferred, eg. the OSX file räksmörgås.txt will appear as räksmörgaÌŠs.txt on the Windows machine.

This is very annoying. I really like my synchronization setup, and this is the only problem I have with it. What to do? Windows uses latin1 encoding for file names, and OSX uses utf8. What if you could trick windows into using utf8 also? Linux supports utf8 file names, so maybe cygwin can help. Nope, turns out Cygwin does not support Unicode... Googled "cygwin unicode" and found a hack to cygwin which enables Unicode and utf8 support for file names. My hope was rising as räksmörgås.txt seemed to correctly appear on the Windows side. Yes I had done it! Ran unison again to to double check, and the file was now for some reason flagged as new on the windows side, and the whole operation failed when unison tried to copy the file back to the OSX side and failing when discovering that the file was already there.

So, it turns out that there is such a thing as Unicode Normalization. Short story: The same character can be represented in different ways in Unicode, namely composed or decomposed. And, to make matters worse, OSX uses the decomposed form (NFD), and Windows/hacked Cygwin uses the composed form (NFKC). So even though the file is called räksmörgås.txt on both machines, the exact bit representation of the name is different. If I had used a Unicode aware program, this wouldn't have been a problem and the file names would have been recognized as identical. But as I said, Unison is NOT such a program...

I've done some research (ie, googled) there doesn't seem to be any plans to incorporate Unicode support in Unison. It turns out Unison is written in OCaml, which doesn't nativly support Unicode, so adding support for this would according to Unisons developers be pretty hard.

But how hard can it really be? I just need to make sure that both filenames are normalized before they are compared. And there are third party libraries to enable Unicode support in OCaml. So I went off and downloaded the Unison source code, the OCaml binaries, and the Unicode library (Camomile). It was pretty easy to locate the piece of code where the normalization should, or at least could, be done. Only one problem remains: Camomile is very poorly documented, and comes with absolutely no example code! Right, two problems: OCaml is a functional languange (like Haskell), and it turns out I hate functional languages!

To be continued (hopefully)...

UPDATE: Problem kind of solved!

1 comment:

  1. hỏa hoàn vẫn không thay đổi, nhưng màu đỏ trên thân nó lại càng thêm rực

    rõ. Cuối cùng khi thân thể cự thú dung nhập hết vào trong hỏa hoàn, nó

    liền nhanh chóng vỡ vụn, hóa thành đốm sáng màu hồng, biến mất trong

    thiên địa.

    Cũng vào lúc đó, trong Hỏa Phần quốc, tất cả hỏa linh thú, đều dừng lại, quỳ trên mặt đất. Trong miệng chúng phát ra những tiếng gầm rú nho nhỏ.

    Ngay sau đó, một con hỏa linh thú chợt ngã ra đất. Một điểm sáng màu

    hồng chợt xuất hiện trên trán của nó. Thân thể con hỏa linh thú nhanh

    chóng bành trướng. Chưa cháy hết một nửa nén nhang, thân thể nó đã cao

    tới hơn mười trượng.
    dịch vụ kế toán thuế
    eco green city
    chung cư goldmark city
    công ty làm dịch vụ kế toán
    chung cư 89 phùng hưng
    tiếng anh cho trẻ em
    chung cư newskyline
    hateco hoàng mai
    chung cư hà nội
    dịch vụ hoàn thuế
    dịch vụ kế toán thuế

    dịch vụ quyết toán thuế
    học kế toán thuế thực hành
    trung tâm kế toán

    Ngay sau đó, từng con hỏa linh thú đều xuất hiện trên trán một chấm màu

    hồng. Thân thể bọn chúng nhanh chóng bành trướng. Chưa tới hai canh giờ, tất cả hỏa linh thú trong Hỏa Phần quốc giống như vừa lột xác thêm lần

    nữa. Bọn chúng đều cao lên hơn mười trượng. Đồng thời, thực lực của

    chúng cũng tăng lên theo.