Saturday, April 18, 2009

I can haz success! Unison hack to enable Unicode normalization offilenames

NOTE: The latest development version of Unison now has built in Unicode support. Check this post for how to compile and use it!

DISCLAIMER: This is a very ugly hack! It's been tested to work in MY setup, but might not work in yours. I really don't know OCaml, or makefiles for that matter. You have been warned!

After much agony I've finally managed to build a hacked version of Unison to make my file sync setup work. The problem, as explained earlier, is that Unison doesn't support Unicode, and that I have to synchronize files between Mac OSX-machines (using UTF8 NFD-normalized filenames) and Windows machines (using latin1 or UTF8 NFKC-normalized filenames). To make filenames containing non ASCII characters transfer correctly, some kind of conversion has to be made, and as of now Unison does not support this.

In my file sync setup, I have three OSX machines synchronizing files using a Windows server as the central node (all OSX machines sync with the Windows machine). Synchronization is always initiated from one of the OSX-machines. What I have done is to install Cygwin on the Windows machine, and also install a hack for Cygwin which enables UTF8 support.

When I first did this I thought it would be enough, but since Windows/Cygwin and OSX uses different Unicode normalization (NFKC and NFD) the bit-by-bit representation of the filenames are different. This is what I set out to fix. I have inserted a few lines of code in the function the preprocesses filenames before comparison is done in Unison. Those lines uses the Camomile Unicode library to normalize the filename to NFKC, so when the OSX and Windows filenames are compared a little bit later they will be bit-wise identical.

This is DEFINITELY not the best way to do this, and does not by far fix all of Unison's encoding problems. What one should do is to rewrite all of the filename handling to support Unicode and also other encodings. But I don't know OCaml very well, in fact I find it quite confusing and frustrating, so for the moment this will have to do for me.

And it seems this is enough to fix my problems. The hack only needs to be applied to the OSX-side of Unison to work, even though it would probably be better if it was applied to both sides (but I'm WAY too lazy to try to compile Unison in Cygwin if it seems I don't have to :P).

So, if anyone needs to sync an OSX machine with a Windows machine, or perhaps with a Linux machine with a UTF8 filesystem, this could perhaps be of some help to you. (Note that while OSX and Windows/Cygwin enforces NFD and NFKC respectivly, Linux does NOT. So in Linux it would be possible to have to two different files with seemingly identical names, but with different normalization. This would obviously not work well with this hack, but that would probably be a less than ideal situation anyway.)

Quick install:

This is the quick install for people who don't want to compile stuff.
  1. Download my precompiled (OSX Leopard) Unison binary here: unison-unicode.zip (600KB, based on Unison 2.27). You only need the modified binary on the OSX side (as long as synchronization is initiated from that side), but all other machines must use the same version of Unison (2.27).

  2. Download the Camomile data files (5MB). These files must be extracted into /usr/local/share/camomile on your OSX machine (hardcoded, sorry!).

Build yourself:


These are instructions for how to build the modified Unison version yourself (for OSX, but might work on other architectures as well):

  1. Download and install OCaml.
  2. Download and install/build Camomile (follow instructions and use the default installation directory).
  3. Checkout a version of Unison with Subversion (I'm using /branches/2.27, but I think it will work with the latest beta version as well).
  4. Replace the files src/case.ml and src/src/Makefile.OCaml with these files.
  5. Compile using "make UISTYLE=text".
  6. The new Unison binary will be at src/unison. I would recommend you rename it to unison-unicode or something to tell it apart from your regular Unison version.
Your modified binary (from either the quick or full install) will enable you to synchronize files with Unicode filenames between an OSX machine and another machine with a UTF8 filesystem (for example Linux). If you want to sync with Windows you need to install Cygwin (make sure to select the unison package during installation) and the Cygwin UTF8 hack as well (make sure it's the cygwin unison binary that is being used during synchronization, use the parameter "-servercmd /usr/bin/unison").

Note that this version of Unison requires that the two file systems being synchronized are UTF8, if it encounters a filename that is not valid UTF8 it will probably crash!

If anyone actually tries this, please post your comments below! Thanks ;)

10 comments:

  1. Wow~! Great Tips for multiple languages.
    I've done the Mac Server and will test with PC/windows. thanks very much.

    ReplyDelete
  2. BurningSnowman20 August, 2009 21:56

    Glad someone has made some effort to at least work round this, thanks for posting it. I'm having trouble with it though. :'(

    I don't really know what I'm doing, at all, but thought I had followed the quick install okay. I'm hoping to use a Windows client with OS X server. But when just running the unicode Unison from a bash shell to check it's working I get "Bus error". Any ideas why? I'm using Tiger, and I've just put the binary in /usr/bin and the Camomile files into /usr/local/share/camomile as above.

    ReplyDelete
  3. The binary might only be working on Leopard, I've never tried it on Tiger. Have you tried downloading the source and compiling yourself? I guess that could be worth a shot.

    Also I should mention that now there is more serious work being done to make Unison Unicode-compatible. I haven't been monitoring the mailinglists very close lately, but I think we can expect at least some beta or development version pretty soon. Check out the unison-hackers (http://www.cis.upenn.edu/~bcpierce/unison/lists.html) list if you want to find out more.

    ReplyDelete
  4. BurningSnowman25 August, 2009 23:44

    Okay, thanks for the reply. I didn't try compiling it, but I've now reworked things to use rsync, since 'proper' bi-directionality wasn't essential for my setup. In light of your comments, I'll definitely keep glancing back at Unison in hope of native support though. :)

    ReplyDelete
  5. Hi people,

    Like LAV reports, Unicode support is now completely implemented in the development version of Unison. If you want to experiment with it, you need to download the latest source and compile unison yourself.

    FYI: for me this development version works fine.

    ReplyDelete
  6. Hi everybody,

    an easy alternative for UTF8 support is to use samba for the translation. I just synchronize to a local samba folder which is mounted with iocharset=iso8859-15 and everything works perfectly.
    I still hope for a UTF8 capable stable version soon.

    Greetings
    Alex

    ReplyDelete
  7. Alex is right! But that method will only work well on either small amounts of data or on a local network. The problem is that the whole contents of the files have to be transferred over the network to be compared, while if you sync over SSH the files can me hashed locally on each machine and unison can then just compare the hashes (please correct me if I'm wrong!).

    ReplyDelete

  8. Koen :
    Hi people,
    Like LAV reports, Unicode support is now completely implemented in the development version of Unison. If you want to experiment with it, you need to download the latest source and compile unison yourself.
    FYI: for me this development version works fine.


    Did you download one of the unison-2.32.xx source tarballs or check out the latest source from the svn repository? I'm confused because there's nothing mentioned about unicode support in the 2.32 change log..

    ReplyDelete
  9. Đám hỏa linh thú sau khi lột xác liền bay lên, hướng về phía đại quân tu

    sĩ vừa biến mất mà đuổi theo. Từ từ, càng lúc càng có nhiều hỏa linh

    thú lột xác tham gia đội ngũ truy kích.

    Sau khi sát nhập trở lại đại quân tu sĩ, mười vị tu sĩ Nguyên Anh kỳ
    dịch vụ kế toán thuế
    eco green city
    chung cư goldmark city
    công ty làm dịch vụ kế toán
    chung cư 89 phùng hưng
    tiếng anh cho trẻ em
    chung cư newskyline
    hateco hoàng mai
    chung cư hà nội
    dịch vụ hoàn thuế
    dịch vụ kế toán thuế

    dịch vụ quyết toán thuế
    học kế toán thuế thực hành
    trung tâm kế toán

    liền trở về vị trí. Có bốn người đi về phía vị trí của Chiến Thần điện.

    Một lão nhân bạch phát trong số đó, đảo mắt một cái rồi nhìn Vương Lâm

    đang đứng trên phượng xa, âm trầm quát:

    - Vừa rồi chính ngươi đã đưa hỏa linh thú tới phải không?

    Ánh mắt Vương Lâm lạnh lùng nhìn lão nhân rồi gật đầu.

    Lão nhân hừ lạnh một tiếng. Bàn tay to duỗi ra, chộp tới Vương Lâm. Nét

    mặt Phượng Loan đang đứng bên cạnh Vương Lâm hơi trầm xuống. Nàng vỗ một

    ReplyDelete
  10. Bài viết bạn truyền tải tôi thấy hay, chắc phải bookmark lại xem tiếp.

    Sẵn đây tớ muốn hỏi bên cậu có nhu cầu van chuyen hang ve Ha Noi hoặc chuyen hang ra Sai Gon không vậy?

    Mình bên cong ty van tai noi dia chuyên cung cấp các dịch vụ van chuyen hang hoa như van chuyen hang den Hai Phong, van chuyen hang hoa Da Nang, vận chuyển hàng ra Nghệ An, vận chuyển Cam Ranh, van chuyen hang ve Phu Quoc .v.v...

    Ngoài ra bên mình cũng chở thang may mini, thang may tai khach mitsubishi từ HCM để vận chuyển hàng đi Campuchia, chuyen hang qua Lao hoặc vận chuyển hàng hoá đi Trung Quốc.

    Có gì alo cho tui, cảm ơn cậu nhiều nhé.

    ReplyDelete