Unihandecode

unihandecode original tree

View project onGitHub

Welcome to Unihandecode Project.

Unihandecode is a fork project of unidecode to provide transliterations of Unicode text by its readings in each native languages in Python environment.

Unihandecode is a fork project of unidecode to provide transliterations of Unicode text by its readings in each native languages in Python environment. There is a description in original unidecode(http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm) said that;

"It often happens that you have non-Roman text data in Unicode, but you can't display it -- usually because you're trying to show it to a user via an application that doesn't support Unicode, or because the fonts you need aren't accessible. You could represent the Unicode characters as "???????" or "\15BA\15A0\1610...", but that's nearly useless to the user who actually wants to read what the text says."

What unihandecode provide is a decode(...) function that takes Unicode data and tries to represent it in US-ASCII characters. There is a simple but big problem for China, Japanese and Korean characters. In some black history, CJK characters in Unicode are share same code blocks for similar(but not same figure, pronounce and meanings) characters. This is why I want to add a feature on unidecode that can recognize user's preferable language and transliterate it based on its readings.

Sean M. Burke, an original unidecode auther, said that;

"Unidecode, in other words, is quick and dirty. Sometimes the output is not so dirty at all... But sometimes the output is very dirty: Unidecode does quite badly on Japanese and Thai."

I am Japanese and feel bad for output of unidecode because of limitations as Sean said. Unihandecode provide good functionality over unidecode code base even for Japanese, Korean, Thai and more.

There are only Python bindings now. It is based on python port of unidecode (http://pypi.python.org/pypi/Unidecode).

The first target application is 'calibre' (http://calibre-ebook.com) that is used unidecode to generate filename from ebook's title and author.

Release 0.2x is licensed under GPLv3/Perl license. After Release 0.3, it is licensed under GPLv3 because of inclusion of KAKASI (GPLv2 and later) logics.