Home > Linux, Linux 2.4, Linux 2.6, Programming, Testing > Languages, Unicode and Charset

Languages, Unicode and Charset

Languages with unicode or other character sets ?If your application needs to support multiple languages or if it needs to support languages with different character sets such as simplified Chinese (GB2312, GBK, GB18030, HZ,..) or traditional Chinese(BIG5, HKSCS, EUC-TW) you’ll need to make yourself familiar with Unicode and the different character sets.

In this article, we’ll focus on introducing character sets,  manipulating and converting charsets and the possible challenges you may encounter while handling Unicode text files.

If you plan to support multiple languages, you’ll also have to internationalize your application, for example by using Po files for different languages, a Po file editor and possibly have the translations done in launchpad if your project is open source. But this would be another subject.

Go for Unicode

If you are building a new application make sure its structure is based on Unicode (UTF-8, UCS-2, UTF-16 or UTF-32 ) since those charsets can handle most written languages (UTF: Universal Character Set) and are widely used. UTF-8 is the most widely used, since it is compatible with ASCII, followed by UTF-16. Actually, there is absolutely no valid reason to use other charsets unless you need to be backward compatible or your system must import files that may have different charsets (e.g. subtitles).

UTF-8 vs. UTF-16

Usually you would most likely use UTF-8, so to help me make your choice here are some useful characteristics of UTF-8 and UTF-16.

UTF-8 UTF-16 / UCS-2
ASCII Compatible Binary Format
Variable character byte-length (1 to 4) UTF-16: Each character is 2 or 4 bytes long
UCS-2: Each character is 2 bytes long
Compatible with string.h functions Imcompatible with string.h functions
Larger size for Chinese, Japanese languages 2 Bytes used for most Chinese/Japanese Ideograms
Examples:  HTML and XML documents, this Blog :) , most unicode capable applications Examples: Qualcomm BREW, .NET Framework, Cocoa, Qt, Symbian OS, Joliet file system, Java

The main advantages of UTF-8 are that it is compatible with ASCII and you can use string functions (strncpy, strncmp,..) safely since there won’t be any null byte (0×00) in the stream. The main advantage of UCS-2 is that you can easily find the number of characters (e.g. required for microwindows API) and it may use less spaces than UTF-8 for East Asian Languages. However, you won’t be able to code using function in string.h since 0×00 may be in the UTF-16 buffer.

Converting character sets with iconv

Since you may need to convert charsets into UTF-8 ot UTF-16 in your application.The ideal tool for that is iconv or its library libiconv. Iconv is usually present in all Linux distributions. If you don’t have it you can download iconv and build it. You can also build iconv for your target board to add charset (aka encoding) conversion to your program.

You can list all encodings supported by iconv with the following command:

iconv -l

Converting a file from one encodings to the other is straightforward. For example, here’s the command line to convert a file encoded with GB2312 (gb2312.txt) to UTF-8 (utf-8.txt):

iconv -f GB2312 -t UTF-8 gb2312.txt > utf-8.txt

For some encodings, you may also have to take into account the endianness and for example for UTF-16, use UTF-16BE (Big Endian) or UTF-16LE (Little Endian).

If you need to convert encodings in your source code, you may use the following function:

int ConvertEncodings(const char* destcode,  const char* sourcecode,
                     char* start, const char* end,
                     char** resultp, size_t* lengthp) {
    char *source_buf;
    char *pointer = NULL;
    size_t inbytes;
    int ret;
    iconv_t cd;

    *resultp = NULL;
    if (!sourcecode || !destcode || (end-start)<=0) {
        /* sanity check */
        fprintf(stderr, "Convert string: safe check failed\n");
        return -1;
    }
    cd = iconv_open(destcode, sourcecode);
    /* Did I succeed in getting a conversion descriptor ? */
    if (cd == (iconv_t)(-1)) {
        fprintf(stderr, "Convert string: iconv_open error=<%s> from=<%s>
                         to=<%s>\n",strerror(errno), sourcecode, destcode);
        /* I guess not */
        return -1;
    }

    source_buf = start;
    /* allocate max sized buffer, assuming target encoding may be 4 byte
       unicode */
    inbytes = end-start;
    *lengthp= sizeof(char) * (end-start)* MAX_CHAR_LENGTH ;
    *resultp = (char *) malloc(*lengthp+1);
    if (*resultp == NULL) {
        fprintf(stderr, "Failed to malloc %ld bytes when doing encoding
                         convertion\n", *lengthp + 1);
        return -1;
    }
    memset(*resultp,0x00, *lengthp);
    pointer = *resultp;

    ret = iconv(cd, &source_buf, &inbytes, &pointer, lengthp);

    iconv_close(cd);
    if (ret == -1) {
        if (*resultp != NULL) {
            free(*resultp);
        }
        fprintf(stderr, "Convert string: iconv from=<%s> to=<%s> error\n",
                         sourcecode, destcode);
        return -1;
    }
    return 0;
}

where destcode and source are iconv encoding strings (such as GB2312 or UTF-8),  start and end, respectively the end of start of the source buffer, resultp the pointer to the converted string and lengthp the pointer to the length in bytes of the converted string.

Difficulties in handling Unicode streams.

If you use UTF-8 you should not face too many problems as it is virtually the same as handling ASCII. However, you should be aware that strlen will not return the number of characters, but the number of bytes if your string does not only contain ASCII characters.

If you use UTF-16, UCS-2 or UTF-32, you’ll have to consider the following

  • You now manipulate data buffers instead of strings. So anycode that uses string funtions (strcpy, strcmp, strstr..) will have to be replaced with memcpy, memcmp etc..
  • While parsing or creating Unicode files, you’ll have to handle files with and without BOM (Byte Order Mask). BOM is used to determine if the file is little or big endian. This flag is optional. For example, for UTF-16, the BOM is composed of the first 2 bytes. If it can be 0xFEFF (Little Endian) or 0xFFFE (Big Endian).
    If you do not taken this into account while parsing a file, you may have gibberish (mojibake) when you display the string and if you do not set the BOM before saving a file it may not be properly be recognized in some programs.

Detecting character set / encodings with Enca

In some cases, you may need to automatically detect the encoding used in a file. Programs like Firefox have such capabilities, but the method used may not be appropriate for embedded systems and/or it may be difficult to extra the code from Firefox to use in your application.

So If you need to detect encodings, you may use Enca (Extremely Naive Charset Analyser), currently Enca 1.13 or its library libenca.

This is a library written in C, that will help detect encodings for the following languages:

  • Belarussian
  • Polish
  • Bulgarian
  • Russian
  • Croatian
  • Slovak
  • Czech
  • Slovene
  • Estonian
  • Ukrainian
  • Hungarian
  • Chinese
  • Latvian
  • Some multibyte encodings independently on language (e.g. Unicode)
  • Lithuanian

Have fun !

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter

  1. No comments yet.
  1. No trackbacks yet.