Previous Topic

Book Contents

Book Index

Next Topic

Unicode

Unlike single-byte character sets, Unicode can intermix different character sets (e.g. German with Arabic).

Unicode can be implemented by different character encodings, e.g. UTF-8, UTF-16 or UCS-2. Depending on the encoding, one character may take from 1 up to 4 bytes.

Example:

  • Windows 1252: 1 byte per character

    Help Image

  • UTF-8: 1 to 4 bytes per character

    Help Image

  • UTF-16: 2 or 4 bytes per character

    Help Image

  • UCS-2: 2 bytes per character

    Help Image

    UCS-2 encoding in this example is the same as UTF-16 but as it is limited to 2 bytes only, it can not display all characters of UTF-16. (UCS-2 is limited to the 'Basic multilingual plane' - the main Unicode code area.)

When Unicode is beneficial or necessary:

  • Use of multiple languages from different character sets
  • Use of Asian languages

The conversion of database columns to Unicode data types may cause:

  • Database size increase by 1/3 to 1/2 of the original size.
  • The DB engine and JDBC driver may have some overhead with Unicode data types (e.g. searching and sorting, transfer of data between the database and application)
  • Problems with possible exceeding the limit on the total length of columns used in index (on MS SQL)

See Also

Possible Types of Textual Data

Single-byte Character Sets