Unicode Compliance For Text and Database

Unicode Compliance for Text

  • Notes on converting unicode properties files to ascii: After converting the properties files using native2ascii need to open the output files and check and remove some of the ascii characters appended at the beginnings of the files if there are any.
  • Note: When verifying if Unicode is displayed correctly in the browser you need to select View -> Encoding -> Unicode (UTF-8).

Unicode Compliance for Database

  • All tables need to be ENGINE=InnoDB, CHARACTER SET utf8 and the right collation (utf8_general_ci or utf8_unicode_ci). Some of Mifos tables don't have character set defined or defined as latin1. Some tables define ENGINE=MyISAM.

The sql to convert table to unicode character set is: ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci, ENGINE=InnoDB; or ALTER TABLE tbl_name CHARSET=utf8, ENGINE=InnoDB;

  • The sql to convert database to unicode char set is

ALTER DATABASE Mifos charset=utf8;

  • The main differences between utf8_unicode_ci and utf8_general_ci are:
    1. utf8_unicode_ci supports so called expansions and ligatures, for example:

German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".

utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.

  1. utf8_unicode_ci is generally more accurate for all scripts.

For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.

The disadvantage of utf8_unicode_ci is that it is a little bit

slower than utf8_general_ci.