Understanding Sinhala/Tamil UNICODE

Post by **Neo** » Sat Apr 10, 2010 3:21 am

There were many questions related to using Sinhala/Tamil UNICODE during past few days. Two questions were raised from two of our Promo Team members. So I thought it would be better to write a little article about Sinhala/Tamil UNICODE.

Unicode: A character encoding standard developed by the Unicode Consortium. By using more than one byte to represent each character, Unicode enables almost all of the written languages in the world to be represented by using a single character set.

UNICODE files/strings
UNICODE files are files that contain UNICODE characters. The files starts with a special signature (FF FE) Unicode Byte Order Mark (BOM). Even if you take a UNICODE string, these two bytes need to be there as the starting two bytes.

Consider the following example. (If one of our Tamil brothers can change this example to Tamil in a following post, it will be really useful).

We write the word ?????? in Sinhala UNICODE.
Now copy-paste this to a notepad file and save it as a UNICODE file.
Then open this file in a HEX editor.
You will see these characters.

Code: Select all

FF FE C3 0D B8 0D B1 0D BD 0D BA 0D CF 0D

Note that the file starts with FF FE. That is the BOM in UNICODE that I have explained above.

Then, you will see rest of the bytes as C3 0D B8 0D B1 0D BD 0D BA 0D CF 0D.
When we write two bytes (16-bits) to disk from memory, there is an effect called Endian Swap. Due to this the two bytes are interchanged when save to disk.

Since UNICODE characters are 16-bit, this effects UNICODE strings and files as well. So when the above stream is read in to memory, you will see it as below.

Code: Select all

0D C3      0D B8      0D B1      0D BD      0D BA      0D CF

Now, lets see what are these characters. I'm sure you are quite interested to know this part.

First, lets see the Mapping of Unicode character planes.
Unicode code points can be logically divided into 17 planes, each with 65,536 (= 216) code points, although currently only a few planes are used:

Plane 0 (0000–FFFF): Basic Multilingual Plane (BMP).
This is the plane containing most of the character assignments so far. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing systems in current use.
Plane 1 (10000–1FFFF): Supplementary Multilingual Plane (SMP).
Plane 2 (20000–2FFFF): Supplementary Ideographic Plane (SIP)
Planes 3 to 13 (30000–DFFFF) are unassigned
Plane 14 (E0000–EFFFF): Supplementary Special-purpose Plane (SSP)
Plane 15 (F0000–FFFFF) reserved for the Private Use Area (PUA)
Plane 16 (100000–10FFFF), reserved for the Private Use Area (PUA)

For Sinhala language, we are interested in the first plane (i.e.: Plane 0): Basic Multilingual Plane.

On this plane, each language has a reserved range of characters.
For Sinhala, it is 0D80–0DFF. For Tamil, 0B80–0BFF.

Lets analyse the character string again. Consider the tables in Sinhala script and Tamil script articles.

Code: Select all

0D C3      0D B8      0D B1      0D BD      0D BA      0D CF

Code: Select all

?	0DC3	
?	0DB8
?	0DB1
?	0DBD
?	0DBA
?	0DCF

You can get the full list of characters from following links.

Sinhala
Tamil

Now you understand, if you can write a file according to above convention, it will be considered as a UNICODE files/string.

UNICODE Fonts
These are fonts that specialise the character set of UNICODE fonts. These fonts respect the character range for each language. This made it possible to co-exists several specialised fonts in a single TTF file. Sarasavi.TTF is one of the famous UNICODE fonts used on the internet.
Windows XP uses the font called Arial UNICODE to draw characters. If you open this font in Character Map, you will see Sinhala is missing while some languages like Arabic is supported. So if you are using Windows XP, you need to install a Sinhala/Tamil font externally.

If you want to chose the Sinhala UNICODE font that you desire, read How to set UNICODE fonts for native languages in Windows.

Add your comments about this article.