Unicode: A character encoding standard developed by the Unicode Consortium. By using more than one byte to represent each character, Unicode enables almost all of the written languages in the world to be represented by using a single character set.
UNICODE files/strings
UNICODE files are files that contain UNICODE characters. The files starts with a special signature (FF FE) Unicode Byte Order Mark (BOM). Even if you take a UNICODE string, these two bytes need to be there as the starting two bytes.
Consider the following example. (If one of our Tamil brothers can change this example to Tamil in a following post, it will be really useful).
We write the word ?????? in Sinhala UNICODE.
Now copy-paste this to a notepad file and save it as a UNICODE file.
Then open this file in a HEX editor.
You will see these characters.
Code: Select all
FF FE C3 0D B8 0D B1 0D BD 0D BA 0D CF 0D
Then, you will see rest of the bytes as C3 0D B8 0D B1 0D BD 0D BA 0D CF 0D.
When we write two bytes (16-bits) to disk from memory, there is an effect called Endian Swap. Due to this the two bytes are interchanged when save to disk.
Since UNICODE characters are 16-bit, this effects UNICODE strings and files as well. So when the above stream is read in to memory, you will see it as below.
Code: Select all
0D C3 0D B8 0D B1 0D BD 0D BA 0D CF
First, lets see the Mapping of Unicode character planes.
Unicode code points can be logically divided into 17 planes, each with 65,536 (= 216) code points, although currently only a few planes are used:
- Plane 0 (0000–FFFF): Basic Multilingual Plane (BMP).
This is the plane containing most of the character assignments so far. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing systems in current use. - Plane 1 (10000–1FFFF): Supplementary Multilingual Plane (SMP).
- Plane 2 (20000–2FFFF): Supplementary Ideographic Plane (SIP)
- Planes 3 to 13 (30000–DFFFF) are unassigned
- Plane 14 (E0000–EFFFF): Supplementary Special-purpose Plane (SSP)
- Plane 15 (F0000–FFFFF) reserved for the Private Use Area (PUA)
- Plane 16 (100000–10FFFF), reserved for the Private Use Area (PUA)
On this plane, each language has a reserved range of characters.
For Sinhala, it is 0D80–0DFF. For Tamil, 0B80–0BFF.
Lets analyse the character string again. Consider the tables in Sinhala script and Tamil script articles.
Code: Select all
0D C3 0D B8 0D B1 0D BD 0D BA 0D CF
Code: Select all
? 0DC3
? 0DB8
? 0DB1
? 0DBD
? 0DBA
? 0DCF
Sinhala
Tamil
Now you understand, if you can write a file according to above convention, it will be considered as a UNICODE files/string.
UNICODE Fonts
These are fonts that specialise the character set of UNICODE fonts. These fonts respect the character range for each language. This made it possible to co-exists several specialised fonts in a single TTF file. Sarasavi.TTF is one of the famous UNICODE fonts used on the internet.
Windows XP uses the font called Arial UNICODE to draw characters. If you open this font in Character Map, you will see Sinhala is missing while some languages like Arabic is supported. So if you are using Windows XP, you need to install a Sinhala/Tamil font externally.
If you want to chose the Sinhala UNICODE font that you desire, read How to set UNICODE fonts for native languages in Windows.
Add your comments about this article.