Encoding script-specific writing rules based on the Unicode character set
(long version)
Malek Boualem & Mark Leisher
CRL (Computing Research Laboratory), New Mexico State University,
Box 30001, Dept 3CRL, Las Cruces, NM 88003, USA
E-mail: malek@crl.nmsu.edu, mleisher@crl.nmsu.edu
http://crl.nmsu.edu
context
Abstract
The World Wide Web is now the primary means for information interchange that is mainly represented in textual format. However programs that create and view these texts generally do not adequately support texts using non-Latin scripts, particularly right-to-left scripts. Unicode as a universal character set solves encoding problems of multilingual texts. It provides abstract character codes but does not offer methods for rendering text on screen or paper. An abstract character such « ARABIC LETTER BEH » which has the U+0628 code value can have different visual representations (called shapes or glyphs) on screen or paper, depending on context. Different scripts which are part of Unicode can have different rules for rendering glyphs, composite characters, ligatures, and other script-specific features. In this paper we present a general approach to encoding script-specific rendering rules based on the Unicode character set and using finite state transducer. The proposed formalism for character classification and writing rules is modular and easy to read and to modify by average users. In addition it is based on the most stable font structure defined in the Unicode Standard, thus it should be reusable by other environments supporting fonts from the Unicode Standard. Moreover the associated program is written in JAVA which makes it portable in many environments. This approach will be demonstrated with writing rules for some languages that use the Arabic script.