Like most modules in Libmba, the ideas and code can be extracted or adapted to meet the needs of your application and do not bind you to a particular "environment" or library. Although the text.h typedefs and macros are used throughout the libmba package, users can simply choose to ignore them and pass char * to functions that accept tchar * (or wchar_t * if libmba has been compiled with -DUSE_WCHAR).
Consider the Russian character called CYRILLIC CAPITAL LETTER GHE which looks like an upside down 'L' has a Unicode value of U+0413. This character's value will be different depending on which charset is being discussed but Unicode is the international standard superset of virtually all charsets so unless otherwise specified Unicode will be used to describe characters throughout this documentation. The number of bytes that the U+0413 character occupies depends on the character encoding being used. Notice that a charset and a character encoding are different things.
Charsets A charset (or characterset) is a map that defines which numeric value represents a particular character. In the Unicode charset CYRILLIC CAPITAL LETTER GHE is the numeric value 0x0413 (written U+0413 in the Unicode standard convention). Some example charsets are ISO-8859-1, CP1251, and GB2312.
Character Encodings A character encoding defines how the numeric values representing characters are serialized into a sequence of bytes so that it can be operated on in memory, stored on a disk, or transmitted over a network. For example, at least two bytes of memory would be required to hold the value 0x0413. If this value were simply stored as the byte 0x04 followed by the byte 0x13 this encoding would be UCS-2BE meaning Unicode Character Set, 2 bytes, and big-endian. The following are examples of some other Unicode character encodings:
For example the sequence of bytes representing the Unicode character U+0413 in UTF-8 are 0xD0 followed by 0x93. This multi-byte sequence was determined by using a hexedit program to create a file called ucs2.bin containing 2 bytes; 0x04 followed by 0x13. As described before, this encoding is UCS-2BE. To convert the file from UCS-2BE to UTF-8 the command $ iconv -f UCS-2BE -t UTF-8 ucs2.bin > utf8.bin was used followed by hexedit again to view the results.
UTF-8 is the premier character encoding used on Unix and Unix-like platforms. For a complete description of UTF-8 read the UTF-8 and Unicode FAQ for Unix/Linux.
#ifdef USE_WCHAR typedef wchar_t tchar; #else typedef unsigned char tchar; #endif
char *strncpy(char *dest, const char *src, size_t n); wchar_t *wcsncpy(wchar_t *dest, const wchar_t *src, size_t n);From the above signatures it can be seen that the only difference is the character type. The number, order, and meaning of the parameters are the same. This permits the function to be abstracted with macros as follows:
#ifdef USE_WCHAR #define tcsncpy wcsncpy #else #define tcsncpy strncpy #endifTo use this function is now a matter of substituting all instances of strncpy or wcsncpy with tcsncpy. Depending on how the program is compiled, code that uses these functions will support wide character or multi-byte strings (but not both at the same time). See the Text Module API Documentation for a complete list of macros in text.h.
There are of course many other functions that operate on strings. Fortunately most standard C library function have wide character versions that are reasonably consistent about identifier names. An identifier that begins with str will likey have a wide character version that begins with wcs. Other functions like vswprintf are not so obvious and depending the the system being used there will certainly be omissions or incompatablities (e.g. the vsnprintf counterpart wide character function is vswprintf without the n even though it accepts an n parameter). If a function does not have a man page or if the compiler issues a warning it does not necessarily mean the function does not exist on your system. For example, with the GNU C library it may be necessary to specify C99 or define _XOPEN_SOURCE=500 to indicate a UNIX98 environment is desired. Check your C library documentation (e.g. /usr/include/features.h). Check the POSIX documentation on the OpenGroup website. On my RedHat Linux 7.3 system the wcstol and several other conversion functions are not documented. It is necessary to specify -std=c99 or define -D_ISOC99_SOURCE with gcc to trigger it to export that symbol.
A good example of when UTF-8 string handling requires special hanlding is when each character needs to be examined individually. Consider the example of caseless comparison of two strings. They cannot simply be compared element by element. Each character must be decoded to their wide character value and converted to upper or lowercase for the comparison to be valid. Below is just such a function:
/* Case insensitive comparison of two UTF-8 strings
*/
int
utf8casecmp(const unsigned char *str1, const unsigned char *str1lim,
const unsigned char *str2, const unsigned char *str2lim)
{
int n1, n2;
wchar_t ucs1, ucs2;
int ch1, ch2;
mbstate_t ps1, ps2;
memset(&ps1, 0, sizeof(ps1));
memset(&ps2, 0, sizeof(ps2));
while (str1 < str1lim && str2 < str2lim) {
if ((*str1 & 0x80) && (*str2 & 0x80)) { /* both multibyte */
if ((n1 = mbrtowc(&ucs1, str1, str1lim - str1, &ps1)) < 0 ||
(n2 = mbrtowc(&ucs2, str2, str2lim - str2, &ps2)) < 0) {
PMNO(errno);
return -1;
}
if (ucs1 != ucs2 && (ucs1 = towupper(ucs1)) != (ucs2 = towupper(ucs2))) {
return ucs1 < ucs2 ? -1 : 1;
}
str1 += n1;
str2 += n2;
} else { /* neither or one multibyte */
ch1 = *str1;
ch2 = *str2;
if (ch1 != ch2 && (ch1 = toupper(ch1)) != (ch2 = toupper(ch2))) {
return ch1 < ch2 ? -1 : 1;
} else if (ch1 == '\0') {
return 0;
}
str1++;
str2++;
}
}
return 0;
}
This is a fairly pathological example. In practice this is probably as difficult as it gets. For example, if the objective is to search for a certain ASCII character such as a space or '\0' termniator, it is not necessary to decode a Unicode value at all. It might even be reasonable to use isspace and similar functions (but probably not ispunct for example). This will require some experimenting and research.
Another example of when using a variable width encoding requires special handling in your code is when calculating the number of bytes from the string required to occupy at most a certin number of dispay positions in a terminal window. In this case it is necessary to convert each character to it's Unicode value and then use the wcwidth(3) function. When the total of values returned by wcwidth(3) equals or exceeds the desired number of columns the number of bytes traversed in the substring is known.
Ultimately if the target code is reading and writing plain text to sockets or files on the filesystem the text will probably need to be converted to and from a well defined encoding like the locale dependant encoding with wcsrtombs(3) and mbsrtowcs(3). Currently the libmba text module does not define macros for the wide character I/O functions but that may changed in the future. See src/cfg.c for a good example of converting between wide character strings and the multi-byte encoding in files and the environment.
The below source illustrates how a wide character pathname could be converted to the multi-byte encoding for passing to fopen(3).
/* Open a file using a wide character path name.
*/
FILE *
wcsfopen(const wchar_t *path, const char *mode)
{
char dst[PATH_MAX + 1];
size_t n;
mbstate_t mb;
memset(&mb, 0, sizeof(mbstate_t));
if ((n = wcsrtombs(dst, &path, PATH_MAX, &mb)) == (size_t)-1) {
return NULL;
}
if (n >= PATH_MAX) {
errno = E2BIG;
return NULL;
}
return fopen(dst, mode);
}
Currently, libmba modules that support the tchar abstraction do not accept wide character pathnames but that may change in the future.
Note that Unicode pathnames are supported by Unix and Unix-like systems that support the UTF-8 multi-byte encoding. Just call setlocale(LC_CTYPE, "en_US.UTF-8") first. Or export LCTYPE=en_US.UTF-8 in the environment and call setlocale(LC_CTYPE, ""). To test such a program it will be necessary to see I18N text printed somewhere. The following is a worthwhile exercise:
$ wget http://www.columbia.edu/kermit/utf8.html
$ xterm -u8 -fn '-misc-fixed-*-*-*-*-20-*-*-*-*-*-iso10646-*'
$ LANG=en_US.UTF-8 cat utf8.html
This downloads a file with a wide range of UTF-8 encoded text in it, launches an xterm in UTF-8 mode with a Unicode font, and runs cat in the UTF-8 locale to print the contents of utf8.html to the terminal window. Some newer Linux systems use the UTF-8 locale by default now so the above setup may not be necessary.
#if defined(USE_WCHAR)
if ((n = swprintf(path, L"/var/spool/mail/%ls", username)) == -1) {
#else
if ((n = snprintf(path, "/var/spool/mail/%s", username)) == -1) {
#endif
tests/TcharAll.c:100: warning: comparison of distinct pointer types lacks a cast
tests/TcharAll.c:161: warning: passing arg 2 of `strtod' from incompatible pointer type
siz = (src - start + 1) * sizeof *src;