[EnglishFrontPage] [TitleIndex] [WordIndex


1. Character encodings

Computers can't actually store letters and symbols; they only store numbers. There are innumerable ways to represent human language characters (like the letter A, the plus sign, etc.) as numbers, and the fact that so many different ways were actually implemented has led to chaos.

Early computers (at least in the United States) converged on two standards for mapping US English characters to numbers and back again: ASCII and EBCDIC. The latter had mostly died out by the end of the 20th century, leaving ASCII (the American Standard Code for Information Interchange) as the primary standard.

The problem with ASCII is that it's too narrow for languages other than English, or even for certain English words which, depending on the writer's stylistic preferences, may be spelled with diacritics (e.g., rôle, naïve). ASCII only covers the 26 letters of the English alphabet (capital and lower-case), the digits 0 through 9, and some basic punctuation -- typically, what you see on a US computer keyboard. On Unix machines, you can type man ascii to see a table.

Most computers use an 8-bit byte as their unit of storage (a range from 0 to 255 when represented as nonnegative integers). ASCII defines characters using only 7 bits (0 to 127), a legacy of days when long-distance data communications were considerably slower and more error-prone. Since ASCII uses only half the range of a byte, this left space for people to define their own sets of characters within a single byte.

(Many Microsoft DOS/Windows users believe ASCII covers the entire range from 0 to 255, with smiley faces and line-drawing characters and so on. This is incorrect. The well-known DOS character set is actually IBM code page 437, which is one of many supersets of ASCII. ASCII itself stops at character 127.)

Computer users in countries outside the US were unable to represent their written languages using ASCII, so while US programmers were using the space from 128 to 255 to make line-drawing characters and mathematical symbols, Europeans were using it to add their extra letters, and their accented letters. This led to more chaos (not surprisingly), out of which another set of standards evolved: ISO 8859. Note that these are supersets of ASCII. ISO-8859-1, also known as Latin-1, became the dominant standard for Unix workstations in North America and western Europe.

However, this still left some unresolved issues. First, there were still competing standards; eastern Europe has very different alphabets than western Europe does, and the various ISO 8859 standards are incompatible. Second, Asian countries have radically different ways of writing compared to European/American countries, and their character sets don't even fit within a single byte (which only allows 256 different symbols).

The Unicode standard tries to address this: instead of defining only 256 symbols, it defines many thousands. If a computer were to represent each symbol of a document using Unicode code points, it would require three bytes per symbol, making simple English documents take three times as much space as they did before, with most of that space being occupied by zeroes.

So, to attempt to preserve some efficiency (as well as some compatibility with existing data files), various encodings of Unicode characters were created; of these, currently the most popular (among English speakers, at least) is UTF-8, which is a variable-width encoding. A simple ASCII document is also a valid UTF-8 document; single-byte characters from ASCII are represented using the same byte in UTF-8. However, UTF-8 also offers multi-byte sequences capable of representing all of Unicode (using up to four bytes per character in some cases).

As of 2009, UTF-8 is the emerging standard for Linux distributions, although there are still many problems with implementations.

2. Locales

So, what's a "locale"? Since there are so many standards out there, and so many different types of computers, some of which only support some of the standards, it's important to be able to say which standard you're working with. This is where locales come in.

A locale is a set of rules determining how information is presented and processed, with respect to human beings. It covers character encodings (which we've talked about in the first part of this page), as well as the order in which those characters are sorted, the format for displaying dates and times, the rules for representing large numbers and numbers with a decimal component, etc.

Examples: an American might write "the third day of January, A.D. 2009" as 1/3/09, while an Englishman may write the same date as 3/1/09. A computer programmer would probably use 2009-01-03. Meanwhile, our American friend writes the number "ten thousand and one one-hundredth" as 10,000.01, much to the distress of his German colleague, who writes 10 000,01 instead.

A Unix system has a command named locale which is used to show which locale a user (or more precisely, a process) is using at the moment, and to list all available locales. For example,

imadev:~$ locale

This shows the locale which is currently in use. To see which ones might be chosen instead:

imadev:~$ locale -a

At this point, the reader should appreciate why the first part of this page was devoted to character set encodings. Without understanding what "iso885915" means, this list would be somewhat cryptic.

A locale name has three components. The first component, which is two lower-case letters, shows the language being used. en, for example, means English; es is Spanish; de is German; and so on, using the two-letter country codes from which the primary dialect of each language is derived.

The second component (after the underscore) is the actual country the user is in (or whose locale rules the user wants enforced), and is primarily used for different dialects of a language. en_US and en_GB have a few differences in spelling, different currency symbols, and so forth.

The third component (after the period) is the character encoding. Note that the spelling of the encoding name is not quite standardized across systems. iso885915 is the normal spelling for an ISO-8859-15 encoding, but other systems may require ISO8859-15 (for example). You must use locale -a to see what is available, and how it's spelled, on your system.

The special names C and POSIX are an exception to this. They are required everywhere, and synonymous; they mean (basically) "ASCII, US English, don't apply any special rules". Output under this locale typically conforms to ISO and RFC standards for dates/times/etc., omits thousands separators entirely, uses the actual ASCII encoding values for sorting characters, and so on (generally defaulting to "traditional US computing rules").

You specify which locale you want to use by setting environment variables. (See DotFiles for a discussion of how and where to set environment variables for your interactive sessions.) The various LC_* variables, if set, define specific rules to follow; the LANG variable defines the fallback for whichever LC_* variables aren't set. In the most common cases, you will only set LANG.

To get the settings we saw on our example system, we might use something like:

imadev:~$ cat .profile
LANG=en_US.iso88591; LC_TIME=POSIX

(Note: .profile is only the correct file for certain types of logins. See DotFiles if you don't know which file you need to edit or create.)

This gives us the "US English, with ISO-8859-1 encoding" rules for most things, but the POSIX rules for displaying dates and times.

Since these are just environment variables, we can explore what happens when we change things.

imadev:~$ LC_TIME=POSIX date
Thu Apr 16 10:32:03 EDT 2009
imadev:~$ LC_TIME=en_US.iso88591 date
Thu, Apr 16, 2009 10:32:13 AM

For details of what your system does with locales, you'll need to check your manuals (such things are very much open to interpretation by implementors). Debian systems have locale(7) (type man 7 locale to read it); HP-UX has lang(5); and so on.

Once you've decided how you want your session to work, and where you need to put variables, just set things however you prefer.

3. Writing locale-aware programs

When writing programs -- particularly shell scripts, but this applies to other forms of programming as well -- one must be aware of the potentially differing behavior of the target system based on locale selection.

We've already seen how the date command on one system changes its output in response to locales, with fields moved around, extra commas inserted, and a 12-hour clock used instead of a 24-hour clock. (Yours may not be quite as radical, or it could be even more so.) Error messages from other programs or from system libraries may be translated into other languages.

If you rely on the output of a program or library call to be in a standard format, you should override the locale environment variables, setting the locale to C, for the parts that require consistency. The LC_ALL variable has priority over the individual LC_* variables, which in turn have priority over LANG. Thus, you can get the behavior you expect by forcing LC_ALL=C at critical points.


imadev:~$ echo Hello World | tr A-Z a-z
imadev:~$ echo Hello World | LC_ALL=C tr A-Z a-z
hello world

(That's one of my favorite examples, ever.)

Many commands offer locale-aware methods of replicating traditional behaviors. For example, tr has [:upper:] to replace A-Z, and so on. These should be preferred where available. Consider that [:upper:] may include things like Á which would not be in the C locale's A-Z. But in the end, as the programmer, you bear the responsibility for choosing what is most appropriate for your project.

The behavior of globs is also locale-dependent; the LC_COLLATE variable defines the order in which names are sorted. The ls command also sorts its output by default, using the same locale-dependent ordering. Unfortunately, there are no standard ways to learn what the ordering is within a given locale. One must resort to brute force tricks. For instance,

imadev:/tmp/greg$ for i in {1..255}; do eval touch \$\'\\x$(printf %02x $i)\'; done
touch: cannot change times on /
imadev:/tmp/greg$ ls -b
   8  Ä  C  É  G  Î  M  Ò  P  t  ü  ý  {  -  ;  ¶  ¥  _  \002  \014  \026  \200  \212  \224  \236
   9  ä  c  é  g  î  m  ò  p  U  V  ÿ  }  ×  :  §  ¤  ­­­   \003  \015  \027  \201  \213  \225  \237
0  A  Å  Ç  È  H  Ï  N  Ô  Q  u  v  Z  «  ÷  "  @  µ  ª  \004  \016  \030  \202  \214  \226  \177
1  a  å  ç  è  h  ï  n  ô  q  Ú  W  z  »  ±  ¿  &  ^  º  \005  \017  \031  \203  \215  \227
2  Á  Ã  D  Ê  I  J  Ñ  Ö  R  ú  w  Þ  <  ¬  ?  °  ~  ¹  \006  \020  \032  \204  \216  \230
3  á  ã  d  ê  i  j  ñ  ö  r  Ù  X  þ  >  ¼  ¡  %  ´  ²  \007  \021  \033  \205  \217  \231
4  À  Æ  Ð  Ë  Í  K  O  Õ  S  ù  x  (  `  ½  !  #  ¨  ³  \010  \022  \034  \206  \220  \232
5  à  æ  ð  ë  í  k  o  õ  s  Û  Y  )  '  ¾  \  $  ¸  ©  \011  \023  \035  \207  \221  \233
6  Â  B  E  F  Ì  L  Ó  Ø  ß  û  y  [  =  *  |  ¢  ·  ®  \012  \024  \036  \210  \222  \234
7  â  b  e  f  ì  l  ó  ø  T  Ü  Ý  ]  +  ,  ¦  £  ¯  \001  \013  \025  \037  \211  \223  \235

This omits the / character, as we cannot create a file with that name, but it does show us the ordering of all the other characters in the HP-UX 10.20 implementation of en_US.iso88591. Except for the one that I'm unable to paste into this web browser's textarea (the blank spot to the left of \003). Of course, attempting this on a multi-byte encoding like UTF-8 poses a few logistical problems. (It's probably best explored in segments.)

Since sorting is affected by locale, you may consider overriding LC_COLLATE if you require traditional "ASCIIbetical" order; but you should generally respect the user's locale choices whenever possible.


2012-07-01 04:11