What does Unicode conformance require?

Chapter 3 discusses this in detail. Here's a very informal version:

  • Unicode characters don't fit in 8 bits; deal with it.

  • 2 Byte order is only an issue in I/O.

  • If you don't know, assume big-endian.

  • Loose surrogates have no meaning.

  • Neither do U+FFFE and U+FFFF.

  • Leave the unassigned codepoints alone.

  • It's OK to be ignorant about a character, but not plain wrong.

  • Subsets are strictly up to you.

  • Canonical equivalence matters.

  • Don't garble what you don't understand.

  • Process UTF-* by the book.

  • Ignore illegal encodings.

  • Right-to-left scripts have to go by bidi rules

No comments:

topics