Introducción

Ejercicio 2.8.1   Antes de comenzar esta sección, lea los siguientes documentos:

La siguiente introducción esta tomada de la sección la sección 'Unicode' en perluniintro:

Unicode is a character set standard which plans to codify all of the writing systems of the world, plus many other symbols.

Unicode and ISO/IEC 10646 are coordinated standards that provide code points for characters in almost all modern character set standards, covering more than 30 writing systems and hundreds of languages, including all commercially-important modern languages.

All characters in the largest Chinese, Japanese, and Korean dictionaries are also encoded. The standards will eventually cover almost all characters in more than 250 writing systems and thousands of languages. Unicode 1.0 was released in October 1991, and 4.0 in April 2003.

A Unicode character is an abstract entity. It is not bound to any particular integer width, especially not to the C language char .

Unicode is language-neutral and display-neutral: it does not encode the language of the text and it does not generally define fonts or other graphical layout details. Unicode operates on characters and on text built from those characters.

Unicode defines characters like LATIN CAPITAL LETTER A or GREEK SMALL LETTER ALPHA and unique numbers for the characters, in this case 0x0041 and 0x03B1, respectively. These unique numbers are called code points.

The Unicode standard prefers using hexadecimal notation for the code points.

The Unicode standard uses the notation U+0041 LATIN CAPITAL LETTER A, to give the hexadecimal code point and the normative name of the character.

Unicode also defines various Unicode properties for the characters, like uppercase or lowercase, decimal digit, or punctuation; these properties are independent of the names of the characters.

Furthermore, various operations on the characters like uppercasing, lowercasing, and collating (sorting) are defined.

A Unicode character consists either of a single code point, or a base character (like LATIN CAPITAL LETTER A ), followed by one or more modifiers (like COMBINING ACUTE ACCENT ). This sequence of base character and modifiers is called a combining character sequence.

Whether to call these combining character sequences "characters" depends on your point of view. If you are a programmer, you probably would tend towards seeing each element in the sequences as one unit, or "character". The whole sequence could be seen as one "character", however, from the user's point of view, since that's probably what it looks like in the context of the user's language.

With this "whole sequence" view of characters, the total number of characters is open-ended. But in the programmer's "one unit is one character" point of view, the concept of "characters" is more deterministic.

In this document, we take that second point of view: one "character" is one Unicode code point, be it a base character or a combining character.

For some combinations, there are precomposed characters. LATIN CAPITAL LETTER A WITH ACUTE , for example, is defined as a single code point.

These precomposed characters are, however, only available for some combinations, and are mainly meant to support round-trip conversions between Unicode and legacy standards (like the ISO 8859).

In the general case, the composing method is more extensible. To support conversion between different compositions of the characters, various normalization forms to standardize representations are also defined.

Because of backward compatibility with legacy encodings, the "a unique number for every character" idea breaks down a bit: instead, there is "at least one number for every character".

The same character could be represented differently in several legacy encodings.

The converse is also not true: some code points do not have an assigned character.

A common myth about Unicode is that it would be "16-bit", that is, Unicode is only represented as 0x10000 (or 65536) characters from 0x0000 to 0xFFFF . This is untrue. Since Unicode 2.0 (July 1996), Unicode has been defined all the way up to 21 bits (0x10FFFF ), and since Unicode 3.1 (March 2001), characters have been defined beyond 0xFFFF . The first 0x10000 characters are called the Plane 0, or the Basic Multilingual Plane (BMP). With Unicode 3.1, 17 (yes, seventeen) planes in all were defined-but they are nowhere near full of defined characters, yet.

Another myth is that the 256-character blocks have something to do with languages-that each block would define the characters used by a language or a set of languages. This is also untrue. The division into blocks exists, but it is almost completely accidental-an artifact of how the characters have been and still are allocated. Instead, there is a concept called scripts, which is more useful: there is
and so on. Scripts usually span varied parts of several blocks. For further information see Unicode::UCD:

pl@nereida:~/Lperltesting$ perl5.10.1 -wdE 0
main::(-e:1):   0
  DB<1> use Unicode::UCD qw{charinfo charscripts}
  DB<2> x charinfo(0x41)
0  HASH(0xc69a88)
   'bidi' => 'L'
   'block' => 'Basic Latin'
   'category' => 'Lu'
   'code' => 0041
   'combining' => 0
   'comment' => ''
   'decimal' => ''
   'decomposition' => ''
   'digit' => ''
   'lower' => 0061
   'mirrored' => 'N'
   'name' => 'LATIN CAPITAL LETTER A'
   'numeric' => ''
   'script' => 'Latin'
   'title' => ''
   'unicode10' => ''
   'upper' => ''
  DB<3> x @{charscripts()->{Greek}}[0..3]
0  ARRAY(0xd676a8)
   0  880
   1  883
   2  'Greek'
1  ARRAY(0xd86300)
   0  885
   1  885
   2  'Greek'
2  ARRAY(0xd6c718)
   0  886
   1  887
   2  'Greek'
3  ARRAY(0xd6c790)
   0  890
   1  890
   2  'Greek'

The Unicode code points are just abstract numbers. To input and output these abstract numbers, the numbers must be encoded or serialised somehow. Unicode defines several character encoding forms, of which UTF-8 is perhaps the most popular. UTF-8 is a variable length encoding that encodes Unicode characters as 1 to 6 bytes (only 4 with the currently defined characters). Other encodings include UTF-16 and UTF-32 and their big- and little-endian variants (UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2 and UCS-4 encoding forms.

Casiano Rodríguez León
Licencia de Creative Commons
Principios de Programación Imperativa, Funcional y Orientada a Objetos Una Introducción en Perl/Una Introducción a Perl
por Casiano Rodríguez León is licensed under a Creative Commons Reconocimiento 3.0 Unported License.

Permissions beyond the scope of this license may be available at http://campusvirtual.ull.es/ocw/course/view.php?id=43.
2012-06-19