perlunicode man page on IRIX

Man page or keyword search:  
man Server   31559 pages
apropos Keyword Search (all sections)
Output format
IRIX logo
[printable version]

PERLUNICODE(1)	 Perl Programmers Reference Guide  PERLUNICODE(1)

NAME
       perlunicode - Unicode support in Perl (EXPERIMENTAL, sub
       ject to change)

DESCRIPTION
       Important Caveat

	   WARNING:  As of the 5.6.1 release, the implementation of Unicode
	   support in Perl is incomplete, and continues to be highly experimental.

       The following areas need further work.  They are being
       rapidly addressed in the 5.7.x development branch.

       Input and Output Disciplines
	   There is currently no easy way to mark data read from
	   a file or other external source as being utf8.  This
	   will be one of the major areas of focus in the near
	   future.

       Regular Expressions
	   The existing regular expression compiler does not pro
	   duce polymorphic opcodes.  This means that the deter
	   mination on whether to match Unicode characters is
	   made when the pattern is compiled, based on whether
	   the pattern contains Unicode characters, and not when
	   the matching happens at run time.  This needs to be
	   changed to adaptively match Unicode if the string to
	   be matched is Unicode.

       ""use utf8"" still needed to enable a few features
	   The "utf8" pragma implements the tables used for Uni
	   code support.  These tables are automatically loaded
	   on demand, so the "utf8" pragma need not normally be
	   used.

	   However, as a compatibility measure, this pragma must
	   be explicitly used to enable recognition of UTF-8
	   encoded literals and identifiers in the source text.

       Byte and Character semantics

       Beginning with version 5.6, Perl uses logically wide char
       acters to represent strings internally.	This internal
       representation of strings uses the UTF-8 encoding.

       In future, Perl-level operations can be expected to work
       with characters rather than bytes, in general.

       However, as strictly an interim compatibility measure,
       Perl v5.6 aims to provide a safe migration path from byte
       semantics to character semantics for programs.  For opera
       tions where Perl can unambiguously decide that the input
       data is characters, Perl now switches to character seman
       tics.  For operations where this determination cannot be
       made without additional information from the user, Perl
       decides in favor of compatibility, and chooses to use byte
       semantics.

       This behavior preserves compatibility with earlier ver
       sions of Perl, which allowed byte semantics in Perl opera
       tions, but only as long as none of the program's inputs
       are marked as being as source of Unicode character data.
       Such data may come from filehandles, from calls to exter
       nal programs, from information provided by the system
       (such as %ENV), or from literals and constants in the
       source text.

       If the "-C" command line switch is used, (or the
       ${^WIDE_SYSTEM_CALLS} global flag is set to "1"), all sys
       tem calls will use the corresponding wide character APIs.
       This is currently only implemented on Windows.

       Regardless of the above, the "bytes" pragma can always be
       used to force byte semantics in a particular lexical
       scope.  See the bytes manpage.

       The "utf8" pragma is primarily a compatibility device that
       enables recognition of UTF-8 in literals encountered by
       the parser.  It may also be used for enabling some of the
       more experimental Unicode support features.  Note that
       this pragma is only required until a future version of
       Perl in which character semantics will become the default.
       This pragma may then become a no-op.  See the utf8 man
       page.

       Unless mentioned otherwise, Perl operators will use char
       acter semantics when they are dealing with Unicode data,
       and byte semantics otherwise.  Thus, character semantics
       for these operations apply transparently; if the input
       data came from a Unicode source (for example, by adding a
       character encoding discipline to the filehandle whence it
       came, or a literal UTF-8 string constant in the program),
       character semantics apply; otherwise, byte semantics are
       in effect.  To force byte semantics on Unicode data, the
       "bytes" pragma should be used.

       Under character semantics, many operations that formerly
       operated on bytes change to operating on characters.  For
       ASCII data this makes no difference, because UTF-8 stores
       ASCII in single bytes, but for any character greater than
       "chr(127)", the character may be stored in a sequence of
       two or more bytes, all of which have the high bit set.
       But by and large, the user need not worry about this,
       because Perl hides it from the user.  A character in Perl
       is logically just a number ranging from 0 to 2**32 or so.
       Larger characters encode to longer sequences of bytes
       internally, but again, this is just an internal detail
       which is hidden at the Perl level.

       Effects of character semantics

       Character semantics have the following effects:

	  Strings and patterns may contain characters that have
	   an ordinal value larger than 255.

	   Presuming you use a Unicode editor to edit your pro
	   gram, such characters will typically occur directly
	   within the literal strings as UTF-8 characters, but
	   you can also specify a particular character with an
	   extension of the "\x" notation.  UTF-8 characters are
	   specified by putting the hexadecimal code within
	   curlies after the "\x".  For instance, a Unicode smi
	   ley face is "\x{263A}".

	  Identifiers within the Perl script may contain Unicode
	   alphanumeric characters, including ideographs.  (You
	   are currently on your own when it comes to using the
	   canonical forms of characters--Perl doesn't (yet)
	   attempt to canonicalize variable names for you.)

	  Regular expressions match characters instead of bytes.
	   For instance, "." matches a character instead of a
	   byte.  (However, the "\C" pattern is provided to force
	   a match a single byte (""char"" in C, hence "\C").)

	  Character classes in regular expressions match charac
	   ters instead of bytes, and match against the character
	   properties specified in the Unicode properties
	   database.  So "\w" can be used to match an ideograph,
	   for instance.

	  Named Unicode properties and block ranges make be used
	   as character classes via the new "\p{}" (matches prop
	   erty) and "\P{}" (doesn't match property) constructs.
	   For instance, "\p{Lu}" matches any character with the
	   Unicode uppercase property, while "\p{M}" matches any
	   mark character.  Single letter properties may omit the
	   brackets, so that can be written "\pM" also.	 Many
	   predefined character classes are available, such as
	   "\p{IsMirrored}" and	 "\p{InTibetan}".

	  The special pattern "\X" match matches any extended
	   Unicode sequence (a "combining character sequence" in
	   Standardese), where the first character is a base
	   character and subsequent characters are mark charac
	   ters that apply to the base character.  It is equiva
	   lent to "(?:\PM\pM*)".

	  The "tr///" operator translates characters instead of
	   bytes.  Note that the "tr///CU" functionality has been
	   removed, as the interface was a mistake.  For similar
	   functionality see pack('U0', ...) and pack('C0', ...).

	  Case translation operators use the Unicode case trans
	   lation tables when provided character input.	 Note
	   that "uc()" translates to uppercase, while "ucfirst"
	   translates to titlecase (for languages that make the
	   distinction).  Naturally the corresponding backslash
	   sequences have the same semantics.

	  Most operators that deal with positions or lengths in
	   the string will automatically switch to using charac
	   ter positions, including "chop()", "substr()",
	   "pos()", "index()", "rindex()", "sprintf()",
	   "write()", and "length()".  Operators that specifi
	   cally don't switch include "vec()", "pack()", and
	   "unpack()".	Operators that really don't care include
	   "chomp()", as well as any other operator that treats a
	   string as a bucket of bits, such as "sort()", and the
	   operators dealing with filenames.

	  The "pack()"/"unpack()" letters ""c"" and ""C"" do not
	   change, since they're often used for byte-oriented
	   formats.  (Again, think ""char"" in the C language.)
	   However, there is a new ""U"" specifier that will con
	   vert between UTF-8 characters and integers.	(It works
	   outside of the utf8 pragma too.)

	  The "chr()" and "ord()" functions work on characters.
	   This is like "pack("U")" and "unpack("U")", not like
	   "pack("C")" and "unpack("C")".  In fact, the latter
	   are how you now emulate byte-oriented "chr()" and
	   "ord()" under utf8.

	  The bit string operators "& | ^ ~" can operate on
	   character data.  However, for backward compatibility
	   reasons (bit string operations when the characters all
	   are less than 256 in ordinal value) one cannot mix "~"
	   (the bit complement) and characters both less than 256
	   and equal or greater than 256.  Most importantly, the
	   DeMorgan's laws ("~($x|$y) eq ~$x&~$y", "~($x&$y) eq
	   ~$x|~$y") won't hold.  Another way to look at this is
	   that the complement cannot return both the 8-bit
	   (byte) wide bit complement, and the full character
	   wide bit complement.

	  And finally, "scalar reverse()" reverses by character
	   rather than by byte.

       Character encodings for input and output

       [XXX: This feature is not yet implemented.]

CAVEATS
       As of yet, there is no method for automatically coercing
       input and output to some encoding other than UTF-8.  This
       is planned in the near future, however.

       Whether an arbitrary piece of data will be treated as
       "characters" or "bytes" by internal operations cannot be
       divined at the current time.

       Use of locales with utf8 may lead to odd results.  Cur
       rently there is some attempt to apply 8-bit locale info to
       characters in the range 0..255, but this is demonstrably
       incorrect for locales that use characters above that range
       (when mapped into Unicode).  It will also tend to run
       slower.	Avoidance of locales is strongly encouraged.

SEE ALSO
       the bytes manpage, the utf8 manpage, the section on
       "${^WIDE_SYSTEM_CALLS}" in the perlvar manpage

2001-03-18		   perl v5.6.1		   PERLUNICODE(1)
[top]

List of man pages available for IRIX

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net