CONVCS(2)CONVCS(2)NAME
Convcs, Btos, Stob - character set conversion suite
SYNOPSIS
include "convcs.m";
convcs := load Convcs Convcs->PATH;
Btos: module {
init: fn(arg: string): string;
btos: fn(s: Convcs->State, b: array of byte, nchars: int)
: (Convcs->State, string, int);
};
Stob: module {
init: fn (arg: string): string;
stob: fn(s: Convcs->State, str: string)
: (Convcs->State, array of byte);
};
Convcs: module {
State: type string;
Startstate: con "";
init: fn(csfile: string): string;
getbtos: fn(cs: string): (Btos, string);
getstob: fn(cs: string): (Stob, string);
enumcs: fn(): list of (string, string, int);
aliases: fn(cs: string): (string, list of string);
};
DESCRIPTION
The Convcs suite is a collection of modules for converting various
standard coded character sets and character encoding schemes to and
from the Limbo strings.
The Convcs module provides an entry point to the suite, mapping charac‐
ter set names and aliases to their associated converter implementation.
The Convcs module
init(csfile)
Init should be called once to initialise the internal state of
Convcs. The csfile argument specifies the path of the converter
mapping file. If this argument is nil, the default mapping file
/lib/convcs/charsets is used.
getbtos(cs)
Getbtos returns an initialised Btos module for converting from
the requested encoding.
The return value is a tuple, holding the module reference and an
error string. If any errors were encountered in locating, load‐
ing or initialising the requested converter, the module refer‐
ence will be nil and the string will contain an explanation.
The character set name, cs, is normalised by mapping all upper-
case latin1 characters to lower-case before comparison with
character set names and aliases from the charsets file.
getstob(cs)
Getstob returns an initialised Stob module for converting from
strings to the requested encoding. Apart from the different
module type, the return value and semantics are the same as for
getbtos().
enumcs()
Returns a list describing the set of available converters. The
list is comprised of (name, desc, mode) tuples.
Name is the standard name of the character set.
Desc is a more friendly description taken directly from the desc
attribute given in the charsets file. If the attribute is not
given then desc is set to name.
Mode is a bitmap that details the converters available for the
given character set. Valid mode bits are Convcs->BTOS and Con‐
vcs->STOB. For convenience, Convcs->BOTH is also defined.
aliases(cs)
Returns the aliases for the character set name cs. The return
value is the tuple (desc, aliases).
Desc is the descriptive text for the character set, as returned
by enumcs().
Aliases lists all the known aliases for cs. The first name in
the list is the default character set name - the name that all
the other aliases refer to. If the aliases list is nil then
there was an error in looking up the character set name and desc
will detail the error.
Using a Btos converter
The Btos module returned by getbtos() is already initialised and is
ready to start the conversion. Conversions can be made on a individual
basis, or in a `streamed' mode.
converter->btos(s, b, nchars)
Converts raw byte codes of the character set encoding to a Limbo
string.
The argument s is a converter state as returned from the previ‐
ous call to btos on the same input stream. The first call to
btos on a particular input stream should give Convcs->Startstate
(or nil) as the value for s. The argument b is the bytes to be
converted. The argument nchars is the maximal length of the
string to be returned. If this argument is -1 then as much of b
will be consumed as possible. A value of 0 indicates to the
converter that there is no more data and that any pending state
should be flushed.
The return value of btos is the tuple (state, str, nbytes) where
state is the new state of the converter, str is the converted
string, and nbytes is the number of bytes from b consumed by the
conversion.
The same converter module can be used for multiple conversion
streams by maintaining a separate state variable for each
stream.
Using an Stob converter
The Stob module returned by getstob() is already initialised and is
ready to start the conversion.
converter->stob(s, str)
Converts the limbo string str to the raw byte codes of the char‐
acter set encoding. The argument s represents the converter
state and is treated in the same way as for converter->btos()
described above. To terminate a conversion stob should be
called with an empty string in order to flush the converter
state.
The return value of stob is the tuple (state, bytes) where state
is the new state of the converter and bytes is the result of the
conversion.
Conversion errors
When using converter->btos() to convert data to Limbo strings, any byte
sequences that are not valid for the specific character encoding scheme
will be converted to the Unicode error character 16rFFFD.
When using converter->stob() to convert Limbo strings, any Unicode
characters that can not be mapped into the character set will normally
be substituted by the US-ASCII code for `?'. Note that this may be
inappropriate for certain conversions, such converters will use a suit‐
able error character for their particular character set and encoding
scheme.
Charset file format
The file /lib/convcs/charsets provides the mapping between character
set names and their implementation modules. The file format conforms
to that supported by cfg(2). The following description relies on terms
defined in the cfg(2) manual page.
Each record name defines a character set name. If the primary value of
the record is non-empty then the name is an alias, the value being the
real name. An alias record must point to an actual converter record,
not to another alias, as Convcs only follows one level of aliasing.
Each converter record consists of a set of tuples with the following
primary attributes:
desc A more descriptive name for the character set encoding. This
attribute is used for the description returned by enumcs().
btos The path to the Btos module implementation of this converter.
stob The path to the Stob module implementation of this converter.
Both the btos and stob tuples can have an optional arg attribute which
is passed to the init() function of the converter when initialised by
Convcs. If a converter record has neither an stob nor a btos tuple,
then it is ignored.
The following example is an extract from the standard Inferno charsets
file:
cp866=ibm866
866=ibm866
ibm866=
desc='Russian MS-DOS CP 866'
stob=/dis/lib/convcs/cp_stob.dis arg=/lib/convcs/ibm866.cp
btos=/dis/lib/convcs/cp_btos.dis arg=/lib/convcs/ibm866.cp
This entry defines Stob and Btos converters for the character set
called ibm866. The converters are actually the generic codepage con‐
verters cp_stob and cp_btos paramaterized with a codepage file. The
entry also defines the aliases cp866 and 866 for the name ibm866.
FILES
/lib/convcs/charsets
The default mapping between character set names and their imple‐
mentation modules.
/lib/convcs/csname.cp
Codepage files for use by the generic codepage converters
cp_stob and cp_btos. Each file consists of 256 unicode runes
mapping codepage byte values to unicode by their index. The
runes are stored in the utf-8 encoding.
SOURCE
/appl/lib/convcs/convcs.b
Implementation of the Convcs module.
/appl/lib/convcs/cp_btos.b
Generic Btos codepage converter.
/appl/lib/convcs/cp_stob.b
Generic Stob codepage converter.
SEE ALSOtcs(1), cfg(2)CONVCS(2)