LOGIWEB(5) File formats and protocols LOGIWEB(5)NAMElogiweb - Logiweb file formats and protocols
DESCRIPTION
Logiweb is a system for storing, locating, and transmitting Logiweb
pages. Logiweb pages may contain free text mixed with machine intelli‐
gible objects like computer programs, testsuites, and formal proofs.
Logiweb defines a referencing scheme in which each Logiweb page has a
unique Logiweb reference. A Logiweb reference is typically around 30
bytes long. A Logiweb reference contains, among other, a RIPEMD-160
hash key of the referenced page.
The Logiweb standard comprises the following file formats and network
protocols.
The Logiweb vector format
The Logiweb rack format
The Logiweb protocol
The vector format is a condensed representation suited for transmission
over the Internet. Furthermore, the authenticity of a Logiweb page in
vector format can be verified using the RIPEMD-160 hash function. Logi‐
web vectors can be retrieved from untrusted repositories and transmit‐
ted over untrusted networks.
The rack format is a comprehensive, precompiled representation suited
for storing on a local disk. Furthermore, the rack format is a good
starting point for conversion to other formats. The Logiweb Abstract
Machine lgwam(1) needs pages in rack format in order to execute pro‐
grams defined on the pages.
Earlier versions of Logiweb supported XML, MathML, Lisp S-expressions
and several other formats. The current version leaves such support to
the user through the capabilities for user defined rendering of the
Logiweb compiler lgc(1). Only support for latex(1), bibtex(1), makein‐
dex(1), dvipdfm(1), and cc(1) has survived in the present version.
The Logiweb protocol allows to locate a Logiweb page given its Logiweb
reference. More specificially, the Logiweb protocol allows to translate
a Logiweb reference into an http URL. Then http may be used for
retrieving the referenced page and RIPEMD-160 may be used for verifying
the authenticity of page.
Logiweb version 0.1.1 to 0.1.10 included a Logiweb server which imple‐
mented the Logiweb protocol. From Version 1.0.1, the Logiweb server is
excluded from the distribution. That is because installation of the
Logiweb server requires considerable privileges and network insight.
For Version 1.0.1 a Logiweb server does not even exist, and users of
the Logiweb compiler lgc(1) are refered to PATH based translation of
references to http URLs. Furthermore, the Logiweb protocol will change.
That is because it has proved to be too difficult to get an assigned
UDP port and thus the protocol will be modified to be channeled via
http. That will reduce efficiency but also has some benefits.
The Logiweb server will be distributed separately at a later stage.
When Logiweb servers are installed in the future, repositories which
run a server will appear to lgc(1) users to contain all Logiweb pages
in the world. The present but obsolete Logiweb protocol is documented
in logiweb(7).
BITS, BYTES, AND VECTORS
The file formats and protocols are byte oriented, so we first define
bytes and vectors (where 'vector' means 'byte vector').
bit ::= "0" | "1"
byte ::= bit*8
vector ::= byte*
A byte denotes a value in 0..255. We refer to the least significant bit
of a byte as bit zero and to the most significant bit as bit seven.
Bytes of vectors are given in network order which is identical to the
order in which they are stored on disk. We refer to the first byte of a
vector as byte zero.
We write vectors in mixed endian binary: each byte is written with the
most significant bit first and bytes a written in network order.
We refer to bit m of byte n of a vector as bit m+8*n of the vector.
Thus, the bits of a two byte vector are given in this order: 7, 6, 5,
4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8. This order gives rise to
the term 'mixed endian'.
Mixed endian notation resembles Internet dot notation. Internet dot
notation specifies bytes to be written in network order and each byte
to be written with the most significant digit first.
As an example, mixed endian binary 0000 0010 0000 0011 denotes a two
byte followed by a three byte.
CARDINALS AND STRINGS
We shall refer to non-negative integers as cardinals. We represent car‐
dinals and byte strings thus:
septet ::= bit*7
middle-septet ::= "1" septet
end-septet ::= "0" septet
cardinal ::= middle-septet* end-septet
string ::= length vector
length ::= cardinal
A middle or end septet denotes a value in 0..127. As examples, the mid‐
dle septet 1000 0010 and the end septet 0000 0010 both denote two.
Cardinals are expressed little endian base 128. As an example,
1000 0011 0000 0010
denotes 3+128*2 = 3+256 = 259.
A string encodes a sequence of bytes. The length field of the string
indicates the number of bytes in the string.
The grammar is *not* context free because the strings cannot be
expressed in a context free manner.
VECTOR FORMAT
The vector format of Logiweb pages has the following syntax:
pagevector ::= bibliography dictionary body
bibliography ::= reference reference* zero
reference ::= string ; the string must be non-empty
zero ::= cardinal ; the cardinal must denote zero
dictionary ::= entry* zero
entry ::= index arity
index ::= nonzero
nonzero ::= cardinal ; the cardinal must be non-zero
arity ::= cardinal
body ::= vector
The indexes of the dictionary must be strictly decreasing.
We refer to the first reference of the bibliography as reference zero.
Reference zero must be the reference of the page itself.
Above, references are described as strings. The following definition is
more precise:
reference ::= length scheme ripemd timestamp
scheme ::= "0000 0001"
ripemd ::= byte*20
timestamp ::= mantissa exponent
mantissa ::= cardinal
exponent ::= cardinal
The length must equal the total number of bytes in scheme, ripemd, and
timestamp.
Whenever a Logiweb page is published, the time of publication is
included as a timestamp in the reference of the page. The timestamp is
given in Logiweb time as M*10^(-E) where M and E are the values of man‐
tissa and exponent, respectively.
Logiweb time indicates the number of seconds which have elapsed accord‐
ing to International Atomic Time (TAI) since TAI:00:00:00 of Modified
Julian Day (MJD) Zero. MJD-0 corresponds to the Gregorian date
GRD-1858-11-17. In 1858, UTC lacked 10 seconds behind TAI (even thought
they did not know at that time). Hence, Logiweb time measures the num‐
ber of seconds according to International Atomic Time since
UTC:23:59:50 of GRD-1858-11-16.
The ripemd code consists of 20 bytes which corresponds to 160 bits. The
ripemd code of a page must be computed using the RIPEMD-160 hash func‐
tion applied to all bytes following ripemd, including the timestamp of
the page itself.
The body represents a parse tree whose nodes may be Logiweb symbols or
strings. A Logiweb symbol comprises a Logiweb reference and an index
where the index is a cardinal. To be a valid symbol, the index of the
symbol must occur in the dictionary of the referenced page. The arity
of a symbol is the arity assigned to the symbol by that dictionary.
Transformation of the body given as a vector to a parse tree is called
'unpacking'. Reference one of a page may define an unpacker, i.e. a
user defined unpacking function. See the 'base' and 'lgc' Logiweb pages
which come with the distribution for details on this.
If reference one of a page does not define an unpacker then the body is
in default format which is as follows:
body ::= node*
node* ::= symbol | zero string
symbol ::= nonzero
The nodes of the body are given in Polish prefix. The arities of each
symbol is taken from dictionaries of the relevant page. Strings have
arity zero.
Each symbol has value 1+R+n*i where i is the index of the symbol, n is
the number of references in the bibliography, and R is the relative
reference of the symbol. The relative reference R if between 0 and n-1.
The reference of a symbol is the R'th element of the bibliography.
RACK FORMAT
The rack of a page is a structure which represents a compiled version
of the page. See the 'base' and 'lgc' pages which come with the distri‐
bution for more on this.
A rack is an inhomogeneous array, i.e. and array whose elements may
have different types. Indexes of a rack are called hooks. Racks and
hooks are to Logiweb what records and record fields are to languages
like C.
Racks may reside in memory or may be stored to disk. Racks in memory
contain more data than racks stored to disk. Hence, racks are pruned
when written to disk and the missing parts have to be reconstructed
when reading the racks back into memory.
Pruning affects three hooks of a rack.
First, a rack in memory has a code hook which contains compiled ver‐
sions of value definitions. To keep disk racks architecture indepen‐
dent, the code hook is removed when writing a rack to disk and value
definitions have to be recompiled when reading the rack back. Using
racks still represent a great saving in time since no parsing, unpack‐
ing, and macro expansion is needed when loading a rack.
Second, a rack in memory has a cluster hook which contains racks of all
transitively referenced pages. This hook is removed since it contains a
lot of data which is fast and easy to reconstruct from the disk ver‐
sions of transitively referenced pages.
Third and finaly, a rack in memory has a diagnose hook. The value of
that hook is 'maptagged'. A maptag allows to delay computation of the
diagnose in an otherwise eager process. When writing a rack to disk,
computation of the diagnose is forced and the maptag is removed. When
reading the rack back from disk, a maptag is added to the diagnose
hook. Again, see the 'base' and 'lgc' pages that come with the distri‐
bution for details.
Once a rack is pruned for code hook, cluster hook, and diagnose maptag,
it is a data structure which is built up from only three data types:
cons, cardinals, and a special value named T which corresponds to NULL
in C.
Pruned racks are stored in the following format:
rack ::= racknode* count
count ::= cardinal ; three plus the number of nodes
racknode ::= card1 | card2 | cons
card1 ::= zero cardinal
card2 ::= two string
two ::= cardinal ; the cardinal must represent 2
cons ::= head tail
head ::= cardinal
tail ::= cardinal
A rack with no nodes represents the value T. Racks with at least one
node represents the value represented by the last node. Each node rep‐
resents a value as indicated in the following.
A card1 node represents the given cardinal.
A card2 node represents a cardinal as follows: Take the given string.
Discard the length field to obtain a vector. Append a byte with value
one to the vector. Then interpret the vector as a little endian cardi‐
nal base 256.
A cons represents the pair of the given head and tail. A head whose
value is one represents T. A head whose value H is greater than 2 rep‐
resents the value of node H-3. The tail is interpretted similarly. The
head and tail must equal one or reference nodes that appear before the
cons.
RACK WRITING CONVENTIONS
As can be seen, cardinals can be represented in two ways. All cardinals
can be represented as a card1 whereas only some cardinals can be repre‐
sented as a card2. When writing racks to disk, one should use the card2
representation when possible and the card1 representation otherwise.
The representation allows to represent sharing in that a node can be
referenced from more than one cons node. When writing a rack to disk,
one should traverse the rack left-to-right, root-to-leaves, and depth
first. One should maximize sharing such that no two nodes represent the
same value.
If followed, the rack writing conventions ensure that the same rack
will be written the same way on distinct systems. That simplifies test‐
ing but is not necessary for reading racks back to memory.
AUTHOR
Klaus Grue, http://logiweb.eu/
SEE ALSOlgc(1), lgwam(1), lgc(5), lgc.conf(5), logiweb(7)Logiweb JULY 2009 LOGIWEB(5)