VENTI(6)VENTI(6)NAMEventi - archival storage server
DESCRIPTION
Venti is a block storage server intended for archival data. In a Venti
server, the SHA1 hash of a block's contents acts as the block identi‐
fier for read and write operations. This approach enforces a write-
once policy, preventing accidental or malicious destruction of data.
In addition, duplicate copies of a block are coalesced, reducing the
consumption of storage and simplifying the implementation of clients.
This manual page documents the basic concepts of block storage using
Venti as well as the Venti network protocol.
Venti(1) documents some simple clients. Vac(1) and vacfs(4) are more
complex clients.
Venti(2) describes a C library interface for accessing Venti servers
and manipulating Venti data structures.
Venti(8) describes the programs used to run a Venti server.
Scores
The SHA1 hash that identifies a block is called its score. The score
of the zero-length block is called the zero score.
Scores may have an optional label: prefix, typically used to describe
the format of the data. For example, vac(1) uses a vac: prefix.
Files and Directories
Venti accepts blocks up to 56 kilobytes in size. By convention, Venti
clients use hash trees of blocks to represent arbitrary-size data
files. The data to be stored is split into fixed-size blocks and writ‐
ten to the server, producing a list of scores. The resulting list of
scores is split into fixed-size pointer blocks (using only an integral
number of scores per block) and written to the server, producing a
smaller list of scores. The process continues, eventually ending with
the score for the hash tree's top-most block. Each file stored this
way is summarized by a VtEntry structure recording the top-most score,
the depth of the tree, the data block size, and the pointer block size.
One or more VtEntry structures can be concatenated and stored as a spe‐
cial file called a directory. In this manner, arbitrary trees of files
can be constructed and stored.
Scores passed between programs conventionally refer to VtRoot blocks,
which contain descriptive information as well as the score of a direc‐
tory block containing a small number of directory entries.
Conventionally, programs do not mix data and directory entries in the
same file. Instead, they keep two separate files, one with directory
entries and one with metadata referencing those entries by position.
Keeping this parallel representation is a minor annoyance but makes it
possible for general programs like venti/copy (see venti(1)) to tra‐
verse the block tree without knowing the specific details of any par‐
ticular program's data.
Block Types
To allow programs to traverse these structures without needing to
understand their higher-level meanings, Venti tags each block with a
type. The types are:
VtDataType 000 data
VtDataType+1 001 scores of VtDataType blocks
VtDataType+2 002 scores of VtDataType+1 blocks
...
VtDirType 010 VtEntry structures
VtDirType+1 011 scores of VtDirType blocks
VtDirType+2 012 scores of VtDirType+1 blocks
...
VtRootType 020 VtRoot structure
The octal numbers listed are the type numbers used by the commands
below. (For historical reasons, the type numbers used on disk and on
the wire are different from the above. They do not distinguish
VtDataType+n blocks from VtDirType+n blocks.)
Zero Truncation
To avoid storing the same short data blocks padded with differing num‐
bers of zeros, Venti clients working with fixed-size blocks convention‐
ally `zero truncate' the blocks before writing them to the server. For
example, if a 1024-byte data block contains the 11-byte string `hello
world' followed by 1013 zero bytes, a client would store only the
11-byte block. When the client later read the block from the server,
it would append zero bytes to the end as necessary to reach the
expected size.
When truncating pointer blocks (VtDataType+n and VtDirType+n blocks),
trailing zero scores are removed instead of trailing zero bytes.
Because of the truncation convention, any file consisting entirely of
zero bytes, no matter what its length, will be represented by the zero
score: the data blocks contain all zeros and are thus truncated to the
empty block, and the pointer blocks contain all zero scores and are
thus also truncated to the empty block, and so on up the hash tree.
Network Protocol
A Venti session begins when a client connects to the network address
served by a Venti server; the conventional address is tcp!server!venti
(the venti port is 17034). Both client and server begin by sending a
version string of the form venti-versions-comment\n. The versions
field is a list of acceptable versions separated by colons. The proto‐
col described here is version 02. The client is responsible for choos‐
ing a common version and sending it in the VtThello message, described
below.
After the initial version exchange, the client transmits requests (T-
messages) to the server, which subsequently returns replies (R-mes‐
sages) to the client. The combined act of transmitting (receiving) a
request of a particular type, and receiving (transmitting) its reply is
called a transaction of that type.
Each message consists of a sequence of bytes. Two-byte fields hold
unsigned integers represented in big-endian order (most significant
byte first). Data items of variable lengths are represented by a one-
byte field specifying a count, n, followed by n bytes of data. Text
strings are represented similarly, using a two-byte count with the text
itself stored as a UTF-encoded sequence of Unicode characters (see
utf(6)). Text strings are not NUL-terminated: n counts the bytes of
UTF data, which include no final zero byte. The NUL character is ille‐
gal in text strings in the Venti protocol. The maximum string length
in Venti is 1024 bytes.
Each Venti message begins with a two-byte size field specifying the
length in bytes of the message, not including the length field itself.
The next byte is the message type, one of the constants in the enumera‐
tion in the include file <venti.h>. The next byte is an identifying
tag, used to match responses to requests. The remaining bytes are
parameters of different sizes. In the message descriptions, the number
of bytes in a field is given in brackets after the field name. The
notation parameter[n] where n is not a constant represents a variable-
length parameter: n[1] followed by n bytes of data forming the parame‐
ter. The notation string[s] (using a literal s character) is shorthand
for s[2] followed by s bytes of UTF-8 text. The notation parameter[]
where parameter is the last field in the message represents a variable-
length field that comprises all remaining bytes in the message.
All Venti RPC messages are prefixed with a field size[2] giving the
length of the message that follows (not including the size field
itself). The message bodies are:
VtThello tag[1] version[s] uid[s] strength[1] crypto[n] codec[n]
VtRhello tag[1] sid[s] rcrypto[1] rcodec[1]
VtTping tag[1]
VtRping tag[1]
VtTread tag[1] score[20] type[1] pad[1] count[2]
VtRread tag[1] data[]
VtTwrite tag[1] type[1] pad[3] data[]
VtRwrite tag[1] score[20]
VtTsync tag[1]
VtRsync tag[1]
VtRerror tag[1] error[s]
VtTgoodbye tag[1]
Each T-message has a one-byte tag field, chosen and used by the client
to identify the message. The server will echo the request's tag field
in the reply. Clients should arrange that no two outstanding messages
have the same tag field so that responses can be distinguished.
The type of an R-message will either be one greater than the type of
the corresponding T-message or Rerror, indicating that the request
failed. In the latter case, the error field contains a string describ‐
ing the reason for failure.
Venti connections must begin with a hello transaction. The VtThello
message contains the protocol version that the client has chosen to
use. The fields strength, crypto, and codec could be used to add
authentication, encryption, and compression to the Venti session but
are currently ignored. The rcrypto, and rcodec fields in the VtRhello
response are similarly ignored. The uid and sid fields are intended to
be the identity of the client and server but, given the lack of authen‐
tication, should be treated only as advisory. The initial hello should
be the only hello transaction during the session.
The ping message has no effect and is used mainly for debugging.
Servers should respond immediately to pings.
The read message requests a block with the given score and type. Use
vttodisktype and vtfromdisktype (see venti(2)) to convert a block type
enumeration value (VtDataType, etc.) to the type used on disk and in
the protocol. The count field specifies the maximum expected size of
the block. The data in the reply is the block's contents.
The write message writes a new block of the given type with contents
data to the server. The response includes the score to use to read the
block, which should be the SHA1 hash of data.
The Venti server may buffer written blocks in memory, waiting until
after responding to the write message before writing them to permanent
storage. The server will delay the response to a sync message until
after all blocks in earlier write messages have been written to perma‐
nent storage.
The goodbye message ends a session. There is no VtRgoodbye: upon
receiving the VtTgoodbye message, the server terminates up the connec‐
tion.
SEE ALSOventi(1), venti(2), venti(8)
Sean Quinlan and Sean Dorward, ``Venti: a new approach to archival
storage'', Usenix Conference on File and Storage Technologies , 2002.
VENTI(6)