NAMEgrio2 - Guaranteed Rate I/O Version 2
DESCRIPTION
Guaranteed-Rate I/O (GRIO) refers to a guarantee made by the system to a
user process that it will deliver data from a storage device at a
predefined rate regardless of any other I/O activity on the system or on
other nodes within its cluster. If a process issues I/O at a rate above
its requested rate, GRIO ensures that it does not exceed its reservation
and throttles its I/O if necessary.
Terminology
The term reservation is used to refer to the set of Quality-of-Service
(QoS) parameters (bandwidth, reservation interval) requested by an
application. Reservation requests are forwarded to the GRIO bandwidth
management daemon ggd2(1M). If the request is granted then the
application is said to have received a guarantee from the system that its
QoS requirements will be met. Within the kernel an object is instantiated
that encodes the requested QoS parameters and maintains the necessary
scheduling and monitoring state. This object is referred to as a GRIO
stream. The stream ID is returned to the user application. Stream IDs
are unique across reservations and across the cluster.
This manual page describes the second version of the GRIO product. Where
it is necessary to distinguish between this release, and the previous
release the terms GRIOv2 and GRIOv1 are used. Where the term GRIO is used
without qualification it refers to the second version of the product.
BACKGROUND
GRIOv1 was designed for use with tightly controlled, locally attached
storage devices. It depends on detailed performance data for every piece
of hardware in the I/O path including: the storage devices themselves,
the SCSI and Fiber Channel busses, system interconnects and bridges. It
only works with the XLV volume manager and does not support shared CXFS
filesystems.
Modern storage systems are moving towards large interconnected Storage
Area Networks (SANs) in which heterogeneous systems and storage devices
are connected via a dedicated high-speed network. In this model, large
storage resources, such as multi-terabyte RAID devices, are shared
amongst a number of clients using a shared filesystem such as CXFS.
GRIOv2 has been created to broaden the GRIO QoS framework to this next
generation of storage architectures.
Its key features are as follows:
1. Support for shared filesystems and clustered heterogeneous
operation.
GRIOv2 has been designed from the outset to work with the XVM volume
manager and fully supports guaranteed-rate I/O to both local XFS and
shared CXFS filesystems. It is designed to manage I/O from multiple
heterogeneous nodes and to ensure that a GRIO reservation on one
node is not affected by I/O elsewhere in a cluster.
Page 1
grio2(5)grio2(5)
2. A new filesystem-level performance qualification model.
GRIOv1 uses a complicated per-device qualification model, in which
the maximum sustainable bandwidth for each component in the I/O
path, from disk device to memory, is qualified separately. A
synthetic benchmark grio_bandwidth(1M) is used to profile individual
storage devices.
GRIOv1 depends on this information being complete and accurate. This
approach is appropriate for the tightly controlled environment of a
locally attached filesystem. However, as storage networks become
increasingly heterogeneous and topologies increasingly complex, this
approach becomes impractical.
As a result, GRIOv2 has moved to a filesystem-level qualification
model in which the maximum sustainable bandwidth is measured across
the entire filesystem under a realistic application workload.
Empirical measurement of actual filesystem performance is used to
determine the QoS parameters that can be delivered in practice by a
particular configuration. This is referred to as the qualified
bandwidth for the filesystem (and the XVM volume on which it
resides).
For local volumes the qualified bandwidth is stored in /etc/griotab,
for shared volumes it is stored in the cluster configuration
database (CDB). Refer to the GRIO Version 2 Guide, ggd2(1M) and
griotab(4) for more information on measuring and setting the
qualified bandwidth for a filesystem.
3. Comprehensive QoS Monitoring.
GRIOv2 provides comprehensive tools for measuring and monitoring
delivered QoS levels. This includes in-kernel collection of per-
stream performance metrics. Refer to grioqos(1M) for further
information.
The information provided by the QoS facilities can be used to help
choose the tradeoff between resource utilisation and delivered I/O
performance that is most appropriate for a given application mix,
workload, and production environment.
4. Cluster-wide encapsulation and control of non-GRIO I/O.
When GRIOv2 begins managing an XVM volume, every node with access to
that volume is notified. From that point on, all user and system I/O
that doesn't have an explicit GRIO reservation is encapsulated. This
means that all non-GRIO I/O is automatically associated with a
system managed nongrio kernel stream.
The central bandwidth management component of GRIOv2 ggd2 allocates
otherwise unused filesystem bandwidth to these streams - allowing
non-GRIO I/O to be processed even when there are active reservations
Page 2
grio2(5)grio2(5)
in the system. ggd2 dynamically adjusts the amount of bandwidth
allocated for this purpose based on monitoring of filesystem demand
and utilisation. In addition to this Dynamic Bandwidth Allocation,
an administrator can reserve bandwidth at the node-level for use by
all nongrio applications running on that node, this is referred to
as a Static Bandwidth Allocation. Refer to ggd2(1M) and
grioadmin(1M) for more information.
USAGE RESTRICTIONS
In order to utilize a GRIO reservation a file must be read or written
using direct I/O. The open(2) manual page describes the use and buffer
alignment restrictions of the direct I/O interface. A GRIO reservation
can be made for any file within an XFS or CXFS filesystem created on an
XVM volume.
In some applications more deterministic performance can be achieved by
creating files on a dedicated real-time subvolume. To allocate a file on
the real-time subvolume of an XFS or CXFS filesystem the fcntl(2)
F_FSSETXATTR command must be used to set the XFS_XFLAG_REALTIME flag.
This can only be issued on a newly created file. It is not possible to
mark a file as real-time once non-real-time data blocks have been
allocated to it.
SOFTWARE COMPONENTS
GRIOv2 functionality is distributed between three main components: the
new guarantee-granting daemon ggd2; the userspace library libgrio2 and
command line utilities; and the kernel.
ggd2(1M) is a user level process started at system boot. It is
responsible for activating and deactivating the GRIOv2 kernel scheduler,
processing client requests to reserve and release bandwidth, tracking
bandwidth utilisation, managing unallocated bandwidth, and enforcing the
GRIOv2 software licenses.
grioadmin(1M) is used to perform node-level administration tasks for XFS
and CXFS filesystems including: querying available bandwidth, listing
active GRIO reservations, and creating, modifying and releasing node-
level static bandwidth allocations.
grioqos(1M) is used to extract and report the QoS metrics that GRIO
maintains for each stream.
libgrio2 implements the GRIOv2 userspace API. User processes communicate
with the daemon using the following core API calls:
grio_avail() - get available bandwidth for a filesystem
grio_reserve() - reserve bandwidth from a filesystem
grio_reserve_fd() - reserve bandwidth and bind a file descriptor
grio_bind() - bind a file descriptor to a stream
grio_unbind() - unbind a file descriptor
grio_modify() - modify an existing stream
grio_get_stream() - map a bound file descriptor to its stream ID
Page 3
grio2(5)grio2(5)grio_release() - signal that a stream should be reclaimed
The process that initially reserves bandwidth with a call to grio_reserve
or grio_reserve_fd is referred to as the owning process. Any streams not
already released when their owning process exits will be automatically
released. Streams can be shared between processes. The ownership of a
GRIOv2 stream is non-transferable.
GRIOv2 functionality in the kernel includes stream management, the I/O
scheduler, cluster integration and messaging.
DEPLOYMENT CONSIDERATIONS
There are two important constraints that must be observed when setting up
a GRIOv2 filesystems:
1. If any of the luns on a particular device will be managed as GRIO
volumes, then all of the luns should be managed as GRIO volumes.
Typically there will be hardware contention between separate luns,
both in the SAN and within the storage device. If only a subset of
the luns are managed, I/O to the unmanaged luns could still cause
oversubscription of the device and in turn violate GRIO rate
guarantees on the managed volumes.
2. For a similar reason, a storage device containing GRIO managed
volumes should not be shared between clusters. The GRIO daemons
running within different clusters are not coordinated, and unmanaged
I/O from one cluster can cause GRIO rate guarantees in the other
cluster to be violated.
It may be appropriate to relax these constraints if a storage device can
be configured such that there is no internal or external contention
between independent luns.
DATA LAYOUT & EXAMPLE
This section provides some tips on how to set up a filesystem on a RAID
device to achieve correct filesystem device alignment and maximise I/O
performance. There are three steps that are essential to ensuring correct
filesystem alignment:
1. Ensuring that each data partition is correctly aligned with the
internal disk layout of its lun.
2. Setting XVM stripe parameters correctly.
3. Passing correct volume geometry (stripe unit and width) to
mkfs_xfs(1M).
These three issues are demonstrated with an example.
Consider a RAID device with 28 disks arranged as 4 volume groups, with 7
disks per volume group, with each volume group configured as 6+1 RAID 5
(6 data disks, 1 parity disk). These are mapped directly to 4 luns - 1
Page 4
grio2(5)grio2(5)
lun per volume group.
If the back end transfer size of the RAID device is 128 KB (i.e. the size
of transfers between the RAID controllers and individual disks), then
each lun will have an aligned transfer size of 6*128 KB which is 768 KB
or 1536 filesystem blocks (512 bytes each).
The first step is to ensure that the raw data partitions are correctly
aligned with the start of their corresponding luns (i.e. the first disk
in the volume group). In this case luns are 1536 blocks wide, so the
start of the data partition should be a multiple of this number. As we
already require space at the start of the lun for the volume header (e.g.
4096 blocks by default for XLV/XVM) a good choice would be to move the
start of the data partition to 4*1536 or 6144 blocks.
GRIOv2 can only be used with XVM volumes, so xvm(1M) is used to partition
each lun. The location of the data partition is controlled by adjusting
the size of volume header and the size of the XVM volume label. In this
case, by passing the following options to the label command:
xvm> label -volhdrblks 5120 -xvmlabelblks 1024 <devname>
The luns are then arranged into a stripe. The stripe unit must match the
aligned transfer size of the luns (or a multiple thereof). This is
specified in the stripe subcommand as follows:
xvm> stripe -unit 1536 <slices ...>
Now a filesystem is created on the XVM volume. If the stripe is used as
the data subvolume the following command creates a filesystem with the
correct alignment:
mkfs_xfs -d sunit=1536,swidth=6144 <xvm_devname>
As there are four luns in total the stripe width swidth is four times the
aligned transfer size of the individual luns. Specifying the stripe unit
and width to mkfs_xfs allows it to ensure that key internal regions of
the filesystem are correctly aligned with the underlying volume
structure.
If the stripe is used as the realtime subvolume then the realtime extent
size should be set to a multiple of the volume stripe width. This extent
size also becomes the optimal I/O size that should be used by
applications doing I/O to the filesystem. The following command sets the
extent size to the stripe width (note that the 'b' suffix is required to
specify filesystem blocks):
mkfs_xfs -r extsize=6144b <xvm_devname>
This will optimize the filesystem for I/Os spanning the entire disk
array.
Page 5
grio2(5)grio2(5)
Note that if a non-GRIO XFS filesystem was created directly on one of
these luns, the fx(1M) command is used to partition the disk and move the
start of the data partition. For example the following sequence of
commands:
fx -x -d <devname>
fx> repartition
fx/repartition> optiondrive
...
fx/repartition> expert -b
...
will partition a drive as an option drive and then allow the layout of
the partitions to be adjusted interactively (-b specifies input values
are in filesystem blocks). The data partition should be selected and its
first block moved to 6144 - placing the start of the data partition on
the first disk of the lun.
Remember, however, that an XFS filesystem must be made on an XVM volume
if it is to be managed by GRIO.
FILES
/etc/griotab
SEE ALSOggd2(1M), grioadmin(1M), griotab(4)grio_avail(3X), grio_bind(3X),
grio_modify(3X), grio_release(3X), grio_reserve(3X), grio_reserve_fd(3X)grio_unbind(3X), grioqos(1M)
Page 6