prof_intro(1)prof_intro(1)NAMEprof_intro - Introduction to application profilers, profiling, opti‐
mization, and performance analysis
DESCRIPTION
Tru64 UNIX supports four approaches to performance improvement: Auto‐
matic and profile-directed optimizations. For example: pixie -update
a.out data/* cc -non_shared-O3-spike-feedback a.out *.c Manual
design and code optimizations. For example: hiprof -all -display pro‐
gram data/* | more hiprof -flat -all -display program data/* | more
uprofile -heavy program data/* | more Minimizing system-resource usage.
For example: third -display program data/* | more Verifying signifi‐
cance of test cases. For example: pixie -testcoverage program data/* |
more
One approach might be enough, but more might be beneficial if no single
approach addresses all aspects of a program's performance. The follow‐
ing sections describe each approach and the tools provided by Tru64
UNIX to support them.
AUTOMATIC AND PROFILE-DIRECTED OPTIMIZATIONS
Techniques
Automatic and profile-directed optimizations are the simplest
approaches to improving application performance.
Some degree of automatic optimization can be achieved by using the com‐
piler's and linker's optimization options. These can help in the gener‐
ation of minimal instruction sequences that make best use of the CPU
architecture and cache memory.
However, the compiler and linker can improve their optimizations if
they are given information on which instructions are executed most
often when the program is run with its normal input data and environ‐
ment. While the default optimizations give improved performance for
most common situations, the optimizers can do even better if they can
tune the program in favor of the heavily used instruction sequences as
determined from a sample run.
Tru64 UNIX helps you provide the optimizers with this information on
processing hot-spots by allowing a profiler's results to be fed back
into a recompilation. This customized, profile-directed optimization
can be used in conjunction with automatic optimization.
Tools and Examples
The cc compiler command's automatic optimization options are selected
with -O, -fast, -inline, -spike, and other related options. See cc(1)
for details and Chapter 10 of the Programmer's Guide for more informa‐
tion on the many options and tradeoffs available.
For example, this command selects a high degree of optimization in both
the compiler and the linker: cc -non_shared -O3 -spike *.c
The pixie profiler provides profile information that the cc command's
-feedback and -spike options can use to tune the generated instruction
sequences to the demands placed on the program by particular sets of
input data.
The steps, shown in the following example, consist of (1) preparing the
program for profile-directed optimization, (2) creating an instrumented
version of the program and running it to collect profiling statistics,
and (3) feeding that information back to the compiler and linker to
help them optimize the executable code: rm -f program cc -non_shared
-feedback program -o program -O3 *.c pixie -update program cc
-non_shared -feedback program -o program -O3 -spike *.c
To apply profile-directed optimizations to shared libraries, generate
profile data with an exerciser program, and store it in the shared
library prior to recompiling with that feedback. For example: rm -f
libexample.so cc -feedback libexample.so -o libexample.so -shared-O3
lib*.c cc -o exerciser exerciser.c -L. -lexample pixie -L. -incobj
libexample.so -run exerciser prof -pixie-update libexample.so exer‐
ciser.Counts cc -spike -feedback libexample.so -o libexample.so -shared
-O3 lib*.c
MANUAL DESIGN AND CODE OPTIMIZATIONS
Techniques
The effectiveness of the automatic optimizations described previously
is limited by the efficiency of the algorithms that the program uses. A
program's performance can be further improved by manually optimizing
its algorithms and data structures. Such optimizations may include
reducing complexity from N-squared to log-N, avoiding copying of data,
and reducing the amount of data used. It may also extend to tuning the
algorithm to the architecture of the particular machine it will be run
on - for example, processing large arrays in small blocks such that
each block remains in the data cache for all processing, instead of the
whole array being read into the cache for each processing phase.
Tru64 UNIX supports manual optimization with its profiling tools, which
identify the parts of the application that use most CPU resources - CPU
cycles, cache misses, and so on. By evaluating different profiles of a
program, you can identify which parts of the program use most CPU
resources and you can then redesign or recode algorithms in those parts
to use less resources. The profiles also make this exercise more cost-
effective by helping you to focus on the most demanding code instead of
the least demanding code.
Tools and Examples
.SS(a) CPU-Time Profiling with Call-Graph
A call-graph profile shows how much CPU time is used by each procedure,
and how much is used by all of the other procedures that it calls. This
can show which phases or subsystems in a program spend most of the
total CPU time, which can help in gaining a general understanding of
the program's performance.
The hiprof profiler instruments the program and records a call graph
while the instrumented program executes. The hiprof profiler does not
require that the program be compiled in any particular way, but the
names of local (for example, static) procedures will be hidden if the
cc command's default -g0 option was used, and procedures will be hidden
if they are inlined. For example: cc -g1 -O2 -o program *.c hiprof -all
-display program data/* | more
By default, hiprof uses a low-frequency sampling technique. It can pro‐
file all of the code executed by the program, including all selected
libraries, though its call graph excludes procedures in threads-related
system libraries. It can also provide detailed profiles at the level of
source lines or machine instructions.
For non-threaded programs, hiprof can alternatively count the number of
machine cycles used or page faults that occur during program execution.
In these modes, the CPU time or page-faults count reported for the
instrumented routines includes that for the uninstrumented routines
that they call. This can summarize the costs and reduce the run-time
overhead, but note that the machine-cycle counter wraps if no instru‐
mented procedure is called at least every few seconds.
The cc compiler's -pg option uses the same sampling technique as
hiprof. This technique is supported in a very similar way on different
vendors' UNIX systems. For example: cc -g1-O2-pg-o program *.c
./program data/* gprof program gmon.out | more
However, hiprof may be preferred because the -pg option has some disad‐
vantages: The program needs to be specially compiled with the -pg
option. Only a few of the archive libraries that are provided with the
operating system were compiled to generate a call-graph profile. Only
the executable is profiled. Shared libraries are not.
The optional dxprof command provides a graphical display of various
call-graph profiles.
.SS(b) CPU-Time/Event Profiles for Sourcelines/Instructions
A good performance-improvement strategy may start with a procedure-
level profile of the whole program (perhaps with a call graph too, to
give the big picture), but it will often progress to detailed profiling
of individual source-lines and instructions.
The uprofile profiler uses a sampling technique to generate a profile
of the CPU time or events such as cache misses associated with each
procedure or source-line or instruction. The sampling frequency depends
on the processor type and the statistic being sampled, but for CPU time
it is on the order of a millisecond. The profiler achieves this with‐
out modifying the target program at all by using hardware counters that
are built into the Alpha CPU. Running the uprofile command with no
arguments yields a list of all the kinds of events that a particular
machine can profile, depending on the nature of its architecture. The
default is to profile machine cycles, resulting in a CPU-time profile.
The following example shows how to display a profile of the source
lines that experienced the top 90% of data cache misses on an EV56
Alpha: cc -g1 -O2 -o program *.c uprofile -h -q 90cum% dcacheldmisses
program data/* | more
This technique has the advantage of very low run-time overhead. Also,
the detailed information it can provide on the costs of executing indi‐
vidual instructions or source lines is essential in identifying exactly
which operation in a procedure is slowing down the program.
The disadvantages of uprofile are that only executables can be pro‐
filed, the results can be skewed unless all processors have the same
cycle speed, only one program can be profiled with the hardware coun‐
ters at one time, threads can not be profiled individually, and the
Alpha EV6 architecture's execution of instructions out of sequence can
significantly reduce the accuracy of fine-grained profiles.
If hiprof's -flat option is used, its default sampling technique can
provide the same fine-grain profiles (CPU time only) and low intrusive‐
ness as uprofile. Also, it is accurate even with mixed processor cycle
speeds, and it can profile all of a program's shared libraries as well
as its individual threads. For example: hiprof -flat-h-all program
data/* | more
The cc compiler's -p option uses the same low-frequency sampling tech‐
nique as hiprof. It is common to many UNIX systems, and (on Tru64 UNIX)
it is able to profile all the shared libraries used by a program. The
program needs to be relinked with the -p option, but it does not need
to be recompiled from source, so long as the original compilation used
an acceptable debug level, such as the -g1 compiler option. For exam‐
ple, to profile individual instructions of a program: cc -p -o program
*.o setenv PROFFLAGS '-all -stride 1' ./program data/* prof -all-asm
-quit 5% program mon.out | more
The pixie tool can also profile source lines and instructions (includ‐
ing shared libraries), but note that when it displays counts of
“Cycles”, it is actually reporting counts of instructions executed, not
machine cycles. For example: cc -g1-O2-o program *.c pixie -all
-lines -quit 20 program data/* | more
The optional dxprof command provides a graphical display of profiles
collected by either pixie or the cc command's -p option.
MINIMIZING SYSTEM RESOURCE USAGE
Techniques
The preceding techniques can improve an application's use of just the
CPU. Further performance improvements can be made by improving the
efficiency with which the application uses the other components of the
computer system: heap memory, disk files, network connections, and so
on.
As with CPU profiling, the first phase of a resource usage improvement
process is to monitor how much memory, data I/O and disk space, elapsed
time, and so on, is used. Then the throughput of the computer can be
increased or tuned in ways that help the program, or the program's
design can be tuned to make better use of the computer resources that
are available. For example: Reduce the size of the data files that the
program reads and writes. Use memory-map files instead of regular I/O.
Allocate memory incrementally on demand instead of allocating at start-
up the maximum that could be required. Fix heap leaks, and do not
leave allocated memory unused. See the System Configuration and Tuning
manual for a broader discussion of analyzing and tuning a Tru64 UNIX
system.
Tools and Examples
.SS(a) System Monitors
The Tru64 UNIX base system commands ps u, swapon -s, and vmstat 3 can
show the currently active processes' usage of system resources such as
CPU time, physical and virtual memory, swap space, page faults, and so
on.
The optional pview command provides a graphical display of similar
information for the processes that comprise an application.
The time commands provided by the Tru64 UNIX system and command shells
provide an easy way to measure the total elapsed time and CPU time for
a program and its descendants.
The collect tool is an optional, low overhead, system performance moni‐
tor.
Many other related commands are described in the System Configuration
and Tuning manual.
.SS(b) Heap Memory Analyzers
The third command reports heap memory leaks in a program, by instru‐
menting it with the Third Degree memory-usage checker, running it, and
displaying a log of leaks detected at program exit. For example: third
-display program data/* | more
If you are interested only in leaks occurring during the normal opera‐
tion of the program, not during startup or shutdown, you can specify
additional places to check for previously unreported leaks. For exam‐
ple, the pre-shutdown leak report will give this information: third
-display -after startup -before shutdown program data/* | more
Third Degree can also detect various kinds of bugs that may be affect‐
ing the correctness or performance of a program. See the Programmer's
Guide for further details on debugging and leak-detection.
The optional dxheap command provides a graphical display of Third
Degree's heap and bug reports.
The optional mview command provides a graphical analysis of heap usage
over time. This view of a program's heap can clearly show the presence
(if not the cause) of significant leaks or other undesireable trends
such as wasted memory.
VERIFYING SIGNIFICANCE OF TEST CASES
Techniques
Most of the preceding profiling techniques are effective only if you
profile and optimize or tune the parts of the program that are executed
in the scenarios whose performance is important. Careful selection of
the data used for the profiled test runs is often sufficient, but you
may want a quantitative analysis of which code was and was not executed
in a given set of tests.
Tools and Examples
The pixie command's -t[estcoverage] option reports lines of code that
were not executed in a given test run. For example: pixie -t program
data/* | more
Conversely, pixie's -p[rocedure], -h[eavy], and -a[sm] options show
which procedures, source lines, and instructions were executed.
If multiple test runs are needed to build up a typical scenario, the
prof command can be run separately on a set of profile data files:
pixie -pids program ./program.pixie data1/* ./program.pixie data2/*
prof -pixie -t program program.Counts.*
SEE ALSO
Optimizing: cc(1), spike(1)
Profiling: hiprof(1), pixie(1), third(1), uprofile(1)
System Monitoring: collect(8), ps(1), swapon(1), vmstat(1)
Graphical tools, available from the Graphical Program Analysis subset
of the Tru64 UNIX Associated Products installation media, or as part of
the Enterprise Toolkit for Windows/NT desktops with Microsoft's Visual
Studio 97: dxheap(1), dxprof(1), mview(1), pview(1)
Programmer's Guide
System Configuration and Tuning
prof_intro(1)