Cool Tools - BIT
Binary Optimization with the Binary Improvement Tool (BIT) Software
Improving binary performance is a frequent request from customers.
These requests usually come from end customers of Sun systems or even
performance, benchmarking and production groups of large independent
software vendors (ISVs). The common theme is the non-availability of
the original source code. Without a re-compile, it is usually a hard,
time-consuming and costly endeavor to meaningfully improve binary
performance. Sometimes system tweaks to a non-optimized system will do
the trick, but often a complete system upgrade is necessary.
The Binary Optimizer is a tool that improves binary performance,
without the need for system changes or upgrades. This tool modifies the
binary by updating the binary instructions to generate more optimal
code. Capability exists to instrument the binary for profile
collection. When data from such a profile training run is fed back to
the Binary Optimizer, significant performance improvements may be
achieved. This is especially true for binaries that were not built with
high levels of optimizations, or were built without profile data, or
even built with profile data that is not representative of the end
customers unique workload.
What is the Binary Optimizer?
The Binary Optimizer is a static SPARC optimizer that
accepts a binary as input and creates an optimized binary as the
output. We define a binary as either an executable or a shared object.
The availability of the original source code is not a pre-requisite for
using this tool. It can optimize binaries irrespective of the original
source language (C, C++, FORTRAN, or a mix of those languages). The
binaries can be compiled and linked by either the Sun Studio
11 ("Studio") compiler or the GCC for SPARC Systems ("GCC") compiler.
A Quick-Start Guide
Without going into details about optimizations, command line options
and debugging here are the essential steps to optimize a binary.
-
The binary must be compiled and linked with a special compiler option,
-xbinopt=prepare. NOTE: When using the Studio
compiler, optimizations
(-O or -xOn) must also be on when compiling and linking the binary.
-
The resulting binary should be instrumented for profile
collection using the
bit instrument subcommand.
-
Run the application with one or more representative
workloads.
-
Optimize the binary with this profile data with the
bit optimize subcommand.
Example:
% gcc -xbinopt=prepare -o myapp *.c % bit instrument myapp % myapp.instr < input_data % bit optimize myapp
Why Use the Binary Optimizer?
The global optimizations performed by the Binary Optimizer usually show
greater performance improvements on large applications. We see the
following potential users of binary optimization technology:
End Users on SPARC Platforms:
Experienced users of Sun systems (for example, database administrators)
are often looking for ways to improve binary performance. For such users,
ready to go that extra mile to tune binaries they receive from software
vendors, the Binary Optimizer is an ideal tool.
For the software vendor, the necessary step to follow is:
For the end user, the following steps will optimize the binary for their
specific workload:
- Instrument the binary,
using
bit instrument.
- Run the instrumented binary on a representative workload.
- Use
bit optimize to optimize the binary with the collected runtime data.
For example, the end user optimizes app using bit:
% bit instrument app
% app.instr < input_data
% bit optimize app
NOTE: Users should save a copy of their binary before optimizing,
because bit optimize replaces the original with
the optimized version.
Software Vendors:
The Binary Optimizer performs optimizations that are not normally
performed by the compiler. Hence by including the Binary Optimizer
in the build process, a better performing production binary may be
obtained.
The steps necessary to create a production binary with the Binary Optimizer are:
- Compile the application with the
-xbinopt=prepare flag.
- Instrument the resulting binary
for profile collection using
bit instrument.
- Run the instrumented application with one or more representative workloads.
- Optimize the binary using the
collected profile data and
bit optimize.
Example:
% cc -xO4 -xbinopt=prepare -o app *.c
% bit instrument app
% app.instr < input_data
% bit optimize app
It is important to note that if you are already using profile feedback
(-xprofile=collect|use compiler flags) to build the application,
it may easier to use the -xlinkopt compiler fag in the build,
rather than using the Binary Optimizer, to obtain similar optimizations.
Performance With Binary Optimization
We see significant performance improvements on large applications when
the Binary Optimizer is used. This is especially true for applications
that are not built with profile feedback or are built with feedback
that does not truly represent the end customer's workload. In these
situations, a 10% or more performance gain is not unheard of.
The user must also be aware that using the Binary Optimizer causes
an increase in size of the binary. This is due to the fact that optimized
code is cached in a new segment in the binary. On large applications, an
increase in size of up to 1.8x is seen.
The Binary Optimizer runtime is usually a fraction of the build time
of the entire binary. For large applications, where the build time is
usually several hours, bit runtime can be measured in minutes. For
example, building a well known database application from source takes
over 5 hours. Performing binary optimizations on the resultant binary
takes 8 minutes.
Optimization Levels
The -O1 optimization level is the default level of optimization
for bit optimize. At this level, code ordering and control flow optimizations
are performed. While ordering code, functions may be split to optimize I-cache performance.
At the -O2 optimization level, data-flow information is constructed
and more aggressive optimizations are performed. These include inlining, address
simplification and load instruction optimizations. Usually better performance
is derived from using this higher level of optimization. The tradeoff is an
increased bit runtime.
At -O0, no optimizations are performed.
Profile Instrumentation
Collecting and using a profile of the execution characteristics of a binary
is crucial to making effective use of the Binary Optimizer. Instrumenting
a binary and executing a training run to collect the data is relatively easy
when using this tool. A single command line instruments the binary. The instrumented
binary may be freely copied to a potentially different run machine – it
is self contained, and no dependencies need to be maintained. It is
only necessary that the instrument, run and optimize phases be able
to find the instrdata file using the same path. If the binary is to
be run in more than one working directory, it is easiest to specify
an absolute path using the -d directory option during the
instrument and optimize phases, as shown in the example below. (See the
BIT Users's Guide for more information.)
Accumulation
of profile data from multiple training runs is another useful feature
– data is automatically accumulated until the binary is
reinstrumented or the instrdata file is removed.
When collecting profile for applications which contain one or more executables
and/or shared objects, all binaries for which optimizations are planned need
to be instrumented. In the example below, the executable app
has a dependency on the shared object x.so. As demonstrated,
both binaries need to be instrumented and optimized
separately.
The -R flag causes the original binaries to
be overwritten with the instrumented or optimized code. We use this option
because, by default, the instrumented library would be saved as
x.so.instr, and the runtime linker wouldn't be able to
find it. With -R, x.so itself becomes instrumented.
Use caution
when using the -R option – do not use it in a production
environment, make copies of the binaries before beginning, and make
sure no one else will be using the binaries – they might get an
unpleasant surprise when the binary is instrumented and the
application slows down.
% bit instrument -R -d /home/me/prj app
% bit instrument -R -d /home/me/prj x.so
% app < input_data
% bit optimize -R -d /home/me/prj app
% bit optimize -R -d /home/me/prj x.so
Debugging
The Binary Optimizer maintains full compatibility with tools that statically
or dynamically examine a binary (analyzer(1), dbx(1),
pstack(1), etc.). The symbol tables are updated to reflect all
transformations. Mangled symbol names are often assigned (see the example
below), which are automatically de-mangled when displayed by the Studio tools.
If the prepared binary was built for debugging (with the -g
compiler option), debugging information is automatically propagated to the binary,
instead of leaving it in the object file by default. When such a binary is
optimized by bit, the available debugging information is updated to reflect the
transformations performed.
Example
Here is a small example to help understand how the Binary Optimizer transforms
the binary.
In the code below, there are three functions main(), add()
and sub(). The frequently executed parts of the code are denoted by the red
rectangles, while the less frequently executed code is colored green. The layout of the
optimized binary is shown on the right hand side. Here are some of the characteristics
of the new binary:
- The optimized code is placed in a new
segment of the binary (named “Optimized code” in Figure 1 below.
- Functions may be split while laying out
code (function
main() is split, the hot fragment which is not the
entry point is given the mangled name _$o1cexhO0.main()).
- The original functions are given new
mangled names (
_$r1.main(), _$b1.add(),
_$b1.sub()).

Figure 1: Typical code layout from the binary optimizer
Additional Details
-xbinopt=prepare Considerations:
The -xbinopt=prepare compiler flag, when used to build a binary,
adds certain information to the binary that allows it to be transformed by
the Binary Optimizer. This information describes the location of the executable
code, points out control flow structures like function boundaries and switch
tables, and provides data flow information about the code. This data is stored
in a new ELF section named .annotate. This additional information in the
binary results in a 5% increase in size, on average. There is no noticeable
build time impact when this flag is used.
In addition, prepared binaries built for debugging (with the -g compiler
option), have an additional size increase due to the presence of debugging
information. The size increase varies depending on the compiler and
type of debugging information used.
Profile Instrumentation Considerations:
While doing a training run to collect binary profile information, the user
will notice a slowdown in application performance. This is to be expected
since there is an overhead associated with recording the execution count
profile of the executable code. Usually we see a 2.5 to 3x slowdown in
application performance.
There is also an increase in binary file size associated with adding instrumentation
code. We usually see a 2.5x increase in binary size due to profile instrumentation.
Finalization
As mentioned above, a binary that may be optimized by the Binary Optimizer
must be prepared using the -xbinopt=prepare compiler
flag. This results in additional information being placed in an ELF section
in the binary. When creating a final binary that is to be deployed on the
run systems, and on which no future optimizations are planned, the
-f option may be used to strip the -xbinopt=prepare
information from the resultant binary. This flag may be used to prevent users of
the binary from making any further modifications to it. For example:
% cc -xO4 -xbinopt=prepare -o app *.c
% bit instrument app
% app.instr < input_data
% bit optimize -f app
Handling Modules Not Built With -xbinopt=prepare
If the binary contains a combination of legacy code and newly created code
(built with -xbinopt=prepare), the Binary Optimizer may still be gainfully
employed. The Binary Optimizer optimizes only that code that was built with the
-xbinopt=prepare compiler option, leaving the legacy code untouched.
Conflicts
The Binary Optimizer has some restrictions. It will not optimize binaries built as follows:
- With the
-xprofile=collect compiler option.
- With the
-xlinkopt compiler option.
- When the binary has been stripped
using the
strip(1) tool, or the -s option of
the Studio compiler.
- (The following two items apply
only to versions of bit before 1.2.)
- Bit will not optimize that part of the code compiled
with the
-xF option of the Studio compiler.
- Bit will not optimize the template code portion of a
C++ application compiled with the Studio compiler.
The Binary Optimizer also does not optimize those parts of the executable
code in a binary that were derived from assembly language files. As mentioned
earlier, code derived from object files compiled without the -xbinopt=prepare
flag are not optimized either. On the other hand, the presence of assembly
code or legacy object code in a binary does not prevent BIT
from optimizing the remainder of the binary.
|