Login | Register
Login | Register

My pages Projects SunSource.net openCollabNet
OpenSPARC.net >  Cool Tools >  BIT >  Binary Optimization

Cool Tools - BIT

Binary Optimization with the Binary Improvement Tool (BIT) Software

Improving binary performance is a frequent request from customers. These requests usually come from end customers of Sun systems or even performance, benchmarking and production groups of large independent software vendors (ISVs). The common theme is the non-availability of the original source code. Without a re-compile, it is usually a hard, time-consuming and costly endeavor to meaningfully improve binary performance. Sometimes system tweaks to a non-optimized system will do the trick, but often a complete system upgrade is necessary.

The Binary Optimizer is a tool that improves binary performance, without the need for system changes or upgrades. This tool modifies the binary by updating the binary instructions to generate more optimal code. Capability exists to instrument the binary for profile collection. When data from such a profile training run is fed back to the Binary Optimizer, significant performance improvements may be achieved. This is especially true for binaries that were not built with high levels of optimizations, or were built without profile data, or even built with profile data that is not representative of the end customers unique workload.

What is the Binary Optimizer?

The Binary Optimizer is a static SPARC optimizer that accepts a binary as input and creates an optimized binary as the output. We define a binary as either an executable or a shared object. The availability of the original source code is not a pre-requisite for using this tool. It can optimize binaries irrespective of the original source language (C, C++, FORTRAN, or a mix of those languages). The binaries can be compiled and linked by either the Sun Studio 11 ("Studio") compiler or the GCC for SPARC Systems ("GCC") compiler.

A Quick-Start Guide
Without going into details about optimizations, command line options and debugging here are the essential steps to optimize a binary.

  • The binary must be compiled and linked with a special compiler option, -xbinopt=prepare.
    NOTE: When using the Studio compiler, optimizations (-O or -xOn) must also be on when compiling and linking the binary.
  • The resulting binary should be instrumented for profile collection using the bit instrument subcommand.
  • Run the application with one or more representative workloads.
  • Optimize the binary with this profile data with the bit optimize subcommand.

Example:

% gcc -xbinopt=prepare -o myapp *.c
% bit instrument myapp
% myapp.instr < input_data
% bit optimize myapp

Why Use the Binary Optimizer?

The global optimizations performed by the Binary Optimizer usually show greater performance improvements on large applications. We see the following potential users of binary optimization technology:

End Users on SPARC Platforms:
Experienced users of Sun systems (for example, database administrators) are often looking for ways to improve binary performance. For such users, ready to go that extra mile to tune binaries they receive from software vendors, the Binary Optimizer is an ideal tool.

For the software vendor, the necessary step to follow is:

  • The vendor ships a binary, app, built with the -xbinopt=prepare flag
    % cc -O -xbinopt=prepare *.c -o app

For the end user, the following steps will optimize the binary for their specific workload:

  • Instrument the binary, using bit instrument.
  • Run the instrumented binary on a representative workload.
  • Use bit optimize to optimize the binary with the collected runtime data.

For example, the end user optimizes app using bit:

% bit instrument app
	    
% app.instr < input_data
% bit optimize app

NOTE: Users should save a copy of their binary before optimizing, because bit optimize replaces the original with the optimized version.

Software Vendors:
The Binary Optimizer performs optimizations that are not normally performed by the compiler. Hence by including the Binary Optimizer in the build process, a better performing production binary may be obtained.

The steps necessary to create a production binary with the Binary Optimizer are:

  • Compile the application with the -xbinopt=prepare flag.
  • Instrument the resulting binary for profile collection using bit instrument.
  • Run the instrumented application with one or more representative workloads.
  • Optimize the binary using the collected profile data and bit optimize.

Example:

% cc -xO4 -xbinopt=prepare -o app *.c
	    
% bit instrument app
% app.instr < input_data
% bit optimize app

It is important to note that if you are already using profile feedback (-xprofile=collect|use compiler flags) to build the application, it may easier to use the -xlinkopt compiler fag in the build, rather than using the Binary Optimizer, to obtain similar optimizations.

Performance With Binary Optimization

We see significant performance improvements on large applications when the Binary Optimizer is used. This is especially true for applications that are not built with profile feedback or are built with feedback that does not truly represent the end customer's workload. In these situations, a 10% or more performance gain is not unheard of.

The user must also be aware that using the Binary Optimizer causes an increase in size of the binary. This is due to the fact that optimized code is cached in a new segment in the binary. On large applications, an increase in size of up to 1.8x is seen.

The Binary Optimizer runtime is usually a fraction of the build time of the entire binary. For large applications, where the build time is usually several hours, bit runtime can be measured in minutes. For example, building a well known database application from source takes over 5 hours. Performing binary optimizations on the resultant binary takes 8 minutes.

Optimization Levels

The -O1 optimization level is the default level of optimization for bit optimize. At this level, code ordering and control flow optimizations are performed. While ordering code, functions may be split to optimize I-cache performance.

At the -O2 optimization level, data-flow information is constructed and more aggressive optimizations are performed. These include inlining, address simplification and load instruction optimizations. Usually better performance is derived from using this higher level of optimization. The tradeoff is an increased bit runtime.

At -O0, no optimizations are performed.

Profile Instrumentation

Collecting and using a profile of the execution characteristics of a binary is crucial to making effective use of the Binary Optimizer. Instrumenting a binary and executing a training run to collect the data is relatively easy when using this tool. A single command line instruments the binary. The instrumented binary may be freely copied to a potentially different run machine – it is self contained, and no dependencies need to be maintained. It is only necessary that the instrument, run and optimize phases be able to find the instrdata file using the same path. If the binary is to be run in more than one working directory, it is easiest to specify an absolute path using the -d directory option during the instrument and optimize phases, as shown in the example below. (See the BIT Users's Guide for more information.)

Accumulation of profile data from multiple training runs is another useful feature – data is automatically accumulated until the binary is reinstrumented or the instrdata file is removed.

When collecting profile for applications which contain one or more executables and/or shared objects, all binaries for which optimizations are planned need to be instrumented. In the example below, the executable app has a dependency on the shared object x.so. As demonstrated, both binaries need to be instrumented and optimized separately.

The -R flag causes the original binaries to be overwritten with the instrumented or optimized code. We use this option because, by default, the instrumented library would be saved as x.so.instr, and the runtime linker wouldn't be able to find it. With -R, x.so itself becomes instrumented.

Use caution when using the -R option – do not use it in a production environment, make copies of the binaries before beginning, and make sure no one else will be using the binaries – they might get an unpleasant surprise when the binary is instrumented and the application slows down.

% bit instrument -R -d /home/me/prj app
% bit instrument -R -d /home/me/prj x.so
% app < input_data
% bit optimize -R -d /home/me/prj app
% bit optimize -R -d /home/me/prj x.so 

Debugging

The Binary Optimizer maintains full compatibility with tools that statically or dynamically examine a binary (analyzer(1), dbx(1), pstack(1), etc.). The symbol tables are updated to reflect all transformations. Mangled symbol names are often assigned (see the example below), which are automatically de-mangled when displayed by the Studio tools.

If the prepared binary was built for debugging (with the -g compiler option), debugging information is automatically propagated to the binary, instead of leaving it in the object file by default. When such a binary is optimized by bit, the available debugging information is updated to reflect the transformations performed.

Example

Here is a small example to help understand how the Binary Optimizer transforms the binary.

In the code below, there are three functions main(), add() and sub(). The frequently executed parts of the code are denoted by the red rectangles, while the less frequently executed code is colored green. The layout of the optimized binary is shown on the right hand side. Here are some of the characteristics of the new binary:

  • The optimized code is placed in a new segment of the binary (named “Optimized code” in Figure 1 below.
  • Functions may be split while laying out code (function main() is split, the hot fragment which is not the entry point is given the mangled name _$o1cexhO0.main()).
  • The original functions are given new mangled names (_$r1.main(), _$b1.add(), _$b1.sub()).

Figure 1: Typical code layout from the binary optimizer

 

Additional Details

-xbinopt=prepare Considerations:
The -xbinopt=prepare compiler flag, when used to build a binary, adds certain information to the binary that allows it to be transformed by the Binary Optimizer. This information describes the location of the executable code, points out control flow structures like function boundaries and switch tables, and provides data flow information about the code. This data is stored in a new ELF section named .annotate. This additional information in the binary results in a 5% increase in size, on average. There is no noticeable build time impact when this flag is used.

In addition, prepared binaries built for debugging (with the -g compiler option), have an additional size increase due to the presence of debugging information. The size increase varies depending on the compiler and type of debugging information used.

Profile Instrumentation Considerations:
While doing a training run to collect binary profile information, the user will notice a slowdown in application performance. This is to be expected since there is an overhead associated with recording the execution count profile of the executable code. Usually we see a 2.5 to 3x slowdown in application performance.

There is also an increase in binary file size associated with adding instrumentation code. We usually see a 2.5x increase in binary size due to profile instrumentation.

Finalization
As mentioned above, a binary that may be optimized by the Binary Optimizer must be prepared using the -xbinopt=prepare compiler flag. This results in additional information being placed in an ELF section in the binary. When creating a final binary that is to be deployed on the run systems, and on which no future optimizations are planned, the -f option may be used to strip the -xbinopt=prepare information from the resultant binary. This flag may be used to prevent users of the binary from making any further modifications to it. For example:

% cc -xO4 -xbinopt=prepare -o app *.c
  
% bit instrument app
% app.instr < input_data
% bit optimize -f app

Handling Modules Not Built With -xbinopt=prepare
If the binary contains a combination of legacy code and newly created code (built with -xbinopt=prepare), the Binary Optimizer may still be gainfully employed. The Binary Optimizer optimizes only that code that was built with the -xbinopt=prepare compiler option, leaving the legacy code untouched.

Conflicts
The Binary Optimizer has some restrictions. It will not optimize binaries built as follows:

  • With the -xprofile=collect compiler option.
  • With the -xlinkopt compiler option.
  • When the binary has been stripped using the strip(1) tool, or the -s option of the Studio compiler.
  • (The following two items apply only to versions of bit before 1.2.)
  • Bit will not optimize that part of the code compiled with the -xF option of the Studio compiler.
  • Bit will not optimize the template code portion of a C++ application compiled with the Studio compiler.

The Binary Optimizer also does not optimize those parts of the executable code in a binary that were derived from assembly language files. As mentioned earlier, code derived from object files compiled without the -xbinopt=prepare flag are not optimized either. On the other hand, the presence of assembly code or legacy object code in a binary does not prevent BIT from optimizing the remainder of the binary.