By Darryl Gove, Senior Performance Engineer, Sun Microsystems.
This article suggests how to get the best performance from an UltraSPARC
processor running on the latest Solaris systems by
compiling with the best set of compiler options and the latest
GCC for SPARC Systems
compilers. These are suggestions of things you should try, but before
you release the final version of your program, you should understand
exactly what you have asked the compiler to do.
-
The fundamental questions
-
There are two questions that you need to ask when compiling your program:
1. What do I know about the platforms that this program will run on?
2. What do I know about the assumptions that are made in the code?
The answers to these two questions determine what compiler options you
should use.
-
The target platform
-
What platforms do you expect your code to run on? The choice of platform
determines:
1. 32-bit or 64-bit instruction set
2. Instruction set extensions the compiler can use
3. Instruction scheduling depending on instruction execution times
4. Cache configuration
The first three are often the most important ones.
-
32-bit versus 64-bit code
-
The UltraSPARC family of processors can run both 32-bit and
64-bit code. In general it is not possible to determine, without testing
the application, whether better performance will be obtained with 32-bit
or 64-bit code; there are several factors which influence performance:
-
When moving from 32-bit to 64-bit code, the memory footprint of
the application typically gets bigger, because long, unsigned
long, and pointers all change from being 32-bits in size to being
64-bits in size. Because of this, some applications will run more
slowly.
- The programming model for the UltraSPARC processor allows 32-bit
applications to use the same set of features as 64-bit
applications. As such there is often little to be gained by
targeting 64-bit.
- The primary and critical reason to use 64-bit code is if the
application handles a large amount of data in memory.
For additional details about migrating from 32-bit to 64-bit code, refer
to
Converting 32-bit Applications Into 64-bit Applications: Things to
Consider.
-
Specify the Target Platform and Architecture as Explicitly as
Possible
-
The target platform specifies the processor that the application is
expected to run on, the minimum processor that is required, and whether
GCC for SPARC Systems compilers target an UltraSPARC processor for the
SPARC
architecture. It is always a good idea to explicitly specify the target
architecture to avoid the possibility that this could be changed by a
change in compiler flags.
There are a number of compiler flags that work together to specify the
target architecture. The flag -xtarget sets all the other flags
(-xarch, -xchip, and -xcache) to appropriate default values for
the given target processor. The flag -xarch sets the instruction set
that the processor supports, the flag -xchip specifies how the
compiler should use these instructions. Finally the flag -xcache
specifies the structure of the caches for this target (however this flag
may not have any impact for many codes). As with all compiler flags, the
order is important; flags accumulate from left to right, in the event
that there are conflicting settings the flag on the right will override
the values of flags which were specified earlier on the command line.
A point to be cautious of is that specifying a more recent hardware
target may mean that older hardware is no longer able to run the
application. In particular specifying the target as being an UltraSPARC
platform means that the application will no longer run on pre-UltraSPARC
processors (however UltraSPARC processors have been shipping for over 10
years).
-
Using -xtarget=generic
-
The compiler supports the options -xtarget=generic and
-xtarget=generic64. These options tell the compiler to produce code
which runs well on as wide a range of machines as possible. The compiler
evolves the meaning of 'generic' as new processors are introduced, so
the flag is the best option if the same binary has to be run over a
range of processors.
-
Specifying the target platform for the UltraSPARC-III family of
processors
-
Because -xtarget=generic favours code that runs well on a wide range
of processors rather than on a particular processor, there may be times
when it does not produce the best performance. Consequently it is worth
comparing the performance of the generic code with a build of the
application specifically targeted for a particular processor family.
For UltraSPARC processors, a generally good option pair to use is
-xtarget=ultra3 with -xarch=v8plusa. These options allow the
compiler to generate 32-bit code that can run on all the members of the
UltraSPARC family and their follow-ons (UltraSPARC I, UltraSPARC II,
UltraSPARC III, UltraSPARC IV). The compiler will also schedule the code
especially for the UltraSPARC III. These options represent a good
compromise, since code scheduled for the UltraSPARC III is better at
taking advantage of the new features of the UltraSPARC III architecture,
while still providing good performance on previous generations of
processors.
If the application requires the capability to address 64-bit memory
addresses, then the appropriate flags to use are -xtarget=ultra3
-xarch=v9a which adds 64-bit addressing whilst still targeting all the
members of the UltraSPARC family of processors
|
Recommended compiler flags for the UltraSPARC platform
|
32-bit code: -xtarget=ultra3 -xarch=v8plusa
|
|
64-bit code: -xtarget=ultra3 -xarch=v9a
|
-
Optimization and debug
-
-
The optimization flags chosen alter three important characteristics; the
runtime of the compiled application, the length of time that the
compilation takes, and the amount of debug that is possible with the
final binary. In general the higher the level of optimization the faster
the application runs (and the longer it takes to compile), but the less
debug information that is available; but the particular impact of
optimization levels will vary from application to application.
The easiest way of thinking about this is to consider three degrees of
optimization, as outlined in the following table.
| Purpose |
Flags |
Comments |
|
Full debug
|
[no optimization flags] -g
|
The application will have full debug capabilities, but
almost no optimization will be performed on the application, leading to
lower performance.
|
|
Optimised
|
-g -O
|
The application will have good debug capabilities, and a
reasonable set of optimizations will be performed on the application,
typically leading to significantly better performance.
|
|
High optimization
|
-g -fast
|
The application will have good debug capabilities, and a
large set of optimizations will be performed on the application,
typically leading to higher performance.
|
|
Suggestion
|
In general an optimization level of at least -O is
suggested, however the two situations where lower levels might be
considered are (i) where more detailed debug information is required and
(ii) the semantics of the program require that variables are treated as
volatile, in which case the optimization level should be lowered to -O1.
|
-
More details on debug information
-
The compiler will generate information for the debugger if the -g flag
is present. For lower levels of optimization, the -g flag disables
some minor optimizations (to make the generated code easier to debug).
At higher levels of optimization, the presence of the flag does not
alter the code generated (or its performance) -- but be aware that at
high levels of optimization it is not always possible for the debugger
to relate the disassembled code to the exact line of source, or for it
to determine the value of local variables held in registers rather than
stored to memory.
A very strong reason for compiling with the -g flag is that the Sun
Studio Performance Analyzer can then attribute time spent in the code
directly to lines of source code -- making the process of finding
performance bottlenecks considerably easier.
|
Suggestion
|
Always compile with -g since it should not make much (if any)
difference to performance. Your program will be easier to debug
and analyze.
|
-
Using the -fast Option
-
The compiler option -fast is a 'macro' option, meaning that it stands
for a number of options that generally give good performance on a range
of codes. But there are a number of pros and cons regarding |-fast| that
you should be aware of.
Pros:
-
-fast is easy to use.
-
-fast should give very good performance on most code.
-
-fast is a good starting point for determining the best set of
flags to build with.
Cons:
-
The -fast option lets the compiler assume that the target
platform the code will run on is the same platform on which it was
compiled (because it includes -xtarget=native). Therefore you
may need to explicitly set the target platform. For example:
-fast -xtarget=ultra3 -xarch=v8plusa
-
The meaning of the -fast option can change with compiler releases.
-
-fast allows the compiler to make floating-point arithmetic
simplifications (for example reordering floating point
expressions), so the resulting code is not IEEE-754 compliant.
-
While -fast gives good performance on most code, it might not be
the best set of options for your particular application.
|
Notes
|
-
Using -fast enables a number of optimizations. Be sure that you
understand all the optimizations that it uses.
-
Use the flag -v
to tell the compiler to list the components of -fast.
|
|
Suggestion
|
-fast is a good starting point when optimizing code. However, it
may not necessarily be the set of optimizations you want for the
finished program. It is a better idea to use the -#
or -V options to print out the options that -fast includes, and
to select the appropriate ones for your application from this list.
|
-
The implications for floating-point arithmetic when using the
-fast option
-
One issue to be aware of is the inclusion of floating-point arithmetic
simplifications in -fast. In particular, the options -fns and
-fsimple=2 allow the compiler to do some optimizations that do not
comply with the IEEE-754 floating-point arithmetic standard, and also
allow the compiler to relax language standards regarding floating point
expression reordering.
With the flag -fns, subnormal numbers (that is, very small numbers
that are too small to be represented in normal form) are flushed to zero.
With -fsimple, the compiler can treat floating-point arithmetic as a
mathematics textbook might express. For example, the order additions are
performed doesn't matter, and it is safe to replace a divide operation
by multiplication by the reciprocal. These kinds of transformations seem
perfectly acceptable when performed on paper, but they can result in a
loss of precision when algebra becomes real numerical computation with
numbers of limited precision.
Also, -fsimple allows the compiler to make optimizations that assume
that the data used in floating-point calculations will not be *NaNs*
(Not a Number). Compiling with -fsimple is not recommended If you
expect computation with *NaNs*.
|
Notes
|
-
The use of the flags -fns and -fsimple can result in
significant performance gains. However, they may also result in a
loss of precision. Before committing to using them in production
code, it is best to evaluate the performance gain you get from
using the flags, and whether there is any difference in the
results of the application.
-
Avoid using -fsimple with applications that perform calculations
on NaNs.
-
For more information on floating-point computation, see the
Numerical Computation Guide.
|
-
Advanced compiler options: Data Prefetch
-
Often the biggest processor wait time for a code is the time taken to
fetch data from memory. The UltraSPARC architecture have
powerful hardware and software prefetch mechanisms. To get the most out
of this feature of the chip, the compiler needs to insert prefetch
instructions in the code.
This option is
enabled by default. However, it is worth discussing the two flags that
control this, -xprefetch tells the compiler to insert prefetch
instructions whenever appropriate. -xprefetch_level suggests to the
compiler how aggressively it should insert those prefetch instructions.
In general, prefetch will help codes that do a lot of floating-point
arithmetic, or where the data is fetched from memory in a predictable order.
Another flag that helps prefetch insertion is -xdepend. This flag
tells the compiler to analyze dependences between loop iterations, and
to determine the memory access pattern. This allows the compiler to do a
better job of analyzing which variables are fetched from memory, and
then more accurately predicting when variables should be prefetched.
|
Suggestion
|
Test the performance of your application with the flag
-xprefetch along with -xdepend. These
flags are included in -fast for C on UltraSPARC
platforms. -fast for C++ does not include -xdepend.
|
-
Advanced compiler options: Assertions about C/C++ pointers
-
There are two flags that you can use to make assertions about the use of
pointers in your program. These flags will tell the compiler something
that it can assume about the use of pointers in your source. It does not
check to see if the assertion is ever violated, so if your code violates
the assertion, then your program might not behave in the way you
intended it to. Note that lint can help you do some validity checking
of the code at a particular -xalias_level. (See
Chapter 5 of the C User's Guide.)
The two assertions are:
-
-xrestrict
Asserts that all pointers passed into functions are restricted
pointers. This means that if a function gets two pointers passed
into it, under -xrestrict the compiler can assume that those two
pointers never point at overlapping memory.
-
-xalias_level
Indicates what assumptions can be made about the degree of
aliasing between two different pointers. -xalias_level can be
considered a statement about coding style -- you are telling the
compiler how you treat pointers in the coding style you use (for
example, you can tell the compiler that an int* will never point
to the same memory location as a float*).
A useful piece of terminology is the expression 'alias'. Two pointers
alias if they point to the same location in memory. The flags
-xrestrict and -xalias_level tell the compiler what degree of
aliasing to assume in the code. For the compiler, aliasing means that
stores to the memory addressed by one pointer may change the memory
addressed by the other pointer -- this means that the compiler has to be
very careful never to reorder stores and loads in expressions containing
pointers, and it may also have to reload the values of memory accessed
through pointers after new data is stored into memory.
The following table summarizes the options for -xalias_level
|
gcc -xalias_level=
|
Comment |
|
any
|
Any pointers can alias (default)
|
|
basic
|
Basic types do not alias each other (for example, int*
and float*)
|
|
weak
|
Structure pointers alias by offset. Structure members of the same type
at the same offset (in bytes) from the structure pointer, may alias.
|
|
layout
|
Structure pointers alias by common fields. If the first few fields of
two structure pointers have identical types, then they may potentially
alias.
|
|
strict
|
Pointers to structures with different variable types in them do not alias
|
|
std
|
Pointers to differently named structures do not alias (so even if all
the elements in the structures have the same types, if they have
different names, then the structures do not alias).
|
|
strong
|
There are no pointers to the interiors of structures and char* is
considered a basic type (at lower levels char* is considered as
potentially aliasing with any other pointers)
|
|
Notes
|
-
Specifying -xrestrict and -xalias_level can lead to
significant performance gains. But if your code does not conform
to the requirements of the flags, then the results of running the
application may be unpredictable.
-
For C, -xalias_level=std means that pointers behave in the same
way as the 1999 ISO C standard suggests. Specified for
standard-conforming codes.
|
-
Advanced compiler options: Crossfile optimization
-
The -xipo option performs interprocedural optimizations over the whole
program at link time. This means that the object files are examined
again at link time to see if there are any further optimization
opportunities. The most common opportunity is to inline one code from
one file into code from another file. The term inlining means that the
compiler replaces a call to a routine with the actual code from that
routine.
Inlining is good for two reasons, the most obvious being that it
eliminates the overhead of calling another routine. A second, less
obvious reason is that inlining may expose additional optimizations that
can now be performed on the object code. For example, imagine that a
routine calculates the color of a particular point in an image by taking
the x and y position of the point and calculating the location of the
point in the block of memory containing the image (*image_offset = y *
row_length + x*). By inlining that code in the routine that works over
all the pixels in the image, the compiler is able generate code to just
add one to the current offset to get to the next point instead of having
to do a multiplication and an addition to calculate each address of each
point, resulting in a performance gain.
The downside of using -xipo is that it can significantly increase the
compile time of the application and may also increase the size of the
executable.
|
Suggestion
|
Try compiling with -xipo to see if the performance gain is worth
the increased compile time and executable size.
|
-
Advanced compiler options: Profile feedback
-
When compiling a program, the compiler takes a best guess at how the
flow of the program might go -- about which branches are taken and which
branches are untaken. For floating-point intensive code, this generally
gives good performance. But programs with many branching operations
might not obtain the best performance.
Profile feedback assists the compiler in optimizing your application by
giving it real information about the paths actually taken by your
program. Knowing the critical routes through the code allows the
compiler to make sure these are the optimized ones.
Profile feedback requires that you compile and execute a version of your
application with -xprofile=collect and then run the application with
representative input data to collect a runtime performance profile. You
then recompile with -xprofile=use and the performance profile data
collected. The downside of doing this is that the compile cycle can be
significantly longer (you are doing two compiles and a run of your
application), but the compiler can produce much more optimal execution
paths, which means a faster runtime.
A representative data set should be one that will exercise the code in
ways similar to the actual data that the application will see in
production; the program can be run multiple times with different
workloads to build up the representative data set. Of course if the
representative data manages to exercise the code in ways which are not
representative of the real workloads, then performance may not be
optimal. However, it is often the case that the code is always executed
through similar routes, and so regardless of whether the data is
representative or not, the performance will improve.
|
Suggestions
|
-
Try compiling with profile feedback and see whether the
performance gain is worth the additional compile time.
-
Try compiling with profile feedback and -xipo, because the
profile information will also help the compiler make better
choices about inlining.
|
-
Advanced compiler options: Large pages for data
-
If the program manipulates large data sets, then it may be the case that
it would benefit from using large pages to hold the data. The idea of a
'page' is a region of contiguous physical memory; the processor deals in
virtual memory, which allows the processor the freedom to move the data
around in physical memory, or even store it to and load it from disk.
Since the processor deals with virtual memory, it has to look up virtual
addresses to find the physical location of that data in memory; in order
to do this it uses the concept of pages. Every time the processor needs
to access a different page in memory, it has to look up the physical
location of that page. This takes a small amount of time, but if it
happens often the time can become significant. The default size of these
pages is 8KB, however the processor can use a range of page sizes. The
advantage of using a large page size is that the processor will have to
perform fewer lookups, but the disadvantage is that the processor may
not be able to find a sufficiently large chunk of contiguous memory to
allocate the large page on (in which case a set of 8KB pages will be
allocated instead).
The compiler option which controls page size is -xpagesize=size. The
options for the size depend on the platform. On UltraSPARC processors,
typical sizes are 8K, 64K, 512K, or 4M. For example, changing the page
size from 8K (the default) to 64K will reduce the number of look ups by
a factor of 8.
Operating system support for large pages became
available with the Solaris 9 OS release on SPARC platforms.
-
A set of flags to try
-
The final thing to do is to pull all these points together to make a
suggestion for a good set of flags. Remember that this set of flags may
not actually be appropriate for your application, but it is hoped that
they will give you a good starting point. (Use of the flags in square
brackets, [..] depends on special circumstances.)
| Flags |
Comment |
|
-g
|
Generate debugging information
|
|
-fast
|
Aggressive optimization
|
|
-xtarget=ultra3 -xarch=v8plusa
|
Specify target platform
|
|
-xprefetch
|
Enable prefetch instructions (enabled by default in Sun
Studio 9)
|
|
-xipo
|
Enable interprocedural optimization
|
|
-xprofile=[collect|use]
|
Compile with profile feedback
|
|
[-fsimple=0 -fns=no]
|
No floating-point arithmetic optimizations. Use if
IEEE-754 compliance is important
|
|
[-xalias_level=val]
|
Set level of pointer aliasing (for C and C++). Use only
if you know the option to be safe for your program.
|
|
[-xrestrict]
|
Uses restricted pointers (for C). Use only if you know
the option to be safe for your program.
|
|
-xpagesize=64K
|
Change the page size for data
|
-
Final remarks
-
There are many other options that the compilers recognize. The ones
presented here probably give the most noticeable performance gains for
most programs and are relatively easy to use. When selecting the
compiler options for your program:
-
It is important to be aware of just what you are telling the
compiler to do. A program may have unpredictable results if it
does not conform to the requirements of the flags.
-
When using optimization you will often be trading increased
compile time for improved runtime performance.
-
Which leads to the final suggestion that you should only use the
flags which both give you a performance benefit and make
acceptable assertions about the code.
-
Further reading
-
Techniques for Optimizing Applications
by Rajat
Garg and Ilya Sharapov is a great resource for finding out about
compiler optimizations plus many other ways of improving performance.
-
Memory Hierarchy In Cache-Based Systems
by Ruud van der Pas
This Sun BluePrints online article helps the reader understand the
architecture of modern microprocessors. The article introduces and
explains the most common terminology and addresses some of the
performance related aspects. (PDF)
-
Application Performance Optimization
by Börje Lindh
This Sun BluePrints online article provides a brief introduction
to optimization on the Solaris operating environment. (PDF)
------------------------------------------------------------------------
-
About the Author
-
Darryl Gove is a senior staff engineer in Compiler Performance
Engineering at Sun Microsystems Inc., analyzing and optimizing the
performance of applications on current and future UltraSPARC
systems. Darryl has an M.Sc. and Ph.D. in Operational Research from
the University of Southampton in the UK. Before joining Sun, Darryl
held various software architecture and development roles in the UK.
(Page last updated July 27, 2006)