Corestat for UltraSPARC T2
Overview
Understanding processor utilization is important for performance
analysis and capacity planning. With the launch of UltraSPARC T2 based
servers I would like to revisit the topic of core utilization.
As we have seen earlier,
for a Chip Multi Threaded (CMT) processor, like UltraSPARC T1, CPU
utilization reported by conventional tools like mpstat/vmstat and core
utilization reported using hardware performance counters in the
processor are different metrics and both are equally important in
performance analysis and tuning.
Before discussing the details about core utilization of UltraSPARC T2
and the details about corestat let us take a quick look at what does a
core on UltraSPARC T2 look like. UltraSPARC T2 extends the CMT
architecture of T1. It consists of eight cores where each core has
eight hardware threads. Hardware threads within a core are grouped into
two sets of four threads each. There are two integer pipelines within a
core and each set of four threads share one integer pipeline. In this
sense, the resources within a core are doubled from that in UltraSPARC
T1. It is worth understanding that threads within a core do not switch
pipelines and the assignment of threads to a pipeline is fixed and
hardwired.
One more important addition to the resources within a core is a
Floating Point Unit (FPU). Each core of T2, includes a FPU shared by
all eight threads from that core. Other shared resources within a core
include Level-1 Instruction (I) and Data (D) cache and Translation Look
aside Buffers (TLBs) like I-TLB and D-TLB. All cores share a 4 MB
Level-2 (L2) cache. Including these there are key features why both
single thread and multi thread performance of UltraSPARC T2 is better
than T1.
A quick look at the UltraSPARC T2 architecture features shows following enhancements which benefit single thread performance :
- Increased frequency - 1400 MHz
- Lower instruction latencies
- Better Floating Point performance
- Hardware TLB miss handling for I-TLB and D-TLB
- Larger D-TLB size (128 entries v/s 64 entries)
- Larger L2 cache (4 MB v/s 3 MB)
- Full support of VIS 2.0 instruction set. No kernel emulation
Similarly following are some of the features of UltraSPARC T2 that benefit multi thread performance :
- Two integer pipelines per core
- Twice the number of hardware threads (64 v/s 32)
- Higher L2 cache set associativity. 16 way compared to 12 way
- Instruction cache being 8 way associative compared to 4 way
- Dedicated Floating point unit per core shared by all 8 strands, improved FP throughput
- Memory interface supports FBDIMMs for higher capacity and bandwidth
- Support for shared context feature where multiple contexts share
the same entry in the TLB for mappings to the same address segment
- Streaming Processing Unit (SPU) per core for on chip encryption/decryption support
Now, let us look at the topic of core utilization. All the important
concepts like thread scheduling, idle hardware thread, stalled thread
etc. have been introduced in my earlier blog on T1.
All those concepts generally hold good for T2 however there are subtle
differences such as on T2 an integer pipeline remaining idle doesn't
mean a full core remains idle. Both the pipelines within a core can
concurrently execute one instruction per cycle hence at 1417 MHz
frequency, a core can execute maximum of 2x1417x1000x1000
instructions/second.
Considering these differences, corestat for UltraSPARC T2 has been enhanced and can be obtained from here . The main enhancements are :
- It now reports the utilization of each pipeline separately. By default only the integer pipe utilization is reported.
- There is a new command line option "-g" added to report the FPU utilization along with integer utilization.
- Corestat detects frequency of the target system at run time.
Corestat Features and Usage
While the usage remains same, corestat for UltraSPARC T2 can be used in two modes :
- For online monitoring purpose, it requires root privileges. This
is the default mode of operation. Default reporting interval is 10 sec
and it assumes the frequency of 1417 MHz.
- It can be used to report core utilization by post processing already sampled cpustat data using following command line :
cpustat -n -c pic0=Instr_cnt,pic1=Instr_FGU_arithmetic -c pic0=Instr_cnt,pic1=Instr_FGU_arithmetic,nouser,sys 1
$ corestat
Frequency = 1050 MHz
corestat : Permission denied. Needs root privilege...
Usage : corestat [-g] [-v] [[-f <infile>] [-i <interval>] [-r <freq>]]
Default mode : Report Integer Pipeline Utilization
-g
: Report FPU usage
-v
: Report version number
-f
infile
: Filename containing sampled cpustat data
-i interval : Reporting interval in
sec (default = 10 sec)
-r
freq
: Processor frequency in MHz (default = 1417 MHz)
# corestat -g
Core Utilization for Integer pipeline
Core,Int-pipe
%Usr %Sys %Usr+Sys
-------------
-----
----- --------
0,0
0.00
0.19 0.20
0,1
0.00
0.01 0.01
1,0
0.00
0.03 0.03
1,1
0.00
0.01 0.01
2,0
1.15
0.02 1.16
2,1
0.00
0.01 0.01
3,0
0.02
0.02 0.04
3,1
0.00
0.01 0.01
4,0
0.00
0.02 0.03
4,1
0.00
0.01 0.01
5,0
0.02
0.01 0.03
5,1
0.00
0.01 0.01
6,0
0.05
0.03 0.08
6,1
0.00
0.01 0.01
7,0
0.00
0.03 0.03
7,1
0.00
0.01 0.01
-------------
-----
----- ------
Avg
0.08
0.03 0.10
FPU Utilization
Core
%Usr %Sys %Usr+Sys
-------------
-----
----- --------
0
0.02
0.01 0.03
1
0.02
0.01 0.03
2
0.01
0.01 0.03
3
0.01
0.01 0.03
4
0.02
0.01 0.04
5
0.02
0.02 0.04
6
0.02
0.02 0.04
7
0.02
0.02 0.04
-------------
----- -----
------
Avg
0.02
0.02 0.04
As far as interpretation of corestat data is concerned, all the points mentioned in an earlier blog with respect to T1,
hold good. Since core saturation (measured using corestat) and virtual
CPU saturation (measured using vmstat/mpstat) are two different
aspects, we need to monitor both simultaneously in order to determine
whether an application is likely to saturate the core by using fewer
application threads. In such cases, increasing workload (e.g. by
increasing the number of threads) may not yield any more performance.
On the other hand, most often we will see applications having high
Cycles Per Instructions (CPI) and thereby not being able to saturate
the cores fully before achieving 100% CPU utilization.
While I make this new version of corestat available here.. we are
already looking at a number of RFEs received as comments on my earlier
blog and via e-mails to me. Some of the points being considered. Stay
tuned !!
Corestat for UltraSPARC T1
Overview
For UltraSPARC T1 processor a hardware
thread being idle and a core becoming idle are
two different things and hence need to be
understood separately. Corestat is a tool for
measuring core utilization of UltraSPARC T1
processor. By using Corestat alone with other
conventional tools like mpstat and vmstat, you
can get a better idea about any possible
scaling issues as well as can be useful for
capacity planning.
Corestat Features and Usage
Corestat is a tool to monitor the
utilization of UltraSPARC T1 cores. Corestat
reports aggregate core usage based on the
instructions executed by a core (i.e. by the
available Virtual Processors sharing the same
core).
Corestat can be used in two modes :
-
It can be used for online monitoring of
core usage. It requires root privilege for
online monitoring purpose. This is the
default mode of operation. Default
reporting interval is 10 sec. Corestat
assumes the processor frequency of 1200
MHz.
Usage :
$ corestat
corestat :
Permission denied. Needs root privilege...
Usage : corestat
[-v] [[-f <infile>] [-i <interval>]
[-r <freq>]]
-v : Report
version number
-f infile :
Filename containing sampled cpustat data
-i interval :
Reporting interval in sec (default = 10
sec)
-r freq :
Processor frequency in MHz (default = 1200
MHz.)
-
It can be used to report core
utilization by post processing sampled
cpustat data. It assumes that cpustat data
was sampled with 1 sec sampling interval
and was collected for both user and system
mode.
Following is an
example of cpustat command used for collecting
data which can be post-processed by
corestat.
cpustat -n -c
pic0=L2_dmiss_ld,pic1=Instr_cnt \
-c
pic0=L2_dmiss_ld,pic1=Instr_cnt,nouser,sys
1
Note, use -r
option during offline processing of cpustat
data if the frequency is different than default
1200 MHz
Corestat Frequently Asked Questions :
Q: There is already
vmstat and mpstat. Why do we need corestat ?
A:
vmstat and mpstat report CPU
utilization. Conventionally if a processor
is not idle it is considered as busy. A
processor can be stalled and hence will not
be executing any instructions. However it
is still reported as busy because the
pipeline is not freed to other runnable
threads on the system. Hence conventional
tools like vmstat and mpstat report
pipeline occupancy which is same as CPU
utilization for non-CMT processors. Idle
time reported by mpstat can be used to
decide about adding more load on the
system.
On UltraSPARC T1, there exist two main
differences why we need to understand the
core utilization separately from the CPU
utilization.
-
Idle hardware threads :
On UltraSPARC T1, an idle h/w thread is
parked by Solaris. It is taken out of the
mix of the schedulable threads and its time
slot is allocated to the next hardware
thread. This will get reported as idle
under mpstat for that hardware thread (i.e.
CPU). However, since the pipeline is shared
by three other threads on the same core,
that core can still execute instructions
and hence an idle hardware thread is not
same as an idle core.
-
Stalled hardware threads :
On UltraSPARC T1, when a h/w thread
stalls due to a long latency instruction
(such as a load), it is taken out of the
mix of schedulable threads with allowing
the next chosen thread to use its time
slice. A stalled thread is reported as 100%
busy by mpstat (similar to non-CMT cpus).
However, it won't execute any instructions
and the same pipeline during the same time
can be shared (time sliced) with other
threads and hence can still execute
instructions.
From corestat data We can get an idea
about the head room available for
performance. High percentage of core usage
means the processor has less head room
available for processing more load. It also
means that the pipeline is being used
effectively.
Q: How to use
corestat along with conventional vmstat/mpstat tools ?
A:
Corestat and mpstat (or vmstat)
need to be used together to make decisions
about system utilization.
Here is an explanation for a few possible
scenarios:
-
Vmstat reports 75% idle and corestat
reports 20% utilization :
Since vmstat reports huge idle time as
well as the core usage is also low, there
is head room for applying more load. Any
performance gain by increasing load will
depend on the characteristic of the
application.
-
Vmstat reports 100% busy and corestat
reports 50% utilization :
Since vmstat reports CPUs being 100%
busy, there is really no more head room to
schedule any more software threads. Hence
the system is at its peak load. Low (i.e.
50%) core utilization indicates that the
application is only utilizing each core to
its 50% capacity and the cores are not
saturated.
-
Vmstat reports 75% idle and corestat
reports 50% utilization :
Since core utilization is higher than
CPU utilization reported by vmstat, this is
an indication that the processor can get
saturated by having fewer software threads
than the available hardware threads i.e.
CPUs. It is also an indication of "LOW CPI"
application. In this case, scalability will
be limited by core saturation and adding
more load after that point will not help
achieve any more performance.
Downloads
Corestat for T1
Corestat for T2
More Information
More information about understanding the utilization of UltraSPARC T1 available here. |