Viewing file: ct_main.html (23.34 KB) -rw-r--r-- Select action/file-type: (+) | (+) | (+) | Code (+) | Session (+) | (+) | SDB (+) | (+) | (+) | (+) | (+) | (+) |
Callgrind: A call-graph oriented Cache Simulator and Profiler
Callgrind: A call-graph generating Cache Simulator and Profiler
Last updated for Version 0.9.8
Callgrind (previously named Calltree) is a Valgrind Tool,
able to run applications under supervision
to generate profiling data.
Additionally, two command line tools
(PERL scripts) are provided:
- callgrind_annotate
To make use of the data produced by the profiler, this script loads
the dumps and gives out sorted lists of functions, optionally with
annotation.
- callgrind_control
This tool enables you to interactively observe and control the status
of currently running applications supervised. You can
get statistic information, the current stack trace, and request
zeroing of counters, and dumping of profiles.
To use this skin, you must specify --tool=callgrind
on the Valgrind command line or use the supplied script callgrind .
This tool is heavily based on the Cachegrind Tool of the Valgrind package.
Read the documentation of Cachegrind first;
this page only describes the features supported in addition
to Cachegrinds features.
Detailed technical documentation on how Callgrind works is available
here. If you want to know how
to use it, you only need to read this page.
1. Purpose
1.1 Profiling as Part Of Application Development
When you develop a program, usually, one of the last steps is to make it as
fast as possible (but still correct). You don't want to waste your time
optimizing functions rarely used. So you need to know in which part of your
program most of the time is spent.
This is done with a technique called Profiling. The program is run under
control of a profiling tool, which gives you the time distribution among
executed functions in the run. After examination of the program's profile,
you probably know where to optimize and afterwards you verify the
optimisation success again with another profile run.
1.2 Profiling Tools
Most known is the GCC profiling tool GProf: You need to compile your
program with option "-pg"; running the program generates a file "gmon.out",
which can be transformed into human readable form with the command line
tool "gprof". The disadvantage is the needed compilation step for a prepared
executable, which has to be statically linked.
Another profiling tool is Cachegrind, part of Valgrind. It uses the
processor emulation of Valgrind to run the executable, and catches all
memory accesses for the trace. The user program does not need to be
recompiled; it can use shared libraries and plugins, and the profile
measuring doesn't influence the trace results. The trace includes the
number of instruction/data memory accesses and 1st/2nd level cache misses,
and relates it to source lines and functions of the run program.
A disadvantage is the slowdown involved in the processor emulation,
it's around 50 times slower.
Cachegrind only can deliver a flat profile. There is no call relationship
among the functions of an application stored. Thus, Inclusive Costs, i.e.
costs of a function including the cost of all functions called from there,
can't be calculated. Callgrind extends Cachegrind by including call
relationship and exact event counts spent while doing a call.
Because Callgrind is based on simulation, the slowdown due to some
preprocessing of events while collecting does not influence the results.
See the next chapter for more details on the possibilities.
2. Usage
2.1 Basics
To start a profile run for a program, execute
callgrind [options] program [program options]
After program termination, a profile dump file named "callgrind.out.pid"
is generated with pid being the process ID number of the profile run.
This will collect information
- on memory accesses of your program, and if an access can be satisfied by loading from 1st/2nd
level cache,
- on the calls made in your program among the functions executed.
If you are only interested the first item, it's enough to use Cachegrind from
Valgrind. If you are only interested in the second item, use Callgrind with option
"--simulate-cache=no". This will only count
events of type Instruction Read Accesses.
But it significantly speeds up the profiling typically by a factor of 2 or 3.
If the program section you want to
profile is somewhere in the middle of the run, it is
benificial to fast forward to this section
without any profiling at all, and switch it on later.
This is achieved by using "--instr-atstart=no"
and interactively use ""callgrind_control -i on"
before the interesting code section is about to be
executed.
2.2 Multiple dumps from one program run
Often, you aren't interested in time characteristics of a full program run, but
only of a small part of it (e.g. execution of one algorithm). If there are multiple
algorithms or one algorithm running with different input data, it's even useful
to get different profile information for multiple parts of one program run.
The generated dump files are named
callgrind.out.pid[.part][-threadID]
where pid is the PID of the running program, part is a number incremented
on each dump (".part" is skipped for the dump at program termination),
threadID is a thread identification ("-threadID" is only used if you request dumps
if individual threads).
There are different ways to generate multiple profile dumps while a program is running
under supervision of Callgrind. Still, all methods trigger the same action
"dump all profile information since last dump or program start, and zero cost counters
afterwards". To allow for zeroing cost counters without dumping, there exists a second
action "zero all cost counters now". The different methods are:
- Dump on program termination. This method is the standard way and doesn't need any
special action from your side.
- Spontaneous, interactive dumping. Use
callgrind_control -dump [hint [PID/Name]]
to request the dumping of profile information of the supervised application
with PID or Name. hint is an arbitrary string you can optionally
specify to later be able to distinguish profile dumps.
The control program will not terminate before the dump
is completely written.
Note that the application must be actively running for detection of the dump command. So, for a GUI application,
resize the window or for a server send a request.
If you are using KCachegrind for browsing of profile information, you can use the toolbar
button "Force dump". This will create the file "cachegrind.cmd" and will trigger a reload after
the dump is written.
- Periodic dumping after execution of a specified number of basic blocks. For this,
use the command line option --dumps=count. The resultion of the internal
basic block counter of Valgrind is only rough, so you should at least specify a interval of
50000 basic blocks.
- Dumping at enter/leave of all functions whose name starts with funcprefix. Use option
--dump-before=funcprefix and --dump-after=funcprefix).
To zero cost counters before entering a function, use --zero-before=funcprefix.
The prefix method for specifying function names was choosen to ease the use with C++:
you don't have to specify full signatures.
You can specify these options multiple times for different function prefixes.
- Program controlled dumping. Put "#include <valgrind/callgrind.h>" into your
source and add "CALLGRIND_DUMP_STATS;" when you want a dump to happen. Use
"CALLGRIND_ZERO_STATS;" to only zero cost centers.
In Valgrind terminology, this way is called "Client requests". The given macros generate a
special instruction pattern with no effect at all (i.e. a NOP). Only when run under Valgrind,
the CPU simulation engine detects the special instruction pattern and triggers
special actions like the ones described above.
If you are running a multi-threaded application and specify the command line option
"--dump-threads=yes", every thread will be profiled on its own and will
create its own profile dump. Thus, the last two methods will only generate one dump of
the currently running thread. With the other methods, you will get multiple dumps
(one for each thread) on a dump request.
2.3 Limiting range of event collection
You can control for which part of your program you want to collect event costs
by using --toggle-collect=funcprefix. This will toggle the collection state
on entering and leaving a function. When specifying this option, the default
collecting state at program start is "off". Thus, only events happing while
running inside of funcprefix will be collected. Recursive function calls of
funcprefix don't influence collecting at all.
2.4 Avoiding cycles
Each group of functions with any two of them happening to have a call chain from one to
the other, is called a cycle. E.g. with A calling B, B calling C, and C calling A, the three
functions A,B,C build up one cycle.
If a call chain goes multiple times around inside of
a cycle, you can't distinguish costs coming from the first round or the second. Thus, it makes
no sense to attach any cost to a call among function in one cycle: if "A > B" appears
multiple times in a call chain, you have no way to partition the one big sum of all appearances
of "A > B".
Thus, for profile data presentation, all functions of a cycle are seen as one big virtual
function.
Unfortunately, if you have an application using some callback mechanism
(like any GUI program), or even with normal polymorphism (as in OO languages like C++),
it's quite possible to get large cycles.
As it is often impossible to say anything about performance behaviour inside of cycles,
it is useful to introduce some mechanisms to avoid cycles in call graphs at all.
This is done by treating the same function as different functions depending on the
current execution context by giving them different names, or by ignoring calls to functions
at all.
There is an option to ignore calls to a function with
"--fn-skip=funcprefix". E.g., you usually don't want to see the trampoline
functions in the PLT sections for calls to functions in shared libraries. You can
see the difference if you profile with "--skip-plt=no". If a call is ignored,
cost events happening will be attached to the enclosing function.
If you have a recursive function, you can distinguish the first 10 recursion
levels by specifying "--fn-recursion10=funcprefix". Or for all functions
with "fn-recursion=10", but this will give you much bigger profile dumps.
In the profile data, you will see the recursion levels of "func" as the
different functions with names "func", "func'2", "func'3" and so on.
If you have call chains "A > B > C" and "A > C > B" in your program, you
usually get a "false" cycle "B <> C". Use "--fn-caller2=B --fn-caller2=C",
and functions "B" and "C" will be treated as different functions depending on
the direct caller. Using the apostrophe for appending this "context" to the
function name, you get "A > B'A > C'B" and "A > C'A > B'C", and there will be
no cycle. Use "--fn-callers=3" to get a 2-caller depencendy for all
functions. Again, this will multiplicate the profile data size.
3. Command line option reference
--base=<prefix>
Specify another base name for the dump file names. This defaults to "cachegrind.out".
To distinguish different profile runs of the same application, there is ".<pid>"
appended to the base dump file name with <pid> being the process ID of the profile
run (with multiple dumps happening, the file name is modified further; see below).
This option is especially usefull if your application changes its working directory.
Usually, the dump file is generated in the current working directory of the application
at program termination. By giving an absolute path with the base specification, you can
force a fixed directory for the dump files.
--simulate-cache=yes|no
Specify if you want to do full cache simulation. Default is yes. If you say no,
only instruction read accesses will be profiled. This typically makes the execution
at least twice as fast.
Note however, that estimating of how much real time your program will need only by using the
instruction read counts is impossible. Use it if you want to find out how many times different
functions are called and there call relation.
--instr-atstart=yes|no
Specify if you want Callgrind to start simulation and profiling from the beginning.
If not, Callgrind will not be able to collect any information, including calls, but
it will have at most a slowdown of around 4, which is the minimum Valgrind overhead.
Instrumentation can be interactively switched on via
Note that the resulting call graph will most probably not contain main, but
all the functions executed after instrumentation was switched on.
Instrumentation can also programatically switched on/off. See the Callgrind include file
<callgrind.h> for the macro you have to use in your source code.
For cache simulation, results will be a little bit off when switching on instrumentation
later in the program run, as the simulator starts with an empty cache at that moment.
Switch on event collection later to cope with this error.
--collect-atstart=yes|no
Specify whether event collection is switched on at beginning of the profile run.
This defaults to yes.
To only look at parts of your program, you have two possibilities:
- Zero event counters before entering the program part you want to profile,
and dump the event counters to a file after leaving that program part.
- Switch on/off collection state as needed to only see event counters happening
while inside of the program part you want to profile.
The second option can be used if the programm part you want to profile is called many times.
Option 1, i.e. creating a lot of dumps is not practically here.
Collection state can be toggled at entering and leaving of a given function with option
--toggle-collect=<function> . For this, collection state should be
switched off at the beginning. Note that the specification of --toggle-collect
implicitly sets --collect-state=no .
Collection state can be toggled also by using a Valgrind User Request in your application.
For this, include valgrind/callgrind.h and specify the macro
CALLGRIND_TOGGLE_COLLECT at the needed positions. This only will have any effect
if run under supervision of the Callgrind tool.
--skip-plt=no|yes
Ignore calls to/from PLT sections. Defaults to yes.
--fn-skip=<function>/code>
Ignore calls to/from a given function? E.g. if you have a call chain A > B > C, and
you specify function B to be ignored, you will only see A > C.
This is very convenient to skip functions handling callback behaviour. E.g. for the SIGNAL/SLOT
mechanism in QT, you only want to see the function emitting a signal to call the slots connected
to that signal. First, determine the real call chain to see the functions needed to be skipped,
then use this option.
--fn-group<number>=<function>
Put a function into separation group number.
--fn-recursion<number>=<function>
Separate <number> recursions for <function>
--fn-caller<number>=<function>
Separate <number> callers for <function>
--dump-before=<function>
Dump when entering <function>
--zero-before=<function>
Zero all costs when entering <function>
--dump-after=<function>
Dump when leaving <function>
--toggle-collect=<function>
Toggle collection on enter/leave <function>
--fn-recursion=<level>
Separate function recursions, maximal <level> [2]
--fn-caller=<callers>
Separate functions by callers [0]
--mangle-names=no|yes
Mangle separation into names? [yes]
--dump-threads=no|yes
Dump traces per thread? [no]
--compress-strings=no|yes
Compress strings in profile dump? [yes]
--dump-bbs=no|yes
Dump basic block info? [no]. This needs an update of the KCachegrind importer!
--dumps=<count>
Dump trace each <count> basic blocks [0=never]
--dump-instr=no|yes
This specifies the granularity of the profile information.
Note that if you dump at instruction level, ct_annotate currently
is not able to show you the data. You have to use KCachegrind to
get annotated disassembled code. [no]
--trace-jump=no|yes
This specifies whether information for (conditional) jumps
should be collected. Same as above, ct_annotate currently
is not able to show you the data. You have to use KCachegrind to
get jump arrows in the annotated code. [no]
4. Profile data file format
The header has an arbitrary number of lines of the format
"key: value". Afterwards, position specifications
"spec=position" and cost lines starting with a
number of position columns (as given by the
"positions:" header field), followed by
space separated cost numbers can appear.
Empty lines are always allowed.
Possible key values for the header are:
- version: major.minor [Callgrind]
This is used to distinguish future trace file formats.
A major version of 0 or 1 is supposed to be upwards
compatible with Cachegrind 1.0.x format.
It is optional; if not appearing, original Cachegrind
1.0.x format is supposed.
Otherwise, this has to be the first header line.
- pid: process id [Callgrind]
This specifies the process ID of the supervised
application for which this profile was generated.
- cmd: program name + args [Cachegrind]
This specifies the full command line of the supervised
application for which this profile was generated.
- part: number [Callgrind]
This specifies a sequentially incremented number for
each dump generated, starting at 1.
- desc: type: value [Cachegrind]
This specifies various information for this dump.
For some types, the semantic is defined, but any
description type is allowed. Unknown types should
be ignored.
There are the types "I1 cache", "D1 cache", "L2 cache",
which specify parameters used for the cache simulator.
These are the only types originally used by Cachegrind.
Additionally, Callgrind uses the following types:
"Timerange" gives a rough range of the basic
block counter, for which the cost of this dump
was collected. Type "Trigger" states the reason
of why this trace was generated.
E.g. program termination or forced interactive dump.
- positions: [instr] [line] [Callgrind]
For cost lines, this defines the semantic of the
first numbers. Any combination of "instr", "bb" and
"line" is allowed, but has to be in this order which
corresponds to position numbers at the start of the
cost lines later in the file.
If "instr" is specified, the position is the address
of an instruction whose execution raised the events
given later on the line. This address is relative to
the offset of the binary/shared library file to not
have to specify relocation info.
For "line", the position is the line number of a
source file, which is responsible for the events
raised. Note that the mapping of "instr" and "line"
positions are given by the debugging line information
produced by the compiler.
This field is optionally. If not specified, "line"
is supposed only.
- events: event type abbrevations [Cachegrind]
A list of short names of the event types logged in this
file. The order is the same as in cost lines.
The first event type is the second or third number in a
cost line, depending on the value of "positions".
Callgrind does not add additional cost types.
Specify exactly once.
Cost types from original Cachegrind are
- Ir
Instruction read access
- I1mr
Instruction Level 1 read cache miss
- I2mr
Instruction Level 2 read cache miss
- ...
- summary: costs [Callgrind]
- totals: costs [Cachegrind]
The value or the total number of events covered by
this trace file.
Both keys have the same meaning, but the "totals:" line happens
to be at the end of the file, while "summary:" appears in the header.
This was added to allow postprocessing tools to know in advance to total
cost. The two lines always give the same cost counts.
As said above, there also exist lines "spec=position".
The values for position specifications are arbitrary
strings.
When starting with "(" and a digit, it's a string in
compressed format.
Otherwise it's the real position string. This allows for
file and symbol names as position strings, as these
never start with "(" + digit.
The compressed format is either "(" number ")"
space position or only "(" number ")".
The first relates position to number in
the context of the given format specification from this line to the end
of the file; it makes the (number) an alias for position.
Compressed format is always optional.
Position specifications allowed:
|