Viewing file: cg_main.html (32.24 KB) -rw-r--r-- Select action/file-type: (+) | (+) | (+) | Code (+) | Session (+) | (+) | SDB (+) | (+) | (+) | (+) | (+) | (+) |
Cachegrind: a cache-miss profiler
4 Cachegrind: a cache-miss profiler
To use this tool, you must specify --tool=cachegrind
on the Valgrind command line.
Detailed technical documentation on how Cachegrind works is available
here. If you want to know how
to use it, you only need to read this page.
4.1 Cache profiling
Cachegrind is a tool for doing cache simulations and annotating your source
line-by-line with the number of cache misses. In particular, it records:
- L1 instruction cache reads and misses;
- L1 data cache reads and read misses, writes and write misses;
- L2 unified cache reads and read misses, writes and writes misses.
On a modern x86 machine, an L1 miss will typically cost around 10 cycles,
and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be
very useful for improving the performance of your program.
Also, since one instruction cache read is performed per instruction executed,
you can find out how many instructions are executed per line, which can be
useful for traditional profiling and test coverage.
Any feedback, bug-fixes, suggestions, etc, welcome.
4.2 Overview
First off, as for normal Valgrind use, you probably want to compile with
debugging info (the -g flag). But by contrast with normal
Valgrind use, you probably do want to turn optimisation on, since you
should profile your program as it will be normally run.
The two steps are:
- Run your program with
valgrind --tool=cachegrind in front of
the normal command line invocation. When the program finishes,
Cachegrind will print summary cache statistics. It also collects
line-by-line information in a file
cachegrind.out.pid , where pid
is the program's process id.
This step should be done every time you want to collect
information about a new program, a changed program, or about the
same program with different input.
- Generate a function-by-function summary, and possibly annotate
source files, using the supplied
cg_annotate program. Source files to annotate can be
specified manually, or manually on the command line, or
"interesting" source files can be annotated automatically with
the --auto=yes option. You can annotate C/C++
files or assembly language files equally easily.
This step can be performed as many times as you like for each
Step 2. You may want to do multiple annotations showing
different information each time.
The steps are described in detail in the following sections.
4.3 Cache simulation specifics
Cachegrind uses a simulation for a machine with a split L1 cache and a unified
L2 cache. This configuration is used for all (modern) x86-based machines we
are aware of. Old Cyrix CPUs had a unified I and D L1 cache, but they are
ancient history now.
The more specific characteristics of the simulation are as follows.
- Write-allocate: when a write miss occurs, the block written to
is brought into the D1 cache. Most modern caches have this
property.
- Bit-selection hash function: the line(s) in the cache to which a
memory block maps is chosen by the middle bits M--(M+N-1) of the
byte address, where:
- line size = 2^M bytes
- (cache size / line size) = 2^N bytes
- Inclusive L2 cache: the L2 cache replicates all the entries of
the L1 cache. This is standard on Pentium chips, but AMD
Athlons use an exclusive L2 cache that only holds blocks evicted
from L1. Ditto AMD Durons and most modern VIAs.
The cache configuration simulated (cache size, associativity and line size) is
determined automagically using the CPUID instruction. If you have an old
machine that (a) doesn't support the CPUID instruction, or (b) supports it in
an early incarnation that doesn't give any cache information, then Cachegrind
will fall back to using a default configuration (that of a model 3/4 Athlon).
Cachegrind will tell you if this happens. You can manually specify one, two or
all three levels (I1/D1/L2) of the cache from the command line using the
--I1 , --D1 and --L2 options.
Other noteworthy behaviour:
If you are interested in simulating a cache with different properties, it is
not particularly hard to write your own cache simulator, or to modify the
existing ones in vg_cachesim_I1.c , vg_cachesim_D1.c ,
vg_cachesim_L2.c and vg_cachesim_gen.c . We'd be
interested to hear from anyone who does.
4.4 Profiling programs
To gather cache profiling information about the program ls -l ,
invoke Cachegrind like this:
valgrind --tool=cachegrind ls -l
The program will execute (slowly). Upon completion, summary statistics
that look like this will be printed:
==31751== I refs: 27,742,716
==31751== I1 misses: 276
==31751== L2 misses: 275
==31751== I1 miss rate: 0.0%
==31751== L2i miss rate: 0.0%
==31751==
==31751== D refs: 15,430,290 (10,955,517 rd + 4,474,773 wr)
==31751== D1 misses: 41,185 ( 21,905 rd + 19,280 wr)
==31751== L2 misses: 23,085 ( 3,987 rd + 19,098 wr)
==31751== D1 miss rate: 0.2% ( 0.1% + 0.4%)
==31751== L2d miss rate: 0.1% ( 0.0% + 0.4%)
==31751==
==31751== L2 misses: 23,360 ( 4,262 rd + 19,098 wr)
==31751== L2 miss rate: 0.0% ( 0.0% + 0.4%)
Cache accesses for instruction fetches are summarised first, giving the
number of fetches made (this is the number of instructions executed, which
can be useful to know in its own right), the number of I1 misses, and the
number of L2 instruction (L2i ) misses.
Cache accesses for data follow. The information is similar to that of the
instruction fetches, except that the values are also shown split between reads
and writes (note each row's rd and wr values add up
to the row's total).
Combined instruction and data figures for the L2 cache follow that.
4.5 Output file
As well as printing summary information, Cachegrind also writes
line-by-line cache profiling information to a file named
cachegrind.out.pid . This file is human-readable, but is
best interpreted by the accompanying program cg_annotate ,
described in the next section.
Things to note about the cachegrind.out.pid file:
- It is written every time Cachegrind
is run, and will overwrite any existing
cachegrind.out.pid in the current directory (but
that won't happen very often because it takes some time for process ids
to be recycled).
- It can be huge:
ls -l generates a file of about
350KB. Browsing a few files and web pages with a Konqueror
built with full debugging information generates a file
of around 15 MB.
Note that older versions of Cachegrind used a log file named
cachegrind.out (i.e. no .pid suffix).
The suffix serves two purposes. Firstly, it means you don't have to
rename old log files that you don't want to overwrite. Secondly, and
more importantly, it allows correct profiling with the
--trace-children=yes option of programs that spawn child
processes.
4.6 Cachegrind options
Cache-simulation specific options are:
--I1=<size>,<associativity>,<line_size>
--D1=<size>,<associativity>,<line_size>
--L2=<size>,<associativity>,<line_size>
[default: uses CPUID for automagic cache configuration]
Manually specifies the I1/D1/L2 cache configuration, where
size and line_size are measured in bytes. The
three items must be comma-separated, but with no spaces, eg:
valgrind --tool=cachegrind --I1=65535,2,64
You can specify one, two or three of the I1/D1/L2 caches. Any level not
manually specified will be simulated using the configuration found in the
normal way (via the CPUID instruction, or failing that, via defaults).
4.7 Annotating C/C++ programs
Before using cg_annotate , it is worth widening your
window to be at least 120-characters wide if possible, as the output
lines can be quite long.
To get a function-by-function summary, run cg_annotate
--pid in a directory containing a
cachegrind.out.pid file. The --pid
is required so that cg_annotate knows which log file to use when
several are present.
The output looks like this:
--------------------------------------------------------------------------------
I1 cache: 65536 B, 64 B, 2-way associative
D1 cache: 65536 B, 64 B, 2-way associative
L2 cache: 262144 B, 64 B, 8-way associative
Command: concord vg_to_ucode.c
Events recorded: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
Events shown: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
Threshold: 99%
Chosen for annotation:
Auto-annotation: on
--------------------------------------------------------------------------------
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
--------------------------------------------------------------------------------
27,742,716 276 275 10,955,517 21,905 3,987 4,474,773 19,280 19,098 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
--------------------------------------------------------------------------------
8,821,482 5 5 2,242,702 1,621 73 1,794,230 0 0 getc.c:_IO_getc
5,222,023 4 4 2,276,334 16 12 875,959 1 1 concord.c:get_word
2,649,248 2 2 1,344,810 7,326 1,385 . . . vg_main.c:strcmp
2,521,927 2 2 591,215 0 0 179,398 0 0 concord.c:hash
2,242,740 2 2 1,046,612 568 22 448,548 0 0 ctype.c:tolower
1,496,937 4 4 630,874 9,000 1,400 279,388 0 0 concord.c:insert
897,991 51 51 897,831 95 30 62 1 1 ???:???
598,068 1 1 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__flockfile
598,068 0 0 299,034 0 0 149,517 0 0 ../sysdeps/generic/lockfile.c:__funlockfile
598,024 4 4 213,580 35 16 149,506 0 0 vg_clientmalloc.c:malloc
446,587 1 1 215,973 2,167 430 129,948 14,057 13,957 concord.c:add_existing
341,760 2 2 128,160 0 0 128,160 0 0 vg_clientmalloc.c:vg_trap_here_WRAPPER
320,782 4 4 150,711 276 0 56,027 53 53 concord.c:init_hash_table
298,998 1 1 106,785 0 0 64,071 1 1 concord.c:create
149,518 0 0 149,516 0 0 1 0 0 ???:tolower@@GLIBC_2.0
149,518 0 0 149,516 0 0 1 0 0 ???:fgetc@@GLIBC_2.0
95,983 4 4 38,031 0 0 34,409 3,152 3,150 concord.c:new_word_node
85,440 0 0 42,720 0 0 21,360 0 0 vg_clientmalloc.c:vg_bogus_epilogue
First up is a summary of the annotation options:
- I1 cache, D1 cache, L2 cache: cache configuration. So you know the
configuration with which these results were obtained.
- Command: the command line invocation of the program under
examination.
- Events recorded: event abbreviations are:
Ir : I cache reads (ie. instructions executed)
I1mr : I1 cache read misses
I2mr : L2 cache instruction read misses
Dr : D cache reads (ie. memory reads)
D1mr : D1 cache read misses
D2mr : L2 cache data read misses
Dw : D cache writes (ie. memory writes)
D1mw : D1 cache write misses
D2mw : L2 cache data write misses
Note that D1 total accesses is given by D1mr +
D1mw , and that L2 total accesses is given by
I2mr + D2mr + D2mw .
- Events shown: the events shown (a subset of events gathered). This can
be adjusted with the
--show option.
- Event sort order: the sort order in which functions are shown. For
example, in this case the functions are sorted from highest
Ir counts to lowest. If two functions have identical
Ir counts, they will then be sorted by I1mr
counts, and so on. This order can be adjusted with the
--sort option.
Note that this dictates the order the functions appear. It is not
the order in which the columns appear; that is dictated by the "events
shown" line (and can be changed with the --show option).
- Threshold:
cg_annotate by default omits functions
that cause very low numbers of misses to avoid drowning you in
information. In this case, cg_annotate shows summaries the
functions that account for 99% of the Ir counts;
Ir is chosen as the threshold event since it is the
primary sort event. The threshold can be adjusted with the
--threshold option.
- Chosen for annotation: names of files specified manually for annotation;
in this case none.
- Auto-annotation: whether auto-annotation was requested via the
--auto=yes option. In this case no.
Then follows summary statistics for the whole program. These are similar
to the summary provided when running valgrind --tool=cachegrind .
Then follows function-by-function statistics. Each function is
identified by a file_name:function_name pair. If a column
contains only a dot it means the function never performs
that event (eg. the third row shows that strcmp()
contains no instructions that write to memory). The name
??? is used if the the file name and/or function name
could not be determined from debugging information. If most of the
entries have the form ???:??? the program probably wasn't
compiled with -g . If any code was invalidated (either due to
self-modifying code or unloading of shared objects) its counts are aggregated
into a single cost centre written as (discarded):(discarded) .
It is worth noting that functions will come from three types of source files:
- From the profiled program (
concord.c in this example).
- From libraries (eg.
getc.c )
- From Valgrind's implementation of some libc functions (eg.
vg_clientmalloc.c:malloc ). These are recognisable because
the filename begins with vg_ , and is probably one of
vg_main.c , vg_clientmalloc.c or
vg_mylibc.c .
There are two ways to annotate source files -- by choosing them
manually, or with the --auto=yes option. To do it
manually, just specify the filenames as arguments to
cg_annotate . For example, the output from running
cg_annotate concord.c for our example produces the same
output as above followed by an annotated version of
concord.c , a section of which looks like:
--------------------------------------------------------------------------------
-- User-annotated source: concord.c
--------------------------------------------------------------------------------
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
[snip]
. . . . . . . . . void init_hash_table(char *file_name, Word_Node *table[])
3 1 1 . . . 1 0 0 {
. . . . . . . . . FILE *file_ptr;
. . . . . . . . . Word_Info *data;
1 0 0 . . . 1 1 1 int line = 1, i;
. . . . . . . . .
5 0 0 . . . 3 0 0 data = (Word_Info *) create(sizeof(Word_Info));
. . . . . . . . .
4,991 0 0 1,995 0 0 998 0 0 for (i = 0; i < TABLE_SIZE; i++)
3,988 1 1 1,994 0 0 997 53 52 table[i] = NULL;
. . . . . . . . .
. . . . . . . . . /* Open file, check it. */
6 0 0 1 0 0 4 0 0 file_ptr = fopen(file_name, "r");
2 0 0 1 0 0 . . . if (!(file_ptr)) {
. . . . . . . . . fprintf(stderr, "Couldn't open '%s'.\n", file_name);
1 1 1 . . . . . . exit(EXIT_FAILURE);
. . . . . . . . . }
. . . . . . . . .
165,062 1 1 73,360 0 0 91,700 0 0 while ((line = get_word(data, line, file_ptr)) != EOF)
146,712 0 0 73,356 0 0 73,356 0 0 insert(data->;word, data->line, table);
. . . . . . . . .
4 0 0 1 0 0 2 0 0 free(data);
4 0 0 1 0 0 2 0 0 fclose(file_ptr);
3 0 0 2 0 0 . . . }
(Although column widths are automatically minimised, a wide terminal is clearly
useful.)
Each source file is clearly marked (User-annotated source ) as
having been chosen manually for annotation. If the file was found in one of
the directories specified with the -I /--include
option, the directory and file are both given.
Each line is annotated with its event counts. Events not applicable for a line
are represented by a `.'; this is useful for distinguishing between an event
which cannot happen, and one which can but did not.
Sometimes only a small section of a source file is executed. To minimise
uninteresting output, Valgrind only shows annotated lines and lines within a
small distance of annotated lines. Gaps are marked with the line numbers so
you know which part of a file the shown code comes from, eg:
(figures and code for line 704)
-- line 704 ----------------------------------------
-- line 878 ----------------------------------------
(figures and code for line 878)
The amount of context to show around annotated lines is controlled by the
--context option.
To get automatic annotation, run cg_annotate --auto=yes .
cg_annotate will automatically annotate every source file it can find that is
mentioned in the function-by-function summary. Therefore, the files chosen for
auto-annotation are affected by the --sort and
--threshold options. Each source file is clearly marked
(Auto-annotated source ) as being chosen automatically. Any files
that could not be found are mentioned at the end of the output, eg:
--------------------------------------------------------------------------------
The following files chosen for auto-annotation could not be found:
--------------------------------------------------------------------------------
getc.c
ctype.c
../sysdeps/generic/lockfile.c
This is quite common for library files, since libraries are usually compiled
with debugging information, but the source files are often not present on a
system. If a file is chosen for annotation both manually and
automatically, it is marked as User-annotated source .
Use the -I/--include option to tell Valgrind where to look for
source files if the filenames found from the debugging information aren't
specific enough.
Beware that cg_annotate can take some time to digest large
cachegrind.out.pid files, e.g. 30 seconds or more. Also
beware that auto-annotation can produce a lot of output if your program is
large!
4.8 Annotating assembler programs
Valgrind can annotate assembler programs too, or annotate the
assembler generated for your C program. Sometimes this is useful for
understanding what is really happening when an interesting line of C
code is translated into multiple instructions.
To do this, you just need to assemble your .s files with
assembler-level debug information. gcc doesn't do this, but you can
use the GNU assembler with the --gstabs option to
generate object files with this information, eg:
as --gstabs foo.s
You can then profile and annotate source files in the same way as for C/C++
programs.
4.9 cg_annotate options
--pid
Indicates which cachegrind.out.pid file to read.
Not actually an option -- it is required.
-h, --help
-v, --version
Help and version, as usual.
--sort=A,B,C [default: order in
cachegrind.out.pid ]
Specifies the events upon which the sorting of the function-by-function
entries will be based. Useful if you want to concentrate on eg. I cache
misses (--sort=I1mr,I2mr ), or D cache misses
(--sort=D1mr,D2mr ), or L2 misses
(--sort=D2mr,I2mr ).
--show=A,B,C [default: all, using order in
cachegrind.out.pid ]
Specifies which events to show (and the column order). Default is to use
all present in the cachegrind.out.pid file (and use
the order in the file).
--threshold=X [default: 99%]
Sets the threshold for the function-by-function summary. Functions are
shown that account for more than X% of the primary sort event. If
auto-annotating, also affects which files are annotated.
Note: thresholds can be set for more than one of the events by appending
any events for the --sort option with a colon and a number
(no spaces, though). E.g. if you want to see the functions that cover
99% of L2 read misses and 99% of L2 write misses, use this option:
--sort=D2mr:99,D2mw:99
--auto=no [default]
--auto=yes
When enabled, automatically annotates every file that is mentioned in the
function-by-function summary that can be found. Also gives a list of
those that couldn't be found.
--context=N [default: 8]
Print N lines of context before and after each annotated line. Avoids
printing large sections of source files that were not executed. Use a
large number (eg. 10,000) to show all source lines.
-I=<dir>, --include=<dir>
[default: empty string]
Adds a directory to the list in which to search for files. Multiple
-I/--include options can be given to add multiple directories.
4.10 Warnings
There are a couple of situations in which cg_annotate issues warnings.
- If a source file is more recent than the
cachegrind.out.pid file. This is because the
information in cachegrind.out.pid is only recorded
with line numbers, so if the line numbers change at all in the source
(eg. lines added, deleted, swapped), any annotations will be
incorrect.
- If information is recorded about line numbers past the end of a file.
This can be caused by the above problem, ie. shortening the source file
while using an old
cachegrind.out.pid file. If this
happens, the figures for the bogus lines are printed anyway (clearly
marked as bogus) in case they are important.
4.11 Things to watch out for
Some odd things that can occur during annotation:
- If annotating at the assembler level, you might see something like this:
1 0 0 . . . . . . leal -12(%ebp),%eax
1 0 0 . . . 1 0 0 movl %eax,84(%ebx)
2 0 0 0 0 0 1 0 0 movl $1,-20(%ebp)
. . . . . . . . . .align 4,0x90
1 0 0 . . . . . . movl $.LnrB,%eax
1 0 0 . . . 1 0 0 movl %eax,-16(%ebp)
How can the third instruction be executed twice when the others are
executed only once? As it turns out, it isn't. Here's a dump of the
executable, using objdump -d :
8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
8048f28: 89 43 54 mov %eax,0x54(%ebx)
8048f2b: c7 45 ec 01 00 00 00 movl $0x1,0xffffffec(%ebp)
8048f32: 89 f6 mov %esi,%esi
8048f34: b8 08 8b 07 08 mov $0x8078b08,%eax
8048f39: 89 45 f0 mov %eax,0xfffffff0(%ebp)
Notice the extra mov %esi,%esi instruction. Where did this
come from? The GNU assembler inserted it to serve as the two bytes of
padding needed to align the movl $.LnrB,%eax instruction on
a four-byte boundary, but pretended it didn't exist when adding debug
information. Thus when Valgrind reads the debug info it thinks that the
movl $0x1,0xffffffec(%ebp) instruction covers the address
range 0x8048f2b--0x804833 by itself, and attributes the counts for the
mov %esi,%esi to it.
- Inlined functions can cause strange results in the function-by-function
summary. If a function
inline_me() is defined in
foo.h and inlined in the functions f1() ,
f2() and f3() in bar.c , there will
not be a foo.h:inline_me() function entry. Instead, there
will be separate function entries for each inlining site, ie.
foo.h:f1() , foo.h:f2() and
foo.h:f3() . To find the total counts for
foo.h:inline_me() , add up the counts from each entry.
The reason for this is that although the debug info output by gcc
indicates the switch from bar.c to foo.h , it
doesn't indicate the name of the function in foo.h , so
Valgrind keeps using the old one.
- Sometimes, the same filename might be represented with a relative name
and with an absolute name in different parts of the debug info, eg:
/home/user/proj/proj.h and ../proj.h . In this
case, if you use auto-annotation, the file will be annotated twice with
the counts split between the two.
- Files with more than 65,535 lines cause difficulties for the stabs debug
info reader. This is because the line number in the
struct
nlist defined in a.out.h under Linux is only a 16-bit
value. Valgrind can handle some files with more than 65,535 lines
correctly by making some guesses to identify line number overflows. But
some cases are beyond it, in which case you'll get a warning message
explaining that annotations for the file might be incorrect.
- If you compile some files with
-g and some without, some
events that take place in a file without debug info could be attributed
to the last line of a file with debug info (whichever one gets placed
before the non-debug-info file in the executable).
This list looks long, but these cases should be fairly rare.
Note: stabs is not an easy format to read. If you come across bizarre
annotations that look like might be caused by a bug in the stabs reader,
please let us know.
4.12 Accuracy
Valgrind's cache profiling has a number of shortcomings:
- It doesn't account for kernel activity -- the effect of system calls on
the cache contents is ignored.
- It doesn't account for other process activity (although this is probably
desirable when considering a single program).
- It doesn't account for virtual-to-physical address mappings; hence the
entire simulation is not a true representation of what's happening in the
cache.
- It doesn't account for cache misses not visible at the instruction level,
eg. those arising from TLB misses, or speculative execution.
- Valgrind's custom threads implementation will schedule threads
differently to the standard one. This could warp the results for
threaded programs.
- The instructions
bts , btr and btc
will incorrectly be counted as doing a data read if both the arguments
are registers, eg:
btsl %eax, %edx
This should only happen rarely.
- FPU instructions with data sizes of 28 and 108 bytes (e.g.
fsave ) are treated as though they only access 16 bytes.
These instructions seem to be rare so hopefully this won't affect
accuracy much.
Another thing worth nothing is that results are very sensitive. Changing the
size of the valgrind.so file, the size of the program being
profiled, or even the length of its name can perturb the results. Variations
will be small, but don't expect perfectly repeatable results if your program
changes at all.
While these factors mean you shouldn't trust the results to be super-accurate,
hopefully they should be close enough to be useful.
4.13 Todo
- Program start-up/shut-down calls a lot of functions that aren't
interesting and just complicate the output. Would be nice to exclude
these somehow.
|