Software: Apache/2.0.54 (Fedora). PHP/5.0.4 uname -a: Linux mina-info.me 2.6.17-1.2142_FC4smp #1 SMP Tue Jul 11 22:57:02 EDT 2006 i686 uid=48(apache) gid=48(apache) groups=48(apache) Safe-mode: OFF (not secure) /usr/share/doc/oprofile-0.8.2/ drwxr-xr-x |
Viewing file: Select action/file-type: John LevonCopyright © 2000-2004 Victoria University of Manchester, John Levon and others Table of Contents
Table of Contents This manual applies to OProfile version 0.8.2cvs. OProfile is a profiling system for Linux 2.2/2.4/2.6 systems on a number of architectures. It is capable of profiling all parts of a running system, from the kernel (including modules and interrupt handlers) to shared libraries to binaries. It runs transparently in the background collecting information at a low overhead. These features make it ideal for profiling entire systems to determine bottle necks in real-world systems. Many CPUs provide "performance counters", hardware registers that can count "events"; for example, cache misses, or CPU cycles. OProfile provides profiles of code based on the number of these occurring events: repeatedly, every time a certain (configurable) number of events has occurred, the PC value is recorded. This information is aggregated into profiles for each binary image. Some hardware setups do not allow OProfile to use performance counters: in these cases, no events are available, and OProfile operates in timer/RTC mode, as described in later chapters. OProfile is useful in a number of situations. You might want to use OProfile when you :
OProfile is not a panacea. OProfile might not be a complete solution when you :
First you need to build OProfile and install it. ./configure , make , make install is often all you need, but note these arguments to ./configure :
You'll need to have a configured kernel source for the current kernel to build the module for 2.4 kernels. Since all distributions provide different kernels it's unlikely the running kernel match the configured source you installed. The safest way is to recompile your own kernel, run it and compile oprofile. It is also recommended that if you have a uniprocessor machine, you enable the local APIC / IO_APIC support for your kernel (this is automatically enabled for SMP kernels). With many BIOS, kernel >= 2.6.9 and UP kernel it's not sufficient to enable the local APIC you must also turn it on explicitely at boot time by providing "lapic" option to the kernel. On machines with power management, such as laptops, the power management must be turned off when using OProfile with 2.4 kernels. The power management software in the BIOS cannot handle the non-maskable interrupts (NMIs) used by OProfile for data collection. If you use the NMI watchdog, be aware that the watchdog is disabled when profiling starts, and not re-enabled until the OProfile module is removed (or, in 2.6, when OProfile is not running). If you compile OProfile for a 2.2 kernel you must be root to compile the module. If you are using 2.6 kernels or higher, you do not need kernel source, as long as the OProfile driver is enabled; additionally, you should not need to disable power management. Please note that you must save or have available the vmlinux file generated during a kernel compile, as OProfile needs it (you can use --no-vmlinux , but this will prevent kernel profiling). Table of Contents
Before you can use OProfile, you must set it up. The minimum setup required for this is to tell OProfile where the vmlinux file corresponding to the running kernel is, for example :
If you don't want to profile the kernel itself, you can tell OProfile you don't have a vmlinux file :
Now we are ready to start the daemon ( oprofiled ) which collects the profile data :
When I want to stop profiling, I can do so with :
Note that unlike gprof , no instrumentation ( -pg and -a options to gcc ) is necessary. Periodically (or on opcontrol --shutdown or opcontrol --dump ) the profile data is written out into the /var/lib/oprofile/samples directory. These profile files cover shared libraries, applications, the kernel (vmlinux), and kernel modules. You can clear the profile data (at any time) with opcontrol --reset . You can get summaries of this data in a number of ways at any time. To get a summary of data across the entire system for all of these profiles, you can do :
Or to get a more detailed summary, for a particular image, you can do something like :
There are also a number of other ways of presenting the data, as described later in this manual. Note that OProfile will choose a default profiling setup for you. However, there are a number of options you can pass to opcontrol if you need to change something, also detailed later. This section gives a brief description of the available OProfile utilities and their purpose.
Table of Contents
In this section we describe the configuration and control of the profiling system with opcontrol in more depth. The opcontrol script has a default setup, but you can alter this with the options given below. In particular, if your hardware supports performance counters, you can configure them. There are a number of counters (for example, counter 0 and counter 1 on the Pentium III). Each of these counters can be programmed with an event to count, such as cache misses or MMX operations. The event chosen for each counter is reflected in the profile data collected by OProfile: functions and binaries at the top of the profiles reflect that most of the chosen events happened within that code. Additionally, each counter has a "count" value: this corresponds to how detailed the profile is. The lower the value, the more frequently profile samples are taken. A counter can choose to sample only kernel code, user-space code, or both (both is the default). Finally, some events have a "unit mask" - this is a value that further restricts the types of event that are counted. The event types and unit masks for your CPU are listed by opcontrol --list-events . The opcontrol script provides the following actions :
There are a number of possible settings, of which, only --vmlinux (or --no-vmlinux ) is required. These settings are stored in ~/.oprofile/daemonrc .
Here, we have a Pentium III running at 800MHz, and we want to look at where data memory references are happening most, and also get results for CPU time.
Here, we have an Intel laptop without support for performance counters, running on 2.4 kernels.
If we're running 2.6 kernels, we can use --start-daemon to avoid the profiler startup affecting results.
Here, we want to see a profile of the OProfile daemon itself, including when it was running inside the kernel driver, and its use of shared libraries.
It can often be useful to split up profiling data into several different time periods. For example, you may want to collect data on an application's startup separately from the normal runtime data. You can use the simple command opcontrol --save to do this. For example :
will create a sub-directory in /var/lib/oprofile/samples containing the samples up to that point (the current session's sample files are moved into this directory). You can then pass this session name as a parameter to the post-profiling analysis tools, to only get data up to the point you named the session. If you do not want to save a session, you can do rm -rf /var/lib/oprofile/samples/sessionname or, for the current session, opcontrol --reset . The --event option to opcontrol takes a specification that indicates how the details of each hardware performance counter should be setup. If you want to revert to OProfile's default setting ( --event is strictly optional), use --event=default . You can pass multiple event specifications. OProfile will allocate hardware counters as necessary. Note that some combinations are not allowed by the CPU; running opcontrol --list-events gives the details of each event. The event specification is a colon-separated string of the form name : count : unitmask : kernel : user as described in this table: NoteFor the PowerPC platforms, all events specified must be in the same group; i.e., the group number appended to the event name (e.g. < some-event-name >_GRP9 ) must be the same.
The last three values are optional, if you omit them (e.g. --event=DATA_MEM_REFS:30000 ), they will be set to the default values (a unit mask of 0, and profiling both kernel and userspace code). Note that some events require a unit mask. If OProfile is using RTC mode, and you want to alter the default counter value, you can use something like --event=RTC_INTERRUPTS:2048 . Note the last three values here are ignored. If OProfile is using timer-interrupt mode, there is no configuration possible. The table below lists the events selected by default ( --event=default ) for the various computer architectures:
The oprof_start application provides a convenient way to start the profiler. Note that oprof_start is just a wrapper around the opcontrol script, so it does not provide more services than the script itself. After oprof_start is started you can select the event type for each counter; the sampling rate and other related parameters are explained in Section 1 . The "Configuration" section allows you to set general parameters such as the buffer size, kernel filename etc. The counter setup interface should be self-explanatory; Section 3.1 and related links contain information on using unit masks. A status line shows the current status of the profiler: how long it has been running, and the average number of interrupts received per second and the total, over all processors. Note that quitting oprof_start does not stop the profiler. Your configuration is saved in the same file as opcontrol uses; that is, ~/.oprofile/daemonrc . NoteYour CPU type may not include the requisite support for hardware performance counters, in which case you must use OProfile in RTC mode in 2.4 (see Section 3.2 ), or timer mode in 2.6 (see Section 3.3 ). You do not really need to read this section unless you are interested in using events other than the default event chosen by OProfile. The Intel hardware performance counters are detailed in the Intel IA-32 Architecture Manual, Volume 3, available from http://developer.intel.com/ . The AMD Athlon/Duron implementation is detailed in http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf . For PowerPC64 processors in IBM iSeries and pSeries systems, contact IBM for more information. These processors are capable of delivering an interrupt when a counter overflows. This is the basic mechanism on which OProfile is based. The delivery mode is NMI, so blocking interrupts in the kernel does not prevent profiling. When the interrupt handler is called, the current PC value and the current task are recorded into the profiling structure. This allows the overflow event to be attached to a specific assembly instruction in a binary image. The daemon receives this data from the kernel, and writes it to the sample files. If we use an event such as CPU_CLK_UNHALTED or INST_RETIRED ( GLOBAL_POWER_EVENTS or INSTR_RETIRED , respectively, on the Pentium 4), we can use the overflow counts as an estimate of actual time spent in each part of code. Alternatively we can profile interesting data such as the cache behaviour of routines with the other available counters. However there are several caveats. First, there are those issues listed in the Intel manual. There is a delay between the counter overflow and the interrupt delivery that can skew results on a small scale - this means you cannot rely on the profiles at the instruction level as being perfectly accurate. If you are using an "event-mode" counter such as the cache counters, a count registered against it doesn't mean that it is responsible for that event. However, it implies that the counter overflowed in the dynamic vicinity of that instruction, to within a few instructions. Further details on this problem can be found in Chapter 5 and also in the Digital paper "ProfileMe: A Hardware Performance Counter". Each counter has several configuration parameters. First, there is the unit mask: this simply further specifies what to count. Second, there is the counter value, discussed below. Third, there is a parameter whether to increment counts whilst in kernel or user space. You can configure these separately for each counter. After each overflow event, the counter will be re-initialized such that another overflow will occur after this many events have been counted. Thus, higher values mean less-detailed profiling, and lower values mean more detail, but higher overhead. Picking a good value for this parameter is, unfortunately, somewhat of a black art. It is of course dependent on the event you have chosen. Specifying too large a value will mean not enough interrupts are generated to give a realistic profile (though this problem can be ameliorated by profiling for longer ). Specifying too small a value can lead to higher performance overhead. NoteThis section applies to 2.2/2.4 kernels only. Some CPU types do not provide the needed hardware support to use the hardware performance counters. This includes some laptops, classic Pentiums, and other CPU types not yet supported by OProfile (such as Cyrix). On these machines, OProfile falls back to using the real-time clock interrupt to collect samples. This interrupt is also used by the rtc module: you cannot have both the OProfile and rtc modules loaded nor the rtc support compiled in the kernel. RTC mode is less capable than the hardware counters mode; in particular, it is unable to profile sections of the kernel where interrupts are disabled. There is just one available event, "RTC interrupts", and its value corresponds to the number of interrupts generated per second (that is, a higher number means a better profiling resolution, and higher overhead). The current implementation of the real-time clock supports only power-of-two sampling rates from 2 to 4096 per second. Other values within this range are rounded to the nearest power of two. Setting the value from the GUI should be straightforward. On the command line, you need to specify the event to opcontrol , e.g. : opcontrol --event=RTC_INTERRUPTS:256 NoteThis section applies to 2.6 kernels and above only. In 2.6 kernels on CPUs without OProfile support for the hardware performance counters, the driver falls back to using the timer interrupt for profiling. Like the RTC mode in 2.4 kernels, this is not able to profile code that has interrupts disabled. Note that there are no configuration parameters for setting this, unlike the RTC and hardware performance counter setup. You can force use of the timer interrupt by using the timer=1 module parameter (or oprofile.timer=1 on the boot command line if OProfile is built-in). The Pentium 4 / Xeon performance counters are organized around 3 types of model specific registers (MSRs): 45 event selection control registers (ESCRs), 18 counter configuration control registers (CCCRs) and 18 counters. ESCRs describe a particular set of events which are to be recorded, and CCCRs bind ESCRs to counters and configure their operation. Unfortunately the relationship between these registers is quite complex; they cannot all be used with one another at any time. There is, however, a subset of 8 counters, 8 ESCRs, and 8 CCCRs which can be used independently of one another, so OProfile only accesses those registers, treating them as a bank of 8 "normal" counters, similar to those in the P6 or Athlon families of CPU. There is currently no support for Precision Event-Based Sampling (PEBS), nor any advanced uses of the Debug Store (DS). Current support is limited to the conservative extension of OProfile's existing interrupt-based model described above. Performance monitoring hardware on Pentium 4 / Xeon processors with Hyperthreading enabled (multiple logical processors on a single die) is not supported in 2.4 kernels (you can use OProfile if you disable hyper-threading, though). The Itanium 2 performance monitoring unit (PMU) organizes the counters as four pairs of performance event monitoring registers. Each pair is composed of a Performance Monitoring Configuration (PMC) register and Performance Monitoring Data (PMD) register. The PMC selects the performance event being monitored and the PMD determines the sampling interval. The IA64 Performance Monitoring Unit (PMU) triggers sampling with maskable interrupts. Thus, samples will not occur in sections of the IA64 kernel where interrupts are disabled. None of the advance features of the Itanium 2 performance monitoring unit such as opcode matching, address range matching, or precise event sampling are supported by this version of OProfile. The Itanium 2 support only maps OProfile's existing interrupt-based model to the PMU hardware. The performance monitoring unit (PMU) for the PowerPC 64-bit processors consists of between 6 and 8 counters (depending on the model), plus three special purpose registers used for programming the counters -- MMCR0, MMCR1, and MMCRA. Advanced features such as instruction matching and thresholding are not supported by this version of OProfile. OProfile is a low-level profiler which allow continuous profiling with a low-overhead cost. If too low a count reset value is set for a counter, the system can become overloaded with counter interrupts, and seem as if the system has frozen. Whilst some validation is done, it is not foolproof. NoteThis can happen as follows: When the profiler count reaches zero an NMI handler is called which stores the sample values in an internal buffer, then resets the counter to its original value. If the count is very low, a pending NMI can be sent before the NMI handler has completed. Due to the priority of the NMI, the local APIC delivers the pending interrupt immediately after completion of the previous interrupt handler, and control never returns to other parts of the system. In this way the system seems to be frozen. If this happens, it will be impossible to bring the system back to a workable state. There is no way to provide real security against this happening, other than making sure to use a reasonable value for the counter reset. For example, setting CPU_CLK_UNHALTED event type with a ridiculously low reset count (e.g. 500) is likely to freeze the system. In short : Don't try a foolish sample count value . Unfortunately the definition of a foolish value is really dependent on the event type - if ever in doubt, e-mail There are situations where you are only interested in the profiling results of a particular running process, or process tty group. You can set the pid/pgrp values via the --pid-filter and --pgrp-filter options to opcontrol , which will make the daemon ignore samples for processes that don't match the filter. These options are not available in 2.6 and above kernels. NoteThis section applies to 2.2/2.4 kernels only, OProfile in 2.6 can be unloaded safely. The kernel module can be unloaded, but is designed to take very little memory when profiling is not underway. There is no need to unload the module between profiler runs. lsmod and similar utilities will still show the module's use count as -1 . However, this is not to be relied on - the module will become unloadable some short time after stopping profiling. Note that by default module unloading is disabled when used on SMP systems. This is because of a small chance of a module unload race crashing the kernel. As the race is very small, it is allowed to re-enable the module unload by specifying the "allow_unload" parameter to the module : modprobe oprofile allow_unload=1 This option can be DANGEROUS and should only be used on non-production systems. Table of Contents OK, so the profiler has been running, but it's not much use unless we can get some data out. Fairly often, OProfile does a little too good a job of keeping overhead low, and no data reaches the profiler. This can happen on lightly-loaded machines. Remember you can force a dump at any time with : opcontrol --dump Remember to do this before complaining there is no profiling data ! Now that we've got some data, it has to be processed. That's the job of opreport , opannotate , opstack or opgprof . The opreport utility is the primary utility you will use for getting formatted data out of OProfile. It produces two types of data: image summaries and symbol summaries. An image summary lists the number of samples for individual binary images such as libraries or applications. Symbol summaries provide per-symbol profile data. In the following example, we're getting an image summary for the whole system:
If we had specified --symbols in the previous command, we would have gotten a symbol summary of all the images across the entire system. We can restrict this to only part of the system profile; for example, below is a symbol summary of the OProfile daemon. Note that as we used opcontrol --separate=kernel , symbols from images that oprofiled has used are also shown.
These are the two basic ways you are most likely to use regularly, but opreport can do a lot more than that. For more details, see Section 7 . The opannotate utility generates annotated source files or assembly listings, optionally mixed with source. If you want to see the source file, the profiled application needs to have debug information, and the source must be available through this debug information. For GCC, you must use the -g option when you are compiling. If the binary doesn't contain sufficient debug information, you can still use opannotate --assembly to get annotated assembly. Note that for the reason explained in Section 3.1 the results can be inaccurate. The debug information itself can add other problems; for example, the line number for a symbol can be incorrect. Assembly instructions can be re-ordered and moved by the compiler, and this can lead to crediting source lines with samples not really "owned" by this line. Also see Chapter 5 . You can output the annotation to one single file, containing all the source found using the --source . You can use this in conjunction with --assembly to get combined source/assembly output. You can also output a directory of annotated source files that maintains the structure of the original sources. Each line in the annotated source is prepended with the samples for that line. Additionally, each symbol is annotated giving details for the symbol as a whole. An example:
Line numbers are maintained in the source files, but each file has a footer appended describing the profiling details. The actual annotation looks something like this :
The first number on each line is the number of samples, whilst the second is the relative percentage of total samples. There are a number of options for specifying the output; for more details, see Section 8 . Of course, opannotate needs to be able to locate the source files for the binary image(s) in order to produce output. Some binary images have debug information where the given source file paths are relative, not absolute. You can specify search paths to look for these files (similar to gdb 's dir command) with the --search-dirs option. Sometimes you may have a binary image which gives absolute paths for the source files, but you have the actual sources elsewhere (commonly, you've installed an SRPM for a binary on your system and you want annotation from an existing profile). You can use the --base-dirs option to redirect OProfile to look somewhere else for source files. For example, imagine we have a binary generated from a source file that is given in the debug information as /tmp/build/libfoo/foo.c , and you have the source tree matching that binary installed in /home/user/libfoo/ . You can redirect OProfile to find foo.c correctly like this :
You can specify multiple (comma-separated) paths to both options. The opstack utility generates call-graph profile at symbol level. It's able to traverse shared library boundaries, so you can trace calls into an application's loaded libraries. You can also get kernel-based call-graph profiles; currently OProfile cannot trace across a system call boundary. For example, consider the following C program:
Here we can see the logical structure is that a() is called twice, once from main() , and once from b() . Let's look at a portion of the output from opstack :
The output is similar to the output of GNU gprof . Each section refers to one function; here we have shown only one section for clarity, which focuses on the function b() . We say that (for example) main() is a caller of b() , and conversely b() is a callee of main() . Functions listed above the non-indented line in each section are callers of the function; functions listed below the non-indented line are direct callees. Note that functions are only listed if samples were attributed against them in the call-graph. Let's go through this section line by line.
The function main() is a caller of b() . No samples were taken inside main itself, but all samples (2053) were taken by functions called by main() . Note this number refers to all callees of main() , not just the one for this section.
This is the function that's the focus of this section (it is not indented). We can see that there were 408 samples inside b() . The percentage figure refers to the relative percentage of sample count for the entire program: here, we spent 19% of our time in b() itself. Additionally, there were 715 samples inside functions that b() called. In this case, there is only one such function - a() . The percentage has the same meaning - of all the samples taken in the program, 34% of them were spent in a() when it was called by b() .
And here we have a() , which is indented and below b() , meaning that b() called a() , as you can see in the source code above. The report shows that a() received 1645 samples in total (whether called by b() or not). The percentage shows that of all the callees of b() , 100% of the samples were in a() . This is to be expected, since b() only calls one function. See Section 9 . If you would like to use call-graph profiling, you need to be running on an x86 machine with 2.6 kernel. You must also apply a kernel patch to generate the data. If you're familiar with the output produced by GNU gprof , you may find opgprof useful. It takes a single binary as an argument, and produces a gmon.out file for use with gprof -p . Note that only a flat profile is included; OProfile cannot produce call graphs. For example:
A few options to opgprof are available, see Section 10 . The oparchive utility generates a directory populated with executable, debug, and oprofile sample files. This directory can be move to another machine via tar and analyzed without further use of the data collection machine. The following command would collect the sample files, the executables associated with the sample files, and the debuginfo files associated with the executables and copy them into /tmp/current_data :
All of the analysis tools take a profile specification . This is a set of definitions that describe which actual profiles should be examined. The simplest profile specification is empty: this will match all the available profile files for the current session (this is what happens when you do opreport ). Specification parameters are of the form name:value[,value] . For example, if I wanted to get a combined symbol summary for /bin/myprog and /bin/myprog2 , I could do opreport -l image:/bin/myprog,/bin/myprog2 . As a special case, you don't actually need to specify the image: part here: anything left on the command line is assumed to be an image: name. Similarly, if no session: is specified, then session:current is assumed ("current" is a special name of the current / last profiling session). In addition to the comma-separated list shown above, some of the specification parameters can take glob -style values. For example, if I want to see image summaries for all binaries profiled in /usr/bin/ , I could do opreport image:/usr/bin/\* . Note the necessity to escape the special character from the shell. Image summaries for all profiles with DATA_MEM_REFS samples in the saved session called "stresstest" :
Symbol summary for the application called "test_sym53c8xx,9xx". Note the escaping is necessary as image: takes a comma-separated list.
Image summaries for all binaries in the test directory, excepting boring-test :
Each session's sample files can be found in the /var/lib/oprofile/samples/ directory. These are used, along with the binary image files, to produce human-readable data. In some circumstances (kernel modules in an initrd, or modules on 2.6 kernels), OProfile will not be able to find the binary images. All the tools have an --image-path to which you can pass a comma-separated list of alternate paths to search. For example, I can let OProfile find my 2.6 modules by using --image-path /lib/modules/2.6.0/kernel/ . It is your responsibility to ensure that the correct images are found when using this option. Note that if a binary image changes after the sample file was created, you won't be able to get useful symbol-based data out. This situation is detected for you. If you replace a binary, you should make sure to save the old binary if you need to do comparative profiles. When attempting to get output, you may see the error :
What this is saying is that the profile specification you passed in, when matched against the available sample files, resulted in no matches. There are a number of reasons this might happen:
If you have used one of the
--separate=
options
whilst profiling, there can be several separate profiles for
a single binary image within a session. Normally the output
will keep these images separated (so, for example, the image summary
output shows library image summaries on a per-application basis,
when using
--separate=lib
).
Sometimes it can be useful to merge these results back together
before getting results. The
--merge
option allows
you to do that.
If you have used multiple events when profiling, by default you get
side-by-side results of each event's sample values from
opreport
.
You can restrict which events to list by appropriate use of the
event:
profile specifications, etc.
Table of Contents The standard caveats of profiling apply in interpreting the results from OProfile: profile realistic situations, profile different scenarios, profile for as long as a time as possible, avoid system-specific artifacts, don't trust the profile data too much. Also bear in mind the comments on the performance counters above - you cannot rely on totally accurate instruction-level profiling. However, for almost all circumstances the data can be useful. Ideally a utility such as Intel's VTUNE would be available to allow careful instruction-level analysis; go hassle Intel for this, not me ;) This is an example of how the latency of delivery of profiling interrupts can impact the reliability of the profiling data. This is pretty much a worst-case-scenario example: these problems are fairly rare.
Here the last instruction of the loop is very costly, and you would expect the result reflecting that - but (cutting the instructions inside the loop):
The problem comes from the x86 hardware; when the counter overflows the IRQ is asserted but the hardware has features that can delay the NMI interrupt: x86 hardware is synchronous (i.e. cannot interrupt during an instruction); there is also a latency when the IRQ is asserted, and the multiple execution units and the out-of-order model of modern x86 CPUs also causes problems. This is the same function, with annotation :
The conclusion: don't trust samples coming at the end of a loop, particularly if the last instruction generated by the compiler is costly. This case can also occur for branches. Always bear in mind that samples can be delayed by a few cycles from its real position. That's a hardware problem and OProfile can do nothing about it. OProfile uses non-maskable interrupts (NMI) on the P6 generation, Pentium 4, Athlon and Duron processors. These interrupts can occur even in section of the Linux where interrupts are disabled, allowing collection of samples in virtually all executable code. The RTC, timer interrupt mode, and Itanium 2 collection mechanisms use maskable interrupts. Thus, the RTC and Itanium 2 data collection mechanism have "sample shadows", or blind spots: regions where no samples will be collected. Typically, the samples will be attributed to the code immediately after the interrupts are re-enabled. Your kernel is likely to support halting the processor when a CPU is idle. As the typical hardware events like CPU_CLK_UNHALTED do not count when the CPU is halted, the kernel profile will not reflect the actual amount of time spent idle. You can change this behaviour by booting with the idle=poll option, which uses a different idle routine. This will appear as poll_idle() in your kernel profile. The internal implementation of the 2.6 OProfile code means that tasks that within the kernel do_exit() routine cannot be profiled. OProfile profiles kernel modules by default. However, there are a couple of problems you may have when trying to get results. First, you may have booted via an initrd; this means that the actual path for the module binaries cannot be determined automatically. To get around this, you can use the -p option to the profiling tools to specify where to look for the kernel modules. In 2.6, the information on where kernel module binaries are located has been removed. This means OProfile needs guiding with the -p option to find your modules. Normally, you can just use your standard module top-level directory for this. Note that due to this problem, OProfile cannot check that the modification times match; it is your responsibility to make sure you do not modify a binary after a profile has been created. If you have run insmod or modprobe to insert a module in a particular directory, it is important that you specify this directory with the -p option first, so that it over-rides an older module binary that might exist in other directories you've specified with -p . It is up to you to make sure that these values are correct: 2.6 kernels simply do not provide enough information for OProfile to get this information. Sometimes the results from call-graph profiles may be different to what you expect to see. The first thing to check is whether the target binaries where compiled with frame pointers enabled (if the binary was compiled using gcc 's -fomit-frame-pointer option, you will not get meaningful results). Note that as of this writing, the GCC developers plan to disable frame pointers by default. The Linux kernel is built without frame pointers by default; there is a configuration option you can use to turn it on under the "Kernel Hacking" menu. Like the rest of OProfile, call-graph profiling uses a statistical approach; this means that sometimes a backtrace sample is truncated, or even partially wrong. Bear this in mind when examining results. The compiler can introduce some pitfalls in the annotated source output. The optimizer can move pieces of code in such manner that two line of codes are interlaced (instruction scheduling). Also debug info generated by the compiler can show strange behavior. This is especially true for complex expressions e.g. inside an if statement:
here the problem come from the position of line number. The available debug info does not give enough details for the if condition, so all samples are accumulated at the position of the right brace of the expression. Using opannotate -a can help to show the real samples at an assembly level. The compiler generally needs to generate "glue" code across function calls, dependent on the particular function call conventions used. Additionally other things need to happen, like stack pointer adjustment for the local variables; this code is known as the function prologue. Similar code is needed at function return, and is known as the function epilogue. This will show up in annotations as samples at the very start and end of a function, where there is no apparent executable code in the source. You may see that a function is credited with a certain number of samples, but the listing does not add up to the correct total. To pick a real example :
Here, the function is credited with 1,882 samples, but the annotations below do not account for this. This is usually because of inline functions - the compiler marks such code with debug entries for the inline function definition, and this is where opannotate annotates such samples. In the case above, memset is the most likely candidate for this problem. Examining the mixed source/assembly output can help identify such results. When running opannotate , you may get a warning "some functions compiled without debug information may have incorrect source line attributions". In some rare cases, OProfile is not able to verify that the derived source line is correct (when some parts of the binary image are compiled without debugging information). Be wary of results if this warning appears. Furthermore, for some languages the compiler can implicitly generate functions, such as default copy constructors. Such functions are labelled by the compiler as having a line number of 0, which means the source annotation can be confusing. Depending on your compiler you can fall into the following problem:
Compiled with gcc 3.0.4 the annotated source is clearly inaccurate:
The problem here is distinct from the IRQ latency problem; the debug line number information is not precise enough; again, looking at output of opannoatate -as can help.
So here it's clear that copying is correctly credited with of all the samples, but the line number information is misplaced. objdump -dS exposes the same problem. Note that maintaining accurate debug information for compilers when optimizing is difficult, so this problem is not suprising. The problem of debug information accuracy is also dependent on the binutils version used; some BFD library versions contain a work-around for known problems of gcc , some others do not. This is unfortunate but we must live with that, since profiling is pointless when you disable optimisation (which would give better debugging entries). Often the assembler cannot generate debug information automatically. This means that you cannot get a source report unless you manually define the neccessary debug information; read your assembler documentation for how you might do that. The only debugging info needed currently by OProfile is the line-number/filename-VMA association. When profiling assembly without debugging info you can always get report for symbols, and optionally for VMA, through opreport -l or opreport -d , but this works only for symbol with the right attributes. For gas you can get this by
whilst for nasm you must use
Note that OProfile does not need the global attribute, only the function attribute. 6. Other discrepanciesAnother cause of apparent problems is the hidden cost of instructions. A very common example is two memory reads: one from L1 cache and the other from memory: the second memory read is likely to have more samples. There are many other causes of hidden cost of instructions. A non-exhaustive list: mis-predicted branch, TLB cache miss, partial register stall, partial register dependencies, memory mismatch stall, re-executed µops. If you want to write programs at the assembly level, be sure to take a look at the Intel and AMD documentation at http://developer.intel.com/ and http://www.amd.com/products/cpg/athlon/techdocs/ . Thanks to (in no particular order) : Arjan van de Ven, Rik van Riel, Juan Quintela, Philippe Elie, Phillipp Rumpf, Tigran Aivazian, Alex Brown, Alisdair Rawsthorne, Bob Montgomery, Ray Bryant, H.J. Lu, Jeff Esper, Will Cohen, Graydon Hoare, Cliff Woolley, Alex Tsariounov, Al Stone, Jason Yeh, Randolph Chung, Anton Blanchard, Richard Henderson, Andries Brouwer, Bryan Rittmeyer, Richard Reich (rreich@rdrtech.com), Zwane Mwaikambo, Dave Jones, Charles Filtness; and finally Pulp, for "Intro". |
:: Command execute :: | |
--[ c99shell v. 1.0 pre-release build #16 powered by Captain Crunch Security Team | http://ccteam.ru | Generation time: 0.0031 ]-- |