ftrace 學習亂記...

Links

簡介

ftrace 是 linux kernel 中高效, 安全的 tracing 機制, 可在不需要安裝其他調適工具的情況下對整個系統做追蹤, 所謂的整個系統指的是從 kernel 到 user space!

ftrace 需要 debug file system, 可以手 mount 上

# mount -t debugfs none /sys/kernel/debug

不過一般現在的 distro 都會自動 mount 上, 不需要我們擔心

# mount | grep debugfs
debugfs on /sys/kernel/debug type debugfs (rw,relatime)

ftrace 主要的控制界面是一堆的檔案, 位置如下

# ls /sys/kernel/debug/tracing
available_events            dyn_ftrace_total_info  kprobe_profile   saved_cmdlines       set_graph_function  trace            tracing_max_latency
available_filter_functions  enabled_functions      max_graph_depth  saved_cmdlines_size  set_graph_notrace   trace_clock      tracing_on
available_tracers           events                 options          set_event            snapshot            trace_marker     tracing_thresh
buffer_size_kb              free_buffer            per_cpu          set_ftrace_filter    stack_max_size      trace_options    uprobe_events
buffer_total_size_kb        instances              printk_formats   set_ftrace_notrace   stack_trace         trace_pipe       uprobe_profile
current_tracer              kprobe_events          README           set_ftrace_pid       stack_trace_filter  tracing_cpumask

available_events 中放置事先定義好, 可被 trace 的事件類型, 例如

# grep open available_events 
hda_controller:azx_pcm_open
syscalls:sys_exit_perf_event_open
syscalls:sys_enter_perf_event_open
syscalls:sys_exit_openat
syscalls:sys_enter_openat
syscalls:sys_exit_open
syscalls:sys_enter_open
syscalls:sys_exit_open_by_handle_at
syscalls:sys_enter_open_by_handle_at
syscalls:sys_exit_mq_open
syscalls:sys_enter_mq_open

在開始 trace 前, 先做些準備工作, 停掉 tracer, 以免被先前的記錄所影響

# echo 0 >tracing_on

清空 log. trace 檔是 kernel 中用來放置 log 的 ring buffer, 在空間不足時會將舊的 log 丟棄

# echo >trace

假設我們今天想對 open() syscall 進行追蹤, 就將 syscalls:sys_exit_open 加到 set_event

# echo syscalls:sys_exit_open >set_event

開始 trace

# echo 1 >trace_on

Log 可從 trace_pipe 中取得

# cat trace_pipe

也可以從 trace 取得. 兩者不同之處在於, 從 trace 中讀取 log 後不會影響 ring buffer 的內容. 而從 trace_pipe 讀取後則會移除已讀取的部份.


ftrace 是 kernel 中方便好用的 tracing 機制, 在 linux kernel 中 function 的 proloque 加上了對 mcount() 的調用, 以便取得相關資訊, 然後在 runtime 時以 nop 指令取代 mcount() 的調用, 讓未啟用的 probe point 幾乎不會對系統效能有所影響. 有了 ftrace 這樣的機制, 你就可以在產品運行的環境安全的觀察系統的情況而不需要安裝額外的 package, 在以往這是很難做到的.

ftrace 需要 debug file system, 可以手 mount 上

$ sudo mount -t debugfs none /debug

不過現今大部份的 linux 系統都會自動 mount 上, 位置如下

/sys/kernel/debug

要使用 ftrace, 先要告訴它你想 trace 什麼東西, 分成

  • event: kernel programmer 事先定義好的 static probe point, 以 system 做為分類. 完整列表可查詢 available_events, 要增/刪 event 時寫到 set_event 檔案中
  • functions: 裡面放的是能被 trace 的 function function 名稱, 完整列表可查詢 available_ftrace_functions, 要增/剛 trace function 時寫到 set_ftrace_filter 檔案中.
  • kprobe: 在 kernel 幾乎任意處按需加上 probe point 完成 trace 功能, 主要對 kprobe_eventskprobe_profile 操作
  • uprobe: 在 user space 按需加上 probe point 完成 trace 功能. 主要對 uprobe_eventsuprobe_profile 操作

告知要 trace 的點後, 執行以下命令就會開始蒐集 trace 結果

echo 1 >tracing_on

執行以下命令停此 trace 動作

echo 0 >tracing_on

Trace 過程會被記錄在 trace 這個檔案中, 它是個 rign buffer, 當 buffer 滿時會將舊的丟棄. 對 trace 讀出內容並不會從 kernel 中移除, 且無內容時直接不會 block 住. 如果有以上兩個需求, 可以改從 trace_pipe 讀出.

以下先來看個追蹤系統調用 read() 的小例子

grep 'sys.*read$' available_events 
syscalls:sys_exit_read
syscalls:sys_enter_read

read() event 有 enter 及 exit 兩種, 分別在進入及離開前觸發. 接下來設定 event filter

echo syscalls:sys_enter_read >set_event

使用 nop tracer, 單純的在 event 發生時印出 process ID, 時間, 事件名及參數

echo nop >current_tracer

開始 trace

echo 1 >tracing_on

要看目前的 log, 執行

cat trace_pipe
 Chrome_IOThread-21144 [001] .... 11025.353781: sys_read(fd: 36, buf: 7fa83b7d49ff, count: 1)
          chrome-21113 [000] .... 11025.354260: sys_read(fd: 11, buf: 7ffc159c446e, count: 2)
            Xorg-2073  [001] .... 11025.357337: sys_read(fd: 9, buf: 7ffc585abbf0, count: 400)
          chrome-21113 [000] .... 11025.358437: sys_read(fd: 11, buf: 7ffc159c446e, count: 2)
          chrome-21113 [000] .... 11025.358455: sys_read(fd: 11, buf: 7ffc159c446e, count: 2)
          task56-25096 [000] .... 11025.372179: sys_read(fd: 3b, buf: 7f15d8023f30, count: 10000)
 Chrome_IOThread-9964  [000] .... 11025.372465: sys_read(fd: 2c, buf: 7f0e2864c9ff, count: 1)
          chrome-9932  [002] .... 11025.372480: sys_read(fd: 11, buf: 7fffaad5b66e, count: 2)
          chrome-21113 [000] .... 11025.374779: sys_read(fd: 11, buf: 7ffc159c446e, count: 2)
          chrome-21113 [000] .... 11025.391043: sys_read(fd: 11, buf: 7ffc159c446e, count: 2)
          chrome-21113 [000] .... 11025.391051: sys_read(fd: 11, buf: 7ffc159c446e, count: 2)
          chrome-21113 [000] .... 11025.393090: sys_read(fd: 11, buf: 7ffc159c446e, count: 2)
          chrome-21113 [000] .... 11025.393108: sys_read(fd: 11, buf: 7ffc159c446e, count: 2)
 Chrome_IOThread-21144 [001] .... 11025.393329: sys_read(fd: 36, buf: 7fa83b7d49ff, count: 1)
          chrome-21221 [000] .... 11025.393422: sys_read(fd: a, buf: 7ffe54df360e, count: 2)
          chrome-21113 [000] .... 11025.393563: sys_read(fd: 11, buf: 7ffc159c446e, count: 2)
 Chrome_IOThread-21144 [001] .... 11025.393844: sys_read(fd: 36, buf: 7fa83b7d49ff, count: 1)
          chrome-21221 [001] .... 11025.393905: sys_read(fd: a, buf: 7ffe54df360e, count: 2)
 Chrome_ChildIOT-21256 [000] .... 11025.395063: sys_read(fd: 14, buf: 7fbeb47e4a2f, count: 1)
          chrome-21113 [000] .... 11025.395128: sys_read(fd: 11, buf: 7ffc159c446e, count: 2)
 Chrome_IOThread-21144 [000] .... 11025.395159: sys_read(fd: 36, buf: 7fa83b7d49ff, count: 1)
 Chrome_IOThread-21144 [000] .... 11025.395247: sys_read(fd: 36, buf: 7fa83b7d49ff, count: 1)
 Chrome_IOThread-21144 [000] .... 11025.395328: sys_read(fd: 36, buf: 7fa83b7d49ff, count: 1)
 Chrome_ChildIOT-21256 [001] .... 11025.395342: sys_read(fd: 14, buf: 7fbeb47e4a2f, count: 1)
...

停止 trace

echo 0 >tracing_on

你可以讓 tail 一直執行著, 會發現, 雖然 trace 動作已停止, 但它還是不斷的印出 log, 仔細看時間可以瞭解其實它只是在印著 ring buffer 中重覆的資訊. 要避免這個問題, 可改成讀取 trace_pip, log 從它裡面讀出後就會從 kernel ring buffer 中去掉, 但因為 tail2

execve

系統啟動時, 可能有大量的 fork/exec, 如果這這些 process 的執行時間很短, 效率就相對低, 且不好觀察出時間的花用, 是啟動過程優化的一個重要手段.

最簡單的可以 trace sched:sched_process_exec

# echo 0 >tracing_on
# echo >trace
# echo sched:sched_process_exec >set_event
# echo 1 tracing_on

開始 trace

# cat trace_pipe 
              ls-28787 [002] .... 23132.309240: sched_process_exec: filename=/bin/ls pid=28787 old_pid=28787
  chrome-sandbox-28807 [003] .... 23138.852845: sched_process_exec: filename=/opt/google/chrome/chrome-sandbox pid=28807 old_pid=28807
  chrome-sandbox-28920 [000] .... 23198.872393: sched_process_exec: filename=/opt/google/chrome/chrome-sandbox pid=28920 old_pid=28920
              sh-28949 [000] .... 23214.097790: sched_process_exec: filename=/bin/sh pid=28949 old_pid=28949
       run-parts-28950 [000] .... 23214.099009: sched_process_exec: filename=/bin/run-parts pid=28950 old_pid=28950
  chrome-sandbox-29052 [001] .... 23259.216646: sched_process_exec: filename=/opt/google/chrome/chrome-sandbox pid=29052 old_pid=29052

fork & exec

ftrace_dump_on_oops

There are times that the kernel will crash and examining the memory and state of the crash is more of a CSI science than a program debugging science. Using kdump/kexec with the crash utility is a valuable way to examine the state of the system at the point of the crash, but it does not let you see what has happened prior to the event that caused the crash.

Having Ftrace configured and enabling ftrace_dump_on_oops in the kernel boot parameters, or by echoing a "1" into /proc/sys/kernel/ftrace_dump_on_oops, will enable Ftrace to dump to the console the entire trace buffer in ASCII format on oops or panic. Having the console output to a serial log makes debugging crashes much easier. You can now trace back the events that led up to the crash.

Dumping to the console may take a long time since the default Ftrace ring buffer is over a megabyte per CPU. To shrink the size of the ring buffer, write the number of kilobytes you want the ring buffer to be to buffer_size_kb. Note that the value is per CPU, not the total size of the ring buffer.

[tracing]# echo 50 > buffer_size_kb

The above will shrink the Ftrace ring buffer down to 50 kilobytes per CPU.
You can also trigger a dump of the Ftrace buffer to the console with sysrq-z.

To choose a particular location for the kernel dump, the kernel may call ftrace_dump() directly. Note, this may permanently disable Ftrace and a reboot may be necessary to enable it again. This is because ftrace_dump() reads the buffer. The buffer is made to be written to in all contexts (interrupt, NMI, scheduling) but the reading of the buffer requires locking. To be able to perform ftrace_dump() the locking is disabled and the buffer may end up being corrupted after the output.

/*
 * The following code will lock up the box, so we dump out the
 * trace before we hit that location.
 */
ftrace_dump();

/* code that locks up */

Stack Tracing

The final topic to discuss is the ability to examine the size of the kernel stack and how much stack space each function is using. Enabling the stack tracer (CONFIG_STACK_TRACER) will show where the biggest use of the stack takes place.

The stack tracer is built from the function tracer infrastructure. It does not use the Ftrace ring buffer, but it does use the function tracer to hook into every function call. Because it uses the function tracer infrastructure, it does not add overhead when not enabled. To enable the stack tracer, echo 1 into /proc/sys/kernel/stack_tracer_enabled. To see the max stack size during boot up, add "stacktrace" to the kernel boot parameters.

The stack tracer checks the size of the stack at every function call. If it is greater than the last recorded maximum, it records the stack trace and updates the maximum with the new size. To see the current maximum, look at the stack_max_size file.

[tracing]# echo 1 > /proc/sys/kernel/stack_tracer_enabled
[tracing]# cat stack_max_size
2928
[tracing]# cat stack_trace
        Depth    Size   Location    (34 entries)
        -----    ----   --------
  0)     2952      16   mempool_alloc_slab+0x15/0x17
  1)     2936     144   mempool_alloc+0x52/0x104
  2)     2792      16   scsi_sg_alloc+0x4a/0x4c [scsi_mod]
  3)     2776     112   __sg_alloc_table+0x62/0x103
[...]
 13)     2072      48   __elv_add_request+0x98/0x9f
 14)     2024     112   __make_request+0x43e/0x4bb
 15)     1912     224   generic_make_request+0x424/0x471
 16)     1688      80   submit_bio+0x108/0x115
 17)     1608      48   submit_bh+0xfc/0x11e
 18)     1560     112   __block_write_full_page+0x1ee/0x2e8
 19)     1448      80   block_write_full_page_endio+0xff/0x10e
 20)     1368      16   block_write_full_page+0x15/0x17
 21)     1352      16   blkdev_writepage+0x18/0x1a
 22)     1336      32   __writepage+0x1a/0x40
 23)     1304     304   write_cache_pages+0x241/0x3c1
 24)     1000      16   generic_writepages+0x27/0x29
[...]
 30)      424      64   bdi_writeback_task+0x3f/0xb0
 31)      360      48   bdi_start_fn+0x76/0xd7
 32)      312     128   kthread+0x7f/0x87
 33)      184     184   child_rip+0xa/0x20

Not only does this give you the size of the maximum stack found, it also shows the breakdown of the stack sizes used by each function. Notice that write_cache_pages had the biggest stack with 304 bytes being used, followed by generic_make_request with 224 bytes of stack.

To reset the maximum, echo "0" into the stack_max_size file.

[tracing]# echo 0 > stack_max_size

Keeping this running for a while will show where the kernel is using a bit too much stack. But remember that the stack tracer only has no overhead when it is not enabled. When it is running you may notice a bit of a performance degradation.

Note that the stack tracer will not trace the max stack size when the kernel is using a separate stack. Because interrupts have their own stack, it will not trace the stack usage there. The reason is that currently there is no easy way to quickly see what the top of the stack is when the stack is something other than the current task's stack. When using split stacks, a process stack may be two pages but the interrupt stack may only be one. This may be fixed in the future, but keep this in mind when using the stack tracer.

Function filtering

Running the function tracer can be overwhelming. The amount of data may be vast, and very hard to get a hold of by the human brain. Ftrace provides a way to limit what functions you see. Two files exist that let you limit what functions are traced:

set_ftrace_filter
set_ftrace_notrace

These filtering features depend on the CONFIG_DYNAMIC_FTRACE option. As explained in the previous articles, when this configuration is enabled all of the mcount caller locations are stored and at boot time are converted into NOPs. These locations are saved and used to enable tracing when the function tracer is activated. But this also has a nice side effect: not all functions must be enabled. The above files will determine which functions gets enabled and which do not.

When any function is listed in the set_ftrace_filter, only those functions will be traced. This will help the performance of the system when the trace is active. Tracing every function incurs a large overhead, but when using the set_ftrace_filter, only those functions listed in that file will have the NOPs changed to call the tracer. Depending on which functions are being traced, just having a couple of hundred functions enabled is hardly noticeable.

The set_ftrace_notrace file is the opposite of set_ftrace_filter. Instead of limiting the trace to a set of functions, functions listed in set_ftrace_notrace will not be traced. Some functions show up quite often and not only does tracing these functions slow down the system, they can fill up the trace buffer and make it harder to analyze the functions you care about. Functions such as rcu_read_lock() and spin_lock() fall into this category.

Wildcard filters

A list of functions that can be added to the filter files is shown in the available_filter_functions file. This list of functions was derived from the list of stored mcount callers previously mentioned.

[tracing]# cat available_filter_functions | head -8
_stext
do_one_initcall
run_init_process
init_post
name_to_dev_t
create_dev
T.627
set_personality_64bit
You can grep this file and redirect the result into one of the filter files:

[tracing]# grep sched available_filter_functions > set_ftrace_filter
[tracing]# cat set_ftrace_filter | head -8
save_stack_address_nosched
mce_schedule_work
smp_reschedule_interrupt
native_smp_send_reschedule
sys32_sched_rr_get_interval
sched_avg_update
proc_sched_set_task
sys_sched_get_priority_max

Unfortunately, adding lots of functions to the filtering files is slow and you will notice that the above grep took several seconds to execute. This is because each function name written into the filter file will be processed individually. The above grep produces over 300 function names. Each of those 300 names will be compared (using strcmp()) against every function name in the kernel, which is quite a lot.

[tracing]# wc -l  available_filter_functions 
24331 available_filter_functions

So the grep above caused set_ftrace_filter to generate over 300 * 24331 (7,299,300) comparisons!

Fortunately, these files also take wildcards; the following glob expressions are valid:

  • value* - Select all functions that begin with value.
  • *value* - Select all functions that contain the text value.
  • *value - Select all functions that end with value.

The kernel contains a rather simple parser, and will not process value*value in the expected way. It will ignore the second value and select all functions that start with value regardless of what it ends with. Wildcards passed to the filter files are processed directly for each available function, which is much faster than passing in individual functions in a list.

Trace functions in specific kernel module

The filters can also select only those functions that belong to a specific module by using the 'mod' command in the input to the filter file:

[tracing]# echo ':mod:tg3' > set_ftrace_filter
[tracing]# cat set_ftrace_filter |head -8
tg3_write32
tg3_read32
tg3_write_flush_reg32
tw32_mailbox_flush
tg3_write32_tx_mbox
tg3_read32_mbox_5906
tg3_write32_mbox_5906
tg3_disable_ints

This is very useful if you are debugging a single module, and only want to see the functions that belong to that module in the trace.

Trigger

In the earlier articles, enabling and disabling recording to the ring buffer was done using the tracing_on file and the tracing_on() and tracing_off() kernel functions. But if you do not want to recompile the kernel, and you want to stop the tracing at a particular function, set_ftrace_filter has a method to do so.
The format of the command to have the function trace enable or disable the ring buffer is as follows:

function:command[:count]

This will execute the command at the start of the function. The command is either traceon or traceoff, and an optional count can be added to have the command only execute a given number of times. If the count is left off (including the leading colon) then the command will be executed every time the function is called.

A while back, I was debugging a change to the kernel I made that was causing a segmentation fault to some programs. I was having a hard time catching the trace, because by the time I was able to stop the trace after seeing the segmentation fault, the data had already been overwritten. But the backtrace on the console showed that the function __bad_area_nosemaphore was being called. I was then able to stop the tracer with the following command:

[tracing]# echo '__bad_area_nosemaphore:traceoff' > set_ftrace_filter
[tracing]# cat set_ftrace_filter
#### all functions enabled ####
__bad_area_nosemaphore:traceoff:unlimited
[tracing]# echo function > current_tracer

Notice that functions with commands do not affect the general filters. Even though a command has been added to __bad_area_nosemaphore, the filter still allowed all functions to be traced. Commands and filter functions are separate and do not affect each other. With the above command attached to the function __bad_area_nosemaphore, the next time the segmentation fault occurred, the trace stopped and contained the data I needed to debug the situation.

Removing functions from the filters

As stated earlier, echoing in nothing with '>' will clear the filter file. But what if you only want to remove a few functions from the filter?

[tracing]# cat set_ftrace_filter > /tmp/filter
[tracing]# grep -v lock /tmp/filter > set_ftrace_filter

The above works, but as mentioned, it may take a while to complete if there were several functions already in set_ftrace_filter. The following does the same thing but is much faster:

[tracing]# echo '!*lock*' >> set_ftrace_filter

The '!' symbol will remove functions listed in the filter file. As shown above, the '!' works with wildcards, but could also be used with a single function. Since '!' has special meaning in bash it must be wrapped with single quotes or bash will try to execute what follows it. Also note the '>>' is used. If you make the mistake of using '>' you will end up with no functions in the filter file.

Because the commands and filters do not interfere with each other, clearing the set_ftrace_filter will not clear the commands. The commands must be cleared with the '!' symbol.

[tracing]# echo 'sched*' > set_ftrace_filter
[tracing]# echo 'schedule:traceoff' >> set_ftrace_filter
[tracing]# cat trace | tail -5
schedule_console_callback
schedule_bh
schedule_iso_resource
schedule_reallocations
schedule:traceoff:unlimited
[tracing]# echo > set_ftrace_filter
[tracing]# cat set_ftrace_filter
#### all functions enabled ####
schedule:traceoff:unlimited
[tracing]# echo '!schedule:traceoff' >> set_ftrace_filter
[tracing]# cat set_ftrace_filter
#### all functions enabled ####
[tracing]#

This may seem awkward, but having the '>' and '>>' only affect the functions to be traced and not the function commands, actually simplifies the control between filtering functions and adding and removing commands.

Tracing a specific process

Perhaps you only need to trace a specific process, or set of processes. The file set_ftrace_pid lets you specify specific processes that you want to trace. To just trace the current thread you can do the following:

[tracing]# echo $$ > set_ftrace_pid

The above will set the function tracer to only trace the bash shell that executed the echo command. If you want to trace a specific process, you can create a shell script wrapper program.

[tracing]# cat ~/bin/ftrace-me
#!/bin/sh
DEBUGFS=`grep debugfs /proc/mounts | awk '{ print $2; }'`
echo $$ > $DEBUGFS/tracing/set_ftrace_pid
echo function > $DEBUGFS/tracing/current_tracer
exec $*
[tracing]# ~/bin/ftrace-me ls -ltr

Note, you must clear the set_ftrace_pid file if you want to go back to generic function tracing after performing the above.

[tracing]# echo -1 > set_ftrace_pid

What calls a specific function?

Sometimes it is useful to know what is calling a particular function. The immediate predecessor is helpful, but an entire backtrace is even better. The function tracer contains an option that will create a backtrace in the ring buffer for every function that is called by the tracer. Since creating a backtrace for every function has a large overhead, which could live lock the system, care must be taken when using this feature. Imagine the timer interrupt on a slower system where it is run at 1000 HZ. It is quite possible that having every function that the timer interrupt calls produce a backtrace could take 1 millisecond to complete. By the time the timer interrupt returns, a new one will be triggered before any other work can be done, which leads to a live lock.

To use the function tracer backtrace feature, it is imperative that the functions being called are limited by the function filters. The option to enable the function backtracing is unique to the function tracer and activating it can only be done when the function tracer is enabled. This means you must first enable the function tracer before you have access to the option:

[tracing]# echo kfree > set_ftrace_filter
[tracing]# cat set_ftrace_filter
kfree
[tracing]# echo function > current_tracer
[tracing]# echo 1 > options/func_stack_trace
[tracing]# cat trace | tail -8
 =>; sys32_execve
 => ia32_ptregs_common
             cat-6829  [000] 1867248.965100: kfree <-free_bprm
             cat-6829  [000] 1867248.965100: 

 => free_bprm
 => compat_do_execve
 => sys32_execve
 => ia32_ptregs_common
[tracing]# echo 0 > options/func_stack_trace
[tracing]# echo > set_ftrace_filter

Notice that I was careful to cat the set_ftrace_filter before enabling the func_stack_trace option to ensure that the filter was enabled. At the end, I disabled the options/func_stack_trace before disabling the filter. Also note that the option is non-volatile, that is, even if you enable another tracer plugin in current_tracer, the option will still be enabled if you re-enable the function tracer.

The function_graph tracer

The function tracer is very powerful, but it may be difficult to understand the linear format that it produces. Frederic Weisbecker has extended the function tracer into the function_graph tracer. The function_graph tracer piggy-backs off of most of the code created by the function tracer, but adds its own hook in the mcount call. Because it still uses the mcount calling methods most of the function filtering explained above also applies to the function_graph tracer, with the exception of the traceon/traceoff commands and set_ftrace_pid (although the latter may change in the future).
The function_graph tracer was also explained in the previous articles, but the set_graph_function file was not described. The func_stack_trace used in the previous section can see what might call a function, but set_graph_function can be used to see what a function calls:

[tracing]# echo kfree > set_graph_function
[tracing]# echo function_graph > current_tracer
[tracing]# cat trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0)               |  kfree() {
 0)               |    virt_to_cache() {
 0)               |      virt_to_head_page() {
 0)   0.955 us    |        __phys_addr();
 0)   2.643 us    |      }
 0)   4.299 us    |    }
 0)   0.855 us    |    __cache_free();
 0)   ==========> |
 0)               |    smp_apic_timer_interrupt() {
 0)               |      apic_write() {
 0)   0.849 us    |        native_apic_mem_write();
 0)   2.853 us    |      }
[tracing]# echo > set_graph_function

This displays the call graph performed only by kfree(). The "==========>" shows that an interrupt happened during the call. The trace records all functions within the kfree() block, even those functions called by an interrupt that triggered while in the scope of kfree().

The function_graph tracer shows the time a function took in the duration field. In the previous articles, it was mentioned that only the leaf functions, the ones that do not call other functions, have an accurate duration, since the duration of parent functions also includes the overhead of the function_graph tracer calling the child functions. By using the set_ftrace_filter file, you can force any function into becoming a leaf function in the function_graph tracer, and this will allow you to see an accurate duration of that function.

[tracing]# echo smp_apic_timer_interrupt > set_ftrace_filter
[tracing]# echo function_graph > current_tracer
[tracing]# cat trace | head
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 1)   ==========> |
 1) + 16.433 us   |  smp_apic_timer_interrupt();
 1)   ==========> |
 1) + 25.897 us   |  smp_apic_timer_interrupt();
 1)   ==========> |
 1) + 24.764 us   |  smp_apic_timer_interrupt();

The above shows that the timer interrupt takes between 16 and 26 microseconds to complete.

Function profiling

oprofile and perf are very powerful profiling tools that take periodic samples of the system and can show where most of the time is spent. With the function profiler, it is possible to take a good look at the actual function execution and not just samples. If CONFIG_FUNCTION_GRAPH_TRACER is configured in the kernel, the function profiler will use the function graph infrastructure to record how long a function has been executing. If just CONFIG_FUNCTION_TRACER is configured, the function profiler will just count the functions being called.

[tracing]# echo nop > current_tracer
[tracing]# echo 1 > function_profile_enabled
[tracing]# cat trace_stat/function 0 |head
  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  schedule                             22943    1994458706 us     86931.03 us 
  poll_schedule_timeout                 8683    1429165515 us     164593.5 us 
  schedule_hrtimeout_range              8638    1429155793 us     165449.8 us 
  sys_poll                             12366    875206110 us     70775.19 us 
  do_sys_poll                          12367    875136511 us     70763.84 us 
  compat_sys_select                     3395    527531945 us     155384.9 us 
  compat_core_sys_select                3395    527503300 us     155376.5 us 
  do_select                             3395    527477553 us     155368.9 us 

The above also includes the times that a function has been preempted or schedule() was called and the task was swapped out. This may seem useless, but it does give an idea of what functions get preempted often. Ftrace also includes options that allow you to have the function graph tracer ignore the time the task was scheduled out.

[tracing]# echo 0 > options/sleep-time
[tracing]# echo 0 > function_profile_enabled
[tracing]# echo 1 > function_profile_enabled
[tracing]# cat trace_stat/function0  | head
  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  default_idle                          2493    6763414 us     2712.962 us  
  native_safe_halt                      2492    6760641 us     2712.938 us  
  sys_poll                              4723    714243.6 us     151.226 us   
  do_sys_poll                           4723    692887.4 us     146.704 us   
  sys_read                              9211    460896.3 us     50.037 us    
  vfs_read                              9243    434521.2 us     47.010 us    
  smp_apic_timer_interrupt              3940    275747.4 us     69.986 us    
  sock_poll                            80613    268743.2 us     3.333 us    

Note that the sleep-time option contains a "-" and is not sleep_time.

Disabling the function profiler and then re-enabling it causes the numbers to reset. The list is sorted by the Avg times, but using scripts you can easily sort by any of the numbers. The trace_stat/function0 only represents CPU 0. There exists a trace_stat/function# for each CPU on the system. All functions that are traced and have been hit are in this file.

[tracing]# cat trace_stat/function0  | wc -l
2978

Functions that were not hit are not listed. The above shows that 2978 functions were hit since I started the profiling.

Another option that affects profiling is graph-time (again with a "-"). By default it is enabled. When enabled, the times for a function include the times of all the functions that were called within the function. As you can see from the output in the above examples, several system calls are listed with the highest average. When disabled, the times only include the execution of the function itself, and do not contain the times of functions that are called from the function:

[tracing]# echo 0 > options/graph-time
[tracing]# echo 0 > function_profile_enabled
[tracing]# echo 1 > function_profile_enabled
[tracing]# cat trace_stat/function0  | head
  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  mwait_idle                           10132    246835458 us    24361.96  us 
  tg_shares_up                        154467    389883.5 us     2.524 us     
  _raw_spin_lock_irqsave              343012    263504.3 us     0.768 us     
  _raw_spin_unlock_irqrestore         351269    175205.6 us     0.498 us    
  walk_tg_tree                         14087    126078.4 us     8.949 us    
  __set_se_shares                     274937    88436.65 us     0.321 us    
  _raw_spin_lock                      100715    82692.61 us     0.821 us    
  kstat_irqs_cpu                      257500    80124.96 us     0.311 us    

Note that both sleep-time and graph-time also affect the duration times displayed by the function_graph tracer.

Kprobe-based Event Tracer