BCC

BCC Python 开发者教程

本教程是关于使用 Python 界面开发BCC工具和程序。有两个部分：可观察性和网络。片段取自BCC的各种程序：请参阅它们的文件以获取许可证。

另请参阅 bcc 开发人员的 reference_guide.md 和工具最终用户的教程：tutorial.md。还有一个用于 bcc 的 lua 接口。

可观察性

本可观察性教程包含 17 节课程，列举了 46 个需要学习的内容。

Hello World

首先运行examples/hello_world.py，同时在另一个会话中运行一些命令（例如“ls”）。
它应该打印 “Hello, World!” 对于新流程。如果没有，请首先修复密件抄送：请参阅 INSTALL.md。

使用示例

# ./examples/hello_world.py
            bash-13364 [002] d... 24573433.052937: : Hello, World!
            bash-13364 [003] d... 24573436.642808: : Hello, World!
[...]

python代码

这是 hello_world.py 的代码：

from bcc import BPF
BPF(text='int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello, World!\\n"); return 0; }').trace_print()

注释说明

注释版 - 前端部分：

from bcc import BPF
BPF(text='...')			# `text='...'` ：这定义了一个内联的 BPF 程序。该程序是用 C 编写的
	.trace_print()		# 读取trace_pipe并打印输出的bcc例程

注释版 - 内核部分：

/**
 * @functionName 通过 kprobes 进行内核动态跟踪的快捷方式。\
 *               如果 C 函数以 `kprobe__` 开头，则其余部分将被视为要检测的内核函数名称, 在本例中为 `sys_clone()`
 * @args ctx 有参数，但由于我们在这里不使用它们，因此我们将其转换为 `void *`
 */
int kprobe__sys_clone(void *ctx) {
    // 一个简单的内核工具，用于 printf() 到公共trace_pipe  (/sys/kernel/debug/tracing/trace_pipe)。
    // 这对于一些快速示例来说是可以的，但有限制：
    //     最多 3 个参数，仅限 1 %s，并且 trace_pipe 是全局共享的，因此并发程序将产生冲突的输出。
    // 更好的接口是通过 BPF_PERF_OUTPUT()，稍后介绍。
	bpf_trace_printk("Hello, World!\\n");
    // 必要的手续（如果你想知道原因，请参阅 [#139](https://github.com/iovisor/bcc/issues/139)）
	return 0;
}

sys_sync()

动手课：

编写一个跟踪 sys_sync() 内核函数的程序。要求：

运行时打印“sys_sync() called”。通过在跟踪时在另一个会话中运行 sync 进行测试。 hello_world.py 程序拥有您所需的一切。
通过打印“Tracing sys_sync()... Ctrl-C to end”来改进它。
当程序第一次启动时。提示：这只是Python。

hello_fields.py

使用示例

该程序位于 examples/tracing/hello_fields.py 中。示例输出（在另一个会话中运行命令）

分别打印了时间、命令?、进程id、信息

# ./examples/tracing/hello_fields.py
TIME(s)            COMM             PID    MESSAGE
24585001.174885999 sshd             1432   Hello, World!
24585001.195710000 sshd             15780  Hello, World!
24585001.991976000 systemd-udevd    484    Hello, World!
24585002.276147000 bash             15787  Hello, World!

python代码

from bcc import BPF
from bcc.utils import printb

# define BPF program
prog = """
int hello(void *ctx) {
    bpf_trace_printk("Hello, World!\\n");
    return 0;
}
"""

# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue
    except KeyboardInterrupt:
        exit()
    printb(b"%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))

注释说明

注释版 - 前端部分：

from bcc import BPF
from bcc.utils import printb

# 定义BPF程序
# 这次我们将C程序声明为变量，稍后再引用它。如果您想根据命令行参数添加一些字符串替换，这非常有用。
prog = '...'

# 加载BPF程序
b = BPF(text=prog)
# attack_kprobe 为内核克隆系统调用函数创建一个kprobe，它将执行我们定义的hello()函数。
# 您可以多次调用 attach_kprobe()，并将 C 函数附加到多个内核函数。
b.attach_kprobe(
    event=b.get_syscall_fnname("clone"),
    fn_name="hello"
)

# 头部打印：时间、命令、进程id、信息
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))

# 格式化输出
while 1:
    try:
        # b.trace_fields()：从trace_pipe 返回一组固定的字段
        # 其与 trace_print() 类似，这对于黑客攻击来说很方便，但对于真正的工具，我们应该切换到BPF_PERF_OUTPUT()
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue
    except KeyboardInterrupt:
        exit()
    printb(b"%-18.9f %-16s %-6d %s" % (ts, task, pid, msg)) # 格式化输出

注释版 - 内核部分：

/**
 * 现在我们只是声明一个 C 函数，而不是 `kprobe__` 快捷方式。我们稍后会提到这一点。
 *
 * - 探测器程序：BPF 程序中声明的所有 C 函数都希望在探测器上执行，因此它们都需要将 `pt_reg* ctx` 作为第一个参数。
 * - 非探测程序：如果您需要定义一些不会在探测器上执行的辅助函数，则需要将它们定义为 `static inline` 以便编译器内联。有时您还需要为其添加 `_always_inline` 函数属性。
 */
int hello(void *ctx) {
    bpf_trace_printk("Hello, World!\\n");
    return 0;
}

sync_timing.py

使用示例

背景：
还记得系统管理员在 reboot 之前在慢速控制台上键入 sync 三次以给出第一个异步同步完成时间的日子吗？然后有人认为 sync;sync;sync 很聪明，将它们全部运行在一条线上，尽管违背了最初的目的，但这成为了行业惯例！然后sync变成了同步，所以更多的原因是它是愚蠢的。
上面这机翻有点怪，可能这里也需要一些背景知识，补充一下：
sync命令：sync 是一个Linux和类Unix系统中的命令，用于将数据从内存缓冲区同步到硬盘上。这样，即使在突然断电的情况下，操作系统也能尽可能保证数据的完整性。
慢速控制台：在这里，较慢的控制台指的是以前计算机硬件性能相对较差的时代
重复输入是次：在一个较慢的控制台上输入sync命令三次，然后再输入reboot。这样做是为了确保第一个异步的同步操作有足够的时间完成。
sync同步执行：最初，sync 命令是异步执行的，意味着在输入sync命令后，系统将继续执行其他任务，同时让数据从内存缓冲区同步到硬盘。但随着技术发展，sync命令变成了同步执行，即在同步完成之前，系统不会执行其他任务。因此，现如今，在执行reboot之前，不再需要多次输入sync命令来确保同步完成，输入一次就足够了。
大致意思：随着时间的推移，计算机性能加强，sync命令变得可以同步执行，这使得这种多次输入的做法变得更加荒谬、过时了。

以下示例对 do_sync 函数的调用速度进行计时，如果最近调用该函数的时间超过一秒，则打印输出。
在另一个终端执行 sync;sync;sync，运行sync_timing.py的终端将打印第二次和第三次同步的输出：

$ ./examples/tracing/sync_timing.py
Tracing for quick sync's... Ctrl-C to end
At time 0.00 s: multiple syncs detected, last 95 ms ago
At time 0.10 s: multiple syncs detected, last 96 ms ago

# 笔记者：上面是他那里的demo的数据，我电脑这里更快
$ ./sync_timing.py
Tracing for quick sync's... Ctrl-C to end
At time 0.00 s: multiple syncs detected, last 2 ms ago
At time 0.00 s: multiple syncs detected, last 1 ms ago

python代码

该程序是 examples/tracing/sync_timing.py：

from __future__ import print_function
from bcc import BPF
from bcc.utils import printb

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>

BPF_HASH(last);

int do_trace(struct pt_regs *ctx) {
    u64 ts, *tsp, delta, key = 0;

    // attempt to read stored timestamp
    tsp = last.lookup(&key);
    if (tsp != NULL) {
        delta = bpf_ktime_get_ns() - *tsp;
        if (delta < 1000000000) {
            // output if time is less than 1 second
            bpf_trace_printk("%d\\n", delta / 1000000);
        }
        last.delete(&key);
    }

    // update stored timestamp
    ts = bpf_ktime_get_ns();
    last.update(&key, &ts);
    return 0;
}
""")

b.attach_kprobe(event=b.get_syscall_fnname("sync"), fn_name="do_trace")
print("Tracing for quick sync's... Ctrl-C to end")

# format output
start = 0
while 1:
    try:
        (task, pid, cpu, flags, ts, ms) = b.trace_fields()
        if start == 0:
            start = ts
        ts = ts - start
        printb(b"At time %.2f s: multiple syncs detected, last %s ms ago" % (ts, ms))
    except KeyboardInterrupt:
        exit()

注释说明

python部分：

from __future__ import print_function
from bcc import BPF
from bcc.utils import printb

# 加载BPF程序
b = BPF(text="...")

b.attach_kprobe(event=b.get_syscall_fnname("sync"), fn_name="do_trace")
print("跟踪快速sync命令 ... 按Ctrl-C结束追踪")

# 格式化输出
start = 0
while 1:
    try:
        (task, pid, cpu, flags, ts, ms) = b.trace_fields()
        if start == 0:
            start = ts
        ts = ts - start
        printb(b"在时间 %.2f s: 检测到多个sync命令, 上一次在 %s ms 前" % (ts, ms))
    except KeyboardInterrupt:
        exit()

BPF C 部分：

#include <uapi/linux/ptrace.h>

BPF_HASH(last); // 创建一个 BPF 映射对象，它是一个哈希（关联数组），称为“last”。我们没有指定任何进一步的参数，因此它默认为 u64 的键和值类型

int do_trace(struct pt_regs *ctx) {
    u64 ts, *tsp, delta, key = 0; // key=0：我们只会在此哈希中存储一个键/值对，其中键被硬连线为零

    // 尝试读取存储的时间戳
    tsp = last.lookup(&key); // 在哈希中查找键，如果存在则返回指向其值的指针，否则返回 NULL。我们将密钥作为地址传递给指针
    if (tsp != NULL) { // 验证程序要求必须检查从映射查找派生的指针值是否为空值，然后才能取消引用和使用它们
        delta = bpf_ktime_get_ns() - *tsp;
        if (delta < 1000000000) {
            // 当时间小于1秒时输出。这里是为了打印三次连续sync命令的两次间隔时间
            bpf_trace_printk("%d\\n", delta / 1000000);
        }
        last.delete(&key); // 从哈希中删除密钥。由于 `.update()` 中的内核错误（在 4.8.10 中已修复），目前需要这样做
    }

    // 更新存储的时间戳
    ts = bpf_ktime_get_ns(); // 返回以纳秒为单位的时间
    last.update(&key, &ts); // 将第二个参数中的值与键关联，覆盖任何先前的值。这记录了时间戳
    return 0;
}

sync_count.py

实践课，要求：

修改sync_timing.py程序（之前的课程）以存储所有内核同步系统调用（快的和慢的）的计数，并将其与输出一起打印。
通过向现有哈希添加新的键索引，可以在 BPF 程序中记录此计数。

disksnoop.py

使用示例

浏览 examples/tracing/disksnoop.py 程序以查看新增内容。这是一些示例输出：

$ ./disksnoop.py
TIME(s)            T  BYTES    LAT(ms)
16458043.436012    W  4096        3.13
16458043.437326    W  4096        4.44
16458044.126545    R  4096       42.82
16458044.129872    R  4096        3.24
[...]

另外：我这里的环境有个BUG：运行该BCC程序出错，运行该路径下的其他python bcc程序则是正常的。这里另外提供一下这个bug的修复：

BUG报错：

$ python ./disksnoop.py
Traceback (most recent call last):
  File "./disksnoop.py", line 46, in <module>
    if BPF.get_kprobe_functions(b'blk_start_request'):
  File "/usr/lib/python3/dist-packages/bcc-0.28.0+18b00a90-py3.8.egg/bcc/__init__.py", line 698, in get_kprobe_functions
  File "/usr/lib/python3/dist-packages/bcc-0.28.0+18b00a90-py3.8.egg/bcc/__init__.py", line 694, in get_kprobe_functions
FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/debug/kprobes/blacklist'

解决：

$ nano /etc/fstab # 确保您的系统已经加载了debugfs。可以通过在/etc/fstab文件中搜索debugfs来检查。如果没有，则需要添加以下行到/etc/fstab：
debugfs /sys/kernel/debug debugfs defaults 0 0
$ mount -a # 然后，加载debugfs

python片段

[...]
REQ_WRITE = 1		# from include/linux/blk_types.h

# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/blk-mq.h>

BPF_HASH(start, struct request *);

void trace_start(struct pt_regs *ctx, struct request *req) {
	// stash start timestamp by request ptr
	u64 ts = bpf_ktime_get_ns();

	start.update(&req, &ts);
}

void trace_completion(struct pt_regs *ctx, struct request *req) {
	u64 *tsp, delta;

	tsp = start.lookup(&req);
	if (tsp != 0) {
		delta = bpf_ktime_get_ns() - *tsp;
		bpf_trace_printk("%d %x %d\\n", req->__data_len,
		    req->cmd_flags, delta / 1000);
		start.delete(&req);
	}
}
""")
if BPF.get_kprobe_functions(b'blk_start_request'):
        b.attach_kprobe(event="blk_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_start")
if BPF.get_kprobe_functions(b'__blk_account_io_done'):
    b.attach_kprobe(event="__blk_account_io_done", fn_name="trace_completion")
else:
    b.attach_kprobe(event="blk_account_io_done", fn_name="trace_completion")
[...]

注释说明

python代码

[...]
# 我们在 Python 程序中定义一个 `内核常量`，因为稍后我们将在那里使用它。
# 如果我们在 BPF 程序中使用 REQ_WRITE，它应该可以与适当的 `#includes` 一起工作（无需定义）。
REQ_WRITE = 1		# from include/linux/blk_types.h

# 加载BPF程序
b = BPF(text="...")
# 将BPF程序挂载到多个位置上
if BPF.get_kprobe_functions(b'blk_start_request'):
        b.attach_kprobe(event="blk_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_start")
if BPF.get_kprobe_functions(b'__blk_account_io_done'):
    b.attach_kprobe(event="__blk_account_io_done", fn_name="trace_completion")
else:
    b.attach_kprobe(event="blk_account_io_done", fn_name="trace_completion")
[...]

BPF C 代码

#include <uapi/linux/ptrace.h>
#include <linux/blk-mq.h>

BPF_HASH(start, struct request *);

/**
 * 此函数稍后将附加到 kprobes。 kprobe 函数的参数是 `struct pt_regs *ctx` ，用于寄存器和 BPF 上下文.
 *
 * @param 然后是函数的实际参数。我们将其附加到 blk_start_request()。其中第一个参数是 `struct request *`
 */
void trace_start(struct pt_regs *ctx, struct request *req) {
	// 通过请求PTR存储开始时间戳
	u64 ts = bpf_ktime_get_ns();
	/// 我们使用指向请求结构的指针作为哈希中的键。神奇吧，这在追踪中很常见
    /// 事实证明，指向结构的指针是很好的键，因为它们是唯一的：两个结构不能具有相同的指针地址（只要小心它何时被释放和重用即可）
    /// 
    /// 因此，我们真正要做的是用我们自己的时间戳标记描述磁盘 I/O  的请求结构，以便我们可以对其进行计时。
    /// 有两个常用键用于存储时间戳：指向结构的指针和线程 ID（用于计时函数条目返回）
	start.update(&req, &ts);
}

void trace_completion(struct pt_regs *ctx, struct request *req) {
	u64 *tsp, delta;

	tsp = start.lookup(&req);
	if (tsp != 0) {
		delta = bpf_ktime_get_ns() - *tsp;
		bpf_trace_printk(
            "%d %x %d\\n", 
            /// req->__data_len：我们取消引用 `struct request` 的成员。请参阅内核源代码中的定义以了解其中的成员。
            /// bcc 实际上将这些表达式重写为一系列 `bpf_probe_read_kernel()` 调用。有时 bcc 无法处理复杂的取消引用，您需要直接调用 `bpf_probe_read_kernel()`
            req->__data_len,
		    req->cmd_flags, 
            delta / 1000
        );
		start.delete(&req);
	}
}

这是一个非常有趣的程序，如果您能够理解所有代码，您就会理解许多重要的基础知识。

我们仍在使用 bpf_trace_printk() hack，下一节让我们解决这个问题，使用 BPF_PERF_OUTPUT() 接口。

hello_perf_output.py

使用示例

让我们最终停止使用 bpf_trace_printk() 并使用正确的 BPF_PERF_OUTPUT() 接口。这也意味着我们将停止获取免费的 trace_field() 成员（例如 PID 和时间戳），并且需要直接获取它们。在另一个会话中运行命令时的示例输出：

# ./hello_perf_output.py
TIME(s)            COMM             PID    MESSAGE
0.000000000        bash             22986  Hello, perf_output!
0.021080275        systemd-udevd    484    Hello, perf_output!
0.021359520        systemd-udevd    484    Hello, perf_output!
0.021590610        systemd-udevd    484    Hello, perf_output!
[...]

python片段

代码为 examples/tracing/hello_perf_output.py

from bcc import BPF

# define BPF program
prog = """
#include <linux/sched.h>

// define output data structure in C
struct data_t {
    u32 pid;
    u64 ts;
    char comm[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(events);

int hello(struct pt_regs *ctx) {
    struct data_t data = {};

    data.pid = bpf_get_current_pid_tgid();
    data.ts = bpf_ktime_get_ns();
    bpf_get_current_comm(&data.comm, sizeof(data.comm));

    events.perf_submit(ctx, &data, sizeof(data));

    return 0;
}
"""

# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))

# process event
start = 0
def print_event(cpu, data, size):
    global start
    event = b["events"].event(data)
    if start == 0:
            start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-16s %-6d %s" % (time_s, event.comm, event.pid,
        "Hello, perf_output!"))

# loop with callback to print_event
b["events"].open_perf_buffer(print_event)
while 1:
    b.perf_buffer_poll()

注释说明

python代码

from bcc import BPF

# 定义BPF程序
prog = "..."

# 加载BPF程序
b = BPF(text=prog)
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="hello")

# 打印表格头
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))

# 程序事件
start = 0
def print_event(cpu, data, size): # 定义一个 Python 函数，用于处理来自 `events` 流的读取事件
    global start
    event = b["events"].event(data) # 现在以 Python 对象的形式获取事件，从 C 声明自动生成
    if start == 0:
    	start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-16s %-6d %s" % (time_s, event.comm, event.pid,
        "Hello, perf_output!"))

# 循环回调函数来打印事件
b["events"].open_perf_buffer(print_event) # 将 Python `print_event` 函数与 `events` 流关联
while 1:
    b.perf_buffer_poll() # 阻塞等待事件

BPF C 代码

#include <linux/sched.h>

// 用C定义输出数据结构。这定义了我们将用来将数据从内核传递到用户空间的 C 结构体
struct data_t {
    u32 pid;
    u64 ts;
    char comm[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(events); // 这将我们的输出通道命名为 `events`。这里相较于之前的代码相比，使用了新的更安全和可靠的输出接口

int hello(struct pt_regs *ctx) {
    struct data_t data = {}; // 创建一个空的 data_t 结构，然后我们将填充该结构

    /// 内核和用户上的 `ID` 和 `UID` 的概念有所不同
    ///     返回低32位的进程 ID（PID, 内核视图的PID = 用户空间中的线程ID），以及高32位的线程组ID（TGID，内核TGID = 用户空间的PID）
    ///     通过直接将其设置为u32，我们丢弃了高32位
    ///
    /// 您应该提供 PID 还是 TGID ?
    ///     对于多线程应用程序，TGID将是相同的，因此您需要PID来区分它们（如果您想要的话）。这也是最终用户的期望问题。
    data.pid = bpf_get_current_pid_tgid();
    data.ts = bpf_ktime_get_ns();
    bpf_get_current_comm(&data.comm, sizeof(data.comm)); // 用当前进程名称填充第一个参数地址

    events.perf_submit(ctx, &data, sizeof(data)); // 提交事件供用户空间通过 perf 环形缓冲区读取

    return 0;
}

disksnoop.py fixed

strlen_count.py

nodejs_http_server.py

task_switch.c

Further Study

联网

原文的这里为 TODO，应该不是没做，应该是还没写这部分的文档

链接到当前文件 0

没有文件链接到当前文件

BCC

BCC

目录

BCC Python 开发者教程

可观察性

Hello World

使用示例

python代码

注释说明

sys_sync()

hello_fields.py

使用示例

python代码

注释说明

sync_timing.py

使用示例

python代码

注释说明

sync_count.py

disksnoop.py

使用示例

python片段

注释说明

hello_perf_output.py

使用示例

python片段

注释说明

sync_perf_output.py

bitehist.py

disklatency.py

vfsreadlat.py

urandomread.py

disksnoop.py fixed

strlen_count.py

nodejs_http_server.py

task_switch.c

Further Study

联网