千里之行始于足下

CephFS debug方法

Posted on By Peter Yang

Intro

本文介绍cephfs 常用debug方法。主要针对 kernel client (以下简称 kcephfs),环境是

centos 7.6, kernel version 3.10.0-957.el7

以及ceph-mds服务端组件的debug

kernel client

相关内核模块介绍

kernel默认包括的ceph kernel 模块如下

# find /lib/modules/3.10.0-957.27.2.el7.x86_64/kernel/ -name "*rbd*" -or -name "*ceph*"
/lib/modules/3.10.0-957.27.2.el7.x86_64/kernel/net/ceph/libceph.ko.xz
/lib/modules/3.10.0-957.27.2.el7.x86_64/kernel/drivers/block/rbd.ko.xz
/lib/modules/3.10.0-957.27.2.el7.x86_64/kernel/fs/ceph/ceph.ko.xz

其中 rbd 是krbd的主要模块, libceph 是网络通信的主要模块,而 ceph 是 kcephfs的主要模块。 本文主要关注 ceph 这个模块。

内核模块在一定程度上可以动态加载生效。因此,当某个版本 kernel 不能满足要求时,可以手动编译并替代原生内核模块来解决。

需要注意的是,内核模块与内核版本强相关,必须要拿到版本完全一致的内核代码才能修改和编译。

如何编译内核模块不在本文讨论范围内。

如何debug kernel

不管什么debug,最主要的是源码和log。

内核源码

获取 centos 对应版本内核代码,参考 github repo

下面以 centos7.6.1810 对应的 repo 为例。

下载源码

git clone https://github.com/kernerist/RHEL_7.6.1810_3.10.0-957.27.2.el7.git

kcephfs相关的代码在 fs/ceph 下。

debug log

kcephfs 可以按 kernel debug 的标准方式打开 debug log。

kernel log level 定义

#define KERN_EMERG 0    /*紧急事件消息,系统崩溃之前提示,表示系统不可用*/
#define KERN_ALERT 1    /*报告消息,表示必须立即采取措施*/
#define KERN_CRIT 2     /*临界条件,通常涉及严重的硬件或软件操作失败*/
#define KERN_ERR 3      /*错误条件,驱动程序常用KERN_ERR来报告硬件的错误*/
#define KERN_WARNING 4  /*警告条件,对可能出现问题的情况进行警告*/
#define KERN_NOTICE 5   /*正常但又重要的条件,用于提醒。常用于与安全相关的消息*/
#define KERN_INFO 6     /*提示信息,如驱动程序启动时,打印硬件信息*/
#define KERN_DEBUG 7   /*调试级别的消息*/

具体命令如下:

查看当前debug log level

# cat /proc/sys/kernel/printk
	4       4       1       7
	current	default	minimum	boot-time-default

其中 current 是指当前的 console log level,我们主要关注这个。

修改debug log level

# echo 8 > /proc/sys/kernel/printk
# cat /proc/sys/kernel/printk
8 4 1 7

打开 dynamic debug log

上面的修改方法只能打开 KERN_INFO 以上的log,对于 KERN_DEBUG 或者通过pr_debug()语句打印的log,我们需要另一种方法来打开,即Dynamic debug

前提条件

需要确保内核开启了 CONFIG_DYNAMIC_DEBUG=y,检查方法如下

# sudo cat /boot/config-`uname -r` | grep DYNAMIC_DEBUG

再查看下是否挂载了debugfs,没有的话尝试挂载

mount -o rw.remount -t debugfs none /sys/kernel/debug/
开启dynamic debug

dynmaic debug 支持非常丰富的log 过滤选项,可按照模块、文件、代码行等粒度选择要开启的debug log。 详细参考Dynamic debug

下面仅就开启ceph内核模块为例,命令格式如下

echo "module ceph [+-=][pflmt_]" >/sys/kernel/debug/dynamic_debug/control

其中 +-=含义如下

-    remove the given flags
+    add the given flags
=    set the flags to the given flags

pflmt_含义如下

p    enables the pr_debug() callsite.
f    Include the function name in the printed message
l    Include line number in the printed message
m    Include module name in the printed message
t    Include thread ID in messages not generated from interrupt context
_    No flags are set. (Or'd with others on input)

具体示例如下

# // 开启 ceph 模块的函数名,行号,内核线程名
# echo "module ceph +pflt" >/sys/kernel/debug/dynamic_debug/control

dynamic debug影响系统性能,切记debug完要关闭!

查看debug log

kernel log 输出有几种方式,最简单的是使用 dmesg

# dmesg -w | tee debug.log

示例如下

[626187.960015] __ceph_do_getattr:2233: ceph:  do_getattr result=0
[626189.846853] delayed_work:3535: ceph:  mdsc delayed_work
[626189.853734] ceph_check_delayed_caps:3938: ceph:  check_delayed_caps
[626189.857563] __ceph_lookup_mds_session:434: ceph:  lookup_mds_session ffff9af013abb000 1
[626189.861862] get_session:403: ceph:  mdsc get_session ffff9af013abb000 1 -> 2
[626189.865894] get_session:403: ceph:  mdsc get_session ffff9af013abb000 2 -> 3
[626189.869876] con_get:4039: ceph:  mdsc con_get ffff9af013abb000 ok (3)
[626189.873709] ceph_put_mds_session:414: ceph:  mdsc put_session ffff9af013abb000 3 -> 2
[626189.878029] con_put:4050: ceph:  mdsc con_put ffff9af013abb000 (1)
[626189.881837] ceph_put_mds_session:414: ceph:  mdsc put_session ffff9af013abb000 2 -> 1
[626190.093320] get_session:403: ceph:  mdsc get_session ffff9af013abb000 1 -> 2
[626190.100630] con_get:4039: ceph:  mdsc con_get ffff9af013abb000 ok (2)
[626190.104545] con_put:4050: ceph:  mdsc con_put ffff9af013abb000 (1)
[626190.108318] ceph_put_mds_session:414: ceph:  mdsc put_session ffff9af013abb000 2 -> 1

抓取到需要的log后,打开debug.log分析即可

debugfs

除了kernel debug log之外,ceph自身还提供一些常用的metric供分析问题,类似于server端的admin socket

位置及内容如下


#  ~ ls -l /sys/kernel/debug/ceph 
总用量 0
drwxr-xr-x 2 root root 0 Apr 28 23:25 c6eed356-0973-4ec1-8adb-6c76dacf3178.client4464
drwxr-xr-x 2 root root 0 Apr 28 23:33 c6eed356-0973-4ec1-8adb-6c76dacf3178.client4473

#  ~ ls -l /sys/kernel/debug/ceph/c6eed356-0973-4ec1-8adb-6c76dacf3178.client4464 
总用量 0
lrwxrwxrwx 1 root root 0 Apr 28 23:25 bdi -> ../../bdi/ceph-12
-r-------- 1 root root 0 Apr 28 23:25 caps
-r-------- 1 root root 0 Apr 28 23:25 client_options
-r-------- 1 root root 0 Apr 28 23:25 dentry_lru
-r-------- 1 root root 0 Apr 28 23:25 mdsc
-r-------- 1 root root 0 Apr 28 23:25 mdsmap
-r-------- 1 root root 0 Apr 28 23:25 mds_sessions
-r-------- 1 root root 0 Apr 28 23:25 monc
-r-------- 1 root root 0 Apr 28 23:25 monmap
-r-------- 1 root root 0 Apr 28 23:25 osdc
-r-------- 1 root root 0 Apr 28 23:25 osdmap
-rw------- 1 root root 0 Apr 28 23:25 writeback_congestion_kb

如上所示,在/sys/kernel/debug/ceph/下,每个client连接建立一个子目录。怎么知道是哪个client呢?有一个方法是通过mds的admin socket,查到连接到该mds的clientid以及对应的ip。示例如下

# ceph daemon /var/run/ceph/bj.kunlun.test.cluster2-mds.bj-cluster2-node4-test.asok session ls
[
    {
        "id": 4464,  // 4464 就是客户端的client id
        "num_leases": 0,
        "num_caps": 5,
        "state": "open",
        "request_load_avg": 0,
        "uptime": 51693.690616,
        "replay_requests": 0,
        "completed_requests": 1,
        "reconnecting": false,
        "inst": "client.4464 <ip>:0/2210658890", // <ip> 就是客户端的ip
        "client_metadata": {
            "entity_id": "admin",
            "hostname": "localhost",
            "kernel_version": "3.10.0-957.27.2.el7.x86_64",
            "root": "/"
        }
    }
]

解决了客户端定位问题之后,我们再来看看详细的指标含义,如下

bdi: BDI info about the Ceph system (blocks dirtied, written, etc)

caps: counts of file “caps” structures in-memory and used

client_options: dumps the options provided to the CephFS mount

dentry_lru: Dumps the CephFS dentries currently in-memory

mdsc: Dumps current requests to the MDS

mdsmap: Dumps the current MDSMap epoch and MDSes

mds_sessions: Dumps the current sessions to MDSes

monc: Dumps the current maps from the monitor, and any “subscriptions” held

monmap: Dumps the current monitor map epoch and monitors

osdc: Dumps the current ops in-flight to OSDs (ie, file data IO)

osdmap: Dumps the current OSDMap epoch, pools, and OSDs

ceph-mds debug

ceph-mds 可以能过ceph的admin socket机制,实时打开debug选项,或者查看内存中主要的数据结构。

slow request 为例,可以查看

# ceph daemon /var/run/ceph/<mds>.asok dump_ops_in_flight
{
    "ops": [],
    "num_ops": 0
}

其它选项,可以通过help自行研究,还是比较全面的。

ceph daemon /var/run/ceph/<mds>.asok help
{
    "cache status": "show cache status",
    "config diff": "dump diff of current config and default config",
    "config diff get": "dump diff get <field>: dump diff of current and default config setting <field>",
    "config get": "config get <field>: get the config value",
    "config help": "get config setting schema and descriptions",
    "config set": "config set <field> <val> [<val> ...]: set a config variable",
    "config show": "dump current config settings",
    "dirfrag ls": "List fragments in directory",
    "dirfrag merge": "De-fragment directory by path",
    "dirfrag split": "Fragment directory by path",
    "dump cache": "dump metadata cache (optionally to a file)",
    "dump loads": "dump metadata loads",
    "dump tree": "dump metadata cache for subtree",
    "dump_blocked_ops": "show the blocked ops currently in flight",
    "dump_historic_ops": "show slowest recent ops",
    "dump_historic_ops_by_duration": "show slowest recent ops, sorted by op duration",
    "dump_mempools": "get mempool stats",
    "dump_ops_in_flight": "show the ops currently in flight",
    "export dir": "migrate a subtree to named MDS",
    "flush journal": "Flush the journal to the backing store",
    "flush_path": "flush an inode (and its dirfrags)",
    "force_readonly": "Force MDS to read-only mode",
    "get subtrees": "Return the subtree map",
    "get_command_descriptions": "list available commands",
    "git_version": "get git sha1",
    "help": "list available commands",
    "log dump": "dump recent log entries to log file",
    "log flush": "flush log entries to log file",
    "log reopen": "reopen log file",
    "objecter_requests": "show in-progress osd requests",
    "ops": "show the ops currently in flight",
    "osdmap barrier": "Wait until the MDS has this OSD map epoch",
    "perf dump": "dump perfcounters value",
    "perf histogram dump": "dump perf histogram values",
    "perf histogram schema": "dump perf histogram schema",
    "perf reset": "perf reset <name>: perf reset all or one perfcounter name",
    "perf schema": "dump perfcounters schema",
    "scrub_path": "scrub an inode and output results",
    "session evict": "Evict a CephFS client",
    "session ls": "Enumerate connected CephFS clients",
    "status": "high-level status of MDS",
    "tag path": "Apply scrub tag recursively",
    "version": "get ceph version"
}

参考资料