osdmap提取crushmap

实验目的

  • 为了证实大话RBD文中对于横向平移crushmap的猜测。
  • 从一个dead cluster中,是否能够重现所有的PGObject Map
  • 本文从一个OSD中的若干osdmap中任意一个提取出来整个集群的CrushMap,并依此复现出原始集群的所有对象对应关系。
  • 还提供了一种简单的方法导出crushmap

实验环境

为了证实这个实验的普适性,前往任意一个集群的OSD目录下都可以操作,这里我采用了一个生产集群的数据,因为这个集群更具有代表性。
该集群规模如下:

1
2
3
4
5
6
7
8
9
10
[root@yd1st003 ~]# ceph -s
cluster 3727c106-0ac9-420d-99a9-4218ea4e099f
health HEALTH_OK
monmap e3: 3 mons at {ceph-1=233.233.233.231:6789/0,ceph-2=233.233.233.232:6789/0,ceph-3=233.233.233.233:6789/0}
election epoch 150, quorum 0,1,2 ceph-1,ceph-2,ceph-3
osdmap e5166: 20 osds: 20 up, 20 in
flags sortbitwise
pgmap v8337878: 1036 pgs, 4 pools, 3259 GB data, 549 kobjects
9721 GB used, 64663 GB / 74384 GB avail
1036 active+clean

这个集群的OSDMAPepoch5166,即共产生了5166个版本的OSDMAP

实验过程

首先,简单介绍一下OSDMAP的生成原因,集群刚刚创建时,为osdmap e1,即版本号为1,之后每当增删OSD或者任一OSD的状态变化:[in | up | down | out],任一状态变化为其他状态时,epoch的值会增加,以记录下对应的变化,一般值会增加个位数至少为1,同时会生成一个文件用于保存,这个文件保存在OSD目录的/current/meta下,因为当前版本为5166, 我们在osd.0目录下查找一个不是太新的osdmap文件,这里我们取5000

1
2
3
4
[root@yd1st003 ~]# cd /var/lib/ceph/osd/ceph-0/current/meta/
[root@yd1st003 meta]# find . |grep 5000
./DIR_7/DIR_D/inc\uosdmap.5000__0_A66D13D7__none
./DIR_8/DIR_6/osdmap.5000__0_0A038C68__none

当然,尝试在meta目录下执行find .,你会找到一堆名称类似的文件:osdmap.NUM__...,这里的NUM就是osdmapepoch值,之所以我这里选择5000,是因为我假象这个集群已经挂掉,并且我手边只有一个osd.0的磁盘,如果5000成功的话,那么其他的5000附近的epoch文件也可以成功。

将这个文件拷贝出来作进一步操作,并查看该文件内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
[root@yd1st003 meta]# cp ./DIR_8/DIR_6/osdmap.5000__0_0A038C68__none /root/osdmap
[root@yd1st003 ~]# hexdump -Cv /root/osdmap
00000000 08 07 62 71 00 00 03 01 ac 47 00 00 37 27 c1 06 |..bq.....G..7'..|
00000010 0a c9 42 0d 99 a9 42 18 ea 4e 09 9f 88 13 00 00 |..B...B..N......|
00000020 4b 03 78 57 a1 97 c6 2c fe 66 fd 57 73 cb 99 28 |K.xW...,.f.Ws..(|
00000030 04 00 00 00 01 00 00 00 00 00 00 00 18 05 e5 00 |................|
...
000003f0 76 6f 6c 75 6d 65 73 02 00 00 00 00 00 00 00 06 |volumes.........|
00000400 00 00 00 69 6d 61 67 65 73 05 00 00 00 00 00 00 |...images.......|
00000410 00 0b 00 00 00 76 6f 6c 75 6d 65 73 5f 73 73 64 |.....volumes_ssd|
00000420 06 00 00 00 00 00 00 00 06 00 00 00 64 6f 63 6b |............dock|
00000430 65 72 06 00 00 00 00 80 01 00 16 00 00 00 16 00 |er..............|
...
00004340 00 00 00 01 00 15 04 00 00 00 00 01 00 10 00 00 |................|
00004350 00 01 00 00 00 16 00 00 00 04 00 00 00 ff ff ff |................|
00004360 ff 0a 00 04 00 fc ff 49 00 04 00 00 00 fb ff ff |.......I........|
00004370 ff fe ff ff ff fd ff ff ff fc ff ff ff ff 7f 12 |................|
00004380 00 00 00 01 00 ff 7f 12 00 00 00 01 00 ff 7f 12 |................|
00004390 00 00 00 01 00 ff 7f 12 00 00 00 01 00 04 00 00 |................|
000043a0 00 fe ff ff ff 01 00 04 00 ff 7f 12 00 05 00 00 |................|
000043b0 00 07 00 00 00 08 00 00 00 09 00 00 00 0a 00 00 |................|
000043c0 00 0b 00 00 00 33 b3 03 00 00 00 01 00 33 b3 03 |.....3.......3..|
000043d0 00 00 00 01 00 33 b3 03 00 00 00 01 00 33 b3 03 |.....3.......3..|
000043e0 00 00 00 01 00 33 b3 03 00 00 00 01 00 04 00 00 |.....3..........|
000043f0 00 fd ff ff ff 01 00 04 00 ff 7f 12 00 05 00 00 |................|
00004400 00 0c 00 00 00 0d 00 00 00 0e 00 00 00 00 00 00 |................|
00004410 00 01 00 00 00 33 b3 03 00 00 00 01 00 33 b3 03 |.....3.......3..|
00004420 00 00 00 01 00 33 b3 03 00 00 00 01 00 33 b3 03 |.....3.......3..|
00004430 00 00 00 01 00 33 b3 03 00 00 00 01 00 04 00 00 |.....3..........|
00004440 00 fc ff ff ff 01 00 04 00 ff 7f 12 00 05 00 00 |................|
00004450 00 11 00 00 00 12 00 00 00 13 00 00 00 14 00 00 |................|
00004460 00 15 00 00 00 33 b3 03 00 00 00 01 00 33 b3 03 |.....3.......3..|
00004470 00 00 00 01 00 33 b3 03 00 00 00 01 00 33 b3 03 |.....3.......3..|
00004480 00 00 00 01 00 33 b3 03 00 00 00 01 00 04 00 00 |.....3..........|
00004490 00 fb ff ff ff 01 00 04 00 ff 7f 12 00 05 00 00 |................|
000044a0 00 04 00 00 00 05 00 00 00 06 00 00 00 02 00 00 |................|
000044b0 00 03 00 00 00 33 b3 03 00 00 00 01 00 33 b3 03 |.....3.......3..|
000044c0 00 00 00 01 00 33 b3 03 00 00 00 01 00 33 b3 03 |.....3.......3..|
000044d0 00 00 00 01 00 33 b3 03 00 00 00 01 00 00 00 00 |.....3..........|
000044e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000044f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00004500 00 00 00 00 00 00 00 00 00 01 00 00 00 03 00 00 |................|
00004510 00 00 01 01 0a 01 00 00 00 ff ff ff ff 00 00 00 |................|
00004520 00 06 00 00 00 00 00 00 00 01 00 00 00 04 00 00 |................|
00004530 00 00 00 00 00 00 00 00 00 0b 00 00 00 00 00 00 |................|
00004540 00 03 00 00 00 6f 73 64 01 00 00 00 04 00 00 00 |.....osd........|
00004550 68 6f 73 74 02 00 00 00 07 00 00 00 63 68 61 73 |host........chas|
00004560 73 69 73 03 00 00 00 04 00 00 00 72 61 63 6b 04 |sis........rack.|
00004570 00 00 00 03 00 00 00 72 6f 77 05 00 00 00 03 00 |.......row......|
00004580 00 00 70 64 75 06 00 00 00 03 00 00 00 70 6f 64 |..pdu........pod|
00004590 07 00 00 00 04 00 00 00 72 6f 6f 6d 08 00 00 00 |........room....|
000045a0 0a 00 00 00 64 61 74 61 63 65 6e 74 65 72 09 00 |....datacenter..|
000045b0 00 00 06 00 00 00 72 65 67 69 6f 6e 0a 00 00 00 |......region....|
000045c0 04 00 00 00 72 6f 6f 74 19 00 00 00 fb ff ff ff |....root........|
000045d0 08 00 00 00 79 64 31 73 74 30 30 31 fc ff ff ff |....yd1st001....|
000045e0 08 00 00 00 79 64 31 73 74 30 30 34 fd ff ff ff |....yd1st004....|
000045f0 08 00 00 00 79 64 31 73 74 30 30 33 fe ff ff ff |....yd1st003....|
00004600 08 00 00 00 79 64 31 73 74 30 30 32 ff ff ff ff |....yd1st002....|
00004610 07 00 00 00 64 65 66 61 75 6c 74 00 00 00 00 05 |....default.....|
00004620 00 00 00 6f 73 64 2e 30 01 00 00 00 05 00 00 00 |...osd.0........|
00004630 6f 73 64 2e 31 02 00 00 00 05 00 00 00 6f 73 64 |osd.1........osd|
00004640 2e 32 03 00 00 00 05 00 00 00 6f 73 64 2e 33 04 |.2........osd.3.|
00004650 00 00 00 05 00 00 00 6f 73 64 2e 34 05 00 00 00 |.......osd.4....|
00004660 05 00 00 00 6f 73 64 2e 35 06 00 00 00 05 00 00 |....osd.5.......|
00004670 00 6f 73 64 2e 36 07 00 00 00 05 00 00 00 6f 73 |.osd.6........os|
...
00004720 00 00 00 6f 73 64 2e 32 31 01 00 00 00 00 00 00 |...osd.21.......|
00004730 00 12 00 00 00 72 65 70 6c 69 63 61 74 65 64 5f |.....replicated_|
00004740 72 75 6c 65 73 65 74 00 00 00 00 00 00 00 00 32 |ruleset........2|
00004750 00 00 00 01 00 00 00 01 01 16 00 00 00 00 01 00 |................|
00004760 00 00 07 00 00 00 64 65 66 61 75 6c 74 04 00 00 |......default...|
00004770 00 01 00 00 00 6b 01 00 00 00 32 01 00 00 00 6d |.....k....2....m|

上面列出了这个文件的一些有用的信息,这里我给出一个我自己总结的osdmap保存crushmap的位置特点的猜测总结:

  • CrushMap0x00 0x00 0x01开头,后面有很多的3....3..3这种信息,一般先找到OSD的信息后,再去往上查找3..33...3这样的信息的前面有个0x00 0x00 0x01的开头,那么这个就是Crushmap开头,比如上面的第00004360行。
  • CrushMap0x01 0x01 0x16 0x00 0x00 0x00 0x00结尾,在OSD的最后,比如这里的00004740这一行的下面,一般是在ruleset的后面一点点。

然后,我们就可以把这个片段给截出来,起始为0x4349=17225,终点为0x475e=18270,长度为18270-17225=1045

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@yd1st003 ~]# dd if=/root/osdmap skip=17225 bs=1 count=1045 of=/root/crushmap iflag=skip_bytes
记录了1045+0 的读入
记录了1045+0 的写出
1045字节(1.0 kB)已复制,0.00290987 秒,359 kB/秒
[root@yd1st003 ~]# hexdump -Cv crushmap
00000000 00 00 01 00 10 00 00 00 01 00 00 00 16 00 00 00 |................|
00000010 04 00 00 00 ff ff ff ff 0a 00 04 00 fc ff 49 00 |..............I.|
00000020 04 00 00 00 fb ff ff ff fe ff ff ff fd ff ff ff |................|
00000030 fc ff ff ff ff 7f 12 00 00 00 01 00 ff 7f 12 00 |................|
...
000003f0 69 63 61 74 65 64 5f 72 75 6c 65 73 65 74 00 00 |icated_ruleset..|
00000400 00 00 00 00 00 00 32 00 00 00 01 00 00 00 01 01 |......2.........|
00000410 16 00 00 00 00 |.....|

这时候,就要做一个神奇的事情了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[root@yd1st003 ~]# crushtool -d crushmap -o crushmap.txt
[root@yd1st003 ~]# cat crushmap.txt
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
...
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map

我们将刚刚截取的crushmap文件,用crushtool反编译,再查看生成的文件内容,这和我们ceph osd getcrushmap的内容居然是一模一样的,这样我们就完成了从随便一个osdmap中提取出了这里的CrushMap,不过这里还要再说明一个问题,根据我的理解,osdmap中保存了当前的crushmap,一般情况下,我们是不会修改crushMap的,所以一连续段的epochosdmap文件内应该保存着一样的crushmap,所以这就是我这个实验的一个假设:假设在一段时间内都没有修改过crushmap

那么有了crushmap之后,我们能做些什么呢,原先我的计划是,从crushmap中读取所有的OSD的权重,和所有的bucket,再建立一个全新的集群,使之crushmap和旧集群完全一样,这时候我们再调用ceph osd map之类的指令就可以获取object或者PG的位置信息,经过试验认证,指令得到的输出是和原来的集群是一模一样的,但是试验过程中,我了解到了osdmaptool这个工具,利用这个工具可以直接得出刚刚类似的输出信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@yd1st003 ~]# osdmaptool --print osdmap
osdmaptool: osdmap file 'osdmap'
epoch 5000
fsid 3727c106-0ac9-420d-99a9-4218ea4e099f
created 2016-07-03 02:09:15.751212
modified 2016-10-12 06:26:06.681167
flags sortbitwise
pool 1 'volumes' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 700 pgp_num 700 last_change 4483 flags hashpspool stripe_width 0
removed_snaps [1~1]
...
max_osd 22
osd.0 up in weight 1 up_from 4935 up_thru 4996 down_at 4930 last_clean_interval [4760,4929) 172.19.48.203:6800/2537077 3.3.4.3:6800/2537077 3.3.4.3:6801/2537077 172.19.48.203:6801/2537077 exists,up ecb520b4-58c7-4ff9-9cb4-d715ce458be4
osd.1 up in weight 1 up_from 4934 up_thru 4999 down_at 4932 last_clean_interval [4758,4929) 172.19.48.203:6802/2537081 3.3.4.3:6802/2537081 3.3.4.3:6803/2537081 172.19.48.203:6803/2537081 exists,up d623aef1-6297-4e19-8a77-b2afccc9a6e3
...

使用这个工具,我们可以立马得到刚刚提取出来的crushmap

1
2
3
4
[root@yd1st003 ~]# osdmaptool osdmap --export-crush i-am-crush
osdmaptool: osdmap file 'osdmap'
osdmaptool: exported crush map to i-am-crush
[root@yd1st003 ~]# crushtool -d i-am-crush -o /tmp/i-am-crush.txt

从刚刚的试验已知osdmap里面是包含crushmap的,同时,osdmaptool --print的输出信息可以得知每个OSD的权重,pool的具体信息,这些信息已经足够我们映射出所有的PGobject

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@yd1st003 ~]# osdmaptool --test-map-pg 1.0 osdmap
osdmaptool: osdmap file 'osdmap'
parsed '1.0' -> 1.0
1.0 raw ([6,10,21], p6) up ([6,10], p6) acting ([6,10], p6)
[root@yd1st003 ~]# osdmaptool --test-map-pg 1.0 osdmap --mark-up-in
osdmaptool: osdmap file 'osdmap'
marking all OSDs up and in
parsed '1.0' -> 1.0
1.0 raw ([6,10,21], p6) up ([6,10,21], p6) acting ([6,10,21], p6)
[root@yd1st003 ~]# ceph pg map 1.0
osdmap e5166 pg 1.0 (1.0) -> up [6,10,21] acting [6,10,21]
[root@yd1st003 ~]# osdmaptool --test-map-object rbd_data.f7ae7f1458600.0000000000000000 --pool 1 osdmap --mark-up-in
osdmaptool: osdmap file 'osdmap'
marking all OSDs up and in
object 'rbd_data.f7ae7f1458600.0000000000000000' -> 1.1d6 -> [0,5,18]
[root@yd1st003 ~]# ceph osd map volumes rbd_data.f7ae7f1458600.0000000000000000
osdmap e5166 pool 'volumes' (1) object 'rbd_data.f7ae7f1458600.0000000000000000' -> pg 1.cea4bfd6 (1.1d6) -> up ([0,5,18], p0) acting ([0,5,18], p0)

值得一提的是刚刚的一个参数mark-up-in,可以看到加和没加这个参数,map-pg 1.0的输出是不一样的,因为5000这个osdmap中,osd.21down状态的,所以映射出来的acting set没有21,使用这个标记就可以将所有的OSD标为up&in

这样就可以使用一个osdmap来定位所有的你想要知道的信息了。

实验结论

虽然这个实验山路十八弯得导出了crushmap,但是我还是觉得有意义的,至少明确一个事实:osdmap里包含了crushmap,同时,我们可以利用一个osdmap得到所有的PG or object的位置信息。也许,目前你体会不到这个实验的好处,但是如果机房被震了,所有的主机就只剩下OSD盘的时候,那时候恢复集群会用到这里面的信息,也就是说,对于一个dead cluster,我们也可以不通过指令得出很多有用的位置信息。