ceph实验

ceph实验

实验环境列表

今后的各个实验,将采用以下的某个实验环境,不再详细列出,而将以env-X的方式代指。

env-1

单节点的集群环境,在该节点部署mon,挂载了三个2T的磁盘:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[root@ceph-1 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 128G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 127.5G 0 part
├─centos-root 253:0 0 50G 0 lvm /
├─centos-swap 253:1 0 2G 0 lvm [SWAP]
└─centos-home 253:2 0 75.5G 0 lvm /home
sdb 8:16 0 2T 0 disk
sdc 8:32 0 2T 0 disk
sdd 8:48 0 2T 0 disk
sr0 11:0 1 1024M 0 rom
[root@ceph-1 ~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
[root@ceph-1 ~]# ceph -v
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
[root@ceph-1 cluster]# cat /etc/ceph/ceph.conf
[global]
fsid = 99fcd5bc-f4ec-4419-88b5-0a1921b90e77
public_network = 192.168.57.0/24
mon_initial_members = ceph-1
mon_host = 192.168.57.241
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_pool_default_size = 1
osd_pool_default_min_size = 1
osd_crush_chooseleaf_type = 0 # from OSD

env-2

三节点的集群环境,每个节点部署一个Mon,挂载有三个2T的磁盘,三个节点分别名为[ceph-1,ceph-2,ceph-3]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[root@ceph-1 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 128G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 127.5G 0 part
├─centos-root 253:0 0 50G 0 lvm /
├─centos-swap 253:1 0 2G 0 lvm [SWAP]
└─centos-home 253:2 0 75.5G 0 lvm /home
sdb 8:16 0 2T 0 disk
sdc 8:32 0 2T 0 disk
sdd 8:48 0 2T 0 disk
sr0 11:0 1 1024M 0 rom
[root@ceph-1 ~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
[root@ceph-1 ~]# ceph -v
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
[root@ceph-1 ~]# cat /etc/ceph/ceph.conf
[global]
fsid = 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
public_network = 192.168.57.0/24
mon_initial_members = ceph-1, ceph-2, ceph-3
mon_host = 192.168.57.241,192.168.57.242,192.168.57.243
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

exp-1: SSD分区对齐并用作journal

实验目的

由于先前的将OSD的journal软链接到SSD盘的某个5G文件,这种方式未能考虑到SSD的对齐,而如果SSD未能正确对齐的话,数据读写速度将会大大降低,本实验将介绍合理的对齐方法。

实验环境

  • env-1
  • 假定/dev/sdb为SSD盘,/dev/sdc&/dev/sdd为SATA盘。
  • 假定journal使用20G空间,需要向ceph.conf中加入osd_journal_size = 20480
  • /dev/sdb1/dev/sdc作journal,/dev/sdb2/dev/sdd作journal。

实验过程

  • /dev/sdb进行分区,前两个分区为20G。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
[root@ceph-1 ~]# fdisk /dev/sdb
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
欢迎使用 fdisk (util-linux 2.23.2)。
更改将停留在内存中,直到您决定将更改写入磁盘。
使用写入命令前请三思。
命令(输入 m 获取帮助):m
命令操作
d delete a partition
g create a new empty GPT partition table
G create an IRIX (SGI) partition table
l list known partition types
m print this menu
n add a new partition
o create a new empty DOS partition table
q quit without saving changes
s create a new empty Sun disklabel
w write table to disk and exit
命令(输入 m 获取帮助):g
Building a new GPT disklabel (GUID: 9A5D6B86-5C14-462E-8350-3B95BDDF312F)
命令(输入 m 获取帮助):n
分区号 (1-128,默认 1):1
第一个扇区 (2048-4294965214,默认 2048):2048
Last sector, +sectors or +size{K,M,G,T,P} (2048-4294965214,默认 4294965214):+20G
已创建分区 1
命令(输入 m 获取帮助):w
The partition table has been altered!
Calling ioctl() to re-read partition table.
正在同步磁盘。
[root@ceph-1 ~]# fdisk /dev/sdb
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
欢迎使用 fdisk (util-linux 2.23.2)。
更改将停留在内存中,直到您决定将更改写入磁盘。
使用写入命令前请三思。
命令(输入 m 获取帮助):n
分区号 (2-128,默认 2):2
第一个扇区 (41945088-4294965214,默认 41945088):
Last sector, +sectors or +size{K,M,G,T,P} (41945088-4294965214,默认 4294965214):+20G
已创建分区 2
命令(输入 m 获取帮助):w
The partition table has been altered!
Calling ioctl() to re-read partition table.
正在同步磁盘。

查看新的分区表:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@ceph-1 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 128G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 127.5G 0 part
├─centos-root 253:0 0 50G 0 lvm /
├─centos-swap 253:1 0 2G 0 lvm [SWAP]
└─centos-home 253:2 0 75.5G 0 lvm /home
sdb 8:16 0 2T 0 disk
├─sdb1 8:17 0 20G 0 part
└─sdb2 8:18 0 20G 0 part
sdc 8:32 0 2T 0 disk
sdd 8:48 0 2T 0 disk
sr0 11:0 1 1024M 0 rom

建立新的OSD,将journal指向sdb的分区。

1
2
3
4
[root@ceph-1 cluster]# ceph --show-config|grep osd_journal_size
osd_journal_size = 20480
[root@ceph-1 cluster]# ceph-deploy osd prepare ceph-1:/dev/sdc:/dev/sdb1 ceph-1:/dev/sdd:/dev/sdb2 --zap-disk
[root@ceph-1 cluster]# ceph-deploy osd activate ceph-1:/dev/sdc1 ceph-1:/dev/sdd1

查看日志生效:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@ceph-1 cluster]# ll /var/lib/ceph/osd/ceph-0/
总用量 40
-rw-r--r-- 1 root root 192 8月 4 15:07 activate.monmap
-rw-r--r-- 1 root root 3 8月 4 15:07 active
-rw-r--r-- 1 root root 37 8月 4 15:06 ceph_fsid
drwxr-xr-x 36 root root 565 8月 4 15:08 current
-rw-r--r-- 1 root root 37 8月 4 15:06 fsid
lrwxrwxrwx 1 root root 9 8月 4 15:06 journal -> /dev/sdb1
-rw------- 1 root root 56 8月 4 15:07 keyring
-rw-r--r-- 1 root root 21 8月 4 15:06 magic
-rw-r--r-- 1 root root 6 8月 4 15:07 ready
-rw-r--r-- 1 root root 4 8月 4 15:07 store_version
-rw-r--r-- 1 root root 53 8月 4 15:07 superblock
-rw-r--r-- 1 root root 0 8月 4 15:07 sysvinit
-rw-r--r-- 1 root root 2 8月 4 15:07 whoami

创建一个不对其的分区,第一个扇区选项比默认值+1:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@ceph-1 cluster]# fdisk /dev/sdb
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
欢迎使用 fdisk (util-linux 2.23.2)。
更改将停留在内存中,直到您决定将更改写入磁盘。
使用写入命令前请三思。
命令(输入 m 获取帮助):n
分区号 (3-128,默认 3):
第一个扇区 (83888128-4294965214,默认 83888128):83888129
Last sector, +sectors or +size{K,M,G,T,P} (83888129-4294965214,默认 4294965214):+20G
已创建分区 3
命令(输入 m 获取帮助):w
The partition table has been altered!

有时候会得到如下提醒,可以执行partprobe通知内核更新分区表:

1
2
3
4
5
WARNING: Re-reading the partition table failed with error 16: 设备或资源忙.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
[root@ceph-1 ~]# partprobe

检查/dev/sdb[1,2,3]是否对齐:

1
2
3
4
5
6
7
8
9
10
[root@ceph-1 cluster]# parted /dev/sdb
GNU Parted 3.1
使用 /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) align-check optimal 1
1 aligned
(parted) align-check optimal 2
2 aligned
(parted) align-check optimal 3
3 not aligned

可以看到/dev/sdb3未对齐,删了重新分区吧,暂时没想到别的方法。

实验结论

在给SSD分完区后,一定要检查一下分区是否对其,否则会影响SSD的读写速度。

exp-2: Journal的替换

实验目的

在日常使用中,用作journal的SSD盘有可能会损坏,损坏之后替换新的SSD盘,需要将原先的journal指向新的SSD盘。或者就是简单的替换journal盘。

实验环境

  • env-1 & exp-1
  • 删除exp-1的不对齐的/dev/sdb3,增加两个分区分别为/dev/sdb[3,4]

实验过程

希望将osd.0的journal修改为/dev/sdb3,osd.1的日志修改为/dev/sdb4,目前,日志盘使用情况为:

1
2
3
4
[root@ceph-1 cluster]# ll /var/lib/ceph/osd/ceph-0/journal
lrwxrwxrwx 1 root root 9 8月 4 15:06 /var/lib/ceph/osd/ceph-0/journal -> /dev/sdb1
[root@ceph-1 cluster]# ll /var/lib/ceph/osd/ceph-1/journal
lrwxrwxrwx 1 root root 9 8月 4 15:07 /var/lib/ceph/osd/ceph-1/journal -> /dev/sdb2

设置noout标志,防止集群进行数据恢复。

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@ceph-1 cluster]# ceph osd set noout
set noout
[root@ceph-1 cluster]# ceph -s
cluster 99fcd5bc-f4ec-4419-88b5-0a1921b90e77
health HEALTH_WARN
noout flag(s) set
monmap e1: 1 mons at {ceph-1=192.168.57.241:6789/0}
election epoch 2, quorum 0 ceph-1
osdmap e10: 2 osds: 2 up, 2 in
flags noout
pgmap v13: 64 pgs, 1 pools, 0 bytes data, 0 objects
68152 kB used, 4093 GB / 4093 GB avail
64 active+clean

停止OSD进程:

1
[root@ceph-1 cluster]# service ceph stop osd.0

下刷journal到OSD中:

1
2
[root@ceph-1 cluster]# ceph-osd -i 0 --flush-journal
2016-08-04 16:27:07.531321 7ffbc0876880 -1 flushed journal /var/lib/ceph/osd/ceph-0/journal for object store /var/lib/ceph/osd/ceph-0

/var/lib/ceph/osd/ceph-0/journal -> /dev/sdb1
这种链接方式存在一定的危险:如果磁盘被拔掉再插上,虽然可能性不大,盘符可能会变,但是分区的uuid是唯一的,拔插并不会改变uuid,所以建议将journal链接到uuid上。

查看/dev/sdb3的uuid:

1
2
[root@ceph-1 cluster]# ll /dev/disk/by-partuuid/ |grep sdb3
lrwxrwxrwx 1 root root 10 8月 4 14:34 4472e58f-37ae-4277-bd24-1cc759cd5a51 -> ../../sdb3

移除旧的journal,并将新的journal链接到原先位置:

1
2
3
4
[root@ceph-1 cluster]# rm -rf /var/lib/ceph/osd/ceph-0/journal
[root@ceph-1 cluster]# ln -s /dev/disk/by-partuuid/4472e58f-37ae-4277-bd24-1cc759cd5a51 /var/lib/ceph/osd/ceph-0/journal
[root@ceph-1 cluster]# chown ceph:ceph /var/lib/ceph/osd/ceph-0/journal
[root@ceph-1 cluster]# echo 4472e58f-37ae-4277-bd24-1cc759cd5a51 > /var/lib/ceph/osd/ceph-0/journal_uuid

至此链接建立完成,创建journal,并开启OSD:

1
2
3
[root@ceph-1 cluster]# ceph-osd -i 0 --mkjournal
2016-08-04 16:55:06.381554 7f07012aa880 -1 journal check: ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected fae5d972-cb4c-46e1-a620-9407002556ba, invalid (someone else's?) journal
2016-08-04 16:55:06.387479 7f07012aa880 -1 created new journal /var/lib/ceph/osd/ceph-0/journal for object store /var/lib/ceph/osd/ceph-0
1
2
3
4
5
6
[root@ceph-1 cluster]# service ceph start osd.0
=== osd.0 ===
=== osd.0 ===
create-or-move updated item name 'osd.0' weight 2 at location {host=ceph-1,root=default} to crush map
Starting Ceph osd.0 on ceph-1...
Running as unit ceph-osd.0.1470301052.328264448.service.

去除noout标志,查看journal状态:

1
2
3
4
5
6
7
8
9
10
11
12
[root@ceph-1 cluster]# ceph osd unset noout
unset noout
[root@ceph-1 cluster]# ll /var/lib/ceph/osd/ceph-0/
总用量 44
-rw-r--r-- 1 root root 192 8月 4 16:24 activate.monmap
-rw-r--r-- 1 root root 3 8月 4 16:24 active
-rw-r--r-- 1 root root 37 8月 4 16:23 ceph_fsid
drwxr-xr-x 36 root root 565 8月 4 16:24 current
-rw-r--r-- 1 root root 37 8月 4 16:23 fsid
lrwxrwxrwx 1 root root 58 8月 4 16:53 journal -> /dev/disk/by-partuuid/4472e58f-37ae-4277-bd24-1cc759cd5a51
-rw-r--r-- 1 root root 37 8月 4 16:54 journal_uuid
..

实验结论

如文中所述,使用盘符的挂载链接方式并不安全,建议使用uuid进行链接,即使拔插磁盘也不会导致journal无法识别,而产生的OSD无法启动现象。

exp-3: client.admin.keyring的恢复

实验前提:
monitor目录下的store.db内部文件未被损坏或删除。

实验目的

在集群的所有key(mon.keyring/client.admin.keyring/bootstrap-[osd/mds/rgw].keyring)均丢失的情况下,进入集群进行操作。

实验原理

ceph客户端通过调用rados进行访问集群,我们在调用ceph -s指令时,实际上还调用了一些默认值:ceph --conf=/etc/ceph/ceph.conf --name=client.admin --keyring=/etc/ceph/ceph.client.admin.keyring -s,即 使用用户client.admin以及它对应的keyring来连接rados,进行认证和访问。此刻我们已经丢失了这个keyring,并且无法访问集群获取之。而权限最大的用户mon.的keyring,在/var/lib/ceph/mon/ceph-ceph-1/keyring下,如果该keyring也丢失了,仍可以通过创建mon的keyring来访问集群。

实验环境

  • env-1

实验过程

如果存在多个monitor,那么如果卡在probing状态,可以参照其他实验(exp-4),修改monmap,使集群认为只有一个mon。
随便创建一个keyring(key的内容随意),保存在monitor目录下:

1
2
3
4
[root@ceph-1 ceph-ceph-1]# cat /var/lib/ceph/mon/ceph-ceph-1/keyring
[mon.]
key = AQDPraFXlH/4MBAA0ozKh9l9jKKp/5ofE/Xjsw==
caps mon = "allow *"

启动monitor:

1
2
3
4
5
[root@ceph-1 ceph-ceph-1]# service ceph start mon
=== mon.ceph-1 ===
Starting Ceph mon.ceph-1 on ceph-1...
Running as unit ceph-mon.ceph-1.1470305528.606084264.service.
Starting ceph-create-keys on ceph-1...

此处有两个方法获取所有keyring:

  1. 关闭cephx认证,执行ceph auth list即可。
  2. 为了更深刻的理解auth,使用mon.用户去访问集群,下面将介绍这种方法。

目前在ceph集群里面,我们已知的用户只有mon.和它的keyring。那么用这两个参数去访问集群:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[root@ceph-1 ceph-ceph-1]# ceph auth list --name=mon. --keyring=/var/lib/ceph/mon/ceph-ceph-1/keyring --conf=/etc/ceph/ceph.conf
installed auth entries:
osd.0
key: AQC5+6JXR07eCBAASy1ZFWT0LT5/gguBGqyjVw==
caps: [mon] allow profile osd
caps: [osd] allow *
osd.1
key: AQDC+6JXL7uGCBAArNnHZCuBb3FsxlIGJzwdwg==
caps: [mon] allow profile osd
caps: [osd] allow *
client.admin
key: AQBx+6JXp5KhCBAAb77YZqcnK23qx3p8J9OD6A==
caps: [mds] allow
caps: [mon] allow *
caps: [osd] allow *
client.bootstrap-mds
key: AQBx+6JXrpSHIRAA8wMo9pDF+jGCkWtN6z4znA==
caps: [mon] allow profile bootstrap-mds
client.bootstrap-osd
key: AQBx+6JX2a+PFBAAbpCCWi5vtA0UNgdWEv9PPg==
caps: [mon] allow profile bootstrap-osd
client.bootstrap-rgw
key: AQB0+6JX2/fqHRAArBHMhgW0exYiv1oAp4yEQg==
caps: [mon] allow profile bootstrap-rgw

这样就可以看到client.admin用户的keyring了,可以拷贝出来,或者用以下指令导出之,这里使用的是get-or-create如果admin用户不存在会创建,否则直接拉去已存在的keyring:

1
[root@ceph-1 ceph-ceph-1]# ceph --cluster=ceph --name=mon. --keyring=/var/lib/ceph/mon/ceph-ceph-1/keyring auth get-or-create client.admin mon 'allow *' osd 'allow *' mds 'allow' -o /etc/ceph/ceph.client.admin.keyring

jewel版本可能需要 *mds 'allow

拥有了ceph.client.admin.keyring,就可以为所欲为了。

附加的测试:
ceph.conf/var/lib/ceph/mon/ceph-ceph-1/keyring拷贝到其他节点上,如ceph-2:/root/ceph-1/,执行如下指令可以访问ceph-1的集群:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@ceph-2 ceph-1]# ceph --conf=/root/ceph-1/ceph.conf --name=mon. --keyring=/root/ceph-1/keyring -s
cluster 99fcd5bc-f4ec-4419-88b5-0a1921b90e77
health HEALTH_WARN
64 pgs stale
64 pgs stuck stale
2/2 in osds are down
monmap e1: 1 mons at {ceph-1=192.168.57.241:6789/0}
election epoch 1, quorum 0 ceph-1
osdmap e15: 2 osds: 0 up, 2 in
pgmap v23: 64 pgs, 1 pools, 0 bytes data, 0 objects
67568 kB used, 4093 GB / 4093 GB avail
64 stale+active+clean
[root@ceph-2 ceph-1]# ls
ceph.conf keyring

实验结论

本实验的宗旨是,利用可以自定义keyring的mon.用户,启动monitor,访问集群,从而进行各种操作,由此可见,集群内的所有keyring都是可有可无的,而mon.的keyring又没有进行加密等操作,因此,实际上是可以增删改查任何keyring的。

exp-4: 3Mon合并为1Mon

实验目的

假定某次故障,最坏情况下,三个monitor损坏了两个,并且磁盘无法恢复,这时,单monitor是无法启动的,会一直卡在probing状态,等待其他至少一个monitor上线,本实验将要模拟使用仅存的mon来启动集群的方法,当然实验前提是这个节点的monitor的store.db没有损坏。

实验原理

通过修改monmap,剔除原先的两个损坏mon,让集群认为只有一个mon。。。NONONO,这是别人讲的方法,要修改monmap,那么就要导出monmap,集群此时挂了,是无法执行导出指令的,此实验通过创建一个新的monmap,注入到集群中,来访问集群。

实验环境

  • env-2

实验过程

实验之前,集群有三个mon,三个OSD,状态如下:

1
2
3
4
5
6
7
8
9
[root@ceph-1 cluster]# ceph -s
cluster 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
health HEALTH_OK
monmap e1: 3 mons at {ceph-1=192.168.57.241:6789/0,ceph-2=192.168.57.242:6789/0,ceph-3=192.168.57.243:6789/0}
election epoch 28, quorum 0,1,2 ceph-1,ceph-2,ceph-3
osdmap e13: 3 osds: 3 up, 3 in
pgmap v19: 64 pgs, 1 pools, 0 bytes data, 0 objects
100 MB used, 6125 GB / 6126 GB avail
64 active+clean

现在关闭两台(ceph-2/ceph-3)两台MON,观察/var/log/ceph/ceph-mon.ceph-1.log文件,将持续卡在probing状态,ceph的指令也无法执行:

1
2
3
4
5
6
7
8
9
[root@ceph-1 ~]# tail -f /var/log/ceph/ceph-mon.ceph-1.log
2016-08-05 13:20:35.491102 7f22a7e94700 0 -- 192.168.57.241:6789/0 >> 192.168.57.243:6789/0 pipe(0x4d4e000 sd=22 :6789 s=1 pgs=28 cs=1 l=0 c=0x46cb080).fault
2016-08-05 13:20:45.490826 7f22aa99c700 1 mon.ceph-1@0(leader).paxos(paxos active c 1..162) lease_ack_timeout -- calling new election
2016-08-05 13:21:16.856613 7f22aa99c700 0 mon.ceph-1@0(probing).data_health(30) update_stats avail 95% total 51175 MB, used 2295 MB, avail 48879 MB
2016-08-05 13:22:16.857007 7f22aa99c700 0 mon.ceph-1@0(probing).data_health(30) update_stats avail 95% total 51175 MB, used 2295 MB, avail 48879 MB
[root@ceph-1 cluster]# ceph mon getmap -o map
2016-08-05 13:25:13.135617 7f67f832d700 0 -- :/2409307395 >> 192.168.57.243:6789/0 pipe(0x7f67f4062550 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f67f405b450).fault
2016-08-05 13:25:16.136372 7f67f822c700 0 -- :/2409307395 >> 192.168.57.242:6789/0 pipe(0x7f67e8000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f67e8004ef0).fault

现在关闭ceph-1的monitor,开始创建新的monmap,需要集群的fsid,并查看新创建的monmap,不用担心,epoch的值并不会影响,代码里是设定为0的:

1
2
3
4
5
6
7
8
9
10
11
[root@ceph-1 cluster]# monmaptool --create --fsid 58f3771b-0fd0-4042-a174-a5a2c36c4dbc --add ceph-1 192.168.57.241 /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: set fsid to 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
[root@ceph-1 cluster]# monmaptool --print /tmp/monmap
monmaptool: monmap file /tmp/monmap
epoch 0
fsid 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
last_changed 2016-08-05 13:30:40.717762
created 2016-08-05 13:30:40.717762
0: 192.168.57.241:6789/0 mon.ceph-1

注入进集群:

1
ceph-mon -i ceph-1 --inject-monmap /tmp/monmap

修改ceph.conf,去除ceph-2和ceph-3这两个Host及其对应IP。
启动ceph-1的monitor:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@ceph-1 cluster]# service ceph start mon
=== mon.ceph-1 ===
Starting Ceph mon.ceph-1 on ceph-1...
Running as unit ceph-mon.ceph-1.1470375746.702089733.service.
Starting ceph-create-keys on ceph-1...
[root@ceph-1 cluster]# ceph -s
cluster 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
health HEALTH_OK
monmap e2: 1 mons at {ceph-1=192.168.57.241:6789/0}
election epoch 1, quorum 0 ceph-1
osdmap e17: 3 osds: 3 up, 3 in
pgmap v32: 64 pgs, 1 pools, 0 bytes data, 0 objects
100 MB used, 6125 GB / 6126 GB avail
64 active+clean

可以观察到,此刻集群可以正常访问,并且,集群只剩下一个monitor,只要集群能访问,再增加mon也是很方便的事情,增加mon之前记得清理干净新增节点mon的目录。

如果只想单纯的恢复三个MON,可以将存活节点的/var/lib/ceph/mon/ceph-ceph-1/store.db覆盖到其余两个节点的这个目录下面,然后再启动三个MON,就可以正常访问集群了,当然另外两个节点的IP等信息应该和旧集群的信息一样。这个经过试验证实方法可行。

试验结论

由该实验的现象可以推测出一些可行的方案:

  • 三个mon的磁盘如果都损坏了,一般是SSD系统盘报废,那么可以尝试拷贝出一个store.db的所有文件,将其放置到一个正常的节点上,新建一个monmap,只包含该节点的mon,将其注入到进mon里面,启动这个节点。这个实验为exp-5

exp-5: 3Mon均无法启动恢复方式

实验目的

有几种可能运用到的场景:

  • 三台Monitor主机都无法再启动,需要去新的机器上建立Mon,并访问原来的集群。
  • Monitor集体搬迁,需要修改他们的IP。

实验环境

  • env-2
  • 新节点为ceph-admin:192.168.57.227,此节点为干净节点。

实验过程

将旧节点的/etc/ceph/以及/var/lib/ceph/mon/ceph-ceph-1/目录分别拷贝到新节点的/etc/ceph/以及/var/lib/ceph/mon/ceph-ceph-admin/目录下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@ceph-1 ~]# scp /etc/ceph/* ceph-admin:/etc/ceph/
root@ceph-admin's password:
ceph.client.admin.keyring 100% 63 0.1KB/s 00:00
ceph.conf 100% 231 0.2KB/s 00:00
[root@ceph-1 ~]# scp -r /var/lib/ceph/mon/ceph-ceph-1/* ceph-admin:/var/lib/ceph/mon/ceph-ceph-admin/
root@ceph-admin's password:
done 100% 0 0.0KB/s 00:00
keyring 100% 77 0.1KB/s 00:00
CURRENT 100% 16 0.0KB/s 00:00
LOG.old 100% 317 0.3KB/s 00:00
LOCK 100% 0 0.0KB/s 00:00
LOG 100% 1142 1.1KB/s 00:00
000037.log 100% 1366KB 1.3MB/s 00:00
MANIFEST-000035 100% 527 0.5KB/s 00:00
000038.sst 100% 2112KB 2.1MB/s 00:00
000039.sst 100% 2058KB 2.0MB/s 00:00
000040.sst 100% 17KB 17.1KB/s 00:00
sysvinit 100% 0 0.0KB/s 00:00

创建一个只包含ceph-admin节点的monmap并导入之:

1
2
3
4
5
[root@ceph-admin store.db]# monmaptool --create --fsid 58f3771b-0fd0-4042-a174-a5a2c36c4dbc --add ceph-admin 192.168.57.227 /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: set fsid to 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
[root@ceph-admin store.db]# ceph-mon --inject-monmap /tmp/monmap -i ceph-admin

修改/etc/ceph/ceph.conf

1
2
3
4
5
6
7
[root@ceph-admin store.db]# cat /etc/ceph/ceph.conf
[global]
fsid = 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
public_network = 192.168.57.0/24
mon_initial_members = ceph-admin
mon_host = 192.168.57.227
..

开启Mon,查看集群状态:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@ceph-admin store.db]# service ceph start mon
=== mon.ceph-admin ===
Starting Ceph mon.ceph-admin on ceph-admin...
Running as unit ceph-mon.ceph-admin.1470381040.185790242.service.
Starting ceph-create-keys on ceph-admin...
[root@ceph-admin store.db]# ceph -s
cluster 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
health HEALTH_OK
monmap e4: 1 mons at {ceph-admin=192.168.57.227:6789/0}
election epoch 1, quorum 0 ceph-admin
osdmap e25: 3 osds: 3 up, 3 in
pgmap v48: 64 pgs, 1 pools, 0 bytes data, 0 objects
101 MB used, 6125 GB / 6126 GB avail
64 active+clean

可以看到,此刻的集群可以正常访问,并且monitor已经更新为新的节点的IP,当然,此时需要前往各个子节点修改conf文件,再重启所有的服务。如果不重启,尽管此刻集群HEALTH_OK,但是这只是假象,原先的OSD仍然向旧的集群汇报信息,所以要修改conf再重启。

实验结论

当然,exp-4,exp-5这三个实验的前提都是monitor的数据库store.db未被损坏,可见,只要数据库文件完整,均可通过这种方式启动monitor。

exp-6: Ceph本地源搭建

实验目的

当然是为了快速部署ceph。

实验环境

env-1,我的源搭建在ceph-admin节点上。

实验过程

在本地源节点安装httpd以及工具createrepo:

1
yum install httpd createrepo -y

创建ceph源目录,并下载所有文件:

下面为0.94.7的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
mkdir -p /var/www/html/ceph/0.94.7
cd /var/www/html/ceph/0.94.7
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-common-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-devel-compat-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-fuse-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-libs-compat-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-radosgw-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-test-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/cephfs-java-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/libcephfs1-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/libcephfs1-devel-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/libcephfs_jni1-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/libcephfs_jni1-devel-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/librados2-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/librados2-devel-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/libradosstriper1-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/libradosstriper1-devel-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/librbd1-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/librbd1-devel-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/python-ceph-compat-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/python-cephfs-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/python-rados-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/python-rbd-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/rbd-fuse-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/rest-bench-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/noarch/ceph-deploy-1.5.34-0.noarch.rpm

下面为10.2.2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
mkdir -p /var/www/html/ceph/0.94.7
cd /var/www/html/ceph/0.94.7
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-base-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-common-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-devel-compat-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-fuse-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-libs-compat-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-mds-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-mon-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-osd-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-radosgw-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-selinux-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-test-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/cephfs-java-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libcephfs1-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libcephfs1-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libcephfs_jni1-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libcephfs_jni1-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librados2-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librados2-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libradosstriper1-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libradosstriper1-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librbd1-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librbd1-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librgw2-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librgw2-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/python-ceph-compat-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/python-cephfs-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/python-rados-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/python-rbd-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/rbd-fuse-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/rbd-mirror-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/rbd-nbd-10.2.2-0.el7.x86_64.rpm

创建repo:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
createrepo /var/www/html/ceph/0.94.7
Spawning worker 0 with 3 pkgs
Spawning worker 1 with 2 pkgs
Spawning worker 2 with 2 pkgs
Spawning worker 3 with 2 pkgs
Spawning worker 4 with 2 pkgs
Spawning worker 5 with 2 pkgs
Spawning worker 6 with 2 pkgs
Spawning worker 7 with 2 pkgs
Spawning worker 8 with 2 pkgs
Spawning worker 9 with 2 pkgs
Spawning worker 10 with 2 pkgs
Spawning worker 11 with 2 pkgs
Workers Finished
Saving Primary metadata
Saving file lists metadata
Saving other metadata
Generating sqlite DBs
Sqlite DBs complete

创建ceph.repo

1
2
3
4
5
[root@ceph-admin ~]# cat /etc/yum.repos.d/ceph.repo
[ceph_local]
name=ceph
baseurl=http://192.168.57.227/ceph/0.94.7
gpgcheck=0

实验7 arm配置

rpm下载

1
2
3
4
5
6
7
8
9
wget ftp://195.220.108.108/linux/fedora-secondary/releases/23/Everything/aarch64/os/Packages/g/glibc-2.22-3.fc23.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/23/Everything/aarch64/os/Packages/g/glibc-common-2.22-3.fc23.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/22/Everything/aarch64/os/Packages/l/leveldb-1.12.0-6.fc21.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/22/Everything/aarch64/os/Packages/l/leveldb-devel-1.12.0-6.fc21.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/24/Everything/aarch64/os/Packages/l/libbabeltrace-1.2.4-4.fc24.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/24/Everything/aarch64/os/Packages/l/lttng-ust-2.6.2-3.fc24.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/24/Everything/aarch64/os/Packages/l/lttng-ust-devel-2.6.2-3.fc24.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/24/Everything/aarch64/os/Packages/u/userspace-rcu-0.8.6-2.fc24.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/24/Everything/aarch64/os/Packages/u/userspace-rcu-devel-0.8.6-2.fc24.aarch64.rpm

安装rpm包:

1
2
3
4
5
6
7
8
rpm -ivh glibc-* --replacefiles
rpm -ivh userspace-rcu-*
rpm -ivh lttng-ust-*
rpm -ivh libbabeltrace-*
rpm -ivh leveldb-*
```
增加ceph.repo,内容如下:

[ceph]
name=ceph
baseurl=http://mirrors.aliyun.com/ceph/rpm-hammer/el7/aarch64/
gpgcheck=0
[ceph-noarch]
name=cephnoarch
baseurl=http://mirrors.aliyun.com/ceph/rpm-hammer/el7/noarch/
gpgcheck=0

1
安装ceph:

yum install -y ceph ceph-common ceph-radosgw

1
2
## bluestore配置

keyvaluestore backend = rocksdb
filestore_omap_backend = rocksdb
enable experimental unrecoverable data corrupting features = rocksdb,bluestore
osd_objectstore = bluestore