Ceph 部署完整版(el7+jewel)

完工架构

环境

  1. 3台装有CentOS 7的主机,每台主机有5个磁盘(虚拟机磁盘要大于30G),具体虚拟机搭建可以查看虚拟机搭建 , 目前已经有了3个2T的盘,再在每个节点添加一个240G的盘,假装是一个用作journal的SSD,和一个800G的盘,假装是一个用作OSD的SSD。另外再添加1台装有CentOS 7的主机,用作ntp serverceph本地源,详细信息如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[root@ceph-1 ~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
[root@ceph-1 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 128G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 127.5G 0 part
├─centos-root 253:0 0 50G 0 lvm /
├─centos-swap 253:1 0 2G 0 lvm [SWAP]
└─centos-home 253:2 0 75.5G 0 lvm /home
sdb 8:16 0 2T 0 disk
sdc 8:32 0 2T 0 disk
sdd 8:48 0 2T 0 disk
sde 8:64 0 240G 0 disk
sdf 8:80 0 800G 0 disk
sr0 11:0 1 1024M 0 rom
[root@ceph-1 ~]# cat /etc/hosts
...
192.168.56.101 ceph-1
192.168.56.102 ceph-2
192.168.56.103 ceph-3
192.168.56.200 ceph-admin
  1. 集群配置如下:
主机 IP 功能
ceph-1 192.168.56.100 deploy、mon1、osd3
ceph-2 192.168.56.101 mon1、 osd3
ceph-3 192.168.56.102 mon1 、osd3
ceph-admin 192.168.56.200 ntp server、 ceph本地源

环境清理!

如果之前部署失败了,不必删除ceph客户端,或者重新搭建虚拟机,只需要在每个节点上执行如下指令即可将环境清理至刚安装完ceph客户端时的状态!强烈建议在旧集群上搭建之前清理干净环境,否则会发生各种异常情况。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ps aux|grep ceph |awk '{print $2}'|xargs kill -9
ps -ef|grep ceph
#确保此时所有ceph进程都已经关闭!!!如果没有关闭,多执行几次。
umount /var/lib/ceph/osd/*
rm -rf /var/lib/ceph/osd/*
rm -rf /var/lib/ceph/mon/*
rm -rf /var/lib/ceph/mds/*
rm -rf /var/lib/ceph/bootstrap-mds/*
rm -rf /var/lib/ceph/bootstrap-osd/*
rm -rf /var/lib/ceph/bootstrap-rgw/*
rm -rf /var/lib/ceph/tmp/*
rm -rf /etc/ceph/*
rm -rf /var/run/ceph/*

Ceph本地源的搭建

这一节有如下几个目的:

  • 对于不能访问外网的集群,必须在内网搭建好ceph源。
  • 对于使用非最新(0.94.9)小版本(0.94.7)的ceph,搭好本地源方便后续增加节点。
  • 做实验的话,有了本地源,搭环境速度会快很多。

写本文时,ceph的最新版本为10.2.3,现在在ceph-admin节点搭建一个10.2.2的本地源。

在ceph-admin节点安装httpd和createrepo:

1
yum install httpd createrepo -y

创建ceph源目录,并下载所有文件,方法其实很简单,前往阿里云的ceph镜像,选择你需要安装的ceph版本,这里选择的是**/rpm-jewel/el7/x86_64/**,然后把所有的包含10.2.2的rpm包都下下来,除了那个ceph-debuginfo,1G多太大了,可以不下。最后再加上ceph-deploy的rpm链接,这里你只需要复制粘贴就好了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
mkdir -p /var/www/html/ceph/10.2.2
cd /var/www/html/ceph/10.2.2
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-base-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-common-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-devel-compat-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-fuse-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-libs-compat-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-mds-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-mon-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-osd-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-radosgw-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-selinux-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-test-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/cephfs-java-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libcephfs1-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libcephfs1-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libcephfs_jni1-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libcephfs_jni1-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librados2-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librados2-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libradosstriper1-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libradosstriper1-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librbd1-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librbd1-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librgw2-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librgw2-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/python-ceph-compat-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/python-cephfs-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/python-rados-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/python-rbd-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/rbd-fuse-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/rbd-mirror-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/rbd-nbd-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/noarch/ceph-deploy-1.5.36-0.noarch.rpm

创建源并添加ceph.repo:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
createrepo /var/www/html/ceph/10.2.2
Spawning worker 0 with 3 pkgs
...
Spawning worker 11 with 2 pkgs
Workers Finished
Saving Primary metadata
Saving file lists metadata
Saving other metadata
Generating sqlite DBs
Sqlite DBs complete
[root@ceph-admin ~]# cat /etc/yum.repos.d/ceph.repo
[ceph_local]
name=ceph
baseurl=http://192.168.56.200/ceph/10.2.2
gpgcheck=0

记得开启httpd服务:

1
systemctl start httpd

这样就在ceph-admin节点上搭建好了10.2.2源,其他的ceph-1/2/3节点只需要把ceph.repo拷贝过去就可以安装了ceph了。暂时没找到别的简单的方法,我制作本地源的最初目的就是为了虚拟机里面安装能快一点。

yum源及ceph的安装

需要在每个节点(ceph-1/2/3)上执行以下指令
需要在每个节点(ceph-1/2/3)上执行以下指令
需要在每个节点(ceph-1/2/3)上执行以下指令

1
2
3
4
5
6
7
yum clean all
rm -rf /etc/yum.repos.d/*.repo
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
wget -O /etc/yum.repos.d/epel.repo http://mirrors.aliyun.com/repo/epel-7.repo
sed -i '/aliyuncs/d' /etc/yum.repos.d/CentOS-Base.repo
sed -i 's/$releasever/7/g' /etc/yum.repos.d/CentOS-Base.repo
sed -i '/aliyuncs/d' /etc/yum.repos.d/epel.repo

增加ceph的源

1
scp ceph-admin:/etc/yum.repos.d/ceph.repo /etc/yum.repos.d/

安装ceph客户端和ntp:

1
2
yum makecache
yum install ceph ceph-radosgw ntp -y

关闭selinux&firewalld

1
2
3
4
sed -i 's/SELINUX=.*/SELINUX=disabled/' /etc/selinux/config
setenforce 0
systemctl stop firewalld
systemctl disable firewalld

配置NTP

这里我把NTP server放在了ceph-admin节点上,其余三个ceph-1/2/3节点都是NTP client,目的就是从根本上解决时间同步问题。(暂时没搞多server的)

在ceph-admin节点上

修改/etc/ntp.conf,注释掉默认的四个server,添加三行配置如下:

1
2
3
4
5
6
7
8
9
10
11
12
vim /etc/ntp.conf
###comment following lines:
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
###add following lines:
server 127.127.1.0 minpoll 4
fudge 127.127.1.0 stratum 0
restrict 192.168.56.0 mask 255.255.0.0 nomodify notrap #这一行需要根据client的IP范围设置。

修改/etc/ntp/step-tickers文件如下:

1
2
3
4
# List of NTP servers used by the ntpdate service.
# 0.centos.pool.ntp.org
127.127.1.0

重启ntp服务,并查看server端是否运行正常,正常的标准就是ntpq -p指令的最下面一行是*:

1
2
3
4
5
6
7
[root@ceph-admin ~]# systemctl enable ntpd
Created symlink from /etc/systemd/system/multi-user.target.wants/ntpd.service to /usr/lib/systemd/system/ntpd.service.
[root@ceph-admin ~]# systemctl restart ntpd
[root@ceph-admin ~]# ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*LOCAL(0) .LOCL. 0 l - 16 1 0.000 0.000 0.000

至此,NTP server端已经配置完毕,下面开始配置client端。

ceph-1/ceph-2/ceph-3三个节点上:
修改/etc/ntp.conf,注释掉四行server,添加一行server指向ceph-admin:

1
2
3
4
5
6
7
8
vim /etc/ntp.conf
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
server 192.168.56.200

重启ntp服务并观察client是否正确连接到server端,同样正确连接的标准是ntpq -p的最下面一行以*号开头:

1
2
3
4
5
6
7
[root@ceph-1 ~]# systemctl enable ntpd
Created symlink from /etc/systemd/system/multi-user.target.wants/ntpd.service to /usr/lib/systemd/system/ntpd.service.
[root@ceph-1 ~]# systemctl restart ntpd
[root@ceph-1 ~]# ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*ceph-admin .LOCL. 1 u 1 64 1 0.329 0.023 0.000

这个过程不会持续太久,实际生产最久5min内也会达到*状态,下图给了一个未能正确连接的输出:

1
2
3
4
[root@ceph-1 ~]# ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
ceph-admin .INIT. 16 u - 64 0 0.000 0.000 0.000

可以观察到ceph-admin前面是没有*的,一定要确定所有server和client端的NTP都正确连接才能继续搭建ceph ! ! !

OS参数配置方法

之所以写这一节是因为,网上查到的一些形如echo 4194303 > /proc/sys/kernel/pid_max的方法重启之后是不会生效的,所以这里提供了一种重启之后仍然生效的方法,注意只是方法,并不代表实际生产值要配置成这里的值。当然在VM里面搭建ceph不需要配置这些。

  • 所有/proc/sys/*参数的配置方法,比如上面的pid_max,这一类的参数配置方法都类似,就是把值写入到/etc/sysctl.conf文件中即可:
1
2
3
4
# /proc/sys/kernel/pid_max 参数配置方法如下:
echo 'kernel.pid_max = 4194303' >> /etc/sysctl.conf
# /proc/sys/fs/file-max 参数配置方法如下:
echo 'fs.file-max = 26234859' >> /etc/sysctl.conf
  • ulimit -nopen files 文件句柄最大数,默认是1024,这里我们需要把这个值放大到65536:
1
2
echo '* soft nofile 65536' >> /etc/security/limits.conf
echo '* hard nofile 65536' >> /etc/security/limits.conf
  • 修改磁盘的I/O Scheduler,这里我添加了一个/etc/udev/rules.d/99-disk.rules,内容如下:
1
SUBSYSTEM=="block", ATTR{device/model}=="VBOX HARDDISK", ACTION=="add|change", KERNEL=="sd[a-h]", ATTR{queue/scheduler}="noop",ATTR{queue/read_ahead_kb}="8192"

解释下这里各个参数的意义,VBOX HARDDISK这个值可以从cat /sys/block/sda/device/model指令获取,一般SSD和SATA的model都是不一样的,上述指令对从sda-sdh的model是VBOX HARDDISK的磁盘设置了shchedulernoop,以及read_ahead_kb8192。 对于不同的磁盘,可以在99-disk.rules内添加多条规则。

终于,终于,终于,把环境准备好了,这时候 ,我比较建议重启下主机。
然后,简要介绍下这次搭建的ceph的架构:

  • 240G SSD: 用作3个2T SATA的journal,以及800G SSD的journal。
  • 800G SSD: 用作OSD,三台机器共3个构成ssd-pool
  • 2T SATA: 用作OSD,三台机器共9个构成sata-pool

磁盘分区

我们将240G的盘分出4个20G的分区用作journal:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
[root@ceph-1 ~]# fdisk /dev/sde
欢迎使用 fdisk (util-linux 2.23.2)。
更改将停留在内存中,直到您决定将更改写入磁盘。
使用写入命令前请三思。
Device does not contain a recognized partition table
使用磁盘标识符 0xca6640e5 创建新的 DOS 磁盘标签。
命令(输入 m 获取帮助):g
Building a new GPT disklabel (GUID: 2828A364-1320-4DB0-9DF1-9CECADB7019D)
命令(输入 m 获取帮助):n
分区号 (1-128,默认 1):
第一个扇区 (2048-503306845,默认 2048):
Last sector, +sectors or +size{K,M,G,T,P} (2048-503306845,默认 503306845):+20G
已创建分区 1
命令(输入 m 获取帮助):n
分区号 (2-128,默认 2):
第一个扇区 (41945088-503306845,默认 41945088):
Last sector, +sectors or +size{K,M,G,T,P} (41945088-503306845,默认 503306845):+20G
已创建分区 2
命令(输入 m 获取帮助):n
分区号 (3-128,默认 3):
第一个扇区 (83888128-503306845,默认 83888128):
Last sector, +sectors or +size{K,M,G,T,P} (83888128-503306845,默认 503306845):+20G
已创建分区 3
命令(输入 m 获取帮助):n
分区号 (4-128,默认 4):
第一个扇区 (125831168-503306845,默认 125831168):
Last sector, +sectors or +size{K,M,G,T,P} (125831168-503306845,默认 503306845):+20G
已创建分区 4
命令(输入 m 获取帮助):w
The partition table has been altered!
Calling ioctl() to re-read partition table.
正在同步磁盘。

更新分区表,并查看分区是否正确:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[root@ceph-1 ~]# partprobe
[root@ceph-1 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 100G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 99.5G 0 part
├─centos-root 253:0 0 50G 0 lvm /
├─centos-swap 253:1 0 2G 0 lvm [SWAP]
└─centos-home 253:2 0 47.5G 0 lvm /home
sdb 8:16 0 2T 0 disk
sdc 8:32 0 2T 0 disk
sdd 8:48 0 2T 0 disk
sde 8:64 0 240G 0 disk
├─sde1 8:65 0 20G 0 part
├─sde2 8:66 0 20G 0 part
├─sde3 8:67 0 20G 0 part
└─sde4 8:68 0 20G 0 part
sdf 8:80 0 800G 0 disk
sr0 11:0 1 1024M 0 rom

将三台主机的240G盘都分好区,就可以开始部署ceph了。

如果报下面的错误,可以尝试重启主机,保证partprobe指令没有输出:

1
2
3
[root@ceph-1 ~]# partprobe
Error: Error informing the kernel about modifications to partition /dev/sde1 -- 设备或资源忙. This means Linux won't know about any changes you made to /dev/sde1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting.
Error: Failed to add partition 1 (设备或资源忙)

部署Ceph

在部署节点(ceph-1)安装ceph-deploy,下文的部署节点统一指ceph-1:

1
2
3
4
5
[root@ceph-1 ~]# yum -y install ceph-deploy
[root@ceph-1 ~]# ceph-deploy --version
1.5.36
[root@ceph-1 ~]# ceph -v
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

在部署节点创建部署目录并开始部署:

1
2
3
4
[root@ceph-1 ~]# cd
[root@ceph-1 ~]# mkdir cluster
[root@ceph-1 ~]# cd cluster/
[root@ceph-1 cluster]# ceph-deploy new ceph-1 ceph-2 ceph-3

从部署节点ssh-copy-id到各个其他节点。

1
2
[root@ceph-1 cluster]# ssh-copy-id ceph-2
[root@ceph-1 cluster]# ssh-copy-id ceph-3

如果之前没有ssh-copy-id到各个节点,则需要输入一下密码,过程log如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
[ceph_deploy.cli][INFO ] Invoked (1.5.34): /usr/bin/ceph-deploy new ceph-1 ceph-2 ceph-3
[ceph_deploy.cli][INFO ] ceph-deploy options:
[ceph_deploy.cli][INFO ] username : None
[ceph_deploy.cli][INFO ] func : <function new at 0x7f91781f96e0>
[ceph_deploy.cli][INFO ] verbose : False
[ceph_deploy.cli][INFO ] overwrite_conf : False
[ceph_deploy.cli][INFO ] quiet : False
[ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f917755ca28>
[ceph_deploy.cli][INFO ] cluster : ceph
[ceph_deploy.cli][INFO ] ssh_copykey : True
[ceph_deploy.cli][INFO ] mon : ['ceph-1', 'ceph-2', 'ceph-3']
..
..
ceph_deploy.new][WARNIN] could not connect via SSH
[ceph_deploy.new][INFO ] will connect again with password prompt
The authenticity of host 'ceph-2 (192.168.57.223)' can't be established.
ECDSA key fingerprint is ef:e2:3e:38:fa:47:f4:61:b7:4d:d3:24:de:d4:7a:54.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ceph-2,192.168.57.223' (ECDSA) to the list of known hosts.
root
root@ceph-2's password:
[ceph-2][DEBUG ] connected to host: ceph-2
..
..
[ceph_deploy.new][DEBUG ] Resolving host ceph-3
[ceph_deploy.new][DEBUG ] Monitor ceph-3 at 192.168.57.224
[ceph_deploy.new][DEBUG ] Monitor initial members are ['ceph-1', 'ceph-2', 'ceph-3']
[ceph_deploy.new][DEBUG ] Monitor addrs are ['192.168.57.222', '192.168.57.223', '192.168.57.224']
[ceph_deploy.new][DEBUG ] Creating a random mon key...
[ceph_deploy.new][DEBUG ] Writing monitor keyring to ceph.mon.keyring...
[ceph_deploy.new][DEBUG ] Writing initial config to ceph.conf...

此时,目录内容如下:

1
2
[root@ceph-1 cluster]# ls
ceph.conf ceph-deploy-ceph.log ceph.mon.keyring

ceph.conf文件中,添加几个配置项(未添加cluster_network):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@ceph-1 cluster]# echo public_network=192.168.56.0/24 >> ceph.conf
[root@ceph-1 cluster]# echo 'osd_crush_update_on_start = false' >> ceph.conf
[root@ceph-1 cluster]# echo 'osd_journal_size = 20480' >> ceph.conf
[root@ceph-1 cluster]# cat ceph.conf
[global]
fsid = db18ec7d-9070-4a23-9fbc-36b3137d7587
mon_initial_members = ceph-1, ceph-2, ceph-3
mon_host = 192.168.56.101,192.168.56.102,192.168.56.103
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network=192.168.56.0/24
#下面这个配置项用于OSD启动时不自动更新crush,因为我们要自定义CRUSH。
osd_crush_update_on_start = false
#日志盘默认为5G,下面设置为20G
osd_journal_size = 20480

开始部署monitor:

1
2
3
4
5
6
[root@ceph-1 cluster]# ceph-deploy mon create-initial
..
..若干log
[root@ceph-1 cluster]# ls
ceph.bootstrap-mds.keyring ceph.bootstrap-rgw.keyring ceph.conf ceph.mon.keyring
ceph.bootstrap-osd.keyring ceph.client.admin.keyring ceph-deploy-ceph.log

查看集群状态:

1
2
3
4
5
6
7
8
9
10
11
[root@ceph-1 cluster]# ceph -s
cluster db18ec7d-9070-4a23-9fbc-36b3137d7587
health HEALTH_ERR
no osds
monmap e1: 3 mons at {ceph-1=192.168.56.101:6789/0,ceph-2=192.168.56.102:6789/0,ceph-3=192.168.56.103:6789/0}
election epoch 4, quorum 0,1,2 ceph-1,ceph-2,ceph-3
osdmap e1: 0 osds: 0 up, 0 in
flags sortbitwise
pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
64 creating

开始部署OSD, journal盘对应关系如下:

OSD盘 journal盘 备注
/dev/sdb /dev/sde1 2T SATA
/dev/sdc /dev/sde2 2T SATA
/dev/sdd /dev/sde3 2T SATA
/dev/sdf /dev/sde4 800G SSD

我的一个小习惯,先部署SATA,再部署SSD,这样序号就会连续了。。。

osd prepare 后面的参数格式为:HOSTNAME:data-disk:journal-disk

1
ceph-deploy --overwrite-conf osd prepare ceph-1:/dev/sdb:/dev/sde1 ceph-1:/dev/sdc:/dev/sde2 ceph-1:/dev/sdd:/dev/sde3 ceph-2:/dev/sdb:/dev/sde1 ceph-2:/dev/sdc:/dev/sde2 ceph-2:/dev/sdd:/dev/sde3 ceph-3:/dev/sdb:/dev/sde1 ceph-3:/dev/sdc:/dev/sde2 ceph-3:/dev/sdd:/dev/sde3

osd activate后面的参数格式为:HOSTNAME:data-disk1,需要注意的是,刚刚的/dev/sdb这里需要写成/dev/sdb1,因为分了区。

现在,激活创建的OSD:

这里会遇到一个ERR,因为jewel版本的ceph要求journal需要是ceph:ceph权限,报错如下:

1
2
3
> [ceph-1][WARNIN] 2016-10-10 04:56:59.268469 7f47dd1d8800 -1 OSD::mkfs: ObjectStore::mkfs failed with error -13
> [ceph-1][WARNIN] 2016-10-10 04:56:59.268529 7f47dd1d8800 -1 ** ERROR: error creating empty object store in /var/lib/ceph/tmp/mnt.YmTO7V: (13) Permission denied
>

执行如下指令即可解决这个错误,当然这个步骤最好是放在磁盘分区一节:

1
2
> chown ceph:ceph /dev/sde[1-4]
>

不要忘了所有的journal盘都需要修改权限。

1
ceph-deploy --overwrite-conf osd activate ceph-1:/dev/sdb1 ceph-1:/dev/sdc1 ceph-1:/dev/sdd1 ceph-2:/dev/sdb1 ceph-2:/dev/sdc1 ceph-2:/dev/sdd1 ceph-3:/dev/sdb1 ceph-3:/dev/sdc1 ceph-3:/dev/sdd1

我在部署的时候出了个小问题,有一个OSD没成功(待所有OSD部署完毕后,再重新部署问题OSD即可解决),如果不出意外的话,集群状态应该如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@ceph-1 cluster]# ceph -s
cluster db18ec7d-9070-4a23-9fbc-36b3137d7587
health HEALTH_ERR
64 pgs are stuck inactive for more than 300 seconds
64 pgs stuck inactive
too few PGs per OSD (7 < min 30)
monmap e1: 3 mons at {ceph-1=192.168.56.101:6789/0,ceph-2=192.168.56.102:6789/0,ceph-3=192.168.56.103:6789/0}
election epoch 4, quorum 0,1,2 ceph-1,ceph-2,ceph-3
osdmap e19: 9 osds: 9 up, 9 in
flags sortbitwise
pgmap v67: 64 pgs, 1 pools, 0 bytes data, 0 objects
294 MB used, 18421 GB / 18422 GB avail
64 creating

不用等了,这里的64 creating会一直卡着的,查看ceph osd tree就会理解osd_crush_update_on_start = false的作用了。不像快速部署文章中会达到active+clean状态了,会在后续的章节中继续完善CRUSH来解决这个问题。

同样的方法部署三个SSD,指令如下:

1
2
3
ceph-deploy osd prepare ceph-1:/dev/sdf:/dev/sde4 ceph-2:/dev/sdf:/dev/sde4 ceph-3:/dev/sdf:/dev/sde4
ceph-deploy osd activate ceph-1:/dev/sdf1 ceph-2:/dev/sdf1 ceph-3:/dev/sdf1

执行完后,查看ceph -s,应该可以看到12个OSD是up的。

自定义CRUSH

最后的工作就是把刚刚建的12个OSD组成下图所示的架构,这里的root-sataceph-1-sata等都是我们虚构出来的,和主机名没有任何关系,但是OSD必须和刚刚部署的顺序一一对应!

搭建的Ceph架构图

首先创建对应的Bucket,指令规范为:

1
2
3
4
5
ceph osd crush add <osdname (id|osd.id)> <float[0.0-]> <args> [<args>...]
# add or update crushmap position and weight for <name> with <weight> and location <args>
ceph osd crush add-bucket <name> <type>
# add no-parent (probably root) crush bucket <name> of type <type>

所以对应指令为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
ceph osd crush add-bucket root-sata root
ceph osd crush add-bucket root-ssd root
ceph osd crush add-bucket ceph-1-sata host
ceph osd crush add-bucket ceph-2-sata host
ceph osd crush add-bucket ceph-3-sata host
ceph osd crush add-bucket ceph-1-ssd host
ceph osd crush add-bucket ceph-2-ssd host
ceph osd crush add-bucket ceph-3-ssd host
[root@ceph-1 cluster]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-9 0 host ceph-3-ssd
-8 0 host ceph-2-ssd
-7 0 host ceph-1-ssd
-6 0 host ceph-3-sata
-5 0 host ceph-2-sata
-4 0 host ceph-1-sata
-3 0 root root-ssd
-2 0 root root-sata
-1 0 root default
0 0 osd.0 up 1.00000 1.00000
1 0 osd.1 up 1.00000 1.00000
...

再将对应的host移到root下,然后把OSD移到对应的host下面,注意添加OSD时的weight是该OSD的实际大小(2T为2 ,800G为0.8),切勿随意填写!!!指令如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
ceph osd crush move ceph-1-sata root=root-sata
ceph osd crush move ceph-2-sata root=root-sata
ceph osd crush move ceph-3-sata root=root-sata
ceph osd crush move ceph-1-ssd root=root-ssd
ceph osd crush move ceph-2-ssd root=root-ssd
ceph osd crush move ceph-3-ssd root=root-ssd
ceph osd crush add osd.0 2 host=ceph-1-sata
ceph osd crush add osd.1 2 host=ceph-1-sata
ceph osd crush add osd.2 2 host=ceph-1-sata
ceph osd crush add osd.3 2 host=ceph-2-sata
ceph osd crush add osd.4 2 host=ceph-2-sata
ceph osd crush add osd.5 2 host=ceph-2-sata
ceph osd crush add osd.6 2 host=ceph-3-sata
ceph osd crush add osd.7 2 host=ceph-3-sata
ceph osd crush add osd.8 2 host=ceph-3-sata
ceph osd crush add osd.9 0.8 host=ceph-1-ssd #注意weight是实际容量
ceph osd crush add osd.10 0.8 host=ceph-2-ssd
ceph osd crush add osd.11 0.8 host=ceph-3-ssd

一不小心加错了怎么办:

1
2
ceph osd crush rm osd.0 #删了再加就好了,当然万能的还是 -h
ceph osd crush -h

查看当前的ceph osd tree,是不是觉得棒棒哒!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@ceph-1 cluster]# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-3 2.39996 root root-ssd
-7 0.79999 host ceph-1-ssd
9 0.79999 osd.9 up 1.00000 1.00000
-8 0.79999 host ceph-2-ssd
10 0.79999 osd.10 up 1.00000 1.00000
-9 0.79999 host ceph-3-ssd
11 0.79999 osd.11 up 1.00000 1.00000
-2 18.00000 root root-sata
-4 6.00000 host ceph-1-sata
0 2.00000 osd.0 up 1.00000 1.00000
1 2.00000 osd.1 up 1.00000 1.00000
2 2.00000 osd.2 up 1.00000 1.00000
-5 6.00000 host ceph-2-sata
3 2.00000 osd.3 up 1.00000 1.00000
4 2.00000 osd.4 up 1.00000 1.00000
5 2.00000 osd.5 up 1.00000 1.00000
-6 6.00000 host ceph-3-sata
6 2.00000 osd.6 up 1.00000 1.00000
7 2.00000 osd.7 up 1.00000 1.00000
8 2.00000 osd.8 up 1.00000 1.00000
-1 0 root default

最后有个孤零零的root default,丢那别管咯,留个纪念。

最后,修改CRUSHMAP即可:
我习惯导出crushmap进行操作,十分方便不需要记什么指令。
导出crushmap,并增加一条rule:

1
2
3
4
ceph osd getcrushmap -o /tmp/crush
crushtool -d /tmp/crush -o /tmp/crush.txt
# -d 的意思是decompile,导出的crush是二进制格式的。
vim /tmp/crush.txt

翻到最下面,有个#rule:

1
2
3
4
5
6
7
8
9
10
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

复制粘贴再加个rule,并修改成下面的样子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# rules
rule rule-sata { #改个名字,好认
ruleset 0
type replicated
min_size 1
max_size 10
step take root-sata #改成图中的SATA的root名
step chooseleaf firstn 0 type host
step emit
}
rule rule-ssd { #改个名字
ruleset 1 #增加rule的编号
type replicated
min_size 1
max_size 10
step take root-ssd #改成图中的SSD的root名
step chooseleaf firstn 0 type host
step emit
}

保存退出,编译刚刚的crush.txt,并注入集群中:

1
2
3
[root@ceph-1 cluster]# crushtool -c /tmp/crush.txt -o /tmp/crush.bin
[root@ceph-1 cluster]# ceph osd setcrushmap -i /tmp/crush.bin
set crush map

查看ceph -s

1
2
3
4
5
6
7
8
9
10
11
[root@ceph-1 cluster]# ceph -s
cluster db18ec7d-9070-4a23-9fbc-36b3137d7587
health HEALTH_WARN
too few PGs per OSD (16 < min 30)
monmap e1: 3 mons at {ceph-1=192.168.56.101:6789/0,ceph-2=192.168.56.102:6789/0,ceph-3=192.168.56.103:6789/0}
election epoch 4, quorum 0,1,2 ceph-1,ceph-2,ceph-3
osdmap e59: 12 osds: 12 up, 12 in
flags sortbitwise
pgmap v321: 64 pgs, 1 pools, 0 bytes data, 0 objects
405 MB used, 20820 GB / 20820 GB avail
64 active+clean

去除WARN,增大rbd的pg数即可:

1
2
ceph osd pool set rbd pg_num 256
ceph osd pool set rbd pgp_num 256

这时候,稍等一下就会看到Health_OK的状态了,但是还没有体现出这种架构的优点,先查看下rbd池的情况:

1
2
[root@ceph-1 cluster]# ceph osd pool ls detail
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 62 flags hashpspool stripe_width 0

这里主要关注crush_ruleset 0 ,rbd池使用的是刚刚建的rule-sata,所以rbd池的所有数据都是保存在osd.0--osd.8上的:

1
2
[root@ceph-1 cluster]# ceph pg map 0.0
osdmap e66 pg 0.0 (0.0) -> up [6,1,5] acting [6,1,5]

关于PG的一些概念,可以参照大话ceph–PG那点事儿这篇文章。

此时,将rbd池是ruleset设置为1,即由osd.9-osd.11组成的结构,rbd池的所有数据将会保存在这三个SSD上:

1
2
3
4
[root@ceph-1 cluster]# ceph osd pool set rbd crush_ruleset 1
set pool 0 crush_ruleset to 1
[root@ceph-1 cluster]# ceph pg map 0.0
osdmap e68 pg 0.0 (0.0) -> up [9,10,11] acting [9,10,11]

实际用法比如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@ceph-1 cluster]# ceph osd pool create sata-pool 256 rule-sata
pool 'sata-pool' created
[root@ceph-1 cluster]# ceph osd pool create ssd-pool 256 rule-ssd
pool 'ssd-pool' created
[root@ceph-1 cluster]# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
20820G 20820G 452M 0
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 0 0 799G 0
sata-pool 1 0 0 6140G 0
ssd-pool 2 0 0 799G 0
[root@ceph-1 cluster]# ceph osd pool ls detail
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 67 flags hashpspool stripe_width 0
pool 1 'sata-pool' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 69 flags hashpspool stripe_width 0
pool 2 'ssd-pool' replicated size 3 min_size 2 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 79 flags hashpspool stripe_width 0

这一节的作用就是将同一台主机上的SATA和SSD分开使用,在一批次同样结构的主机上,构建出两种不同速度的pool,实际上,CRUSH的自定义性,给ceph提供了极大的可塑性。

总结的话

这篇文章除了没有涉及到具体的ceph.conf参数以及OS优化的参数外,基本上可以搭建一个适用于生产环境的集群,也算了了一桩心事,因为快速部署那篇还是写的很毛糙的,这篇花了不少心思把整个流程整理了下。

Enjoy it !