Ceph常用表格汇总

文中文字全部参考红帽的官方文档,还有北京Ceph Day的PPT。只是为了普及知识用。

OSD的Flags

Flag Description Use Cases
noin Prevents OSDs from being treated as in the cluster. Commonly used with noout to address flapping OSDs.

通常和noout一起用防止OSD up/down跳来跳去
noout Prevents OSDs from being treated as out of the cluster. If the mon osd report timeout is exceeded and an OSD has not reported to the monitor, the OSD will get marked out. If this happens erroneously, you can set noout to prevent the OSD(s) from getting marked out while you troubleshoot the issue.

MON在过了300秒(mon_osd_down_out_interval)后自动将down掉的OSD标记为out,一旦out数据就会开始迁移,建议在处理故障期间设置该标记,避免数据迁移。
noup Prevents OSDs from being treated as up and running. Commonly used with nodown to address flapping OSDs.

通常和nodwon一起用解决OSD up/down跳来跳去
nodown Prevents OSDs from being treated as down. Networking issues may interrupt Ceph ‘heartbeat’ processes, and an OSD may be up but still get marked down. You can set nodown to prevent OSDs from getting marked down while troubleshooting the issue.

网络问题可能会影响到Ceph进程之间的心跳,有时候OSD进程还在,却被其他OSD一起举报标记为down,导致不必要的损耗,如果确定OSD进程始终正常,可以设置nodown标记防止OSD被误标记为down.
full Makes a cluster appear to have reached its full_ratio, and thereby prevents write operations. If a cluster is reaching its full_ratio, you can pre-emptively set the cluster to full and expand capacity. NOTE: Setting the cluster to full will prevent write operations.

如果集群快要满了,你可以预先将其设置为FULL,注意这个设置会停止写操作。(有没有效需要实际测试)
pause Ceph will stop processing read and write operations, but will not affect OSD in, out, up or down statuses. If you need to troubleshoot a running Ceph cluster without clients reading and writing data, you can set the cluster to pause to prevent client operations.

这个标记会停止一切客户端的读写,但是集群依旧保持正常运行。
nobackfill Ceph will prevent new backfill operations. If you need to take an OSD or node down temporarily, (e.g., upgrading daemons), you can set nobackfill so that Ceph will not backfill while the OSD(s) is down.
norebalance Ceph will prevent new rebalancing operations. 这个标记通常和上面的nobackfill和下面的norecover一起设置,在操作集群(挂掉OSD或者整个节点)时,如果不希望操作过程中数据发生恢复迁移等,可以设置这个标志,记得操作完后unset掉。
norecover Ceph will prevent new recovery operations. If you need to replace an OSD disk and don’t want the PGs to recover to another OSD while you are hotswapping disks, you can set norecover to prevent the other OSDs from copying a new set of PGs to other OSDs.

也是在操作磁盘时防止数据发生恢复。
noscrub Ceph will prevent new scrubbing operations. If you want to prevent scrubbing (e.g., to reduce overhead during high loads, recovery, backfilling, rebalancing, etc.), you can set noscrub and/or nodeep-scrub to prevent the cluster from scrubbing OSDs.
nodeep-scrub Ceph will prevent new deep scrubbing operations. 有时候在集群恢复时,scrub操作会影响到恢复的性能,和上面的noscrub一起设置来停止scrub。一般不建议打开。
notieragent Ceph will disable the process that is looking for cold/dirty objects to flush and evict. If you want to stop the tier agent process from finding cold objects to flush to the backing storage tier, you may set notieragent.

停止tier引擎查找冷数据并下刷到后端存储。

Auth的CAPs

capabilities Description
allow Precedes access settings for a daemon.
r Gives the user read access. Required with monitors to retrieve the CRUSH map.
w Gives the user write access to objects.
x Gives the user the capability to call class methods (i.e., both read and write) and to conduct auth operations on monitors.
class-read Gives the user the capability to call class read methods. Subset of x.
class-write Gives the user the capability to call class write methods. Subset of x.
* Gives the user read, write and execute permissions for a particular daemon/pool, and the ability to execute admin commands.
profile osd Gives a user permissions to connect as an OSD to other OSDs or monitors. Conferred on OSDs to enable OSDs to handle replication heartbeat traffic and status reporting.
profile bootstrap-osd Gives a user permissions to bootstrap an OSD. Conferred on deployment tools such as ceph-disk, ceph-deploy, etc. so that they have permissions to add keys, etc. when bootstrapping an OSD.

PG的States

这段有点长,但是讲解很详尽,建议从头看到尾。

State Description
Creating When you create a pool, it will create the number of placement groups you specified. Ceph will echo creating when it is creating one or more placement groups. Once they are created, the OSDs that are part of a placement group’s Acting Set will peer. Once peering is complete, the placement group status should be active+clean, which means a Ceph client can begin writing to the placement group.

当创建一个池的时候,Ceph会创建一些PG(通俗点说就是在OSD上建目录),处于创建中的PG就被标记为creating,当创建完之后,那些处于Acting集合(ceph pg map 1.0 osdmap e9395 pg 1.0 (1.0) -> up [27,4,10] acting [27,4,10],对于pg 1.0 它的三副本会分布在osd.27,osd.4,osd.10上,那么这三个OSD上的pg 1.0就会发生沟通,确保状态一致)的PG就会进行peer,当peering完成后,也就是这个PG的三副本状态一致后,这个PG就会变成active+clean状态,也就意味着客户端可以进行写入操作了。
Peering When Ceph is Peering a placement group, Ceph is bringing the OSDs that store the replicas of the placement group into agreement about the state of the objects and metadata in the placement group. When Ceph completes peering, this means that the OSDs that store the placement group agree about the current state of the placement group. However, completion of the peering process does NOT mean that each replica has the latest contents.

peer过程实际上就是让三个保存同一个PG副本的OSD对保存在各自OSD上的对象状态和元数据进行协商的过程,但是呢peer完成并不意味着每个副本都保存着最新的数据。
Authoritative History
Ceph will NOT acknowledge a write operation to a client, until all OSDs of the acting set persist the write operation. This practice ensures that at least one member of the acting set will have a record of every acknowledged write operation since the last successful peering operation.
With an accurate record of each acknowledged write operation, Ceph can construct and disseminate a new authoritative history of the placement group—​a complete, and fully ordered set of operations that, if performed, would bring an OSD’s copy of a placement group up to date.

直到OSD的副本都完成写操作,Ceph才会通知客户端写操作完成。这确保了Acting集合中至少有一个副本,自最后一次成功的peer后。剩下的不好翻译因为没怎么理解。
Active Once Ceph completes the peering process, a placement group may become active. The active state means that the data in the placement group is generally available in the primary placement group and the replicas for read and write operations.

当PG完成了Peer之后,就会成为active状态,这个状态意味着主从OSD的该PG都可以提供读写了。
Clean When a placement group is in the clean state, the primary OSD and the replica OSDs have successfully peered and there are no stray replicas for the placement group. Ceph replicated all objects in the placement group the correct number of times.

这个状态的意思就是主从OSD已经成功peer并且没有滞后的副本。PG的正常副本数满足集群副本数。
Degraded When a client writes an object to the primary OSD, the primary OSD is responsible for writing the replicas to the replica OSDs. After the primary OSD writes the object to storage, the placement group will remain in a degraded state until the primary OSD has received an acknowledgement from the replica OSDs that Ceph created the replica objects successfully.

当客户端向一个主OSD写入一个对象时,主OSD负责向从OSD写剩下的副本,在主OSD写完后,在从OSD向主OSD发送ack之前,这个PG均会处于降级状态。
The reason a placement group can be active+degraded is that an OSD may be active even though it doesn’t hold all of the objects yet. If an OSD goes down, Ceph marks each placement group assigned to the OSD as degraded. The OSDs must peer again when the OSD comes back online. However, a client can still write a new object to a degraded placement group if it is active.

而PG处于active+degraded状态是因为一个OSD处于active状态但是这个OSD上的PG并没有保存所有的对象。当一个OSDdown了,Ceph会将这个OSD上的PG都标记为降级。当这个挂掉的OSD重新上线之后,OSD们必须重新peer。然后,客户端还是可以向一个active+degraded的PG写入的。
If an OSD is down and the degraded condition persists, Ceph may mark the down OSD as out of the cluster and remap the data from the down OSD to another OSD. The time between being marked down and being marked out is controlled by mon osd down out interval, which is set to 300 seconds by default.

当OSDdown掉五分钟后,集群会自动将这个OSD标为out,然后将缺少的PGremap到其他OSD上进行恢复以保证副本充足,这个五分钟的配置项是mon osd down out interval,默认值为300s。
A placement group can also be degraded, because Ceph cannot find one or more objects that Ceph thinks should be in the placement group. While you cannot read or write to unfound objects, you can still access all of the other objects in the degraded placement group.

PG如果丢了对象,Ceph也会将其标记为降级。你可以继续访问没丢的对象,但是不能读写已经丢失的对象了。
Let’s say there are 9 OSDs with size = 3 (three copies of objects). If OSD number 9 goes down, the PGs assigned to OSD 9 go in a degraded state. If OSD 9 doesn’t recover, it goes out of the cluster and the cluster rebalances. In that scenario, the PGs are degraded and then recover to an active state.

假设有9个OSD,三副本,然后osd.8挂了,在osd.8上的PG都会被标记为降级,如果osd.8不再加回到集群那么集群就会自动恢复出那个OSD上的数据,在这个场景中,PG是降级的然后恢复完后就会变成active状态。
Recovering Ceph was designed for fault-tolerance at a scale where hardware and software problems are ongoing. When an OSD goes down, its contents may fall behind the current state of other replicas in the placement groups. When the OSD is back up, the contents of the placement groups must be updated to reflect the current state. During that time period, the OSD may reflect a recovering state.

Ceph设计之初就考虑到了容错性,比如软硬件的错误。当一个OSD挂了,它所包含的副本内容将会落后于其他副本,当这个OSD起来之后, 这个OSD的数据将会更新到当前最新的状态。这段时间,这个OSD上的PG就会被标记为recover
Recovery isn’t always trivial, because a hardware failure might cause a cascading failure of multiple OSDs. For example, a network switch for a rack or cabinet may fail, which can cause the OSDs of a number of host machines to fall behind the current state of the cluster. Each one of the OSDs must recover once the fault is resolved.

recover是不容忽视的,因为有时候一个小的硬件故障可能会导致多个OSD发生一连串的问题。比如,如果一个机架或者机柜的路由挂了,会导致一大批OSD数据滞后,每个OSD在故障解决重新上线后都需要进行recover
Ceph provides a number of settings to balance the resource contention between new service requests and the need to recover data objects and restore the placement groups to the current state. The osd recovery delay start setting allows an OSD to restart, re-peer and even process some replay requests before starting the recovery process. The osd recovery threads setting limits the number of threads for the recovery process (1 thread by default). The osd recovery thread timeout sets a thread timeout, because multiple OSDs may fail, restart and re-peer at staggered rates. The osd recovery max active setting limits the number of recovery requests an OSD will entertain simultaneously to prevent the OSD from failing to serve . The osd recovery max chunk setting limits the size of the recovered data chunks to prevent network congestion.

Ceph提供了一些配置项,用来解决客户端请求和数据恢复的请求优先级问题,这些配置参考上面加粗的字体吧。
Backfilling When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs in the cluster to the newly added OSD. Forcing the new OSD to accept the reassigned placement groups immediately can put excessive load on the new OSD. Back filling the OSD with the placement groups allows this process to begin in the background. Once backfilling is complete, the new OSD will begin serving requests when it is ready. During the backfill operations, you may see one of several states: backfill_wait indicates that a backfill operation is pending, but isn’t underway yet; backfill indicates that a backfill operation is underway; and, backfill_too_full indicates that a backfill operation was requested, but couldn’t be completed due to insufficient storage capacity. When a placement group can’t be backfilled, it may be considered incomplete. Ceph provides a number of settings to manage the load spike associated with reassigning placement groups to an OSD (especially a new OSD). By default, osd_max_backfills sets the maximum number of concurrent backfills to or from an OSD to 10. The osd backfill full ratio enables an OSD to refuse a backfill request if the OSD is approaching its full ratio (85%, by default). If an OSD refuses a backfill request, the osd backfill retry interval enables an OSD to retry the request (after 10 seconds, by default). OSDs can also set osd backfill scan min and osd backfill scan max to manage scan intervals (64 and 512, by default).

当一个新的OSD加入到集群后,CRUSH会重新规划PG将其他OSD上的部分PG迁移到这个新增的PG上。如果强制要求新OSD接受所有的PG迁入要求会极大的增加该OSD的负载。回填这个OSD允许进程在后端执行。一旦回填完成后,新的OSD将会承接IO请求。在回填过程中,你可能会看到如下状态:

backfill_wait: 表明回填动作被挂起,并没有执行。
backfill:表明回填动作正在执行。
backfill_too_full:表明当OSD收到回填请求时,由于OSD已经满了不能再回填PG了。
imcomplete: 当一个PG不能被回填时,这个PG会被认为是不完整的。
同样,Ceph提供了一系列的参数来限制回填动作,包括osd_max_backfills:OSD最大回填PG数。osd_backfill_full_ratio:当OSD容量达到默认的85%是拒绝回填请求。osd_backfill_retry_interval:字面意思。
Remmapped When the Acting Set that services a placement group changes, the data migrates from the old acting set to the new acting set. It may take some time for a new primary OSD to service requests. So it may ask the old primary to continue to service requests until the placement group migration is complete. Once data migration completes, the mapping uses the primary OSD of the new acting set.

当Acting集合里面的PG组合发生变化时,数据从旧的集合迁移到新的集合中。这段时间可能比较久,新集合的主OSD在迁移完之前不能响应请求。所以新主OSD会要求旧主OSD继续服务指导PG迁移完成。一旦数据迁移完成,新主OSD就会生效接受请求。
Stale While Ceph uses heartbeats to ensure that hosts and daemons are running, the ceph-osd daemons may also get into a stuck state where they aren’t reporting statistics in a timely manner (e.g., a temporary network fault). By default, OSD daemons report their placement group, up thru, boot and failure statistics every half second (i.e., 0.5), which is more frequent than the heartbeat thresholds. If the Primary OSD of a placement group’s acting set fails to report to the monitor or if other OSDs have reported the primary OSD down, the monitors will mark the placement group stale.

Ceph使用心跳来确保主机和进程都在运行,OSD进程如果不能周期性的发送心跳包,那么PG就会变成stuck状态。默认情况下,OSD每半秒钟汇汇报一次PG,up thru,boot, failure statistics等信息,要比心跳包更会频繁一点。如果主OSD不能汇报给MON或者其他OSD汇报主OSD挂了,Monitor会将主OSD上的PG标记为stale。
When you start your cluster, it is common to see the stale state until the peering process completes. After your cluster has been running for awhile, seeing placement groups in the stale state indicates that the primary OSD for those placement groups is down or not reporting placement group statistics to the monitor.

当启动集群后,直到peer过程完成,PG都会处于stale状态。而当集群运行了一段时间后,如果PG卡在stale状态,说明主OSD上的PG挂了或者不能给MON发送信息。
Misplaced There are some temporary backfilling scenarios where a PG gets mapped temporarily to an OSD. When that temporary situation should no longer be the case, the PGs might still reside in the temporary location and not in the proper location. In which case, they are said to be misplaced. That’s because the correct number of extra copies actually exist, but one or more copies is in the wrong place.

有一些回填的场景:PG被临时映射到一个OSD上。而这种情况实际上不应太久,PG可能仍然处于临时位置而不是正确的位置。这种情况下个PG就是misplaced。这是因为正确的副本数存在但是有个别副本保存在错误的位置上。
Lets say there are 3 OSDs: 0,1,2 and all PGs map to some permutation of those three. If you add another OSD (OSD 3), some PGs will now map to OSD 3 instead of one of the others. However, until OSD 3 is backfilled, the PG will have a temporary mapping allowing it to continue to serve I/O from the old mapping. During that time, the PG is misplaced (because it has a temporary mapping) but not degraded (since there are 3 copies).
Example:
pg 1.5: up=acting: [0,1,2] <add osd 3> pg 1.5: up: [0,3,1] acting: [0,1,2]
Here, [0,1,2] is a temporary mapping, so the up set is not equal to the acting set and the PG is misplaced but not degraded since [0,1,2] is still three copies.
pg 1.5: up=acting: [0,3,1]
OSD 3 is now backfilled and the temporary mapping is removed, not degraded and not misplaced.
Incomplete A PG goes into a incomplete state when there is incomplete content and peering fails i.e, when there are no complete OSDs which are current enough to perform recovery.

当一个PG被标记为incomplete,说明这个PG内容不完整或者peer失败,比如没有一个完整的OSD用来恢复数据了。
Lets say [1,2,3] is a acting OSD set and it switches to [1,4,3], then osd.1 will request a temporary acting set of [1,2,3] while backfilling 4. During this time, if 1,2,3 all go down, osd.4 will be the only one left which might not have fully backfilled. At this time, the PG will go incomplete indicating that there are no complete OSDs which are current enough to perform recovery.
Alternately, if osd.4 is not involved and the acting set is simply [1,2,3] when 1,2,3 go down, the PG would likely go stale indicating that the mons have not heard anything on that PG since the acting set changed. The reason being there are no OSDs left to notify the new OSDs.

RBD的Formats

This setting only applies to format 2 images.

Formats Bit Descriptions
Layering 1 Layering enables you to use cloning.
Striping v2 2 Striping spreads data across multiple objects. Striping helps with parallelism for sequential read/write workloads.
Exclusive locking 4 When enabled, it requires a client to get a lock on an object before making a write.
Object map 8 Block devices are thin provisioned—​meaning, they only store data that actually exists. Object map support helps track which objects actually exist (have data stored on a drive). Enabling object map support speeds up I/O operations for cloning, or importing and exporting a sparsely populated image.
Fast-diff 16 Fast-diff support depends on object map support and exclusive lock support. It adds another property to the object map, which makes it much faster to generate diffs between snapshots of an image, and the actual data usage of a snapshot much faster.
Deep-flatten 32 Deep-flatten makes rbd flatten work on all the snapshots of an image, in addition to the image itself. Without it, snapshots of an image will still rely on the parent, so the parent will not be delete-able until the snapshots are deleted. Deep-flatten makes a parent independent of its clones, even if they have snapshots.

MON和OSD的配置参考

低配(RedHat提供)

OSD的:
| Criteria | Minimum Recommended|
| : --:| :–|
|Processor| 1x AMD64 and Intel 64|
|RAM | 2 GB of RAM per deamon|
|Volume Storage | 1x storage drive per daemon|
| Journal | 1x SSD partition per daemon (optional)|
| Network| 2x 1GB Ethernet NICs|

MON的:
| Criteria | Minimum Recommended|
| : --:| :–|
|Processor| 1x AMD64 and Intel 64|
|RAM | 1 GB of RAM per deamon|
|Disk Space | 10 GB per daemon|
| Network| 2x 1GB Ethernet NICs|

土豪配(Intel提供)

颤抖吧!
Alt text

Ceph常用测试工具

Tool Name Testing Scenario Command line /GUI OS Support Popularity Reference
FIO (Flexible I/O Tester) major in Block level storage ex.SAN、DAS Command line Linux / Windows High fio github
IOmeter major in Block level storage ex.SAN、DAS GUI / Command line Linux / Windows High Iometer and IOzone
iozone File Level Storage ex.NAS GUI / Command line Linux / Windows High IOzone Filesystem Benchmark
dd File Level Storage ex.NAS Command line Linux / Windows High dd over NFS testing
rados bench Ceph Rados Command line Linux Only Normal BENCHMARK A CEPH STORAGECLUSTER
cosbench Cloud Object Storage Service GUI / Command line Linux / Windows High COSBench - Cloud Object Storage Benchmark