|noin||Prevents OSDs from being treated as in the cluster.||Commonly used with noout to address flapping OSDs.
|noout||Prevents OSDs from being treated as out of the cluster.||If the mon osd report timeout is exceeded and an OSD has not reported to the monitor, the OSD will get marked out. If this happens erroneously, you can set noout to prevent the OSD(s) from getting marked out while you troubleshoot the issue.
|noup||Prevents OSDs from being treated as up and running.||Commonly used with nodown to address flapping OSDs.
|nodown||Prevents OSDs from being treated as down.||Networking issues may interrupt Ceph ‘heartbeat’ processes, and an OSD may be up but still get marked down. You can set nodown to prevent OSDs from getting marked down while troubleshooting the issue.
|full||Makes a cluster appear to have reached its full_ratio, and thereby prevents write operations.||If a cluster is reaching its full_ratio, you can pre-emptively set the cluster to full and expand capacity. NOTE: Setting the cluster to full will prevent write operations.
|pause||Ceph will stop processing read and write operations, but will not affect OSD in, out, up or down statuses.||If you need to troubleshoot a running Ceph cluster without clients reading and writing data, you can set the cluster to pause to prevent client operations.
|nobackfill||Ceph will prevent new backfill operations.||If you need to take an OSD or node down temporarily, (e.g., upgrading daemons), you can set nobackfill so that Ceph will not backfill while the OSD(s) is down.|
|norebalance||Ceph will prevent new rebalancing operations.||这个标记通常和上面的
|norecover||Ceph will prevent new recovery operations.||If you need to replace an OSD disk and don’t want the PGs to recover to another OSD while you are hotswapping disks, you can set norecover to prevent the other OSDs from copying a new set of PGs to other OSDs.
|noscrub||Ceph will prevent new scrubbing operations.||If you want to prevent scrubbing (e.g., to reduce overhead during high loads, recovery, backfilling, rebalancing, etc.), you can set noscrub and/or nodeep-scrub to prevent the cluster from scrubbing OSDs.|
|nodeep-scrub||Ceph will prevent new deep scrubbing operations.||有时候在集群恢复时，scrub操作会影响到恢复的性能，和上面的
|notieragent||Ceph will disable the process that is looking for cold/dirty objects to flush and evict.||If you want to stop the tier agent process from finding cold objects to flush to the backing storage tier, you may set notieragent.
|allow||Precedes access settings for a daemon.|
|r||Gives the user read access. Required with monitors to retrieve the CRUSH map.|
|w||Gives the user write access to objects.|
|x||Gives the user the capability to call class methods (i.e., both read and write) and to conduct auth operations on monitors.|
|class-read||Gives the user the capability to call class read methods. Subset of x.|
|class-write||Gives the user the capability to call class write methods. Subset of x.|
|*||Gives the user read, write and execute permissions for a particular daemon/pool, and the ability to execute admin commands.|
|profile osd||Gives a user permissions to connect as an OSD to other OSDs or monitors. Conferred on OSDs to enable OSDs to handle replication heartbeat traffic and status reporting.|
|profile bootstrap-osd||Gives a user permissions to bootstrap an OSD. Conferred on deployment tools such as ceph-disk, ceph-deploy, etc. so that they have permissions to add keys, etc. when bootstrapping an OSD.|
|Creating||When you create a pool, it will create the number of placement groups you specified. Ceph will echo creating when it is creating one or more placement groups. Once they are created, the OSDs that are part of a placement group’s Acting Set will peer. Once peering is complete, the placement group status should be active+clean, which means a Ceph client can begin writing to the placement group.
|Peering||When Ceph is Peering a placement group, Ceph is bringing the OSDs that store the replicas of the placement group into agreement about the state of the objects and metadata in the placement group. When Ceph completes peering, this means that the OSDs that store the placement group agree about the current state of the placement group. However, completion of the peering process does NOT mean that each replica has the latest contents.
Ceph will NOT acknowledge a write operation to a client, until all OSDs of the acting set persist the write operation. This practice ensures that at least one member of the acting set will have a record of every acknowledged write operation since the last successful peering operation.
With an accurate record of each acknowledged write operation, Ceph can construct and disseminate a new authoritative history of the placement group—a complete, and fully ordered set of operations that, if performed, would bring an OSD’s copy of a placement group up to date.
|Active||Once Ceph completes the peering process, a placement group may become active. The active state means that the data in the placement group is generally available in the primary placement group and the replicas for read and write operations.
|Clean||When a placement group is in the clean state, the primary OSD and the replica OSDs have successfully peered and there are no stray replicas for the placement group. Ceph replicated all objects in the placement group the correct number of times.
|Degraded||When a client writes an object to the primary OSD, the primary OSD is responsible for writing the replicas to the replica OSDs. After the primary OSD writes the object to storage, the placement group will remain in a degraded state until the primary OSD has received an acknowledgement from the replica OSDs that Ceph created the replica objects successfully.
The reason a placement group can be active+degraded is that an OSD may be active even though it doesn’t hold all of the objects yet. If an OSD goes down, Ceph marks each placement group assigned to the OSD as degraded. The OSDs must peer again when the OSD comes back online. However, a client can still write a new object to a degraded placement group if it is active.
If an OSD is down and the degraded condition persists, Ceph may mark the down OSD as out of the cluster and remap the data from the down OSD to another OSD. The time between being marked down and being marked out is controlled by mon osd down out interval, which is set to 300 seconds by default.
A placement group can also be degraded, because Ceph cannot find one or more objects that Ceph thinks should be in the placement group. While you cannot read or write to unfound objects, you can still access all of the other objects in the degraded placement group.
Let’s say there are 9 OSDs with size = 3 (three copies of objects). If OSD number 9 goes down, the PGs assigned to OSD 9 go in a degraded state. If OSD 9 doesn’t recover, it goes out of the cluster and the cluster rebalances. In that scenario, the PGs are degraded and then recover to an active state.
|Recovering||Ceph was designed for fault-tolerance at a scale where hardware and software problems are ongoing. When an OSD goes down, its contents may fall behind the current state of other replicas in the placement groups. When the OSD is back up, the contents of the placement groups must be updated to reflect the current state. During that time period, the OSD may reflect a recovering state.
Recovery isn’t always trivial, because a hardware failure might cause a cascading failure of multiple OSDs. For example, a network switch for a rack or cabinet may fail, which can cause the OSDs of a number of host machines to fall behind the current state of the cluster. Each one of the OSDs must recover once the fault is resolved.
Ceph provides a number of settings to balance the resource contention between new service requests and the need to recover data objects and restore the placement groups to the current state. The osd recovery delay start setting allows an OSD to restart, re-peer and even process some replay requests before starting the recovery process. The osd recovery threads setting limits the number of threads for the recovery process (1 thread by default). The osd recovery thread timeout sets a thread timeout, because multiple OSDs may fail, restart and re-peer at staggered rates. The osd recovery max active setting limits the number of recovery requests an OSD will entertain simultaneously to prevent the OSD from failing to serve . The osd recovery max chunk setting limits the size of the recovered data chunks to prevent network congestion.
|Backfilling||When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs in the cluster to the newly added OSD. Forcing the new OSD to accept the reassigned placement groups immediately can put excessive load on the new OSD. Back filling the OSD with the placement groups allows this process to begin in the background. Once backfilling is complete, the new OSD will begin serving requests when it is ready. During the backfill operations, you may see one of several states: backfill_wait indicates that a backfill operation is pending, but isn’t underway yet; backfill indicates that a backfill operation is underway; and, backfill_too_full indicates that a backfill operation was requested, but couldn’t be completed due to insufficient storage capacity. When a placement group can’t be backfilled, it may be considered incomplete. Ceph provides a number of settings to manage the load spike associated with reassigning placement groups to an OSD (especially a new OSD). By default, osd_max_backfills sets the maximum number of concurrent backfills to or from an OSD to 10. The osd backfill full ratio enables an OSD to refuse a backfill request if the OSD is approaching its full ratio (85%, by default). If an OSD refuses a backfill request, the osd backfill retry interval enables an OSD to retry the request (after 10 seconds, by default). OSDs can also set osd backfill scan min and osd backfill scan max to manage scan intervals (64 and 512, by default).
|Remmapped||When the Acting Set that services a placement group changes, the data migrates from the old acting set to the new acting set. It may take some time for a new primary OSD to service requests. So it may ask the old primary to continue to service requests until the placement group migration is complete. Once data migration completes, the mapping uses the primary OSD of the new acting set.
|Stale||While Ceph uses heartbeats to ensure that hosts and daemons are running, the ceph-osd daemons may also get into a stuck state where they aren’t reporting statistics in a timely manner (e.g., a temporary network fault). By default, OSD daemons report their placement group, up thru, boot and failure statistics every half second (i.e., 0.5), which is more frequent than the heartbeat thresholds. If the Primary OSD of a placement group’s acting set fails to report to the monitor or if other OSDs have reported the primary OSD down, the monitors will mark the placement group stale.
Ceph使用心跳来确保主机和进程都在运行，OSD进程如果不能周期性的发送心跳包，那么PG就会变成stuck状态。默认情况下，OSD每半秒钟汇汇报一次PG，up thru,boot, failure statistics等信息，要比心跳包更会频繁一点。如果主OSD不能汇报给MON或者其他OSD汇报主OSD挂了，Monitor会将主OSD上的PG标记为stale。
When you start your cluster, it is common to see the stale state until the peering process completes. After your cluster has been running for awhile, seeing placement groups in the stale state indicates that the primary OSD for those placement groups is down or not reporting placement group statistics to the monitor.
|Misplaced||There are some temporary backfilling scenarios where a PG gets mapped temporarily to an OSD. When that temporary situation should no longer be the case, the PGs might still reside in the temporary location and not in the proper location. In which case, they are said to be misplaced. That’s because the correct number of extra copies actually exist, but one or more copies is in the wrong place.
Lets say there are 3 OSDs: 0,1,2 and all PGs map to some permutation of those three. If you add another OSD (OSD 3), some PGs will now map to OSD 3 instead of one of the others. However, until OSD 3 is backfilled, the PG will have a temporary mapping allowing it to continue to serve I/O from the old mapping. During that time, the PG is misplaced (because it has a temporary mapping) but not degraded (since there are 3 copies).
pg 1.5: up=acting: [0,1,2] <add osd 3> pg 1.5: up: [0,3,1] acting: [0,1,2]
Here, [0,1,2] is a temporary mapping, so the up set is not equal to the acting set and the PG is misplaced but not degraded since [0,1,2] is still three copies.
pg 1.5: up=acting: [0,3,1]
OSD 3 is now backfilled and the temporary mapping is removed, not degraded and not misplaced.
|Incomplete||A PG goes into a incomplete state when there is incomplete content and peering fails i.e, when there are no complete OSDs which are current enough to perform recovery.
Lets say [1,2,3] is a acting OSD set and it switches to [1,4,3], then osd.1 will request a temporary acting set of [1,2,3] while backfilling 4. During this time, if 1,2,3 all go down, osd.4 will be the only one left which might not have fully backfilled. At this time, the PG will go incomplete indicating that there are no complete OSDs which are current enough to perform recovery.
Alternately, if osd.4 is not involved and the acting set is simply [1,2,3] when 1,2,3 go down, the PG would likely go stale indicating that the mons have not heard anything on that PG since the acting set changed. The reason being there are no OSDs left to notify the new OSDs.
This setting only applies to format 2 images.
|Layering||1||Layering enables you to use cloning.|
|Striping v2||2||Striping spreads data across multiple objects. Striping helps with parallelism for sequential read/write workloads.|
|Exclusive locking||4||When enabled, it requires a client to get a lock on an object before making a write.|
|Object map||8||Block devices are thin provisioned—meaning, they only store data that actually exists. Object map support helps track which objects actually exist (have data stored on a drive). Enabling object map support speeds up I/O operations for cloning, or importing and exporting a sparsely populated image.|
|Fast-diff||16||Fast-diff support depends on object map support and exclusive lock support. It adds another property to the object map, which makes it much faster to generate diffs between snapshots of an image, and the actual data usage of a snapshot much faster.|
|Deep-flatten||32||Deep-flatten makes rbd flatten work on all the snapshots of an image, in addition to the image itself. Without it, snapshots of an image will still rely on the parent, so the parent will not be delete-able until the snapshots are deleted. Deep-flatten makes a parent independent of its clones, even if they have snapshots.|
| Criteria | Minimum Recommended|
| : --:| :–|
|Processor| 1x AMD64 and Intel 64|
|RAM | 2 GB of RAM per deamon|
|Volume Storage | 1x storage drive per daemon|
| Journal | 1x SSD partition per daemon (optional)|
| Network| 2x 1GB Ethernet NICs|
| Criteria | Minimum Recommended|
| : --:| :–|
|Processor| 1x AMD64 and Intel 64|
|RAM | 1 GB of RAM per deamon|
|Disk Space | 10 GB per daemon|
| Network| 2x 1GB Ethernet NICs|
|Tool Name||Testing Scenario||Command line /GUI||OS Support||Popularity||Reference|
|FIO (Flexible I/O Tester)||major in Block level storage ex.SAN、DAS||Command line||Linux / Windows||High||fio github|
|IOmeter||major in Block level storage ex.SAN、DAS||GUI / Command line||Linux / Windows||High||Iometer and IOzone|
|￼￼iozone||File Level Storage ex.NAS||GUI / Command line||Linux / Windows||High||IOzone Filesystem Benchmark|
|￼￼dd||File Level Storage ex.NAS||Command line||Linux / Windows||High||dd over NFS testing|
|￼￼rados bench||Ceph Rados||Command line||Linux Only||Normal||BENCHMARK A CEPH STORAGECLUSTER|
|￼￼￼cosbench||Cloud Object Storage Service||GUI / Command line||Linux / Windows||High||COSBench - Cloud Object Storage Benchmark|