Setup
Hardware Configuration
We have our OpenStack hardware to play with which for Ceph means:
- nodes 1 & 2 have an NVMe boot disk and a 1TB SSD and a 1TB HDD
- node 3 has an NVMe boot disk and a couple of 4TB HDD
we also have a few collected PCs lying about which we can run the monitor VMs on.
The NVMe disks have 64GB boot partitions and the rest is split into journal partitions for the OSD. The SSDs and HDDs are dedicated to Ceph.
Warning
Using a small (128GB) NVMe drive as a journal might not have been the smartest choice as you can read in NVMe Burnout. Essentially, after a few month of VMs chuntering away the system had written 23TB of data to the 64GB of journal space which was "over" the manufacturer's lifetime expectancy. It died.
Setup
We're following the manual deployment docs. They have an admin node but we'll use one of the monitors.
This will probably seem overly complicated but at heart there's a few things going on:
- we need to bootstrap the live cluster map
- Ceph uses some very simple ACLs which we need to keep in mind
- keep your ducks in a row. We'll use some variables to help show what's going on
- check whether things are owned by ceph or not -- you can accidentally leave things owned by root. Which doesn't work.
Networking
We'll use 192.168.8.0/24 for the Ceph public network -- that's the one the clients use to talk to the monitors and storage nodes.
We'll also use 192.168.9.0/24 for the Ceph cluster network. This is entirely optional and is used by the storage nodes to copy blobs between themselves. If it isn't set up they'll use the public network.
Software Install
Ceph and its dependencies are covered by the CentOS OpenStack repos:
yum install centos-release-openstack-stein yum install ceph
Note
Do the above, install the OpenStack SIG release, rather than yum install centos-release-ceph-nautilus, the Ceph Storage SIG release, because it covers the dependencies.
This is the same on all Ceph nodes (storage and monitors).
Bootstrap
We'll do this on our initial monitor node which is, for some reason, mon3.
Let's set up a few variables for the rest of the setup. We'll use the short hostname where applicable.
One thing you will want to consider is the number of placement groups, 128 in the example below. The number you need is dependent on the number of pools (of data), their replication number and the number OSDs. There is a calculator.
HOSTNAME=$(uname -n) HOST=${HOSTNAME%%.*} MON_IP=192.168.8.3 CN=ceph FSID=$(uuidgen)
Note
The cluster's name, CN, defaults to ceph however we'll use the variable to show where it is being used.
ceph.conf
cat <<EOF > /etc/ceph/${CN}.conf [global] fsid = ${FSID} mon initial members = ${HOST} mon host = ${MON_IP} public network = 192.168.8.0/24 cluster network = 192.168.9.0/24 auth cluster required = cephx auth service required = cephx auth client required = cephx osd pool default size = 3 osd pool default min size = 2 osd pool default pg num = 128 osd pool default pgp num = 128 # 1 normally, 0 for a single machine hack osd crush chooseleaf type = 1 [mon.${HOST}] mon allow pool delete = true EOF chown ceph:ceph /etc/ceph/${CN}.conf
Replication Ratio
Two parameters govern the replication ratio, osd pool default size is the actual replication ratio and osd pool default min size is what we're prepared to live with (temporarily). The cluster will be degraded if a pool loses an OSD and drops below the default size (3) but won't panic. I think there is a tacit assumption that the OSD has gone away temporarily (the host has been rebooted, say) and will be back soon enough.
Of course, the state of the OSD is verified when it does come back, Ceph is reliable!
If the OSD doesn't come back for a while then Ceph will make some attempt to move data to an alternate OSD (if one is available).
If you do drop below the minimum size then Ceph will mark the pool as read-only. This is probably fatal if that pool is the backing store for your VMs!
If, like in our minimum viable product scenario, say two of your three storage nodes have rebooted then there's not much Ceph can do anyway and when the two come back there'll be plenty of checking going on!
Monitor
Again, this looks complicated but it's mostly about creating ACLs for various purposes.
Keyrings bind a name (-n *name*) with a set of capabilities (--cap *thing* *ACL*).
Create a temporary keyring for the monitors to use -- the monitors only need to see the, er, monitors:
sudo -u ceph ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
Note
The name mon. is used to retrieve the monitor key when we add more monitors.
Create a permanent keyring for admins (us!) to use -- we need access to everything:
sudo ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' --cap mgr 'allow *'
Create a permanent keyring for bootstrapping OSDs:
sudo ceph-authtool --create-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --gen-key -n client.bootstrap-osd --cap mon 'profile bootstrap-osd'
Add the second two permanent keyrings (admins, bootstrap OSDs) to the first temporary keyring (monitors):
sudo ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring sudo ceph-authtool /tmp/ceph.mon.keyring --import-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
Create the initial monitor map with this host in it using the temporary keyring:
monmaptool --create --add ${HOST} ${MON_IP} --fsid {$FSID} /tmp/monmap
Create the working directory for the monitor daemon on this host:
sudo -u ceph mkdir /var/lib/ceph/mon/${CN}-${HOST}
Start the monitor with the initial monitor map:
sudo -u ceph ceph-mon --mkfs -i ${HOST} --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring
Note
The monitor is now running. We've just run the above command in the foreground and, given that nothing else is running in our cluster, we can ^C it and restart it with systemd in two ticks.
The instructions say "create the done file" so, er, we do it:
sudo -u ceph touch /var/lib/ceph/mon/${CN}-${HOST}/done
Finally, kick off the monitor daemon on this host:
sudo systemctl enable ceph-mon@${HOST} sudo systemctl start ceph-mon@${HOST}
That's it, our monitor daemon is now live. Remember, the monitor map is live. That initial monitor map we created remains just that, a point in time instance of the map. If we rebooted now the monitor would revert to that old map. Which might not be ideal.
How do we know it is running?
# ceph -s
You should get something even it reports some errors about the cluster state. We only have a monitor node so it shouldn't be too surprising that the cluster is in ill-health.
You can also run a long term variant of the above:
ceph -w
which prints out the same initial status then hangs about printing any state changes. Most of the time nothing much happens other than Ceph complaining that time is out of sync (for reasons unknown as these systems are generally sub-millisecond in sync). However, if you restart an OSD you'll get some more interesting messages.
Firewall
Just to tidy things up, open up the firewall to allow other nodes to talk to us:
# port 6789 firewall-cmd --add-service=ceph-mon --permanent
In addition (beyond the official docs) I believe you need to open up the general Ceph ports (6800-7300) as the monitors maintain connections to all the storage nodes:
firewall-cmd --add-service=ceph --permanent
Manager
You can (should?) run a manager on each monitor node. The manager is used to collate metrics so it's not required but interesting.
The instructions kick off the manager daemon interactively:
( umask 007 mkdir /var/lib/ceph/mgr/${CN}-${HOST} umask 077 ceph auth get-or-create mgr.${HOST} mon 'allow profile mgr' osd 'allow *' mds 'allow *' > /var/lib/ceph/mgr/${CN}-${HOST}/keyring chown -R ceph:ceph /var/lib/ceph/mgr/${CN}-${HOST} ceph-mgr -i ${HOST} )
So you might want to prep the manager daemon to start automatically for next time:
systemctl enable ceph-mgr@${HOST}
Firewall
# mgr! firewall-cmd --add-port=6810/tcp --permanent
Dashboard
The Ceph Manager supports a decent Web UI dashboard:
ceph mgr module enable dashboard
(you only need to do this once per cluster)
and open the firewall on each manager node:
# mgr dashboard firewall-cmd --add-port=8443/tcp --permanent
One nice aspect is that you can bookmark any of the monitor/manager nodes in your browser and you will be automatically redirected to the current live one.
Other Nodes
Once we have one node up and running and the monitor is live we can start adding more.
Installing the software is the same. You then need to bootstrap the new node from the old. We're copying the vital bootstrap config. So, from the original node targeting our second node, ceph2:
cd /etc/ceph scp ceph.client.admin.keyring ${CN}.conf ceph2:$PWD cd /var/lib/ceph/bootstrap-osd scp ceph.keyring ceph2:$PWD
Other Monitors
Other monitors are a little more involved as the cluster map is live. We need to get a copy of the live map, add our new monitor and then kick off the daemon.
Having installed the software and copied the config as above:
Log into the new monitor and prep the working directory:
ssh mon2 HOSTNAME=$(uname -n) HOST=${HOSTNAME%%.*} mkdir /var/lib/ceph/mon/ceph-${HOST}
Collect the mon. key and the current cluster map:
ceph auth get mon. -o key ceph mon getmap -o map
The mon. key was the name used for all monitors.
Kick off the new monitor using the key and map we just saved:
ceph-mon -i ${HOST} --mkfs --monmap map --keyring key
Note
The cluster map is live so the act of this new monitor starting (with the right credentials and latest map) means the cluster immediately has an extra monitor. Nothing more to do here.
Ready this system to start the monitor on reboot:
chown -R ceph:ceph /var/lib/ceph/mon/ceph-${HOST} systemctl enable ceph-mon@${HOST}
OSDs
Adding storage with ceph-volume is very easy. We're complicating things a bit by using an NVMe partition as the journal.
Actually, we've created an LVM volume group on the NVMe partition and we'll use a logical volume in the volume group for the journal.
If you recall we want something like 4% of the OSD for the journal. Our OSDs are 1TB (and 4TB) so we need something like 40GB (and 160GB) of space per OSD. Here I've used 80GB per OSD on this example host for reasons I don't recall:
# pvs /dev/nvme0n1p4 PV VG Fmt Attr PSize PFree /dev/nvme0n1p4 ceph-db lvm2 a-- 173.27g 13.27g # lvs ceph-db LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert db-sda1 ceph-db -wi-ao---- 80.00g db-sdb1 ceph-db -wi-ao---- 80.00g
Creating the OSD is simple, here for partition 1 on sdb:
ceph-volume lvm create --data /dev/sdb1 --block.db ceph-db/db-sdb1
The OSD will be given an ID, a small integer starting at 0.
ceph osd tree
At any point you can run ceph osd tree to see your accumulated set of OSDs:
# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 9.70517 root default -2 1.54099 host c1c 3 hdd 0.77049 osd.3 up 1.00000 1.00000 2 ssd 0.77049 osd.2 up 1.00000 1.00000 -10 1.81839 host c2c 4 hdd 0.90919 osd.4 up 1.00000 1.00000 5 ssd 0.90919 osd.5 up 1.00000 1.00000 -5 6.34579 host c3c 0 hdd 3.17290 osd.0 up 1.00000 1.00000 1 hdd 3.17290 osd.1 up 1.00000 1.00000
Here I don't have the full 1TB/4TB (essentially the weight column) as I've chewed off disk partitions for an OpenStack Swift implementation.
OSD Class
Note that Ceph has recognised SSDs and HDDs and designated the appropriate class. It doesn't differentiate NVMes (at the time of writing) and marks them as SSDs. You can revisit this if you desire with something like:
ceph osd crush rm-device-class ${osd_id} ceph osd crush set-device-class nvme ${osd_id}
Does the class do anything? Well, not immediately (or at least not immediately obviously). The device class is used (or at least available to be used) in placement decision making. We can force it to be used for particular pools.
OSD removal
You can "cleanly" remove an OSD if you decide to repurpose your disk. Suppose we want to remove sdb1 then first figure out which OSD is using sdb1:
ceph-volume lvm list
Now you can mark the corresponding osd.N as down:
systemctl stop ceph-osd@N
and now actually zap the disk:
ceph-volume lvm zap /dev/sdb1
which may get so far then fail. It's not figured out the volume groups. You might want to make some appropriate decision about who's using what then manually remove the volume group (and contained logical volume) and re-run ceph-volume lvm zap. Something like:
pvs /dev/sdb1
to figure out the volume group used by sdb1 then:
vgremove -y $UUID
and finally repeat:
ceph-volume lvm zap /dev/sdb1
What we're aiming for is to have the OSD ID freed up. The ultimate test of the success of that is that a subsequent ceph-volume lvm create will re-use that freed up OSD ID.
If it didn't then you may need to remove references to the OSD ID from various places:
ceph osd rm osd.$ID ceph osd crush rm osd.$ID ceph auth rm osd.$ID
OSD Usage
You can get a view of how your OSDs are being used, something like:
# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 3 hdd 0.77049 1.00000 789 GiB 375 GiB 294 GiB 100 KiB 1024 MiB 414 GiB 47.55 2.78 102 up 2 ssd 0.77049 1.00000 789 GiB 130 GiB 49 GiB 20 KiB 1024 MiB 659 GiB 16.52 0.97 90 up 4 hdd 0.90919 1.00000 931 GiB 289 GiB 288 GiB 28 KiB 1.2 GiB 642 GiB 31.06 1.82 97 up 5 ssd 0.90919 1.00000 931 GiB 57 GiB 56 GiB 24 KiB 1024 MiB 874 GiB 6.07 0.36 95 up 0 hdd 3.17290 1.00000 3.2 TiB 406 GiB 155 GiB 176 KiB 1024 MiB 2.8 TiB 12.50 0.73 90 up 1 hdd 3.17290 1.00000 3.2 TiB 440 GiB 189 GiB 96 KiB 1024 MiB 2.7 TiB 13.53 0.79 102 up TOTAL 9.7 TiB 1.7 TiB 1.0 TiB 445 KiB 6.2 GiB 8.0 TiB 17.08
Although I'm not sure you can really tell much from that as a user. Notice that the larger disks (IDs 0 and 1) don't necessarily have much more data on them. We're back to that balancing across failure points (hosts) issue.
Pools
Our manual installation of Ceph nautilus does not automatically create a pool called rbd which is a bit unfortunate as several commands, notably the rbd command itself, use it as the fallback/default pool if you don't explicitly pass one.
We're targeting OpenStack which will have pools something like: glance-images, cinder-volumes and nova-vms.
When we create a pool we indicate the number of placement groups the pool should use. Wait, we indicate the number? We don't know what these placement groups are or what they mean or anything! That's true but we still have to do it!
Ceph now starts to get a bit unreasonable. ceph -s may complain about too few PGs per OSD (12 < min 30) (for some numbers of X and Y).
We made an effort in Bootstrap to use the calculator to figure out the number of placement groups we need. The act of creating pools will will under- or over-use the number of placement groups we defined in the monitor setup and Ceph will whinge. grr! If this was the first of several pools to be added then add the other pools to increase the number of placement groups in total and therefore the number of placement groups per OSD!
If you still get some whinging then you may require some judicious modification of the number of placement groups to appease it.
Leaving you to fiddle with parameters, here for the nova-vms pool:
ceph osd pool nova-vms set pgp_num 512 ceph osd pool nova-vms set pg_num 512
Pool Creation
Pool creation is easy enough, barring this unclear decision about placement groups:
ceph osd pool create glance-images 64
where 64 is the number of placement groups to use. If you decide you're unhappy with something about your pool you can fiddle with its parameters:
ceph osd pool cinder-volumes [get|set] parameter [...]
If you omit the parameter you'll get a list of possible parameters.
At this point we can do a simple list of the pools:
# ceph osd lspools 1 glance-images 2 cinder-volumes 3 nova-vms
although a rados df is more illuminating about the overall state of play (rados is a command for interacting with a Ceph object storage cluster):
# rados df POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR cinder-volumes 13 GiB 1134 0 3402 0 0 0 55083257 49 GiB 3440171 119 GiB 0 B 0 B glance-images 706 GiB 30019 0 90057 0 0 0 54244420 230 GiB 238858 329 GiB 0 B 0 B nova-vms 315 GiB 15452 0 46356 0 0 0 384551762 676 GiB 385853274 13 TiB 0 B 0 B total_objects 46605 total_used 1.7 TiB total_avail 8.0 TiB total_space 9.7 TiB
Remember that the usage is going to reflect your replication ratio.
Pool Destruction
Getting rid of pools isn't quite so easy. We need to be allowed to do it and we need to express our determination to do so.
Firstly, we need to be allowed to delete a pool at all. You'll need to edit /etc/ceph/ceph.conf:
[mon.mon3] mon allow pool delete = true
and then restart the daemon:
systemctl restart ceph-mon@mon3
The actual delete command requires extra effort too:
# ceph osd pool delete foo Error EPERM: WARNING: this will *PERMANENTLY DESTROY* all data stored in pool foo. If you are *ABSOLUTELY CERTAIN* that is what you want, pass the pool name *twice*, followed by --yes-i-really-really-mean-it.
OK:
# ceph osd pool delete foo foo --yes-i-really-really-mean-it pool 'foo' removed
which is trivially scripted.
OSD CRUSH Rule
We can create specialised rules to use, say, only specific classes of OSD. Suppose we want to have a "fast" pool that only uses SSDs.
The key command is to create a rule that uses a specific class:
ceph osd crush rule create-replicated *rule-name* default host *osd-class*
Note
default refers to the root of the OSD tree and host refers to the differentiator for replication. Here we say replicas should appear on different hosts.
An example for creating a CRUSH rule using SSDs only:
ceph osd crush rule create-replicated fast-ssd default host ssd
If we wanted to create a pool up front using the specific CRUSH rule then we need to supply the number of placement groups for placement (who? use the same value as for placement groups) as well as the new rule name:
ceph osd pool create cinder-volumes-fast 512 512 fast-ssd
Alternatively, you can direct a pool to use the new CRUSH rule after it is up and running:
ceph osd pool set cinder-volumes crush_rule fast-ssd
Document Actions