Using Ceph
Our main use of Ceph is detailed in OpenStack Integration, next, but we can do some other things first.
Block Devices
We can create RADOS Block Devices with the rbd command:
rbd create na1 --size 1024
will create a 1GB block device in the rbd pool which we don't have (shakes fist at assumed default values).
So, for the sake of demagoguery let's create a foo pool and our na1 device in it:
ceph osd pool create foo 64 rbd create foo/na1 --size 1024
Which we can see by listing the pool contents:
# rbd ls -l foo NAME SIZE PARENT FMT PROT LOCK na1 1 GiB 2
Now we can map that to get a device name. This is a bit like using losetup (or lofiadm or ...) to turn a .iso file into a mountable device:
# rbd map foo/na1 rbd: sysfs write failed RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable foo/na1 object-map fast-diff deep-flatten". In some cases useful info is found in syslog - try "dmesg | tail". rbd: map failed: (6) No such device or address
Bah! This kernel, 3.10.0-1062.9.1.el7.x86_64, doesn't support those features. Let's do what it says and try again:
# rbd feature disable foo/na1 object-map fast-diff deep-flatten # rbd map foo/na1 /dev/rbd0
If we forgot what the mapping was:
# rbd device ls id pool namespace image snap device 0 foo na1 - /dev/rbd0
Cool, we have a device we can manipulate in the usual way:
# mkfs.xfs /dev/rbd0 meta-data=/dev/rbd0 isize=512 agcount=8, agsize=32768 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0, sparse=0 data = bsize=4096 blocks=262144, imaxpct=25 = sunit=1024 swidth=1024 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # mount /dev/rbd0 /mnt
Filesystems
We can create a single mountable filesystem for which there is a three step process.
We'll just repeat that. You can create one filesystem. Ceph doesn't seemed designed for filesystem usage.
Metadata Server
This is the server we are going to mount the filesystem from. A monitor node or a storage node (or a dedicated metadata server node)?
We're not using it so we'll use a monitor node.
Continuing with the variables from setup:
Note
The MDS_ID must not start with a number.
MDS_ID=${HOST} mkdir /var/lib/ceph/mds/${CN}-${MDS_ID} cat <<EOF >> /etc/ceph/ceph.conf [mds.${MDS_ID}] host = ${MON_IP} EOF ( umask 077 ceph auth get-or-create mds.${MDS_ID} mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/${CN}-${MDS_ID}/keyring chown -R ceph:ceph /var/lib/ceph/mds/${CN}-${MDS_ID} ) systemctl enable ceph-mds@${MDS_ID} systemctl start ceph-mds@${MDS_ID}
Create a User
We can get a list of the current users with:
ceph auth ls
With our NFV hats on we'll create a user called client.nfv who'll have permission to access the nfv_data and nfv_metadata pools:
( umask 077 ceph auth get-or-create client.nfv mon 'allow r' mds 'allow r' osd 'allow rw pool nfv_data,allow rw pool nfv_metadata' -o /etc/ceph/nfv.keyring )
If you mess things up, check and repair with:
ceph auth get client.nfv ceph auth caps client.nfv mon 'allow r' mds 'allow r' osd 'allow pool nfv_data rw, allow pool nfv_metadata rw'
Save the key (not the keyring) to be used when mounting the filesystem:
( umask 077 ceph auth print-key client.nfv > /etc/ceph/nfv.key )
Create the Filesystem
A Ceph filesystem contains both a data and a metadata components which get bound together:
# FS=nfv # ceph osd pool create ${FS}_data 90 pool 'nfv_data' created # ceph osd pool create ${FS}_metadata 90 pool 'nfv_metadata' created # ceph fs new ${FS} ${FS}_metadata ${FS}_data new fs with metadata pool 6 and data pool 5
Check we're running:
# ceph mds stat nfv:1 {0=ceph3=up:active}
Mount the filesystem
# mount -t ceph ${MON_IP}:6789:/ /mnt -o name=nfv,secretfile=/etc/ceph/nfv.key mount error 1 = Operation not permitted
Doh! I forgot to save the key into /etc/ceph/nfv.key as instructed above, do that and all is well:
# df -h | grep mnt 192.168.8.3:6789:/ 2.0T 0 2.0T 0% /mnt
Notice a few things about that:
the new filesystem type, ceph
We've used a single server but you can specify multiple servers (using the default port):
$IP1,$IP2:/
the port number, 6789, that needs to be open through the firewall for others to use
name=nfv means the user client.nfv -- Ceph does some implied name munging.
The default name is guest
We need to specify the secretfile containing the key we saved to grant us permission.
You can pass secret=... but then it's not very secret...
The server's export point, here, /, is relative to the root of the filesystem, just as you would expect.
Had we created something in the filesystem we could have said ... $IP:/foo/bar /here/there
More details in the CephFS docs or man mount.ceph.
Unmount
mount /mnt
as you would expect.
Removing Filesystems
Again, Ceph is quite defensive (and not massively well documented). Something like:
ceph fs fail ${FS} ceph fs rm ${FS} --yes-i-really-mean-it ceph osd pool rm ${FS}_data ${FS}_data --yes-i-really-really-mean-it ceph osd pool rm ${FS}_metadata ${FS}_metadata --yes-i-really-really-mean-it
Multiple Filesystems
If you try you'll get:
Error EINVAL: Creation of multiple filesystems is disabled. To enable this experimental feature, use 'ceph fs flag set enable_multiple true'
If you do go down that route then those CephFS docs suggests you differentiate between filesystems with mds_namespace=${FS} -- although that option is not documented in mount.ceph(8).
Troubleshooting
Degraded pgs
# ceph -s ... data: pools: 5 pools, 192 pgs objects: 45.91k objects, 344 GiB usage: 1.8 TiB used, 9.0 TiB / 11 TiB avail pgs: 456/137727 objects degraded (0.331%) 191 active+clean 1 active+undersized+degraded
Can we find out more?
# ceph health detail ... PG_DEGRADED Degraded data redundancy: 456/137727 objects degraded (0.331%), 1 pg degraded, 1 pg undersized pg 1.7 is stuck undersized for 56676.916030, current state active+undersized+degraded, last acting [1,4]
Can we find out even more?
Note
ceph pg dump produces a table around 330 columns wide.
# ceph pg dump | grep ^1.[56789] dumped all 1.9 512 0 0 0 0 4286578706 0 0 1484 1484 active+clean 2020-01-09 00:56:54.828879 2064'1484 3142:37220 [4,1,3] 4 [4,1,3] 4 2064'1484 2020-01-09 00:56:54.828760 2064'1484 2020-01-07 14:47:23.346632 0 1.8 456 0 0 0 0 3808428578 414 27 3081 3081 active+clean 2020-01-09 00:17:48.201274 2199'4581 3142:58139 [1,3,4] 1 [1,3,4] 1 2199'4581 2020-01-08 10:03:25.518633 2199'4581 2020-01-05 01:07:02.541860 0 1.7 456 0 456 0 0 3816816640 414 27 3060 3060 active+undersized+degraded 2020-01-08 18:12:22.521247 3142'10460 3142:77602 [1,4] 1 [1,4] 1 2344'10448 2020-01-08 17:05:04.844260 2344'10415 2020-01-06 03:38:42.922036 0 1.6 458 0 0 0 0 3833593856 414 27 3010 3010 active+clean 2020-01-09 00:19:48.401463 3142'11600 3142:56322 [3,4,0] 3 [3,4,0] 3 2344'11579 2020-01-08 10:12:12.636661 2344'11551 2020-01-07 02:34:07.901581 0 1.5 490 0 0 0 0 4093640738 0 0 3000 3000 active+clean 2020-01-09 00:56:08.441742 2064'7484 3142:61277 [3,0,4] 3 [3,0,4] 3 2064'7484 2020-01-09 00:56:08.441696 2064'7484 2020-01-04 00:21:46.179037 0
What can we read into that? Well, the UP and ACTING columns for our undersized+degraded pg have two elements (OSD IDs 1 and 4) whereas everything else has three. We would expect three as that is our replication ratio. ceph pg map can save us a grep:
# ceph pg map 1.7 osdmap e3142 pg 1.7 (1.7) -> up [1,4] acting [1,4]
We can dump just the degraded (or undersized or ...) pgs:
# ceph pg dump_stuck degraded ok PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY 1.7 active+undersized+degraded [1,4] 1 [1,4] 1
ceph pg ls produces slightly less output and takes an optional state:
# ceph pg ls degraded PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 1.7 456 456 0 0 3816816640 414 27 3060 active+undersized+degraded 16h 3142'10460 3142:77606 [1,4]p1 [1,4]p1 2020-01-08 17:05:04.844260 2020-01-06 03:38:42.922036
The first timestamp in the output is the time of the last scrub. Let's force the issue:
# ceph pg scrub 1.7 instructing pg 1.7 on osd.1 to scrub
twiddles thumbs I'm not seeing this be updated urgently, maybe there's a scheduler. Let's try a repair:
# ceph pg repair 1.7 instructing pg 1.7 on osd.1 to repair
Hmm, equally unenthusiastic. Why would you have a scheduler for repair?
Version Differences
If you were of a mind to mix'n'match various operating systems in your Ceph cluster you'll likely to come up short. Ceph is particularly fussy -- as you might hope -- about versions of things. But not in a particularly tolerant manner.
Take for example my original Ceph cluster on hosts running CentOS 7.7 which means the Ceph instances are 14.2.1 (run ceph versions for a list). I then tried to add a Fedora 31 OSD node into the mix. Fedora 31's Ceph version is 14.2.7 and it turns out those 6 sub-minor versions are really quite different.
You get a slew of:
log_channel(cluster) log [WRN] : failed to encode map e4404 with expected crc
messages and after ten minutes your FC31 OSDs refuse to join.
Rummaging about, the Ceph upgrade process involves upgrading mon nodes first. Unfortunately, my mon nodes are also running 14.2.1 and there appears to be no enthusiasm [1] to build a newer release of Ceph for CentOS 7. CentOS 8 has a 14.2.7 release but not CentOS 7.
Of interest, supposing I decided to go all Fedora on my network then I see that Fedora Rawhide (ie. Fedora next) is already running Ceph Octopus, 15.1.0.
The fix is, noted in this Upgrade and Scale-Out Ceph In Production presentation, to create new FC31 monitor nodes and add them to the cluster and phase out the CentOS7 monitor nodes. (In fact, in the presentation, which was Jewel to Luminous, they had to add the monitor nodes as Jewel instances then upgrade them to Luminous in situ as the Jewel cluster didn't like Luminous nodes being added directly. Nautilus seems OK with different version of Nautilus at least.)
[1] | Ah! Hold that thought. After much aggravation and following the failed link on the Upgrading Ceph page you can reach http://download.ceph.com/rpm-nautilus/el7/x86_64/ which you can reverse engineer into a repo such as: # cat /etc/yum.repos.d/ceph.repo [ceph] name=Ceph Packages and Backports $basearch baseurl=http://download.ceph.com/rpm-nautilus/el7/$basearch enabled=1 gpgcheck=1 gpgkey=https://download.ceph.com/keys/release.asc In other words, Ceph themselves have built some RPMs, CentOS have moved on. |
Document Actions