Using Ceph

Our main use of Ceph is detailed in OpenStack Integration, next, but we can do some other things first.

Block Devices

We can create RADOS Block Devices with the rbd command:

rbd create na1 --size 1024

will create a 1GB block device in the rbd pool which we don't have (shakes fist at assumed default values).

So, for the sake of demagoguery let's create a foo pool and our na1 device in it:

ceph osd pool create foo 64
rbd create foo/na1 --size 1024

Which we can see by listing the pool contents:

# rbd ls -l foo
NAME SIZE  PARENT FMT PROT LOCK
na1  1 GiB          2

Now we can map that to get a device name. This is a bit like using losetup (or lofiadm or ...) to turn a .iso file into a mountable device:

# rbd map foo/na1
rbd: sysfs write failed
RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable foo/na1 object-map fast-diff deep-flatten".
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (6) No such device or address

Bah! This kernel, 3.10.0-1062.9.1.el7.x86_64, doesn't support those features. Let's do what it says and try again:

# rbd feature disable foo/na1 object-map fast-diff deep-flatten
# rbd map foo/na1
/dev/rbd0

If we forgot what the mapping was:

# rbd device ls
id pool namespace image snap device
0  foo            na1   -    /dev/rbd0

Cool, we have a device we can manipulate in the usual way:

# mkfs.xfs /dev/rbd0
meta-data=/dev/rbd0              isize=512    agcount=8, agsize=32768 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=262144, imaxpct=25
         =                       sunit=1024   swidth=1024 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
# mount /dev/rbd0 /mnt

Filesystems

We can create a single mountable filesystem for which there is a three step process.

We'll just repeat that. You can create one filesystem. Ceph doesn't seemed designed for filesystem usage.

Metadata Server

This is the server we are going to mount the filesystem from. A monitor node or a storage node (or a dedicated metadata server node)?

We're not using it so we'll use a monitor node.

Continuing with the variables from setup:

Note

The MDS_ID must not start with a number.

MDS_ID=${HOST}
mkdir /var/lib/ceph/mds/${CN}-${MDS_ID}

cat <<EOF >> /etc/ceph/ceph.conf
[mds.${MDS_ID}]
host = ${MON_IP}
EOF

(
  umask 077
  ceph auth get-or-create mds.${MDS_ID} mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/${CN}-${MDS_ID}/keyring
  chown -R ceph:ceph /var/lib/ceph/mds/${CN}-${MDS_ID}
)

systemctl enable ceph-mds@${MDS_ID}
systemctl start ceph-mds@${MDS_ID}

Create a User

We can get a list of the current users with:

ceph auth ls

With our NFV hats on we'll create a user called client.nfv who'll have permission to access the nfv_data and nfv_metadata pools:

(
  umask 077
  ceph auth get-or-create client.nfv mon 'allow r' mds 'allow r' osd 'allow rw pool nfv_data,allow rw pool nfv_metadata' -o /etc/ceph/nfv.keyring
)

If you mess things up, check and repair with:

ceph auth get client.nfv
ceph auth caps client.nfv mon 'allow r' mds 'allow r' osd 'allow pool nfv_data rw, allow pool nfv_metadata rw'

Save the key (not the keyring) to be used when mounting the filesystem:

(
  umask 077
  ceph auth print-key client.nfv > /etc/ceph/nfv.key
)

Create the Filesystem

A Ceph filesystem contains both a data and a metadata components which get bound together:

# FS=nfv
# ceph osd pool create ${FS}_data 90
pool 'nfv_data' created
# ceph osd pool create ${FS}_metadata 90
pool 'nfv_metadata' created
# ceph fs new ${FS} ${FS}_metadata ${FS}_data
new fs with metadata pool 6 and data pool 5

Check we're running:

# ceph mds stat
nfv:1 {0=ceph3=up:active}

Mount the filesystem

# mount -t ceph ${MON_IP}:6789:/ /mnt -o name=nfv,secretfile=/etc/ceph/nfv.key
mount error 1 = Operation not permitted

Doh! I forgot to save the key into /etc/ceph/nfv.key as instructed above, do that and all is well:

# df -h | grep mnt
192.168.8.3:6789:/       2.0T     0  2.0T   0% /mnt

Notice a few things about that:

the new filesystem type, ceph
We've used a single server but you can specify multiple servers (using the default port):
```
$IP1,$IP2:/
```
the port number, 6789, that needs to be open through the firewall for others to use
name=nfv means the user client.nfv -- Ceph does some implied name munging.

The default name is guest
We need to specify the secretfile containing the key we saved to grant us permission.

You can pass secret=... but then it's not very secret...
The server's export point, here, /, is relative to the root of the filesystem, just as you would expect.

Had we created something in the filesystem we could have said ... $IP:/foo/bar /here/there

More details in the CephFS docs or man mount.ceph.

Unmount

mount /mnt

as you would expect.

Removing Filesystems

Again, Ceph is quite defensive (and not massively well documented). Something like:

ceph fs fail ${FS}
ceph fs rm ${FS} --yes-i-really-mean-it
ceph osd pool rm ${FS}_data ${FS}_data --yes-i-really-really-mean-it
ceph osd pool rm ${FS}_metadata ${FS}_metadata --yes-i-really-really-mean-it

Multiple Filesystems

If you try you'll get:

Error EINVAL: Creation of multiple filesystems is disabled.  To enable this experimental feature, use 'ceph fs flag set enable_multiple true'

If you do go down that route then those CephFS docs suggests you differentiate between filesystems with mds_namespace=${FS} -- although that option is not documented in mount.ceph(8).

Troubleshooting

Degraded pgs

# ceph -s
...
  data:
    pools:   5 pools, 192 pgs
    objects: 45.91k objects, 344 GiB
    usage:   1.8 TiB used, 9.0 TiB / 11 TiB avail
    pgs:     456/137727 objects degraded (0.331%)
             191 active+clean
             1   active+undersized+degraded

Can we find out more?

# ceph health detail
...
PG_DEGRADED Degraded data redundancy: 456/137727 objects degraded (0.331%), 1 pg degraded, 1 pg undersized
    pg 1.7 is stuck undersized for 56676.916030, current state active+undersized+degraded, last acting [1,4]

Can we find out even more?

Note

ceph pg dump produces a table around 330 columns wide.

# ceph pg dump | grep ^1.[56789]
dumped all
1.9         512                  0        0         0       0 4286578706           0          0 1484     1484               active+clean 2020-01-09 00:56:54.828879     2064'1484    3142:37220 [4,1,3]          4 [4,1,3]              4     2064'1484 2020-01-09 00:56:54.828760       2064'1484 2020-01-07 14:47:23.346632             0
1.8         456                  0        0         0       0 3808428578         414         27 3081     3081               active+clean 2020-01-09 00:17:48.201274     2199'4581    3142:58139 [1,3,4]          1 [1,3,4]              1     2199'4581 2020-01-08 10:03:25.518633       2199'4581 2020-01-05 01:07:02.541860             0
1.7         456                  0      456         0       0 3816816640         414         27 3060     3060 active+undersized+degraded 2020-01-08 18:12:22.521247    3142'10460    3142:77602   [1,4]          1   [1,4]              1    2344'10448 2020-01-08 17:05:04.844260      2344'10415 2020-01-06 03:38:42.922036             0
1.6         458                  0        0         0       0 3833593856         414         27 3010     3010               active+clean 2020-01-09 00:19:48.401463    3142'11600    3142:56322 [3,4,0]          3 [3,4,0]              3    2344'11579 2020-01-08 10:12:12.636661      2344'11551 2020-01-07 02:34:07.901581             0
1.5         490                  0        0         0       0 4093640738           0          0 3000     3000               active+clean 2020-01-09 00:56:08.441742     2064'7484    3142:61277 [3,0,4]          3 [3,0,4]              3     2064'7484 2020-01-09 00:56:08.441696       2064'7484 2020-01-04 00:21:46.179037             0

What can we read into that? Well, the UP and ACTING columns for our undersized+degraded pg have two elements (OSD IDs 1 and 4) whereas everything else has three. We would expect three as that is our replication ratio. ceph pg map can save us a grep:

# ceph pg map 1.7
osdmap e3142 pg 1.7 (1.7) -> up [1,4] acting [1,4]

We can dump just the degraded (or undersized or ...) pgs:

# ceph pg dump_stuck degraded
ok
PG_STAT STATE                      UP    UP_PRIMARY ACTING ACTING_PRIMARY
1.7     active+undersized+degraded [1,4]          1  [1,4]              1

ceph pg ls produces slightly less output and takes an optional state:

# ceph pg ls degraded
PG  OBJECTS DEGRADED MISPLACED UNFOUND BYTES      OMAP_BYTES* OMAP_KEYS* LOG  STATE                      SINCE VERSION    REPORTED   UP      ACTING  SCRUB_STAMP                DEEP_SCRUB_STAMP
1.7     456      456         0       0 3816816640         414         27 3060 active+undersized+degraded   16h 3142'10460 3142:77606 [1,4]p1 [1,4]p1 2020-01-08 17:05:04.844260 2020-01-06 03:38:42.922036

The first timestamp in the output is the time of the last scrub. Let's force the issue:

# ceph pg scrub 1.7
instructing pg 1.7 on osd.1 to scrub

twiddles thumbs I'm not seeing this be updated urgently, maybe there's a scheduler. Let's try a repair:

# ceph pg repair 1.7
instructing pg 1.7 on osd.1 to repair

Hmm, equally unenthusiastic. Why would you have a scheduler for repair?

Version Differences

If you were of a mind to mix'n'match various operating systems in your Ceph cluster you'll likely to come up short. Ceph is particularly fussy -- as you might hope -- about versions of things. But not in a particularly tolerant manner.

Take for example my original Ceph cluster on hosts running CentOS 7.7 which means the Ceph instances are 14.2.1 (run ceph versions for a list). I then tried to add a Fedora 31 OSD node into the mix. Fedora 31's Ceph version is 14.2.7 and it turns out those 6 sub-minor versions are really quite different.

You get a slew of:

log_channel(cluster) log [WRN] : failed to encode map e4404 with expected crc

messages and after ten minutes your FC31 OSDs refuse to join.

Rummaging about, the Ceph upgrade process involves upgrading mon nodes first. Unfortunately, my mon nodes are also running 14.2.1 and there appears to be no enthusiasm [1] to build a newer release of Ceph for CentOS 7. CentOS 8 has a 14.2.7 release but not CentOS 7.

Of interest, supposing I decided to go all Fedora on my network then I see that Fedora Rawhide (ie. Fedora next) is already running Ceph Octopus, 15.1.0.

The fix is, noted in this Upgrade and Scale-Out Ceph In Production presentation, to create new FC31 monitor nodes and add them to the cluster and phase out the CentOS7 monitor nodes. (In fact, in the presentation, which was Jewel to Luminous, they had to add the monitor nodes as Jewel instances then upgrade them to Luminous in situ as the Jewel cluster didn't like Luminous nodes being added directly. Nautilus seems OK with different version of Nautilus at least.)

[1]

Ah! Hold that thought. After much aggravation and following the failed link on the Upgrading Ceph page you can reach http://download.ceph.com/rpm-nautilus/el7/x86_64/ which you can reverse engineer into a repo such as:

# cat /etc/yum.repos.d/ceph.repo
[ceph]
name=Ceph Packages and Backports $basearch
baseurl=http://download.ceph.com/rpm-nautilus/el7/$basearch
enabled=1
gpgcheck=1
gpgkey=https://download.ceph.com/keys/release.asc

In other words, Ceph themselves have built some RPMs, CentOS have moved on.

Document Actions