This is an old revision of the document!
Table of Contents
CEPH: Architecture Overview
Documentation | |
---|---|
Name: | CEPH: Architecture Overview |
Description: | This document is the start line to understand this technology |
Modification date : | 11/02/2019 |
Owner: | jholgado |
Notify changes to: | jholgado |
Tags: | ceph, oss |
Scalate to: | Sistemas |
Initial concepts
Ceph is a advanced Object Storage Service (OSS).
It provides easy to use and robust methods to store tons of objects of any kind and store it in multiple ways: S3/Swift, NFS, Block device.
It's fast and scale-out easily.
You can read/watch a basic introduction here: https://ceph.com/ceph-storage/
Basic architecture
VERY BASIC (2 minutes approach)
From a very simple point of view: Ceph acts as a disk.
With the correct OS library like can be the kernel module for “ext4”/“btrfs”, you'll be able to read/write directly.
This library is called librados
.
And interact with RADOS
the reliable autonomic distributed object store which is the Object storage itselv.
Continuing with this simple view, like ext4 for example, you'll have data blocks and journal blocks to maintain consistency; that are OSD (object store data) and MON (monitoring) nodes:
- OSD: serve data
- MON: keep track of OSD nodes, DOES NOT SERVE DATA
So you'll have something like this:
BASIC Architecture
Going deeper, you'll find that the data placement over OSD nodes is calculated by an algorithm called CRUSH: Controlled Replication Under Scalable Hashing which is:
- Pseudo-random, very fast, repeatable and deterministic
- Makes a uniform data distribution
- Results in a stable mapping of data (with very limited data migration on changes)
- It has a rule-based configuration: adjustable replication, weights …
So when a client want to write data in the CEPH cluster though RADOS, librados in the client side invokes CRUSH to perform the calculation on where of the available OSD's write the data.
That result in a very strong architecture with no single point of failure, cause you'll not have a/many node/s taking care of metadata.
Its also really fast: you'll have n*osd servers to perform read/writes.
Its robust, if any of the OSD node fails, the data is replicated N times (where N is a config option) through other OSD's and will be accessible with the CRUSH calculation.
Also if any OSD fail, the MONitors will re-map the cluster and OSD's will re-replicate the data to have copied N times in the cluster.
In a graph
CEPH as REST Object Storage
The unique diference in this case is there's a new component involved, the gateway which translate HTTP/REST into librados.
That's all.
You'll have a lot of overhead/performance gap using the gateway instead of using librados directly…
So if you take the previous graph, simplified, you'll have:
CEPH as Filesystem Architecture
Again, the diference in using CEPH as “filesystem” is that there's another component: the “Metadata server”.
Metadata a role similar to Monitor:
- it DOES NOT SERVE DATA
- It has the metadata database, that is a “inode” in a common filesystem.
- The Metadata nodes also replicate and vote between them.
- The filesystem tree is split dynamically between all the metadata nodes.
Real Life Cluster Example
Node list:
- Admin:
- ACCLM-OSADM-001
- OSD:
- ACCLM-OSD-001
- ACCLM-OSD-002
- ACCLM-OSD-003
- ACCLM-OSD-004
- ACCLM-OSD-005
- Gateways:
- ACCLM-OSGW-001
- ACCLM-OSGW-002
- Monitors (yes, there's 1 more needed):
- ACCLM-OSM-001
- ACCLM-OSADM-001