====== CEPH: Architecture Overview ====== ^ Documentation ^| ^Name:| CEPH: Architecture Overview | ^Description:| This document is the start line to understand this technology | ^Modification date :|11/02/2019| ^Owner:|dodger| ^Notify changes to:|dodger| ^Tags:| ceph, oss | ^Scalate to:|The_fucking_bofh| ====== Initial concepts ====== What is: * Ceph is a advanced Object Storage Service (OSS). * A easy way to store tons of objects of any kind and store it in multiple ways: S3/Swift, NFS, Block device. * It's fast, scale-out and fault tolerance. What is NOT: * A database * A cache \\ You can read/watch a basic introduction here: [[https://ceph.com/ceph-storage/]] \\ ====== Basic architecture ====== ===== VERY BASIC (2 minutes approach) ===== From a very simple point of view: Ceph acts as a disk.\\ With the correct OS library like can be the kernel module for "ext4"/"btrfs", you'll be able to read/write directly.\\ This library is called ''librados''.\\ And interact with ''RADOS'' the //reliable autonomic distributed object store// which is the Object storage itselv.\\ Continuing with this simple view, like ext4 for example, you'll have data blocks and journal blocks to maintain consistency; that are OSD (object store data) and MON (monitoring) nodes: * OSD: serve data * MON: keep track of OSD nodes, **DOES NOT SERVE DATA** \\ So you'll have something like this: digraph G { compound=true; subgraph clusterR { label = "RADOS"; style=filled; fillcolor=gray; osd0 [shape=cylinder,style=filled,fillcolor=coral]; osd1 [shape=cylinder,style=filled,fillcolor=coral]; osd2 [shape=cylinder,style=filled,fillcolor=coral]; mon0 [shape=box3d,style=filled,fillcolor=lightblue]; osd3 [shape=cylinder,style=filled,fillcolor=coral]; osd4 [shape=cylinder,style=filled,fillcolor=coral]; osd5 [shape=cylinder,style=filled,fillcolor=coral]; mon1 [shape=box3d,style=filled,fillcolor=lightblue]; } subgraph clusterC { label = "CLIENT"; librados [shape=Mdiamond, style=filled, fillcolor=lightpink]; } librados->osd1 [dir=both;label=data]; librados->osd3 [dir=both;label=data] ; } ===== BASIC Architecture ===== Going deeper, you'll find that the data placement over OSD nodes is calculated by an algorithm called CRUSH: //Controlled Replication Under Scalable Hashing// which is: * Pseudo-random, very fast, repeatable and deterministic * Makes a uniform data distribution * Results in a stable mapping of data (with very limited data migration on changes) * It has a rule-based configuration: adjustable replication, weights ... \\ So when a client want to write data in the CEPH cluster though RADOS, librados in the client side invokes CRUSH to perform the calculation on where of the available OSD's write the data.\\ \\ That result in a very strong architecture with no single point of failure, cause you'll not have a/many node/s taking care of metadata.\\ Its also really fast: you'll have n*osd servers to perform read/writes.\\ Its robust, if any of the OSD node fails, the data is replicated //N// times (where //N// is a config option) through other OSD's and will be accessible with the CRUSH calculation.\\ Also if any OSD fail, the MONitors will re-map the cluster and OSD's will re-replicate the data to have copied //N// times in the cluster. ==== In a graph ==== digraph G { compound=true; subgraph clusterR { label = "RADOS"; subgraph clusterOSD{ style=filled; fillcolor=gray; label = "OSD side"; osd0 [shape=cylinder,style=filled,fillcolor=coral]; osd1 [shape=cylinder,style=filled,fillcolor=coral]; osd2 [shape=cylinder,style=filled,fillcolor=coral]; osd3 [shape=cylinder,style=filled,fillcolor=coral]; osd4 [shape=cylinder,style=filled,fillcolor=coral]; osd5 [shape=cylinder,style=filled,fillcolor=coral]; } subgraph clusterMON{ style=filled; fillcolor=gray; label = "MON side"; mon0 [shape=box3d,style=filled,fillcolor=lightblue]; mon1 [shape=box3d,style=filled,fillcolor=lightblue]; mon0 -> mon1 [dir=both, label="replication/voting"]; { rank=same; mon0 mon1} } status [shape=Mdiamond, label="osd status", syte=filled, fillcolor=lightpink]; osd0 -> status [dir=both]; osd1 -> status [dir=both]; osd2 -> status [dir=both]; osd3 -> status [dir=both]; osd4 -> status [dir=both]; osd5 -> status [dir=both]; status->mon0; status->mon1; } subgraph clusterC { label = "CLIENT"; data [shape=oval,style=filled, fillcolor=darkorange]; subgraph clusterLIBRADOS { style=filled; fillcolor=gray; label = "LIBRADOS"; crush [shape=Mdiamond, style=filled, fillcolor=lightpink]; } data->crush [lhead=clusterLIBRADOS]; } crush->osd1 [dir=both;label="data CRUSHed"]; crush->osd3 [dir=both;label="data CRUSHed"] ; } ====== Usage Cases ====== ===== CEPH as REST Object Storage ===== The **unique** diference in this case is there's a new component involved, the **gateway** which translate HTTP/REST into librados.\\ That's all.\\ You'll have a lot of overhead/performance gap using the gateway instead of using librados directly...\\ So if you take the previous graph, simplified, you'll have: digraph G { compound=true; rankdir=LR; subgraph clusterCLIENT { label = "CLIENT"; data [shape=oval,style=filled, fillcolor=darkorange]; } subgraph clusterG { label = "GATEWAY"; subgraph clusterLIBRADOS { style=filled; fillcolor=gray; label = "LIBRADOS"; crush [shape=Mdiamond, style=filled, fillcolor=lightpink]; } } data->crush [label="HTTP/REST",lhead=clusterG]; subgraph clusterR { label = "RADOS"; style=filled; fillcolor=gray; ceph [shape=cylinder,label="CEPH cluster", style=filled, fillcolor=coral]; } crush->ceph; } ===== CEPH as Filesystem Architecture ===== **Official documentation: [[https://docs.ceph.com/docs/master/cephfs/]]**\\ Again, the diference in using CEPH as "filesystem" is that there's another component: the "Metadata server".\\ Metadata a role similar to Monitor: * it **DOES NOT SERVE DATA** * It has the metadata database, that is a "inode" in a common filesystem. * The Metadata nodes also replicate and vote between them. * The filesystem tree is split **dynamically** between all the metadata nodes. digraph G { compound=true; subgraph clusterR { label = "RADOS"; style=filled; fillcolor=gray; subgraph clusterOSD{ label = "OSD pool"; osd0 [label="osd0..osdN",shape=cylinder,style=filled,fillcolor=brown]; } subgraph clusterMON{ label = "MON pool"; mon0 [label="mon0..monN",shape=box3d,style=filled,fillcolor=lightblue]; } subgraph clusterMETA{ label = "METADATA pool"; meta0 [shape=box3d,style=filled,fillcolor=green]; meta1 [shape=box3d,style=filled,fillcolor=green]; meta0->meta1 [dir=both,label="replication"]; { rank=same; meta0 meta1} } } subgraph clusterC { label = "CLIENT"; file [label="dodger@server ~ $ cp file /mnt/ceph/paht/to/new.file";shape=none]; file->cephFS; subgraph clusterLIBRADOS { style=filled; fillcolor=gray; label = "LIBRADOS/CEPHFS"; cephFS [label="/mnt/cephfs",shape=Mdiamond, style=filled, fillcolor=lightpink]; } } cephFS->osd0 [label=data]; cephFS->meta0 [label="file metadata"]; osd0->mon0 [label="status"]; meta0->mon0 [label="status"]; meta1->mon0 [label="status"]; } ====== Real Life ====== ===== ePayments PRO/PRE ===== Node list: * Admin: * ACCLM-OSADM-001 * OSD: * ACCLM-OSD-001 * ACCLM-OSD-002 * ACCLM-OSD-003 * ACCLM-OSD-004 * ACCLM-OSD-005 * Gateways: * ACCLM-OSGW-001 * ACCLM-OSGW-002 * Monitors: * ACCLM-OSM-001 * ACCLM-OSADM-001 digraph G { compound=true; subgraph clusterR { label = "CEPH"; subgraph clusterOSD{ style=filled; fillcolor=gray; label = "OSDs"; osd1 [label="ACCLM-OSD-001", shape=cylinder,style=filled,fillcolor=coral]; osd2 [label="ACCLM-OSD-002", shape=cylinder,style=filled,fillcolor=coral]; osd3 [label="ACCLM-OSD-003", shape=cylinder,style=filled,fillcolor=coral]; osd4 [label="ACCLM-OSD-004", shape=cylinder,style=filled,fillcolor=coral]; osd5 [label="ACCLM-OSD-005", shape=cylinder,style=filled,fillcolor=coral]; } subgraph clusterMON{ style=filled; fillcolor=gray; label = "MONs"; mon0 [label="ACCLM-OSM-001", shape=box3d,style=filled,fillcolor=lightblue]; mon1 [label="ACCLM-OSADM-001", shape=box3d,style=filled,fillcolor=lightblue]; mon1 [shape=box3d,style=filled,fillcolor=lightblue]; { rank=same; mon0 mon1} } subgraph clusterG { label = "GATEWAYs"; style=filled; fillcolor=gray; gw01 [label="ACCLM-OSGW-001", shape=Mdiamond, style=filled, fillcolor=lightpink]; gw02 [label="ACCLM-OSGW-002", shape=Mdiamond, style=filled, fillcolor=lightpink]; } } subgraph clusterCLIENT { label = "CLIENT"; data [shape=oval,style=filled, fillcolor=darkorange]; } dns [label="Balanced through DNS", shape=note]; dns->gw01 [color=firebrick4]; dns->gw02 [color=firebrick4]; data->gw01 [label="HTTP/REST",lhead=clusterG, dir=both]; data->gw02 [label="HTTP/REST",lhead=clusterG, dir=both]; gw01->osd1 [dir=both]; gw01->osd2 [dir=both]; gw01->osd3 [dir=both]; gw01->osd4 [dir=both]; gw01->osd5 [dir=both]; gw02->osd1 [dir=both]; gw02->osd2 [dir=both]; gw02->osd3 [dir=both]; gw02->osd4 [dir=both]; gw02->osd5 [dir=both]; osd1->mon0 [dir=both]; osd1->mon1 [dir=both]; osd2->mon0 [dir=both]; osd2->mon1 [dir=both]; osd3->mon0 [dir=both]; osd3->mon1 [dir=both]; osd4->mon0 [dir=both]; osd4->mon1 [dir=both]; osd5->mon0 [dir=both]; osd5->mon1 [dir=both]; mon0->mon1 [dir=both]; } ===== Clover schematics (here comes the monster) ===== HAproxies: * AVMLP-OSLB-001 * AVMLP-OSLB-002 Object Gateways: * AVMLP-OSGW-001 * AVMLP-OSGW-002 * AVMLP-OSGW-003 * AVMLP-OSGW-004 Monitors: * AVMLP-OSM-001 * AVMLP-OSM-002 * AVMLP-OSM-003 * AVMLP-OSM-004 * AVMLP-OSM-005 * AVMLP-OSM-006 Metadata servers: * AVMLP-OSFS-001 * AVMLP-OSFS-002 * AVMLP-OSFS-003 * AVMLP-OSFS-004 Data servers: * AVMLP-OSD-001 * AVMLP-OSD-002 * AVMLP-OSD-003 * AVMLP-OSD-004 * AVMLP-OSD-005 * AVMLP-OSD-006 * AVMLP-OSD-007 * AVMLP-OSD-008 * AVMLP-OSD-009 * AVMLP-OSD-010 * AVMLP-OSD-011 * AVMLP-OSD-012 * AVMLP-OSD-013 * AVMLP-OSD-014 * AVMLP-OSD-015 * AVMLP-OSD-016 * AVMLP-OSD-017 * AVMLP-OSD-018 * AVMLP-OSD-019 * AVMLP-OSD-020 \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ == Clover == [[https://www.imdb.com/title/tt1060277/?ref_=fn_al_tt_1|{{:documentation:linux:ceph:start_here:clover.gif?nolink|}}]] ==== As object gateway ==== digraph G { compound=true; node [shape=record]; subgraph clusterClover { label = "Clover Cluster"; clover [label="{ {{mon layer | nuciberterminal | {MON-001 | MON-003 }|nuciberterminal2 |{ MON-002 | MON-004 } | nudev | { MON-005 | MON-006 } }|{ object gateway layer | nuciberterminal | { OSGW-001 | OSGW-003 } | nuciberterminal2 |{ OSGW-002 | OSGW-004 } | nudev | { - }}|{mds (metadata) layer | nuciberterminal | { OSFS-001 | OSFS-003 } | nuciberterminal2 |{ OSFS-002 | OSFS-004 } | nudev | { - }}} | osd layer | nuciberterminal | {OSD-001 | OSD-003 | OSD-005 | OSD-007 | OSD-009 | OSD-011 | OSD-013 | OSD-015 | OSD-017 | OSD-019 }|nuciberterminal2 |{ OSD-002 | OSD-004 | OSD-006 | OSD-008 | OSD-010 | OSD-012 | OSD-014 | OSD-016 | OSD-018 | OSD-020 } }"]; } subgraph clusterHA { label = "Load balancer cluster (ceph independant)"; ha01 [label="AVMLP-OSLB-001", shape=Mdiamond, style=filled, fillcolor=lightpink]; ha02 [label="AVMLP-OSLB-002", shape=Mdiamond, style=filled, fillcolor=lightpink]; { rank=same; ha01 ha02 } ha01 -> ha02 [style=dashed, color=grey, dir=both, label="Keepalived VIP 10.20.54.0"] ; } subgraph clusterCLIENT1 { label = "Http (s3 api) Client"; data [label="http request data",shape=oval,style=filled, fillcolor=darkorange]; } data->ha02 [style=dashed, color=grey, lhead=clusterHA] ; data->ha01 [label="http", lhead=clusterHA] ; ha01->clover:gwproxy [label="http"] ; ha02->clover:gwproxy [label="http"] ; } ==== As Filesystem (cephfs) ==== digraph G { compound=true; node [shape=record]; subgraph clusterCLIENT2 { label = "Cephfs/librados Client"; mount [shape=rectangle,label="mount -t ceph OSM-001,OSM-002,OSM-003,OSM-004,OSM-005,OSM-006:6789:/ /mnt/cephfs -o name=cephfs-ftp,secretfile=/etc/ceph/client.secret"]; cephfs [shape=folder,style=filled, fillcolor=azure]; mount->cephfs; { rank=same ; mount cephfs } } subgraph clusterClover { label = "Clover Cluster"; clover [label="{ {{ mon layer | nuciberterminal | {MON-001 | MON-003 }|nuciberterminal2 |{ MON-002 | MON-004 } | nudev | { MON-005 | MON-006 } }|{ object gateway layer | nuciberterminal | { OSGW-001 | OSGW-003 } | nuciberterminal2 |{ OSGW-002 | OSGW-004 } | nudev | { - }}|{ mds (metadata) layer | nuciberterminal | { OSFS-001 | OSFS-003 } | nuciberterminal2 |{ OSFS-002 | OSFS-004 } | nudev | { - }}} | osd layer | nuciberterminal | {OSD-001 | OSD-003 | OSD-005 | OSD-007 | OSD-009 | OSD-011 | OSD-013 | OSD-015 | OSD-017 | OSD-019 }|nuciberterminal2 |{ OSD-002 | OSD-004 | OSD-006 | OSD-008 | OSD-010 | OSD-012 | OSD-014 | OSD-016 | OSD-018 | OSD-020 } }"]; } mount->clover:monpointer [label="mount with librados/cephfs"] ; cephfs -> clover:osdpointer [label="direct access"] ; clover:mdspointer -> cephfs [label="metadata info"] ; } \\ \\ ==== Public ceph schema ==== digraph G { compound=true; subgraph clusterR { label = "VOXEL INFRASTRUCTURE"; subgraph clusterGW1{ style=filled; fillcolor=gray; label = "Ceph Gateways - NUVOXEL&NUVOXEL2"; gateway1 [label="OSGW-001", shape=Mdiamond,style=filled,fillcolor=coral]; gateway3 [label="OSGW-003", shape=Mdiamond,style=filled,fillcolor=coral]; gateway2 [label="OSGW-002", shape=Mdiamond,style=filled,fillcolor=coral]; gateway4 [label="OSGW-004", shape=Mdiamond,style=filled,fillcolor=coral]; } subgraph clusterG1 { label = "DMZ - NUVOXEL&NUVOXEL2"; style=filled; fillcolor=gray; ngx01 [label="NGINX-001", shape=Mdiamond, style=filled, fillcolor=lightpink]; ngx02 [label="NGINX-002", shape=Mdiamond, style=filled, fillcolor=lightpink]; { rank=same; ngx01 ngx02 } ngx01 -> ngx02 [style=dashed, color=green, dir=both, label="Keepalived VIP"] ; } ceph [label="clover", shape=cylinder,style=filled,fillcolor=lightblue]; } subgraph clusterCLIENT { label = "CLIENT"; data [shape=oval,style=filled, fillcolor=darkorange]; } data->ngx01 [label="HTTP/GET",lhead=clusterG, dir=back]; ngx01->gateway1 ; ngx01->gateway2 ; ngx01->gateway3 ; ngx01->gateway4 ; ngx02->gateway1 ; ngx02->gateway2 ; ngx02->gateway3 ; ngx02->gateway4 ; gateway1->ceph; gateway2->ceph; gateway3->ceph; gateway4->ceph; } \\ \\ ====== Considerations for newcomers ====== When requesting access to **any** of our object storages or if you're a newcomer, you should know that: * The_fucking_bofh team will give you the "access_key" and "secret_key" to access the object storage, **but also** will inform you of the name of the user, this username is **unique** and will be something related to the name of the project. * The_fucking_bofh team will **not** create the bucket, you can easily do it :-) * bucket names are **unique** across ONE object storage instance: If you create the bucket "test" no one can create another bucket named "test" in the same object storage. Have this in mind as you're sharing ceph with other users and with yourself (for PRO/PRE/DEV/TEST/STAGING/DEMO/SERVERLESS)! So don't use "ciberterminal" as bucketname :-) * What you'll need to connect to the object storage is: * Endpoint url: for example https://clover.ciberterminal.net * access_key * secre_key * knowledge of how to connect. * Please specify where do you want your user to be created, for example: PRO/PRE/DEV/TEST/STAGING/DEMO/SERVERLESS \\ \\ Here you have a template to request a new user for the object storage: Good morning #infrastructure, We're facing a new project that involves store tons of objects and We want to use our incredible Ceph installation. Please provide us a new User so we can store all the data from this project. Name of the project: "This_template_sucks" Environment: DEVELOPMENT Expected number of buckets: 666 Thanks for your effort, best regards!