ceph:start_here
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
ceph:start_here [2019/07/17 13:26] – external edit 127.0.0.1 | ceph:start_here [Unknown date] (current) – removed - external edit (Unknown date) 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== CEPH: Architecture Overview ====== | ||
- | |||
- | ^ Documentation | ||
- | ^Name:| CEPH: Architecture Overview | | ||
- | ^Description: | ||
- | ^Modification date : | ||
- | ^Owner: | ||
- | ^Notify changes to: | ||
- | ^Tags:| ceph, oss | | ||
- | ^Scalate to: | ||
- | |||
- | |||
- | |||
- | ====== Initial concepts ====== | ||
- | |||
- | Ceph is a advanced Object Storage Service (OSS). \\ | ||
- | It provides easy to use and robust methods to store tons of objects of any kind and store it in multiple ways: S3/Swift, NFS, Block device.\\ | ||
- | It's fast and scale-out easily.\\ | ||
- | |||
- | You can read/watch a basic introduction here: [[https:// | ||
- | |||
- | |||
- | ====== Basic architecture ====== | ||
- | |||
- | ===== VERY BASIC (2 minutes approach) ===== | ||
- | |||
- | From a very simple point of view: Ceph acts as a disk.\\ | ||
- | With the correct OS library like can be the kernel module for " | ||
- | This library is called '' | ||
- | And interact with '' | ||
- | Continuing with this simple view, like ext4 for example, you'll have data blocks and journal blocks to maintain consistency; | ||
- | * OSD: serve data | ||
- | * MON: keep track of OSD nodes, **DOES NOT SERVE DATA** | ||
- | \\ | ||
- | So you'll have something like this: | ||
- | |||
- | < | ||
- | digraph G { | ||
- | compound=true; | ||
- | subgraph clusterR { | ||
- | label = " | ||
- | style=filled; | ||
- | fillcolor=gray; | ||
- | osd0 [shape=cylinder, | ||
- | osd1 [shape=cylinder, | ||
- | osd2 [shape=cylinder, | ||
- | mon0 [shape=box3d, | ||
- | osd3 [shape=cylinder, | ||
- | osd4 [shape=cylinder, | ||
- | osd5 [shape=cylinder, | ||
- | mon1 [shape=box3d, | ||
- | } | ||
- | subgraph clusterC { | ||
- | label = " | ||
- | librados [shape=Mdiamond, | ||
- | } | ||
- | librados-> | ||
- | librados-> | ||
- | |||
- | } | ||
- | </ | ||
- | |||
- | ===== BASIC Architecture ===== | ||
- | |||
- | Going deeper, you'll find that the data placement over OSD nodes is calculated by an algorithm called CRUSH: // | ||
- | * Pseudo-random, | ||
- | * Makes a uniform data distribution | ||
- | * Results in a stable mapping of data (with very limited data migration on changes) | ||
- | * It has a rule-based configuration: | ||
- | \\ | ||
- | So when a client want to write data in the CEPH cluster though REDOS, librados in the client side invokes CRUSH to perform the calculation on where of the available OSD's write the data.\\ | ||
- | \\ | ||
- | That result in a very strong architecture with no single point of failure, cause you'll not have a/many node/s taking care of metadata.\\ | ||
- | Its also really fast: you'll have n*osd servers to perform read/ | ||
- | Its robust, if any of the OSD node fails, the data is replicated //N// times (where //N// is a config option) through other OSD's and will be accessible with the CRUSH calculation.\\ | ||
- | Also if any OSD fail, the MONitors will re-map the cluster and OSD's will re-replicate the data to have copied //N// times in the cluster. | ||
- | | ||
- | ==== In a graph ==== | ||
- | |||
- | < | ||
- | digraph G { | ||
- | compound=true; | ||
- | subgraph clusterR { | ||
- | label = " | ||
- | subgraph clusterOSD{ | ||
- | style=filled; | ||
- | fillcolor=gray; | ||
- | label = "OSD side"; | ||
- | osd0 [shape=cylinder, | ||
- | osd1 [shape=cylinder, | ||
- | osd2 [shape=cylinder, | ||
- | osd3 [shape=cylinder, | ||
- | osd4 [shape=cylinder, | ||
- | osd5 [shape=cylinder, | ||
- | } | ||
- | subgraph clusterMON{ | ||
- | style=filled; | ||
- | fillcolor=gray; | ||
- | label = "MON side"; | ||
- | mon0 [shape=box3d, | ||
- | mon1 [shape=box3d, | ||
- | mon0 -> mon1 [dir=both, label=" | ||
- | { rank=same; mon0 mon1} | ||
- | } | ||
- | status [shape=Mdiamond, | ||
- | osd0 -> status [dir=both]; | ||
- | osd1 -> status [dir=both]; | ||
- | osd2 -> status [dir=both]; | ||
- | osd3 -> status [dir=both]; | ||
- | osd4 -> status [dir=both]; | ||
- | osd5 -> status [dir=both]; | ||
- | status-> | ||
- | status-> | ||
- | |||
- | } | ||
- | subgraph clusterC { | ||
- | label = " | ||
- | data [shape=oval, | ||
- | subgraph clusterLIBRADOS { | ||
- | style=filled; | ||
- | fillcolor=gray; | ||
- | label = " | ||
- | crush [shape=Mdiamond, | ||
- | } | ||
- | data-> | ||
- | } | ||
- | crush-> | ||
- | crush-> | ||
- | } | ||
- | </ | ||
- | |||
- | |||
- | ===== CEPH as REST Object Storage ===== | ||
- | |||
- | The **unique** diference in this case is there' | ||
- | That's all.\\ | ||
- | You'll have a lot of overhead/ | ||
- | So if you take the previous graph, simplified, you'll have: | ||
- | |||
- | < | ||
- | digraph G { | ||
- | compound=true; | ||
- | rankdir=LR; | ||
- | subgraph clusterCLIENT { | ||
- | label = " | ||
- | data [shape=oval, | ||
- | } | ||
- | subgraph clusterG { | ||
- | label = " | ||
- | subgraph clusterLIBRADOS { | ||
- | style=filled; | ||
- | fillcolor=gray; | ||
- | label = " | ||
- | crush [shape=Mdiamond, | ||
- | } | ||
- | } | ||
- | data-> | ||
- | subgraph clusterR { | ||
- | label = " | ||
- | style=filled; | ||
- | fillcolor=gray; | ||
- | ceph [shape=cylinder, | ||
- | } | ||
- | crush-> | ||
- | } | ||
- | </ | ||
- | |||
- | |||
- | |||
- | ===== CEPH as Filesystem Architecture ===== | ||
- | |||
- | |||
- | Again, the diference in using CEPH as " | ||
- | Metadata a role similar to Monitor: | ||
- | * it **DOES NOT SERVE DATA** | ||
- | * It has the metadata database, that is a " | ||
- | * The Metadata nodes also replicate and vote between them. | ||
- | * The filesystem tree is split **dynamically** between all the metadata nodes. | ||
- | |||
- | < | ||
- | digraph G { | ||
- | compound=true; | ||
- | subgraph clusterR { | ||
- | label = " | ||
- | style=filled; | ||
- | fillcolor=gray; | ||
- | subgraph clusterOSD{ | ||
- | label = "OSD pool"; | ||
- | osd0 [label=" | ||
- | } | ||
- | subgraph clusterMON{ | ||
- | label = "MON pool"; | ||
- | mon0 [label=" | ||
- | } | ||
- | subgraph clusterMETA{ | ||
- | label = " | ||
- | meta0 [shape=box3d, | ||
- | meta1 [shape=box3d, | ||
- | meta0-> | ||
- | { rank=same; meta0 meta1} | ||
- | } | ||
- | |||
- | } | ||
- | subgraph clusterC { | ||
- | label = " | ||
- | file [label=" | ||
- | file-> | ||
- | subgraph clusterLIBRADOS { | ||
- | style=filled; | ||
- | fillcolor=gray; | ||
- | label = " | ||
- | cephFS [label="/ | ||
- | } | ||
- | } | ||
- | cephFS-> | ||
- | cephFS-> | ||
- | osd0-> | ||
- | meta0-> | ||
- | meta1-> | ||
- | } | ||
- | </ | ||
- | |||
- | ====== Real Life Cluster: ePayments PRO/ | ||
- | |||
- | Node list: | ||
- | * Admin: | ||
- | * ACCLM-OSADM-001 | ||
- | * OSD: | ||
- | * ACCLM-OSD-001 | ||
- | * ACCLM-OSD-002 | ||
- | * ACCLM-OSD-003 | ||
- | * ACCLM-OSD-004 | ||
- | * ACCLM-OSD-005 | ||
- | * Gateways: | ||
- | * ACCLM-OSGW-001 | ||
- | * ACCLM-OSGW-002 | ||
- | * Monitors: | ||
- | * ACCLM-OSM-001 | ||
- | * ACCLM-OSADM-001 | ||
- | |||
- | |||
- | < | ||
- | digraph G { | ||
- | compound=true; | ||
- | subgraph clusterR { | ||
- | label = " | ||
- | subgraph clusterOSD{ | ||
- | style=filled; | ||
- | fillcolor=gray; | ||
- | label = " | ||
- | osd1 [label=" | ||
- | osd2 [label=" | ||
- | osd3 [label=" | ||
- | osd4 [label=" | ||
- | osd5 [label=" | ||
- | } | ||
- | subgraph clusterMON{ | ||
- | style=filled; | ||
- | fillcolor=gray; | ||
- | label = " | ||
- | mon0 [label=" | ||
- | mon1 [label=" | ||
- | mon1 [shape=box3d, | ||
- | { rank=same; mon0 mon1} | ||
- | } | ||
- | subgraph clusterG { | ||
- | label = " | ||
- | style=filled; | ||
- | fillcolor=gray; | ||
- | gw01 [label=" | ||
- | gw02 [label=" | ||
- | } | ||
- | } | ||
- | subgraph clusterCLIENT { | ||
- | label = " | ||
- | data [shape=oval, | ||
- | } | ||
- | dns [label=" | ||
- | dns-> | ||
- | dns-> | ||
- | data-> | ||
- | data-> | ||
- | gw01-> | ||
- | gw01-> | ||
- | gw01-> | ||
- | gw01-> | ||
- | gw01-> | ||
- | gw02-> | ||
- | gw02-> | ||
- | gw02-> | ||
- | gw02-> | ||
- | gw02-> | ||
- | osd1-> | ||
- | osd1-> | ||
- | osd2-> | ||
- | osd2-> | ||
- | osd3-> | ||
- | osd3-> | ||
- | osd4-> | ||
- | osd4-> | ||
- | osd5-> | ||
- | osd5-> | ||
- | mon0-> | ||
- | } | ||
- | |||
- | </ | ||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||