====== Modifying CRUSH map ======
^ Documentation ^|
^Name:| Modifying CRUSH map |
^Description:| How to modify CRUSH map |
^Modification date :|11/06/2019|
^Owner:|dodger|
^Notify changes to:|Owner|
^Tags:| ceph, object storage|
^Scalate to:|The_fucking_bofh|
====== Information ======
The described process is a advanced process, take care.
====== Pre-Requirements ======
* Knowledge of ceph management
* Understand of CRUSH
* Understand of CRUSH mapping options
====== Instructions ======
===== Review actual CRUSH tree =====
ceph osd crush tree
Example:
avmlp-osm-001 ~ # ceph osd crush tree
ID CLASS WEIGHT TYPE NAME
-1 39.97986 root default
-3 1.99899 host avmlp-osd-001
0 hdd 1.99899 osd.0
-5 1.99899 host avmlp-osd-002
1 hdd 1.99899 osd.1
-7 1.99899 host avmlp-osd-003
2 hdd 1.99899 osd.2
-9 1.99899 host avmlp-osd-004
3 hdd 1.99899 osd.3
-11 1.99899 host avmlp-osd-005
4 hdd 1.99899 osd.4
-13 1.99899 host avmlp-osd-006
5 hdd 1.99899 osd.5
-15 1.99899 host avmlp-osd-007
6 hdd 1.99899 osd.6
-17 1.99899 host avmlp-osd-008
7 hdd 1.99899 osd.7
-19 1.99899 host avmlp-osd-009
8 hdd 1.99899 osd.8
-21 1.99899 host avmlp-osd-010
9 hdd 1.99899 osd.9
-23 1.99899 host avmlp-osd-011
10 hdd 1.99899 osd.10
-25 1.99899 host avmlp-osd-012
11 hdd 1.99899 osd.11
-27 1.99899 host avmlp-osd-013
12 hdd 1.99899 osd.12
-29 1.99899 host avmlp-osd-014
13 hdd 1.99899 osd.13
-31 1.99899 host avmlp-osd-015
14 hdd 1.99899 osd.14
-33 1.99899 host avmlp-osd-016
15 hdd 1.99899 osd.15
-35 1.99899 host avmlp-osd-017
16 hdd 1.99899 osd.16
-37 1.99899 host avmlp-osd-018
17 hdd 1.99899 osd.17
-39 1.99899 host avmlp-osd-019
18 hdd 1.99899 osd.18
-41 1.99899 host avmlp-osd-020
19 hdd 1.99899 osd.19
===== Get actual map =====
This command export the map from the running cluster, the file is **binary**:
ceph osd getcrushmap -o crushmap.bin
===== Convert the map to text =====
crushtool -d crushmap.bin -o crushmap.txt
===== the map file =====
The file has the hierarchy of the tree as follows:
* root
* datacenter (if defined)
* rack (if defined)
* host
\\
Each item is defined as this example:
^ rack |
root default {
# id -1 # do not change unnecessarily
# id -2 class hdd # do not change unnecessarily
# weight 39.980
alg straw2
hash 0 # rjenkins1
item itconic weight 0.000
item mediacloud weight 0.000
}
|
^ datacenter |
datacenter itconic {
# id -44 # do not change unnecessarily
# id -46 class hdd # do not change unnecessarily
# weight 0.000
alg straw2
hash 0 # rjenkins1
item nuciberterminal weight 0.000
}
|
So each "node" define the child "nodes" with the ''item'' keyword.
Negative id's from entities **MUST BE REMOVED!!!** \\
That's why I've left commented:
# id -44 # do not change unnecessarily
Sample CRUSH map:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
# buckets
host nuciberterminal_cluster {
# weight 19.990
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.999
item osd.2 weight 1.999
item osd.4 weight 1.999
item osd.6 weight 1.999
item osd.8 weight 1.999
item osd.10 weight 1.999
item osd.12 weight 1.999
item osd.14 weight 1.999
item osd.16 weight 1.999
item osd.18 weight 1.999
}
host nuciberterminal2_cluster {
# weight 19.990
alg straw2
hash 0 # rjenkins1
item osd.1 weight 1.999
item osd.3 weight 1.999
item osd.5 weight 1.999
item osd.7 weight 1.999
item osd.9 weight 1.999
item osd.11 weight 1.999
item osd.13 weight 1.999
item osd.15 weight 1.999
item osd.17 weight 1.999
item osd.19 weight 1.999
}
root default {
# weight 0.000
alg straw2
hash 0 # rjenkins1
item nuciberterminal_cluster weight 1.000
item nuciberterminal2_cluster weight 1.000
}
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 3 type host
step emit
}
rule ciberterminalRule {
id 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 1 type host
step emit
}
# end crush map
===== Important CRUSH parameters =====
* ''step chooseleaf firstn'': What I've found is that this value is the DEEP of your CRUSH tree:
* ''0'' Single Node Cluster
* ''1'' for a multi node cluster in a single rack
* ''2'' for a multi node, multi chassis cluster with multiple hosts in a chassis
* ''3'' for a multi node cluster with hosts across racks, etc.
* ''id'': should be removed from "buckets", not from "rules"
===== Compile the new map =====
crushtool -c crushmap.txt -o crushmap_new.bin
===== Checking map before applying =====
This step is **MANDATORY**!!!!
Perform a test of OSD utilization with ruleset ''--rule '' and number of replicas ''--num-rep='':
crushtool --test -i crushmap_new.bin --show-utilization --rule --num-rep=
Sample:
# crushtool --test -i crushmap_new.map --show-utilization --rule 1 --num-rep=4
rule 1 (ciberterminalRule), x = 0..1023, numrep = 4..4
rule 1 (ciberterminalRule) num_rep 4 result size == 1: 1024/1024
device 0: stored : 58 expected : 51.2
device 1: stored : 45 expected : 51.2
device 2: stored : 46 expected : 51.2
device 3: stored : 62 expected : 51.2
device 4: stored : 45 expected : 51.2
device 5: stored : 40 expected : 51.2
device 6: stored : 47 expected : 51.2
device 7: stored : 53 expected : 51.2
device 8: stored : 36 expected : 51.2
device 9: stored : 46 expected : 51.2
device 10: stored : 43 expected : 51.2
device 11: stored : 68 expected : 51.2
device 12: stored : 53 expected : 51.2
device 13: stored : 46 expected : 51.2
device 14: stored : 52 expected : 51.2
device 15: stored : 60 expected : 51.2
device 16: stored : 50 expected : 51.2
device 17: stored : 59 expected : 51.2
device 18: stored : 52 expected : 51.2
device 19: stored : 63 expected : 51.2
(reverse-i-search)`p': crushtool --test -i crushmap_new.map
Perform a CRUSH algorithm check displaying the necessary steps to allocate any object:
crushtool --test -i crushmap_new.bin --show-choose-tries --rule --num-rep=
Sample (in this sample, with rule "1" all the objects will need 0 steps to find a placement):
# crushtool --test -i crushmap_new.map --show-choose-tries --rule 1 --num-rep=4
0: 2048
1: 0
2: 0
3: 0
4: 0
5: 0
6: 0
7: 0
8: 0
9: 0
10: 0
11: 0
12: 0
13: 0
14: 0
15: 0
16: 0
17: 0
18: 0
19: 0
20: 0
21: 0
22: 0
23: 0
24: 0
25: 0
26: 0
27: 0
28: 0
29: 0
30: 0
31: 0
32: 0
33: 0
34: 0
35: 0
36: 0
37: 0
38: 0
39: 0
40: 0
41: 0
42: 0
43: 0
44: 0
45: 0
46: 0
47: 0
48: 0
49: 0
Sample (in this sample, with rule "0" CRUSH will need more than 1 loop trough the algorithm to find a placement):
avmlp-osm-001 ~/crushmaps/crush_20190725 # crushtool --test -i crushmap_new.map --show-choose-tries --rule 0 --num-rep=4
0: 3580
1: 251
2: 134
3: 76
4: 28
5: 15
6: 7
7: 2
8: 1
9: 1
10: 0
11: 0
12: 1
13: 0
14: 0
15: 0
16: 0
17: 0
18: 0
19: 0
20: 0
21: 0
22: 0
23: 0
24: 0
25: 0
26: 0
27: 0
28: 0
29: 0
30: 0
31: 0
32: 0
33: 0
34: 0
35: 0
36: 0
37: 0
38: 0
39: 0
40: 0
41: 0
42: 0
43: 0
44: 0
45: 0
46: 0
47: 0
48: 0
49: 0
===== Apply the new map =====
ceph osd setcrushmap -i crushmap.new
===== Review the Map =====
ceph osd crush tree
Sample:
avmlp-osm-001 ~/crushmaps/crush_20190725 # ceph osd crush tree
ID CLASS WEIGHT TYPE NAME
-3 2.00000 root default
-2 1.00000 host nuciberterminal2_cluster
1 hdd 1.99899 osd.1
3 hdd 1.99899 osd.3
5 hdd 1.99899 osd.5
7 hdd 1.99899 osd.7
9 hdd 1.99899 osd.9
11 hdd 1.99899 osd.11
13 hdd 1.99899 osd.13
15 hdd 1.99899 osd.15
17 hdd 1.99899 osd.17
19 hdd 1.99899 osd.19
-1 1.00000 host nuciberterminal_cluster
0 hdd 1.99899 osd.0
2 hdd 1.99899 osd.2
4 hdd 1.99899 osd.4
6 hdd 1.99899 osd.6
8 hdd 1.99899 osd.8
10 hdd 1.99899 osd.10
12 hdd 1.99899 osd.12
14 hdd 1.99899 osd.14
16 hdd 1.99899 osd.16
18 hdd 1.99899 osd.18
In graphic mode:\\
digraph G {
compound=true;
default [shape=rectangle,style=filled,fillcolor=coral];
cluster_1 [shape=rectangle,style=filled,fillcolor=coral];
osd1 [shape=cylinder,style=filled,fillcolor=coral];
osd3 [shape=cylinder,style=filled,fillcolor=coral];
osd5 [shape=cylinder,style=filled,fillcolor=coral];
osd7 [shape=cylinder,style=filled,fillcolor=coral];
osd9 [shape=cylinder,style=filled,fillcolor=coral];
cluster_2 [shape=rectangle,style=filled,fillcolor=coral];
osd0 [shape=cylinder,style=filled,fillcolor=coral];
osd2 [shape=cylinder,style=filled,fillcolor=coral];
osd4 [shape=cylinder,style=filled,fillcolor=coral];
osd6 [shape=cylinder,style=filled,fillcolor=coral];
osd8 [shape=cylinder,style=filled,fillcolor=coral];
default->cluster_1;
default->cluster_2;
cluster_1->osd1;
cluster_1->osd3;
cluster_1->osd5;
cluster_1->osd7;
cluster_1->osd9;
cluster_2->osd0;
cluster_2->osd2;
cluster_2->osd4;
cluster_2->osd6;
cluster_2->osd8;
}
===== Check pg location =====
This will show the placement of each PG in ''[OSDx,OSDy...]'':
ceph pg dump | egrep "^[0-9]" | awk '{print $17}'|less
====== External documentation ======
* [[http://www.sebastien-han.fr/blog/2012/12/07/ceph-2-speed-storage-with-crush/]]
* [[https://swamireddy.wordpress.com/2016/01/16/ceph-crush-map-for-host-and-rack/]]
* [[http://cephnotes.ksperis.com/blog/2015/02/02/crushmap-example-of-a-hierarchical-cluster-map/]]
* [[https://ceph.com/community/new-luminous-crush-device-classes/]]
* [[https://alanxelsys.com/ceph-hands-on-guide/]]