====== Modifying CRUSH map ====== ^ Documentation ^| ^Name:| Modifying CRUSH map | ^Description:| How to modify CRUSH map | ^Modification date :|11/06/2019| ^Owner:|dodger| ^Notify changes to:|Owner| ^Tags:| ceph, object storage| ^Scalate to:|The_fucking_bofh| ====== Information ====== The described process is a advanced process, take care. ====== Pre-Requirements ====== * Knowledge of ceph management * Understand of CRUSH * Understand of CRUSH mapping options ====== Instructions ====== ===== Review actual CRUSH tree ===== ceph osd crush tree Example: avmlp-osm-001 ~ # ceph osd crush tree ID CLASS WEIGHT TYPE NAME -1 39.97986 root default -3 1.99899 host avmlp-osd-001 0 hdd 1.99899 osd.0 -5 1.99899 host avmlp-osd-002 1 hdd 1.99899 osd.1 -7 1.99899 host avmlp-osd-003 2 hdd 1.99899 osd.2 -9 1.99899 host avmlp-osd-004 3 hdd 1.99899 osd.3 -11 1.99899 host avmlp-osd-005 4 hdd 1.99899 osd.4 -13 1.99899 host avmlp-osd-006 5 hdd 1.99899 osd.5 -15 1.99899 host avmlp-osd-007 6 hdd 1.99899 osd.6 -17 1.99899 host avmlp-osd-008 7 hdd 1.99899 osd.7 -19 1.99899 host avmlp-osd-009 8 hdd 1.99899 osd.8 -21 1.99899 host avmlp-osd-010 9 hdd 1.99899 osd.9 -23 1.99899 host avmlp-osd-011 10 hdd 1.99899 osd.10 -25 1.99899 host avmlp-osd-012 11 hdd 1.99899 osd.11 -27 1.99899 host avmlp-osd-013 12 hdd 1.99899 osd.12 -29 1.99899 host avmlp-osd-014 13 hdd 1.99899 osd.13 -31 1.99899 host avmlp-osd-015 14 hdd 1.99899 osd.14 -33 1.99899 host avmlp-osd-016 15 hdd 1.99899 osd.15 -35 1.99899 host avmlp-osd-017 16 hdd 1.99899 osd.16 -37 1.99899 host avmlp-osd-018 17 hdd 1.99899 osd.17 -39 1.99899 host avmlp-osd-019 18 hdd 1.99899 osd.18 -41 1.99899 host avmlp-osd-020 19 hdd 1.99899 osd.19 ===== Get actual map ===== This command export the map from the running cluster, the file is **binary**: ceph osd getcrushmap -o crushmap.bin ===== Convert the map to text ===== crushtool -d crushmap.bin -o crushmap.txt ===== the map file ===== The file has the hierarchy of the tree as follows: * root * datacenter (if defined) * rack (if defined) * host \\ Each item is defined as this example: ^ rack | root default { # id -1 # do not change unnecessarily # id -2 class hdd # do not change unnecessarily # weight 39.980 alg straw2 hash 0 # rjenkins1 item itconic weight 0.000 item mediacloud weight 0.000 } | ^ datacenter | datacenter itconic { # id -44 # do not change unnecessarily # id -46 class hdd # do not change unnecessarily # weight 0.000 alg straw2 hash 0 # rjenkins1 item nuciberterminal weight 0.000 } | So each "node" define the child "nodes" with the ''item'' keyword. Negative id's from entities **MUST BE REMOVED!!!** \\ That's why I've left commented: # id -44 # do not change unnecessarily Sample CRUSH map: # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class hdd device 1 osd.1 class hdd device 2 osd.2 class hdd device 3 osd.3 class hdd device 4 osd.4 class hdd device 5 osd.5 class hdd device 6 osd.6 class hdd device 7 osd.7 class hdd device 8 osd.8 class hdd device 9 osd.9 class hdd device 10 osd.10 class hdd device 11 osd.11 class hdd device 12 osd.12 class hdd device 13 osd.13 class hdd device 14 osd.14 class hdd device 15 osd.15 class hdd device 16 osd.16 class hdd device 17 osd.17 class hdd device 18 osd.18 class hdd device 19 osd.19 class hdd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 zone type 10 region type 11 root # buckets host nuciberterminal_cluster { # weight 19.990 alg straw2 hash 0 # rjenkins1 item osd.0 weight 1.999 item osd.2 weight 1.999 item osd.4 weight 1.999 item osd.6 weight 1.999 item osd.8 weight 1.999 item osd.10 weight 1.999 item osd.12 weight 1.999 item osd.14 weight 1.999 item osd.16 weight 1.999 item osd.18 weight 1.999 } host nuciberterminal2_cluster { # weight 19.990 alg straw2 hash 0 # rjenkins1 item osd.1 weight 1.999 item osd.3 weight 1.999 item osd.5 weight 1.999 item osd.7 weight 1.999 item osd.9 weight 1.999 item osd.11 weight 1.999 item osd.13 weight 1.999 item osd.15 weight 1.999 item osd.17 weight 1.999 item osd.19 weight 1.999 } root default { # weight 0.000 alg straw2 hash 0 # rjenkins1 item nuciberterminal_cluster weight 1.000 item nuciberterminal2_cluster weight 1.000 } # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 3 type host step emit } rule ciberterminalRule { id 1 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 1 type host step emit } # end crush map ===== Important CRUSH parameters ===== * ''step chooseleaf firstn'': What I've found is that this value is the DEEP of your CRUSH tree: * ''0'' Single Node Cluster * ''1'' for a multi node cluster in a single rack * ''2'' for a multi node, multi chassis cluster with multiple hosts in a chassis * ''3'' for a multi node cluster with hosts across racks, etc. * ''id'': should be removed from "buckets", not from "rules" ===== Compile the new map ===== crushtool -c crushmap.txt -o crushmap_new.bin ===== Checking map before applying ===== This step is **MANDATORY**!!!! Perform a test of OSD utilization with ruleset ''--rule '' and number of replicas ''--num-rep='': crushtool --test -i crushmap_new.bin --show-utilization --rule --num-rep= Sample: # crushtool --test -i crushmap_new.map --show-utilization --rule 1 --num-rep=4 rule 1 (ciberterminalRule), x = 0..1023, numrep = 4..4 rule 1 (ciberterminalRule) num_rep 4 result size == 1: 1024/1024 device 0: stored : 58 expected : 51.2 device 1: stored : 45 expected : 51.2 device 2: stored : 46 expected : 51.2 device 3: stored : 62 expected : 51.2 device 4: stored : 45 expected : 51.2 device 5: stored : 40 expected : 51.2 device 6: stored : 47 expected : 51.2 device 7: stored : 53 expected : 51.2 device 8: stored : 36 expected : 51.2 device 9: stored : 46 expected : 51.2 device 10: stored : 43 expected : 51.2 device 11: stored : 68 expected : 51.2 device 12: stored : 53 expected : 51.2 device 13: stored : 46 expected : 51.2 device 14: stored : 52 expected : 51.2 device 15: stored : 60 expected : 51.2 device 16: stored : 50 expected : 51.2 device 17: stored : 59 expected : 51.2 device 18: stored : 52 expected : 51.2 device 19: stored : 63 expected : 51.2 (reverse-i-search)`p': crushtool --test -i crushmap_new.map Perform a CRUSH algorithm check displaying the necessary steps to allocate any object: crushtool --test -i crushmap_new.bin --show-choose-tries --rule --num-rep= Sample (in this sample, with rule "1" all the objects will need 0 steps to find a placement): # crushtool --test -i crushmap_new.map --show-choose-tries --rule 1 --num-rep=4 0: 2048 1: 0 2: 0 3: 0 4: 0 5: 0 6: 0 7: 0 8: 0 9: 0 10: 0 11: 0 12: 0 13: 0 14: 0 15: 0 16: 0 17: 0 18: 0 19: 0 20: 0 21: 0 22: 0 23: 0 24: 0 25: 0 26: 0 27: 0 28: 0 29: 0 30: 0 31: 0 32: 0 33: 0 34: 0 35: 0 36: 0 37: 0 38: 0 39: 0 40: 0 41: 0 42: 0 43: 0 44: 0 45: 0 46: 0 47: 0 48: 0 49: 0 Sample (in this sample, with rule "0" CRUSH will need more than 1 loop trough the algorithm to find a placement): avmlp-osm-001 ~/crushmaps/crush_20190725 # crushtool --test -i crushmap_new.map --show-choose-tries --rule 0 --num-rep=4 0: 3580 1: 251 2: 134 3: 76 4: 28 5: 15 6: 7 7: 2 8: 1 9: 1 10: 0 11: 0 12: 1 13: 0 14: 0 15: 0 16: 0 17: 0 18: 0 19: 0 20: 0 21: 0 22: 0 23: 0 24: 0 25: 0 26: 0 27: 0 28: 0 29: 0 30: 0 31: 0 32: 0 33: 0 34: 0 35: 0 36: 0 37: 0 38: 0 39: 0 40: 0 41: 0 42: 0 43: 0 44: 0 45: 0 46: 0 47: 0 48: 0 49: 0 ===== Apply the new map ===== ceph osd setcrushmap -i crushmap.new ===== Review the Map ===== ceph osd crush tree Sample: avmlp-osm-001 ~/crushmaps/crush_20190725 # ceph osd crush tree ID CLASS WEIGHT TYPE NAME -3 2.00000 root default -2 1.00000 host nuciberterminal2_cluster 1 hdd 1.99899 osd.1 3 hdd 1.99899 osd.3 5 hdd 1.99899 osd.5 7 hdd 1.99899 osd.7 9 hdd 1.99899 osd.9 11 hdd 1.99899 osd.11 13 hdd 1.99899 osd.13 15 hdd 1.99899 osd.15 17 hdd 1.99899 osd.17 19 hdd 1.99899 osd.19 -1 1.00000 host nuciberterminal_cluster 0 hdd 1.99899 osd.0 2 hdd 1.99899 osd.2 4 hdd 1.99899 osd.4 6 hdd 1.99899 osd.6 8 hdd 1.99899 osd.8 10 hdd 1.99899 osd.10 12 hdd 1.99899 osd.12 14 hdd 1.99899 osd.14 16 hdd 1.99899 osd.16 18 hdd 1.99899 osd.18 In graphic mode:\\ digraph G { compound=true; default [shape=rectangle,style=filled,fillcolor=coral]; cluster_1 [shape=rectangle,style=filled,fillcolor=coral]; osd1 [shape=cylinder,style=filled,fillcolor=coral]; osd3 [shape=cylinder,style=filled,fillcolor=coral]; osd5 [shape=cylinder,style=filled,fillcolor=coral]; osd7 [shape=cylinder,style=filled,fillcolor=coral]; osd9 [shape=cylinder,style=filled,fillcolor=coral]; cluster_2 [shape=rectangle,style=filled,fillcolor=coral]; osd0 [shape=cylinder,style=filled,fillcolor=coral]; osd2 [shape=cylinder,style=filled,fillcolor=coral]; osd4 [shape=cylinder,style=filled,fillcolor=coral]; osd6 [shape=cylinder,style=filled,fillcolor=coral]; osd8 [shape=cylinder,style=filled,fillcolor=coral]; default->cluster_1; default->cluster_2; cluster_1->osd1; cluster_1->osd3; cluster_1->osd5; cluster_1->osd7; cluster_1->osd9; cluster_2->osd0; cluster_2->osd2; cluster_2->osd4; cluster_2->osd6; cluster_2->osd8; } ===== Check pg location ===== This will show the placement of each PG in ''[OSDx,OSDy...]'': ceph pg dump | egrep "^[0-9]" | awk '{print $17}'|less ====== External documentation ====== * [[http://www.sebastien-han.fr/blog/2012/12/07/ceph-2-speed-storage-with-crush/]] * [[https://swamireddy.wordpress.com/2016/01/16/ceph-crush-map-for-host-and-rack/]] * [[http://cephnotes.ksperis.com/blog/2015/02/02/crushmap-example-of-a-hierarchical-cluster-map/]] * [[https://ceph.com/community/new-luminous-crush-device-classes/]] * [[https://alanxelsys.com/ceph-hands-on-guide/]]