User Tools

Site Tools


linux:ceph:modifying_crush_map

Modifying CRUSH map

Documentation
Name: Modifying CRUSH map
Description: How to modify CRUSH map
Modification date :11/06/2019
Owner:dodger
Notify changes to:Owner
Tags: ceph, object storage
Scalate to:The_fucking_bofh

Information

The described process is a advanced process, take care.

Pre-Requirements

  • Knowledge of ceph management
  • Understand of CRUSH
  • Understand of CRUSH mapping options

Instructions

Review actual CRUSH tree

ceph osd crush tree

Example:

avmlp-osm-001 ~ # ceph osd crush tree
ID  CLASS WEIGHT   TYPE NAME              
 -1       39.97986 root default           
 -3        1.99899     host avmlp-osd-001 
  0   hdd  1.99899         osd.0          
 -5        1.99899     host avmlp-osd-002 
  1   hdd  1.99899         osd.1          
 -7        1.99899     host avmlp-osd-003 
  2   hdd  1.99899         osd.2          
 -9        1.99899     host avmlp-osd-004 
  3   hdd  1.99899         osd.3          
-11        1.99899     host avmlp-osd-005 
  4   hdd  1.99899         osd.4          
-13        1.99899     host avmlp-osd-006 
  5   hdd  1.99899         osd.5          
-15        1.99899     host avmlp-osd-007 
  6   hdd  1.99899         osd.6          
-17        1.99899     host avmlp-osd-008 
  7   hdd  1.99899         osd.7          
-19        1.99899     host avmlp-osd-009 
  8   hdd  1.99899         osd.8          
-21        1.99899     host avmlp-osd-010 
  9   hdd  1.99899         osd.9          
-23        1.99899     host avmlp-osd-011 
 10   hdd  1.99899         osd.10         
-25        1.99899     host avmlp-osd-012 
 11   hdd  1.99899         osd.11         
-27        1.99899     host avmlp-osd-013 
 12   hdd  1.99899         osd.12         
-29        1.99899     host avmlp-osd-014 
 13   hdd  1.99899         osd.13         
-31        1.99899     host avmlp-osd-015 
 14   hdd  1.99899         osd.14         
-33        1.99899     host avmlp-osd-016 
 15   hdd  1.99899         osd.15         
-35        1.99899     host avmlp-osd-017 
 16   hdd  1.99899         osd.16         
-37        1.99899     host avmlp-osd-018 
 17   hdd  1.99899         osd.17         
-39        1.99899     host avmlp-osd-019 
 18   hdd  1.99899         osd.18         
-41        1.99899     host avmlp-osd-020 
 19   hdd  1.99899         osd.19

Get actual map

This command export the map from the running cluster, the file is binary:

ceph osd getcrushmap  -o crushmap.bin

Convert the map to text

crushtool  -d crushmap.bin -o crushmap.txt

the map file

The file has the hierarchy of the tree as follows:

  • root
  • datacenter (if defined)
  • rack (if defined)
  • host


Each item is defined as this example:

rack
root default {
#        id -1           # do not change unnecessarily
#        id -2 class hdd         # do not change unnecessarily
        # weight 39.980
        alg straw2
        hash 0  # rjenkins1
        item itconic weight 0.000
        item mediacloud weight 0.000
}
datacenter
datacenter itconic {
#        id -44          # do not change unnecessarily
#        id -46 class hdd                # do not change unnecessarily
        # weight 0.000
        alg straw2
        hash 0  # rjenkins1
        item nuciberterminal weight 0.000
}

So each “node” define the child “nodes” with the item keyword.

Negative id's from entities MUST BE REMOVED!!!
That's why I've left commented:

#        id -44          # do not change unnecessarily

Sample CRUSH map:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
 
# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
 
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
 
# buckets
 
host nuciberterminal_cluster {
        # weight 19.990
        alg straw2
        hash 0  # rjenkins1
 
        item osd.0 weight 1.999
        item osd.2 weight 1.999
        item osd.4 weight 1.999
        item osd.6 weight 1.999
        item osd.8 weight 1.999
        item osd.10 weight 1.999
        item osd.12 weight 1.999
        item osd.14 weight 1.999
        item osd.16 weight 1.999
        item osd.18 weight 1.999
}
 
host nuciberterminal2_cluster {
        # weight 19.990
        alg straw2
        hash 0  # rjenkins1
 
        item osd.1 weight 1.999
        item osd.3 weight 1.999
        item osd.5 weight 1.999
        item osd.7 weight 1.999
        item osd.9 weight 1.999
        item osd.11 weight 1.999
        item osd.13 weight 1.999
        item osd.15 weight 1.999
        item osd.17 weight 1.999
        item osd.19 weight 1.999
 
}
 
root default {
        # weight 0.000
        alg straw2
        hash 0  # rjenkins1
        item nuciberterminal_cluster weight 1.000
        item nuciberterminal2_cluster weight 1.000
}
 
# rules
rule replicated_rule {
        id 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 3 type host
        step emit
}
rule ciberterminalRule {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 1 type host
        step emit
}
 
# end crush map

Important CRUSH parameters

  • step chooseleaf firstn: What I've found is that this value is the DEEP of your CRUSH tree:
    • 0 Single Node Cluster
    • 1 for a multi node cluster in a single rack
    • 2 for a multi node, multi chassis cluster with multiple hosts in a chassis
    • 3 for a multi node cluster with hosts across racks, etc.
  • id: should be removed from “buckets”, not from “rules”

Compile the new map

crushtool -c crushmap.txt -o crushmap_new.bin

Checking map before applying

This step is MANDATORY!!!!

Perform a test of OSD utilization with ruleset –rule <ID> and number of replicas –num-rep=<NUM_REP>:

crushtool --test -i crushmap_new.bin --show-utilization --rule <ID> --num-rep=<NUM_REP>

Sample:

# crushtool --test -i crushmap_new.map --show-utilization --rule 1 --num-rep=4
rule 1 (ciberterminalRule), x = 0..1023, numrep = 4..4
rule 1 (ciberterminalRule) num_rep 4 result size == 1:  1024/1024
  device 0:              stored : 58     expected : 51.2
  device 1:              stored : 45     expected : 51.2
  device 2:              stored : 46     expected : 51.2
  device 3:              stored : 62     expected : 51.2
  device 4:              stored : 45     expected : 51.2
  device 5:              stored : 40     expected : 51.2
  device 6:              stored : 47     expected : 51.2
  device 7:              stored : 53     expected : 51.2
  device 8:              stored : 36     expected : 51.2
  device 9:              stored : 46     expected : 51.2
  device 10:             stored : 43     expected : 51.2
  device 11:             stored : 68     expected : 51.2
  device 12:             stored : 53     expected : 51.2
  device 13:             stored : 46     expected : 51.2
  device 14:             stored : 52     expected : 51.2
  device 15:             stored : 60     expected : 51.2
  device 16:             stored : 50     expected : 51.2
  device 17:             stored : 59     expected : 51.2
  device 18:             stored : 52     expected : 51.2
  device 19:             stored : 63     expected : 51.2
(reverse-i-search)`p': crushtool --test -i crushmap_new.map 

Perform a CRUSH algorithm check displaying the necessary steps to allocate any object:

crushtool --test -i crushmap_new.bin --show-choose-tries --rule <ID> --num-rep=<NUM_REP>

Sample (in this sample, with rule “1” all the objects will need 0 steps to find a placement):

# crushtool --test -i crushmap_new.map --show-choose-tries --rule 1 --num-rep=4
 0:      2048
 1:         0
 2:         0
 3:         0
 4:         0
 5:         0
 6:         0
 7:         0
 8:         0
 9:         0
10:         0
11:         0
12:         0
13:         0
14:         0
15:         0
16:         0
17:         0
18:         0
19:         0
20:         0
21:         0
22:         0
23:         0
24:         0
25:         0
26:         0
27:         0
28:         0
29:         0
30:         0
31:         0
32:         0
33:         0
34:         0
35:         0
36:         0
37:         0
38:         0
39:         0
40:         0
41:         0
42:         0
43:         0
44:         0
45:         0
46:         0
47:         0
48:         0
49:         0

Sample (in this sample, with rule “0” CRUSH will need more than 1 loop trough the algorithm to find a placement):

avmlp-osm-001 ~/crushmaps/crush_20190725 # crushtool --test -i crushmap_new.map --show-choose-tries --rule 0 --num-rep=4
 0:      3580
 1:       251
 2:       134
 3:        76
 4:        28
 5:        15
 6:         7
 7:         2
 8:         1
 9:         1
10:         0
11:         0
12:         1
13:         0
14:         0
15:         0
16:         0
17:         0
18:         0
19:         0
20:         0
21:         0
22:         0
23:         0
24:         0
25:         0
26:         0
27:         0
28:         0
29:         0
30:         0
31:         0
32:         0
33:         0
34:         0
35:         0
36:         0
37:         0
38:         0
39:         0
40:         0
41:         0
42:         0
43:         0
44:         0
45:         0
46:         0
47:         0
48:         0
49:         0

Apply the new map

ceph osd setcrushmap -i crushmap.new

Review the Map

ceph osd crush tree

Sample:

avmlp-osm-001 ~/crushmaps/crush_20190725 # ceph osd crush tree
ID CLASS WEIGHT  TYPE NAME                 
-3       2.00000 root default              
-2       1.00000     host nuciberterminal2_cluster 
 1   hdd 1.99899         osd.1             
 3   hdd 1.99899         osd.3             
 5   hdd 1.99899         osd.5             
 7   hdd 1.99899         osd.7             
 9   hdd 1.99899         osd.9             
11   hdd 1.99899         osd.11            
13   hdd 1.99899         osd.13            
15   hdd 1.99899         osd.15            
17   hdd 1.99899         osd.17            
19   hdd 1.99899         osd.19            
-1       1.00000     host nuciberterminal_cluster  
 0   hdd 1.99899         osd.0             
 2   hdd 1.99899         osd.2             
 4   hdd 1.99899         osd.4             
 6   hdd 1.99899         osd.6             
 8   hdd 1.99899         osd.8             
10   hdd 1.99899         osd.10            
12   hdd 1.99899         osd.12            
14   hdd 1.99899         osd.14            
16   hdd 1.99899         osd.16            
18   hdd 1.99899         osd.18     

In graphic mode:

Check pg location

This will show the placement of each PG in [OSDx,OSDy…]:

ceph pg dump | egrep "^[0-9]" | awk '{print $17}'|less

External documentation

linux/ceph/modifying_crush_map.txt · Last modified: 2022/02/11 11:36 by 127.0.0.1