On this page I describe my view how Ganglia could be used best on IBM POWER5/6/7 systems.
Some things you should consider before you start:
- Hostnames
- To Ganglia a new hostname is a new machine
- It has to resolve IP addresses so please use DNS
- Stable IP addresses
- Make sure you are not going to change IP addresses
- Time and date
- Make sure the time zone, time and date is consistent on all machines in a cluster
- Use of NTP is highly recommended
- So
- these are normal requirements on production machines
- for prototype and test systems – get this right before starting Ganglia
- Read the simple Ganglia How-To available for people setting up their first Ganglia system at:
Preferred setup
- Define each System p machine with all its LPARs as a separate cluster
- Use Unicast for network communication
- Define at least two LPARs per System p machine as gmond hosts for gmetad
- One would be sufficient, however, two is better for high availibility reasons
- Define those two LPARs in
<confdir>/gmetad.conf
as the information brokers for that machine
- From gmetad: Don’t poll the gmond hosts more frequently than every 15 seconds
- Know upfront what time intervals to use for sampling ("
RRAs
" stanza in <confdir>/gmetad.conf
), see below
- Use my extensions for
- Ethernet adapters (including Shared Ethernet Adapters)
- Fibre Channel adapters
- Web interface
Ganglia sampling intervals
Important to know:
- The sampling interval is defined in
<confdir>/gmetad.conf
.
- The "RRAs" stanza is used to defined individual settings.
- The sampling settings are global.
- If no "
RRAs
" stanza is defined a default configuration is used.
- For historic reasons all values are specified in intervals of 15 seconds.
Example: Default settings in Ganglia
RRAs "RRA:AVERAGE:0.5:1:240" "RRA:AVERAGE:0.5:24:240" "RRA:AVERAGE:0.5:168:240" "RRA:AVERAGE:0.5:672:240" "RRA:AVERAGE:0.5:5760:370"
Translation:
- Take 240 samples at 1 × 15 seconds intervals (used for display of hour)
- Take 240 samples at 24 × 15 seconds (= 6 minutes) intervals (used for display of day)
- Take 240 samples at 168 × 15 seconds (= 42 minutes) intervals (used for display of week)
- Take 240 samples at 672 × 15 seconds (= 168 minutes) intervals (used for display of month)
- Take 370 samples at 5760 × 15 seconds (= 24 hours) intervals (used for display of year)
Example: 1-minute sampling for one year
RRAs "RRA:AVERAGE:0.5:4:525600"
Translation:
- Take 525600 samples at 4 × 15 seconds (= 1 minute) intervals
- 525600 = 60 (samples/hour) × 24 (hours) × 365 (days) × 1 (year)
Example: 1-minute sampling for 6 months, 5-minute sampling for 2 years
RRAs "RRA:AVERAGE:0.5:4:259200" \
"RRA:AVERAGE:0.5:20:210240"
Translation:
- Take 259200 samples at every 4 × 15 seconds (= 1 minute) intervals
- 259200 = 60 (samples/hour) × 24 (hours) × 30 (days) × 6 (months)
- Take 210240 samples at every 20 × 15 seconds (= 5 minutes) intervals
- 210240 = 12 (samples/hour) × 24 (hours) × 365 (days) × 2 (years)
Example: 15-second sampling for 1 day, 1-minute sampling for 2 months, 10-minute sampling for 1 year
RRAs "RRA:AVERAGE:0.5:1:5760" \
"RRA:AVERAGE:0.5:4:86400" \
"RRA:AVERAGE:0.5:40:52560"
Translation:
- Take 5760 samples at every 1 × 15 seconds intervals
- 5760 = 4 (samples/minute) 60 (samples/hour) × 24 (hours)
- Take 86400 samples at every 4 × 15 seconds (= 1 minute) intervals
- 86400 = 60 (samples/hour) × 24 (hours) × 30 (days) × 2 (months)
- Take 52560 samples at every 40 × 15 seconds (= 10 minutes) intervals
- 52560 = 6 (samples/hour) × 24 (hours) × 365 (days) × 1 (year)
Example: 1-minute sampling for 2 months, 5-minute sampling for 6 months, 15-minute sampling for 3 years
RRAs "RRA:AVERAGE:0.5:4:86400" \
"RRA:AVERAGE:0.5:20:51840" \
"RRA:AVERAGE:0.5:60:105120"
Translation:
- Take 86400 samples at every 4 × 15 seconds (= 1 minute) intervals
- 86400 = 60 (samples/hour) × 24 (hours) × 30 (days) × 2 (months)
- Take 210240 samples at every 20 × 15 seconds (= 5 minutes) intervals
- 51840 = 12 (samples/hour) × 24 (hours) × 30 (days) × 6 (month)
- Take 105120 samples at every 60 × 15 seconds (= 15 minutes) intervals
- 105120 = 4 (samples/hour) × 24 (hours) × 365 (days) × 3 (years)
Default ports
Ganglia by default uses the following ports:
- 8649
- The port gmond uses for
- Sending to other gmonds via UDP (
udp_send_channel
in <confdir>/gmond.conf
)
- Receiving from other gmonds via UDP (
udp_receive_channel
in <confdir>/gmond.conf
)
- Sending an XML description of the state of the cluster (
tcp_accept_channel
in <confdir>/gmond.conf
)
- 8651
- The port gmetad will answer requests for XML.
- 8652
- The port gmetad will answer queries for XML.
- This facility allows simple subtree and summation views of the XML tree.
Shared Ethernet Adapter statistics
Question: How to monitor SEA statistics on the VIO server ?
- The AIX libperfstat library seems not to report any statistics about Ethernet adapters if there are no interfaces defined on that adapter.
- Only seldom interfaces are defined on SEAs.
- The AIX command '
entstat
' however provides these statistics.
Solution: Extension through gmetric via a shell script
Fibre Channel statistics
Question: How to monitor Fibre Channel statistics on the VIO server ?
- The AIX libperfstat library seems not to report any statistics about Fibre Channel adapters if there are no disks attached to the adapter.
- Tapes, for instance, would be left out.
- The AIX command '
fcstat
' however provides these statistics.
Solution: Extension through gmetric via a shell script
Enhanced web interface
Please take a look at http://www.perzl.org/ganglia/webinterface.html.