Heartbeat reporting

Dejan Muhamedagic
<dmuhamedagic@suse.de>
v1.0

hb_report is a utility to collect all information relevant to Heartbeat over
the given period of time.

Quick start

Run hb_report on one of the nodes or on the host which serves as a central log
server. Run hb_report without parameters to see usage.

A few examples:

 1. Last night during the backup there were several warnings encountered
    (logserver is the log host):

    logserver# hb_report -f 3:00 -t 4:00 /tmp/report

    collects everything from all nodes from 3am to 4am last night. The files
    are stored in /tmp/report and compressed to a tarball /tmp/report.tar.gz.

 2. Just found a problem during testing:

    node1# date : note the current time
    node1# /etc/init.d/heartbeat start
    node1# nasty_command_that_breaks_things
    node1# sleep 120 : wait for the cluster to settle
    node1# hb_report -f time /tmp/hb1


Introduction

Managing clusters is cumbersome. Heartbeat v2 with its numerous configuration
files and multi-node clusters just adds to the complexity. No wonder then that
most problem reports were less than optimal. This is an attempt to rectify that
situation and make life easier for both the users and the developers.

On security

hb_report is a fairly complex program. As some of you are probably going to run
it as root let us state a few important things you should keep in mind:

 1. Don't run hb_report as root! It is fairly simple to setup things in such a
    way that root access is not needed. I won't go into details, just to stress
    that all information collected should be readable by accounts belonging the
    haclient group.

 2. If you still have to run this as root. Well, don't use the -C option.

 3. Of course, every possible precaution has been taken not to disturb
    processes, or touch or remove files out of the given destination directory.
    If you (by mistake) specify an existing directory, hb_report will bail out
    soon. If you specify a relative path, it won't work either.

The final product of hb_report is a tarball. However, the destination directory
is not removed on any node, unless the user specifies -C. If you're too lazy to
cleanup the previous run, do yourself a favour and just supply a new
destination directory. You've been warned. If you worry about the space used,
just put all your directories under /tmp and setup a cronjob to remove those
directories once a week:

        for d in /tmp/*; do
                test -d $d ||
                        continue
                test -f $d/description.txt || test -f $d/.env ||
                        continue
                grep -qs 'By: hb_report' $d/description.txt ||
                        grep -qs '^UNIQUE_MSG=Mark' $d/.env ||
                        continue
                rm -r $d
        done

Mode of operation

Cluster data collection is straightforward: just run the same procedure on all
nodes and collect the reports. There is, apart from many small ones, one large
complication: central syslog destination. So, in order to allow this to be
fully automated, we should sometimes run the procedure on the log host too.
Actually, if there is a log host, then the best way is to run hb_report there.

We use ssh for the remote program invocation. Even though it is possible to run
hb_report without ssh by doing a more menial job, the overall user experience
is much better if ssh works. Anyway, how else do you manage your cluster?

Another ssh related point: In case your security policy proscribes
loghost-to-cluster-over-ssh communications, then you'll have to copy the log
file to one of the nodes and point hb_report to it.

Prerequisites

 1. ssh

    This is not strictly required, but you won't regret having a password-less
    ssh. It is not too difficult to setup and will save you a lot of time. If
    you can't have it, for example because your security policy does not allow
    such a thing, or you just prefer menial work, then you will have to resort
    to the semi-manual semi-automated report generation. See below for
    instructions.

    If you need to supply a password for your passphrase/login, then please use
    the -u option.

 2. Times

    In order to find files and messages in the given period and to parse the -f
    and -t options, hb_report uses perl and one of the Date::Parse or
    Date::Manip perl modules. Note that you need only one of these.
    Furthermore, on nodes which have no logs and where you don't run hb_report
    directly, no date parsing is necessary. In other words, if you run this on
    a loghost then you don't need these perl modules on the cluster nodes.

    On rpm based distributions, you can find Date::Parse in perl-TimeDate and
    on Debian and its derivatives in libtimedate-perl.

 3. Core dumps

    To backtrace core dumps gdb is needed and the Heartbeat packages with the
    debugging info. The debug info packages may be installed at the time the
    report is created. Let's hope that you will need this really seldom.

What is in the report

 1. Heartbeat related

      □ heartbeat version/release information

      □ heartbeat configuration (CIB, ha.cf, logd.cf)

      □ heartbeat status (output from crm_mon, crm_verify, ccm_tool)

      □ pengine transition graphs (if any)

      □ backtraces of core dumps (if any)

      □ heartbeat logs (if any)

 2. System related

      □ general platform information (uname, arch, distribution)

      □ system statistics (uptime, top, ps, netstat -i, arp)

 3. User created :)

      □ problem description (template to be edited)

 4. Generated

      □ problem analysis (generated)

It is preferred that the Heartbeat is running at the time of the report, but
not absolutely required. hb_report will also do a quick analysis of the
collected information.

Times

Specifying times can at times be a nuisance. That is why we have chosen to use
one of the perl modules—they do allow certain freedom when talking dates. You
can either read the instructions at the Date::Parse examples page.

or just rely on common sense and try stuff like:

3:00          (today at 3am)
15:00         (today at 3pm)
2007/9/1 2pm  (September 1st at 2pm)

hb_report will (probably) complain if it can't figure out what do you mean.

Try to delimit the event as close as possible in order to reduce the size of
the report, but still leaving a minute or two around for good measure.

Note that -f is not an optional option. And don't forget to quote dates when
they contain spaces.

It is also possible to extract a CTS test. Just prefix the test number with
cts: in the -f option.

Should I send all this to the rest of Internet?

We make an effort to remove sensitive data from the Heartbeat configuration
(CIB, ha.cf, and transition graphs). However, you have to tell us what is
sensitive! Use the -p option to specify additional regular expressions to match
variable names which may contain information you don't want to leak. For
example:

# hb_report -f 18:00 -p "user.*" -p "secret.*" /var/tmp/report

We look by default for variable names matching "pass.*" and the stonith_host
ha.cf directive.

Logs and other files are not filtered. Please filter them yourself if
necessary.

Logs

It may be tricky to find syslog logs. The scheme used is to log a unique
message on all nodes and then look it up in the usual syslog locations. This
procedure is not foolproof, in particular if the syslog files are in a
non-standard directory. We look in /var/log /var/logs /var/syslog /var/adm /var
/log/ha /var/log/cluster. In case we can't find the logs, please supply their
location:

# hb_report -f 5pm -l /var/log/cluster1/ha-log -S /tmp/report_node1

If you have different log locations on different nodes, well, perhaps you'd
like to make them the same and make life easier for everybody.

The log files are collected from all hosts where found. In case your syslog is
configured to log to both the log server and local files and hb_report is run
on the log server you will end up with multiple logs with same content.

Files starting with "ha-" are preferred. In case syslog sends messages to more
than one file, if one of them is named ha-log or ha-debug those will be
favoured to syslog or messages.

If there is no separate log for Heartbeat, possibly unrelated messages from
other programs are included. We don't filter logs, just pick a segment for the
period you specified.

NB: Don't have a central log host? Read the CTS README and setup one.

Manual report collection

So, your ssh doesn't work. In that case, you will have to run this procedure on
all nodes. Use -S so that we don't bother with ssh:

# hb_report -f 5:20pm -t 5:30pm -S /tmp/report_node1

If you also have a log host which is not in the cluster, then you'll have to
copy the log to one of the nodes and tell us where it is:

# hb_report -f 5:20pm -t 5:30pm -l /var/tmp/ha-log -S /tmp/report_node1

Furthermore, to prevent hb_report from asking you to edit the report to
describe the problem on every node use -D on all but one:

# hb_report -f 5:20pm -t 5:30pm -DS /tmp/report_node1

If you reconsider and want the ssh setup, take a look at the CTS README file
for instructions.

Analysis

The point of analysis is to get out the most important information from
probably several thousand lines worth of text. Perhaps this should be more
properly named as report review as it is rather simple, but let's pretend that
we are doing something utterly sophisticated.

The analysis consists of the following:

  • compare files coming from different nodes; if they are equal, make one copy
    in the top level directory, remove duplicates, and create soft links
    instead

  • print errors, warnings, and lines matching -L patterns from logs

  • report if there were coredumps and by whom

  • report crm_verify results

The goods

 1. Common

      □ ha-log (if found on the log host)

      □ description.txt (template and user report)

      □ analysis.txt

 2. Per node

      □ ha.cf

      □ logd.cf

      □ ha-log (if found)

      □ cib.xml (cibadmin -Ql or cp if Heartbeat is not running)

      □ ccm_tool.txt (ccm_tool -p)

      □ crm_mon.txt (crm_mon -1)

      □ crm_verify.txt (crm_verify -V)

      □ pengine/ (only on DC, directory with pengine transitions)

      □ sysinfo.txt (static info)

      □ sysstats.txt (dynamic info)

      □ backtraces.txt (if coredumps found)

      □ DC (well…)

      □ RUNNING or STOPPED

Last updated 29-Nov-2007 16:12:02 CEST
