Add a guide to explain where to look for information when monitoring bazel This is the correct fix for #114. Change-Id: Ie2ab1e54beee6b4a11f102457842c36c4e715184

commit: 9584fe457a7580c3dd3072c6c1e15b4a034e2bb5 [log] [tgz]
author: Damien Martin-Guillerez <dmarting@google.com> Wed Aug 23 18:16:53 2017 +0200
committer: Damien Martin-guillerez <dmarting@google.com> Fri Sep 01 10:37:02 2017 +0000
tree: 5b8695824b724b0ef17c46bdc5eeea2be86ab431
parent: 75b8520f441036e4f79e0478ea9a9365cc5e2125 [diff]
diff --git a/README.md b/README.md
index 567a3e9..226bae6 100644
--- a/README.md
+++ b/README.md

@@ -13,7 +13,6 @@
 * [user](docs/user.md): explains how to use the CI system for a Bazel
   contributor.
 
-
 ## For maintainers of Bazel continuous integration system
 
 Make sure you have a Bazel installed with a recent enough version of

diff --git a/docs/bazel-monitoring.md b/docs/bazel-monitoring.md
new file mode 100644
index 0000000..605643f
--- /dev/null
+++ b/docs/bazel-monitoring.md

@@ -0,0 +1,56 @@
+# How to monitor for Bazel regression?
+
+This is a guide on what to monitor for Bazel for the
+Bazel build sheriff.
+
+# The dashboard
+
+A general dashboard to have a quick view of the general health is available
+at (http://ci.bazel.io/view/Dashboard/)[http://ci.bazel.io/view/Dashboard/].
+This dashboard represent the health of all important builds that runs on the CI
+system.
+
+There is 2 kinds of projects we monitor:
+
+  - Owned by the core bazel team:
+    [bazel-tests](http://ci.bazel.io/job/bazel-tests),
+    [bazel-docker-tests](http://ci.bazel.io/job/bazel-docker-tests),
+    [Tutorial](http://ci.bazel.io/job/Tutorial),
+    [benchmark](http://ci.bazel.io/job/benchmark),
+    [nightly](http://ci.bazel.io/job/bazel/job/nightly) and
+    [release](http://ci.bazel.io/job/bazel/job/release)
+  - Projects built using Bazel such as repositories on the bazelbuild GitHub
+    organisation, TensorFlow or Gerrit.
+
+If project owned by the Bazel team are not green, then the Bazel team needs to
+investigate and fix as soon as possible to keep our build green.
+
+The other projects health depends on the other projects owner and the Bazel team
+responsibility is only to report issue and if the build stay broken for too
+long (more than a week), to deactivate the project. Those projects are useful
+for the Bazel team to test non regression in Bazel itself.
+
+# Triaging failure
+
+The build sheriff should monitor the output of the various type of job:
+
+  - [Global tests](user.md#global-jobs) (e.g.
+    [nightly](http://ci.bazel.io/job/bazel/job/nightly) and
+    [release](http://ci.bazel.io/job/bazel/job/release)).
+    The [release](http://ci.bazel.io/job/bazel/job/release) job runs at every
+    push and is always green for non release push. The
+    [nightly](http://ci.bazel.io/job/bazel/job/nightly) job runs every night
+    and can be re-run on demand simply using the run button in Jenkins (needs
+    to be logged in). See the [user guide](user.md#global-jobs) on how to
+    interpret the results. Serious failure in the a global test should be filed
+    to [bazelbuild/bazel](https://github.com/bazelbuild/bazel/issues/new) as
+    a breakage, and as release blocker if on the release job.
+  - [benchmark](http://ci.bazel.io/job/benchmark) should be investigated
+    just by looking at the output logs. If the job fails with a java error,
+    build error, an issue should be filed to
+    [bazelbuild/bazel](https://github.com/bazelbuild/bazel/issues/new), else it
+    should be filed to [bazelbuild/continuous-integration](https://github.com/bazelbuild/bazel/continuous-integration/new).
+  - [postsubmits](user.md#postsubmit), which are all the other monitored
+    jobs. A postsubmit failure should be reported to the project owning the
+    job. If a failure stay for too long, the job should be partially or totally
+    deactivated to maintain the clarity of global tests.

diff --git a/docs/user.md b/docs/user.md
index 0574096..ac58f8e 100644
--- a/docs/user.md
+++ b/docs/user.md

@@ -12,6 +12,7 @@
 tested on Bazel CI, go see the
 [project owner documentation](owner.md).
 
+<a name="postsubmit">
 ## Postsubmit
 
 Every project that runs on Bazel CI is run on postsubmit. It is done
@@ -19,6 +20,36 @@
 [bazel-io](https://github.com/bazel-io) has write access to the
 repository.
 
+The result of a build can be either:
+
+  - Sucesss (job is green).
+  - Unstable (job is yellow). Some tests failed. Blue Ocean View<sup>1</sup>
+    will show the failing platforms in Pipeline view, and list of failing tests
+    in Tests view.
+  - Failed (job is red). Compilation failed, or configuration files broken.
+    Blue Ocean View<sup>1</sup> will show the build breakage. If it does not
+    fall back to the full console log.
+
+<sup>1</sup> Open Blue Ocean view with the button on the left of the job view.
+
+Tips:
+
+  - Tests logs are available under the artifacts list (`<joburl>/artifact`, e.g.
+    http://ci.bazel.io/job/bazel-tests/lastCompletedBuild/artifact/).
+  - Flaky tests can be analyzed with the Test Results Analyzer (available in
+    the normal job view on the side menu) which show history per tests.
+  - The "Pipeline Steps" button on the side menu on a job view let you examine
+    each step of the Jenkins pipeline one by one. Looking for the enclosing
+    workspace or node start step of another step give you access to the
+    workspace of that step.
+
+Current limitations:
+
+  - Jenkins Blue Ocean UI has no good way to mark an unstable step so if a
+    platform stage fails without clear sub-step failing look for the last shell
+    step in the platform stage view.
+  - Tests are not ordered by platforms in the test view.
+
 ## Presubmit
 
 The Bazel CI is able to run presubmit tests of changes from GitHub and
@@ -47,18 +78,32 @@
 `Verified+1` or `Verified-1` depending on the result of the test. To
 retrigger a test, simply reset the `Presubmit-Ready` label.
 
+The output should be read the same way as the output of the [postsubmit](#postsubmit).
+
+<a name="global-tests"/>
 ## Global tests
 
 In addition to pre- and postsubmit tests for an individual change, the
 Bazel CI performs a "global test" which builds Bazel from a branch, and
-uses that build of Bazel to run all the other jobs on the Bazel CI. It
-then produces a report comparing the global test results of this build
-of Bazel with the global test results from the latest release of
-Bazel.
+uses that build of Bazel to run all the other jobs on the Bazel CI.
 
-This report can be found at `http://ci.bazel.io/job/Global/job/pipeline/<buildNumber>/Downstream_projects/`,
-for instance for the last run it will be at
-[http://ci.bazel.io/job/Global/job/pipeline/lastBuild/Downstream_projects/](http://ci.bazel.io/job/Global/job/pipeline/lastBuild/Downstream_projects/).
+If it succeed to build Bazel (if it is not red), it produces a report
+comparing the global test results of this build of Bazel with the global
+test results from the latest release of Bazel.
+
+This report can be found at `http://ci.bazel.io/job/bazel/job/<nightly|release|presubmit>/<buildNumber>/Downstream_projects/`,
+for instance for the last nigthly run it will be at
+[http://ci.bazel.io/job/bazel/job/nightly/lastBuild/Downstream_projects/](http://ci.bazel.io/job/bazel/job/nightly/lastBuild/Downstream_projects/).
+
+The way to read that report is:
+
+  - Every newly failing jobs are problematic and likely to indicate a
+    failure due to a Bazel change. It cause the build to be unstable (yellow).
+  - Every already failing jobs means that the job result is no worse, it is
+    generally safe to ignore those failure but we should aim at having 0 of
+    them to make sure we do not hide problem (a build that breaks because of
+    Bazel whereas it was broken before of a project issue).
+  - Every passing job can be safely ignored.
 
 ## Release process
commit	9584fe457a7580c3dd3072c6c1e15b4a034e2bb5	[log] [tgz]
author	Damien Martin-Guillerez <dmarting@google.com>	Wed Aug 23 18:16:53 2017 +0200
committer	Damien Martin-guillerez <dmarting@google.com>	Fri Sep 01 10:37:02 2017 +0000
tree	5b8695824b724b0ef17c46bdc5eeea2be86ab431
parent	75b8520f441036e4f79e0478ea9a9365cc5e2125 [diff]