Add a guide to explain where to look for information when monitoring bazel
This is the correct fix for #114.
Change-Id: Ie2ab1e54beee6b4a11f102457842c36c4e715184
diff --git a/README.md b/README.md
index 567a3e9..226bae6 100644
--- a/README.md
+++ b/README.md
@@ -13,7 +13,6 @@
* [user](docs/user.md): explains how to use the CI system for a Bazel
contributor.
-
## For maintainers of Bazel continuous integration system
Make sure you have a Bazel installed with a recent enough version of
diff --git a/docs/bazel-monitoring.md b/docs/bazel-monitoring.md
new file mode 100644
index 0000000..605643f
--- /dev/null
+++ b/docs/bazel-monitoring.md
@@ -0,0 +1,56 @@
+# How to monitor for Bazel regression?
+
+This is a guide on what to monitor for Bazel for the
+Bazel build sheriff.
+
+# The dashboard
+
+A general dashboard to have a quick view of the general health is available
+at (http://ci.bazel.io/view/Dashboard/)[http://ci.bazel.io/view/Dashboard/].
+This dashboard represent the health of all important builds that runs on the CI
+system.
+
+There is 2 kinds of projects we monitor:
+
+ - Owned by the core bazel team:
+ [bazel-tests](http://ci.bazel.io/job/bazel-tests),
+ [bazel-docker-tests](http://ci.bazel.io/job/bazel-docker-tests),
+ [Tutorial](http://ci.bazel.io/job/Tutorial),
+ [benchmark](http://ci.bazel.io/job/benchmark),
+ [nightly](http://ci.bazel.io/job/bazel/job/nightly) and
+ [release](http://ci.bazel.io/job/bazel/job/release)
+ - Projects built using Bazel such as repositories on the bazelbuild GitHub
+ organisation, TensorFlow or Gerrit.
+
+If project owned by the Bazel team are not green, then the Bazel team needs to
+investigate and fix as soon as possible to keep our build green.
+
+The other projects health depends on the other projects owner and the Bazel team
+responsibility is only to report issue and if the build stay broken for too
+long (more than a week), to deactivate the project. Those projects are useful
+for the Bazel team to test non regression in Bazel itself.
+
+# Triaging failure
+
+The build sheriff should monitor the output of the various type of job:
+
+ - [Global tests](user.md#global-jobs) (e.g.
+ [nightly](http://ci.bazel.io/job/bazel/job/nightly) and
+ [release](http://ci.bazel.io/job/bazel/job/release)).
+ The [release](http://ci.bazel.io/job/bazel/job/release) job runs at every
+ push and is always green for non release push. The
+ [nightly](http://ci.bazel.io/job/bazel/job/nightly) job runs every night
+ and can be re-run on demand simply using the run button in Jenkins (needs
+ to be logged in). See the [user guide](user.md#global-jobs) on how to
+ interpret the results. Serious failure in the a global test should be filed
+ to [bazelbuild/bazel](https://github.com/bazelbuild/bazel/issues/new) as
+ a breakage, and as release blocker if on the release job.
+ - [benchmark](http://ci.bazel.io/job/benchmark) should be investigated
+ just by looking at the output logs. If the job fails with a java error,
+ build error, an issue should be filed to
+ [bazelbuild/bazel](https://github.com/bazelbuild/bazel/issues/new), else it
+ should be filed to [bazelbuild/continuous-integration](https://github.com/bazelbuild/bazel/continuous-integration/new).
+ - [postsubmits](user.md#postsubmit), which are all the other monitored
+ jobs. A postsubmit failure should be reported to the project owning the
+ job. If a failure stay for too long, the job should be partially or totally
+ deactivated to maintain the clarity of global tests.
diff --git a/docs/user.md b/docs/user.md
index 0574096..ac58f8e 100644
--- a/docs/user.md
+++ b/docs/user.md
@@ -12,6 +12,7 @@
tested on Bazel CI, go see the
[project owner documentation](owner.md).
+<a name="postsubmit">
## Postsubmit
Every project that runs on Bazel CI is run on postsubmit. It is done
@@ -19,6 +20,36 @@
[bazel-io](https://github.com/bazel-io) has write access to the
repository.
+The result of a build can be either:
+
+ - Sucesss (job is green).
+ - Unstable (job is yellow). Some tests failed. Blue Ocean View<sup>1</sup>
+ will show the failing platforms in Pipeline view, and list of failing tests
+ in Tests view.
+ - Failed (job is red). Compilation failed, or configuration files broken.
+ Blue Ocean View<sup>1</sup> will show the build breakage. If it does not
+ fall back to the full console log.
+
+<sup>1</sup> Open Blue Ocean view with the button on the left of the job view.
+
+Tips:
+
+ - Tests logs are available under the artifacts list (`<joburl>/artifact`, e.g.
+ http://ci.bazel.io/job/bazel-tests/lastCompletedBuild/artifact/).
+ - Flaky tests can be analyzed with the Test Results Analyzer (available in
+ the normal job view on the side menu) which show history per tests.
+ - The "Pipeline Steps" button on the side menu on a job view let you examine
+ each step of the Jenkins pipeline one by one. Looking for the enclosing
+ workspace or node start step of another step give you access to the
+ workspace of that step.
+
+Current limitations:
+
+ - Jenkins Blue Ocean UI has no good way to mark an unstable step so if a
+ platform stage fails without clear sub-step failing look for the last shell
+ step in the platform stage view.
+ - Tests are not ordered by platforms in the test view.
+
## Presubmit
The Bazel CI is able to run presubmit tests of changes from GitHub and
@@ -47,18 +78,32 @@
`Verified+1` or `Verified-1` depending on the result of the test. To
retrigger a test, simply reset the `Presubmit-Ready` label.
+The output should be read the same way as the output of the [postsubmit](#postsubmit).
+
+<a name="global-tests"/>
## Global tests
In addition to pre- and postsubmit tests for an individual change, the
Bazel CI performs a "global test" which builds Bazel from a branch, and
-uses that build of Bazel to run all the other jobs on the Bazel CI. It
-then produces a report comparing the global test results of this build
-of Bazel with the global test results from the latest release of
-Bazel.
+uses that build of Bazel to run all the other jobs on the Bazel CI.
-This report can be found at `http://ci.bazel.io/job/Global/job/pipeline/<buildNumber>/Downstream_projects/`,
-for instance for the last run it will be at
-[http://ci.bazel.io/job/Global/job/pipeline/lastBuild/Downstream_projects/](http://ci.bazel.io/job/Global/job/pipeline/lastBuild/Downstream_projects/).
+If it succeed to build Bazel (if it is not red), it produces a report
+comparing the global test results of this build of Bazel with the global
+test results from the latest release of Bazel.
+
+This report can be found at `http://ci.bazel.io/job/bazel/job/<nightly|release|presubmit>/<buildNumber>/Downstream_projects/`,
+for instance for the last nigthly run it will be at
+[http://ci.bazel.io/job/bazel/job/nightly/lastBuild/Downstream_projects/](http://ci.bazel.io/job/bazel/job/nightly/lastBuild/Downstream_projects/).
+
+The way to read that report is:
+
+ - Every newly failing jobs are problematic and likely to indicate a
+ failure due to a Bazel change. It cause the build to be unstable (yellow).
+ - Every already failing jobs means that the job result is no worse, it is
+ generally safe to ignore those failure but we should aim at having 0 of
+ them to make sure we do not hide problem (a build that breaks because of
+ Bazel whereas it was broken before of a project issue).
+ - Every passing job can be safely ignored.
## Release process