Diff - 343ba438a93f8c56a7b524ac7a54666c57a969d9^! - bazel

commit	343ba438a93f8c56a7b524ac7a54666c57a969d9	[log] [tgz]
author	nharmata <nharmata@google.com>	Mon Dec 27 16:10:47 2021 -0800
committer	Copybara-Service <copybara-worker@google.com>	Mon Dec 27 16:12:25 2021 -0800
tree	fad2364ba9276a64619b1094093a0e65d21eeba8
parent	743d929b4fa113863f4c11ee4967c77918d13716 [diff] [blame]

When Blaze detects it's under memory pressure in the middle of a Skyframe evaluation, it drops all the temporary `SkyKeyComputeState` instances.

This fully addresses the performance problem that I described in the description of https://github.com/bazelbuild/bazel/commit/ed279ab4fa2d4356be00b54266f56edcc5aeae78 (I summarized this problem here in a comment in `SkyframeHighWaterMarkLimiter.java`), thus allowing us to make full use of `SkyKeyComputeState` in heavy hitters in the Blaze-on-Skyframe codebase and get strict CPU and wall time wins in all situations. Look forward to a future CL that does this for `ConfiguredTargetFunction` and `AspectFunction` :)

Alternatives considered
-----

After reimplementing `ConfiguredTargetFunction` and `AspectFunction` to use `SkyKeyComputeState` I tried several different approaches for addressing this problem. None of them worked.

0. Do nothing.

On various benchmarks where Blaze was not memory constrained, I was getting 3-5% reductions in both CPU time and wall time. Good! This was always the initial motivation of this project.

On various benchmarks where Blaze **was** memory constrained, I was getting 7-10% increases in both CPU time and wall time. Bad! This was unacceptable.

1. Have `SkyKeyComputeStateManager` use a bounded cache `SkyKeyComputeState` (including having different per-`SkyFunctionName` bounds).

I spent a while on this. For a specific target, I was able to come up with precise bounds on the cache sizes that let me mitigate the regression when Blaze was memory constrained. There was still a regression though, and so this approach is definitely inferior to what I ended up doing in this CL.

There's also the massive issue of: How would we set the cache bounds to achieve good results in all situations? This is clearly impossible: Each combo of `(blaze invocation, blaze Xmx)` would want different choices for the cache bounds. Even if I were able to come up with some cache bounds that are good for common situations internally at Google, there's no way they would be good for arbitrary Bazel usage elsewhere in the world. And even ignoring that, any static cache bounds would definitely grow stale as Blaze's implementation changes and the code it's being asked to build changes.

Therefore, this approach was unacceptable.

2. Have `SkyKeyComputeStateManager` use a soft cache.

This sounds good in theory, since `SoftReference`s are collected only when the JVM thinks its under memory pressure. But there are two problems

* There's a GC performance penalty to using `SoftReference`s since the JVM has to scan all of them to check reachability.
* Some usages of Blaze inside Google tweak JVM settings controlling `SoftReference` collection behavior. All of those usages would have to be re-tweaked for this new thing. And it may not be possible to reconcile the existing thing with the new thing. Also, some usages of Bazel might be similarly tweaking JVM settings. I didn't think it'd be good to tell Bazel users "Hi, Bazel now crucicially uses SoftReference. You'll have to tweak JVM settings yourself to get decent performance".

In my benchmarks, this approach was a lot worse than (0) when Blaze was memory constrained (something like a net reduction of only 1-2% CPU and wall time), and still yielded an overall regression when Blaze was memory constrained. So this is still unacceptable.

3. Same as (2) but with `WeakReference`s.

The JVM clears `WeakReference`s even more aggressively than `SoftReference`s. So, while the GC performance penalty noted above doesn't apply, our usage will definitely suffer since `SkyKeyComputeStateManager` itself will be logically thrashing.

My benchmarks confirmed this intuition. The results weren't good enough.

4. Combo of (1) & (2)/(3). That is, when entries get evicted from the soft/weak cache(s), put them in strong bounded cache(s), respecting the LRU policy.

I spent a while on this and was able to get decent benchmark results, but nothing close to what I got with the approach I ended up doing in this CL.

5. Same as (4) but with "the approach in this CL" rather than (2)/(3).

I didn't try this because I was very pleased with the positive results of this CL. And I was still concerned about the downsides of (1).

Done well, I could imagine this being a further improvement in the future though.

PiperOrigin-RevId: 418537075

diff --git a/src/main/java/com/google/devtools/build/lib/runtime/CommonCommandOptions.java b/src/main/java/com/google/devtools/build/lib/runtime/CommonCommandOptions.java
index 3e15816..b481b8c 100644
--- a/src/main/java/com/google/devtools/build/lib/runtime/CommonCommandOptions.java
+++ b/src/main/java/com/google/devtools/build/lib/runtime/CommonCommandOptions.java

@@ -337,6 +337,20 @@
   public int oomMoreEagerlyThreshold;
 
   @Option(
+      name = "skyframe_high_water_mark_threshold",
+      defaultValue = "85",
+      documentationCategory = OptionDocumentationCategory.BUILD_TIME_OPTIMIZATION,
+      effectTags = {OptionEffectTag.HOST_MACHINE_RESOURCE_OPTIMIZATIONS},
+      help =
+          "Flag for advanced configuration of Bazel's internal Skyframe engine. If Bazel detects"
+              + " its retained heap percentage usage is at least this threshold, it will drop"
+              + " unnecessary temporary Skyframe state. Tweaking this may let you mitigate wall"
+              + " time impact of GC thrashing, when the GC thrashing is (i) caused by the memory"
+              + " usage of this temporary state and (ii) more costly than reconstituting the state"
+              + " when it is needed.")
+  public int skyframeHighWaterMarkMemoryThreshold;
+
+  @Option(
       name = "heap_dump_on_oom",
       defaultValue = "false",
       documentationCategory = OptionDocumentationCategory.LOGGING,