commit	eb905bde4920954aefee0f018488d054cceaa772	[log] [tgz]
author	Xudong Yang <wyverald@gmail.com>	Fri May 03 17:03:40 2024 -0700
committer	Copybara-Service <copybara-worker@google.com>	Fri May 03 17:05:06 2024 -0700
tree	47d7250604aad238e664f0c428ab571ccf6a43ca
parent	9795f9a8594a00711d5c581c8ccd68ea61490eac [diff]

Use a Semaphore for signaling in repo fetching

I managed to reproduce some deadlocks during repo fetching with virtual worker threads. One notable trigger was some _other_ repo failing to fetch, which seems to cause Skyframe to try to interrupt other concurrent repo fetches. This _might_ be the cause for a deadlock where we submit a task to the worker executor service, but the task never starts running before it gets cancelled, which causes us to wait forever for a `DONE` signal that never comes. (The worker task puts a `DONE` signal in the queue in a `finally` block -- but we don't even enter the `try`.)

This PR improves the situation in various ways:

1. Instead of using a `SynchronousQueue` for the signal queue, we now use a Semaphore for signaling. Semaphores have the crucial property that releasing a permit (ie. incrementing the counter) does not block, and thus cannot be interrupted. This means that the worker thread can now reliably send signals the host thread, even when it's interrupted.

2. Instead of using two signals for `DONE` and `RESTART`, we just use the one semaphore for both signals, and rely on `workerFuture.isDone()` to tell whether the worker has finished or is waiting for a fresh Environment.

3. Instead of signaling `DONE` in a `finally` block, we now use a `ListenableFuture` and signal to the semaphore in the worker future's listener. This makes sure that the signaling is performed _after_ the worker future's status changes, and safeguards against the case where the submitted task never starts running before it gets cancelled.

4. Instead of waiting for a `DONE` signal (or, in the new setup, the signal semaphore) to make sure the worker thread has finished, we now hold on to the executor service, which offers a `close()` method that essentially uninterruptibly waits for any scheduled tasks to terminate, whether or not they have started running. (@justinhorvitz had suggested a similar idea before.) To make sure distinct repo fetches don't interfere with each other, we start a separate worker executor service for each repo fetch instead of making everyone share the same worker executor service. (This is recommended for virtual threads; see https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-C0FEE349-D998-4C9D-B032-E01D06BE55F2 for example.)

And because I now create a separate worker executor service for each repo fetch, it doesn't really make sense to use this for platform threads anymore. So setting `--experimental_worker_for_repo_fetching` to any but `off` will cause us to use virtual threads.

Related: https://github.com/bazelbuild/bazel/issues/22003

Fixes https://github.com/bazelbuild/bazel/issues/21712.

Closes #22100.

PiperOrigin-RevId: 630534733
Change-Id: If989bf9cae76abb1579a2b1de896df8e5a63b88d

4 files changed

tree: 47d7250604aad238e664f0c428ab571ccf6a43ca

README.md

Bazel

{Fast, Correct} - Choose two

Build and test software of any size, quickly and reliably.

Speed up your builds and tests: Bazel rebuilds only what is necessary. With advanced local and distributed caching, optimized dependency analysis and parallel execution, you get fast and incremental builds.
One tool, multiple languages: Build and test Java, C++, Android, iOS, Go, and a wide variety of other language platforms. Bazel runs on Windows, macOS, and Linux.
Scalable: Bazel helps you scale your organization, codebase, and continuous integration solution. It handles codebases of any size, in multiple repositories or a huge monorepo.
Extensible to your needs: Easily add support for new languages and platforms with Bazel's familiar extension language. Share and re-use language rules written by the growing Bazel community.

Getting Started

Documentation

Reporting a Vulnerability

To report a security issue, please email security@bazel.build with a description of the issue, the steps you took to create the issue, affected versions, and, if known, mitigations for the issue. Our vulnerability management team will respond within 3 working days of your email. If the issue is confirmed as a vulnerability, we will open a Security Advisory. This project follows a 90 day disclosure timeline.

Contributing to Bazel

See CONTRIBUTING.md