blob: 927a2ffe80a3a167acb75cf5e7dc387688de8bca [file] [log] [blame] [view] [edit]
Project: /_project.yaml
Book: /_book.yaml
# Dependency Management
{% include "_buttons.html" %}
In looking through the previous pages, one theme repeats over and over: managing
your own code is fairly straightforward, but managing its dependencies is much
more difficult. There are all sorts of dependencies: sometimes theres a
dependency on a task (such as push the documentation before I mark a release as
complete”), and sometimes theres a dependency on an artifact (such as I need
to have the latest version of the computer vision library to build my code”).
Sometimes, you have internal dependencies on another part of your codebase, and
sometimes you have external dependencies on code or data owned by another team
(either in your organization or a third party). But in any case, the idea of I
need that before I can have this is something that recurs repeatedly in the
design of build systems, and managing dependencies is perhaps the most
fundamental job of a build system.
## Dealing with Modules and Dependencies
Projects that use artifact-based build systems like Bazel are broken into a set
of modules, with modules expressing dependencies on one another via `BUILD`
files. Proper organization of these modules and dependencies can have a huge
effect on both the performance of the build system and how much work it takes to
maintain.
## Using Fine-Grained Modules and the 1:1:1 Rule
The first question that comes up when structuring an artifact-based build is
deciding how much functionality an individual module should encompass. In Bazel,
a _module_ is represented by a target specifying a buildable unit like a
`java_library` or a `go_binary`. At one extreme, the entire project could be
contained in a single module by putting one `BUILD` file at the root and
recursively globbing together all of that projects source files. At the other
extreme, nearly every source file could be made into its own module, effectively
requiring each file to list in a `BUILD` file every other file it depends on.
Most projects fall somewhere between these extremes, and the choice involves a
trade-off between performance and maintainability. Using a single module for the
entire project might mean that you never need to touch the `BUILD` file except
when adding an external dependency, but it means that the build system must
always build the entire project all at once. This means that it wont be able to
parallelize or distribute parts of the build, nor will it be able to cache parts
that its already built. One-module-per-file is the opposite: the build system
has the maximum flexibility in caching and scheduling steps of the build, but
engineers need to expend more effort maintaining lists of dependencies whenever
they change which files reference which.
Though the exact granularity varies by language (and often even within
language), Google tends to favor significantly smaller modules than one might
typically write in a task-based build system. A typical production binary at
Google often depends on tens of thousands of targets, and even a moderate-sized
team can own several hundred targets within its codebase. For languages like
Java that have a strong built-in notion of packaging, each directory usually
contains a single package, target, and `BUILD` file (Pants, another build system
based on Bazel, calls this the 1:1:1 rule). Languages with weaker packaging
conventions frequently define multiple targets per `BUILD` file.
The benefits of smaller build targets really begin to show at scale because they
lead to faster distributed builds and a less frequent need to rebuild targets.
The advantages become even more compelling after testing enters the picture, as
finer-grained targets mean that the build system can be much smarter about
running only a limited subset of tests that could be affected by any given
change. Because Google believes in the systemic benefits of using smaller
targets, weve made some strides in mitigating the downside by investing in
tooling to automatically manage `BUILD` files to avoid burdening developers.
Some of these tools, such as `buildifier` and `buildozer`, are available with
Bazel in the [`buildtools`
directory](https://github.com/bazelbuild/buildtools){: .external}.
## Minimizing Module Visibility
Bazel and other build systems allow each target to specify a visibility a
property that determines which other targets may depend on it. A private target
can only be referenced within its own `BUILD` file. A target may grant broader
visibility to the targets of an explicitly defined list of `BUILD` files, or, in
the case of public visibility, to every target in the workspace.
As with most programming languages, it is usually best to minimize visibility as
much as possible. Generally, teams at Google will make targets public only if
those targets represent widely used libraries available to any team at Google.
Teams that require others to coordinate with them before using their code will
maintain an allowlist of customer targets as their targets visibility. Each
teams internal implementation targets will be restricted to only directories
owned by the team, and most `BUILD` files will have only one target that isnt
private.
## Managing Dependencies
Modules need to be able to refer to one another. The downside of breaking a
codebase into fine-grained modules is that you need to manage the dependencies
among those modules (though tools can help automate this). Expressing these
dependencies usually ends up being the bulk of the content in a `BUILD` file.
### Internal dependencies
In a large project broken into fine-grained modules, most dependencies are
likely to be internal; that is, on another target defined and built in the same
source repository. Internal dependencies differ from external dependencies in
that they are built from source rather than downloaded as a prebuilt artifact
while running the build. This also means that theres no notion of version for
internal dependenciesa target and all of its internal dependencies are always
built at the same commit/revision in the repository. One issue that should be
handled carefully with regard to internal dependencies is how to treat
transitive dependencies (Figure 1). Suppose target A depends on target B, which
depends on a common library target C. Should target A be able to use classes
defined in target C?
[![Transitive
dependencies](/images/transitive-dependencies.png)](/images/transitive-dependencies.png)
**Figure 1**. Transitive dependencies
As far as the underlying tools are concerned, theres no problem with this; both
B and C will be linked into target A when it is built, so any symbols defined in
C are known to A. Bazel allowed this for many years, but as Google grew, we
began to see problems. Suppose that B was refactored such that it no longer
needed to depend on C. If Bs dependency on C was then removed, A and any other
target that used C via a dependency on B would break. Effectively, a targets
dependencies became part of its public contract and could never be safely
changed. This meant that dependencies accumulated over time and builds at Google
started to slow down.
Google eventually solved this issue by introducing a strict transitive
dependency mode in Bazel. In this mode, Bazel detects whether a target tries to
reference a symbol without depending on it directly and, if so, fails with an
error and a shell command that can be used to automatically insert the
dependency. Rolling this change out across Googles entire codebase and
refactoring every one of our millions of build targets to explicitly list their
dependencies was a multiyear effort, but it was well worth it. Our builds are
now much faster given that targets have fewer unnecessary dependencies, and
engineers are empowered to remove dependencies they dont need without worrying
about breaking targets that depend on them.
As usual, enforcing strict transitive dependencies involved a trade-off. It made
build files more verbose, as frequently used libraries now need to be listed
explicitly in many places rather than pulled in incidentally, and engineers
needed to spend more effort adding dependencies to `BUILD` files. Weve since
developed tools that reduce this toil by automatically detecting many missing
dependencies and adding them to a `BUILD` files without any developer
intervention. But even without such tools, weve found the trade-off to be well
worth it as the codebase scales: explicitly adding a dependency to `BUILD` file
is a one-time cost, but dealing with implicit transitive dependencies can cause
ongoing problems as long as the build target exists. Bazel [enforces strict
transitive
dependencies](https://blog.bazel.build/2017/06/28/sjd-unused_deps.html){: .external}
on Java code by default.
### External dependencies
If a dependency isnt internal, it must be external. External dependencies are
those on artifacts that are built and stored outside of the build system. The
dependency is imported directly from an artifact repository (typically accessed
over the internet) and used as-is rather than being built from source. One of
the biggest differences between external and internal dependencies is that
external dependencies have versions, and those versions exist independently of
the projects source code.
### Automatic versus manual dependency management
Build systems can allow the versions of external dependencies to be managed
either manually or automatically. When managed manually, the buildfile
explicitly lists the version it wants to download from the artifact repository,
often using a [semantic version string](https://semver.org/){: .external} such
as `1.1.4`. When managed automatically, the source file specifies a range of
acceptable versions, and the build system always downloads the latest one. For
example, Gradle allows a dependency version to be declared as 1.+” to specify
that any minor or patch version of a dependency is acceptable so long as the
major version is 1.
Automatically managed dependencies can be convenient for small projects, but
theyre usually a recipe for disaster on projects of nontrivial size or that are
being worked on by more than one engineer. The problem with automatically
managed dependencies is that you have no control over when the version is
updated. Theres no way to guarantee that external parties wont make breaking
updates (even when they claim to use semantic versioning), so a build that
worked one day might be broken the next with no easy way to detect what changed
or to roll it back to a working state. Even if the build doesnt break, there
can be subtle behavior or performance changes that are impossible to track down.
In contrast, because manually managed dependencies require a change in source
control, they can be easily discovered and rolled back, and its possible to
check out an older version of the repository to build with older dependencies.
Bazel requires that versions of all dependencies be specified manually. At even
moderate scales, the overhead of manual version management is well worth it for
the stability it provides.
### The One-Version Rule
Different versions of a library are usually represented by different artifacts,
so in theory theres no reason that different versions of the same external
dependency couldnt both be declared in the build system under different names.
That way, each target could choose which version of the dependency it wanted to
use. This causes a lot of problems in practice, so Google enforces a strict
[One-Version
Rule](https://opensource.google/docs/thirdparty/oneversion/){: .external} for
all third-party dependencies in our codebase.
The biggest problem with allowing multiple versions is the diamond dependency
issue. Suppose that target A depends on target B and on v1 of an external
library. If target B is later refactored to add a dependency on v2 of the same
external library, target A will break because it now depends implicitly on two
different versions of the same library. Effectively, its never safe to add a
new dependency from a target to any third-party library with multiple versions,
because any of that targets users could already be depending on a different
version. Following the One-Version Rule makes this conflict impossibleif a
target adds a dependency on a third-party library, any existing dependencies
will already be on that same version, so they can happily coexist.
### Transitive external dependencies
Dealing with the transitive dependencies of an external dependency can be
particularly difficult. Many artifact repositories such as Maven Central, allow
artifacts to specify dependencies on particular versions of other artifacts in
the repository. Build tools like Maven or Gradle often recursively download each
transitive dependency by default, meaning that adding a single dependency in
your project could potentially cause dozens of artifacts to be downloaded in
total.
This is very convenient: when adding a dependency on a new library, it would be
a big pain to have to track down each of that librarys transitive dependencies
and add them all manually. But theres also a huge downside: because different
libraries can depend on different versions of the same third-party library, this
strategy necessarily violates the One-Version Rule and leads to the diamond
dependency problem. If your target depends on two external libraries that use
different versions of the same dependency, theres no telling which one youll
get. This also means that updating an external dependency could cause seemingly
unrelated failures throughout the codebase if the new version begins pulling in
conflicting versions of some of its dependencies.
Bazel did not use to automatically download transitive dependencies. It used to
employ a `WORKSPACE` file that required all transitive dependencies to be
listed, which led to a lot of pain when managing external dependencies. Bazel
has since added support for automatic transitive external dependency management
in the form of the `MODULE.bazel` file. See [external dependency
overview](/external/overview) for more details.
Yet again, the choice here is one between convenience and scalability. Small
projects might prefer not having to worry about managing transitive dependencies
themselves and might be able to get away with using automatic transitive
dependencies. This strategy becomes less and less appealing as the organization
and codebase grows, and conflicts and unexpected results become more and more
frequent. At larger scales, the cost of manually managing dependencies is much
less than the cost of dealing with issues caused by automatic dependency
management.
### Caching build results using external dependencies
External dependencies are most often provided by third parties that release
stable versions of libraries, perhaps without providing source code. Some
organizations might also choose to make some of their own code available as
artifacts, allowing other pieces of code to depend on them as third-party rather
than internal dependencies. This can theoretically speed up builds if artifacts
are slow to build but quick to download.
However, this also introduces a lot of overhead and complexity: someone needs to
be responsible for building each of those artifacts and uploading them to the
artifact repository, and clients need to ensure that they stay up to date with
the latest version. Debugging also becomes much more difficult because different
parts of the system will have been built from different points in the
repository, and there is no longer a consistent view of the source tree.
A better way to solve the problem of artifacts taking a long time to build is to
use a build system that supports remote caching, as described earlier. Such a
build system saves the resulting artifacts from every build to a location that
is shared across engineers, so if a developer depends on an artifact that was
recently built by someone else, the build system automatically downloads it
instead of building it. This provides all of the performance benefits of
depending directly on artifacts while still ensuring that builds are as
consistent as if they were always built from the same source. This is the
strategy used internally by Google, and Bazel can be configured to use a remote
cache.
### Security and reliability of external dependencies
Depending on artifacts from third-party sources is inherently risky. Theres an
availability risk if the third-party source (such as an artifact repository)
goes down, because your entire build might grind to a halt if its unable to
download an external dependency. Theres also a security risk: if the
third-party system is compromised by an attacker, the attacker could replace the
referenced artifact with one of their own design, allowing them to inject
arbitrary code into your build. Both problems can be mitigated by mirroring any
artifacts you depend on onto servers you control and blocking your build system
from accessing third-party artifact repositories like Maven Central. The
trade-off is that these mirrors take effort and resources to maintain, so the
choice of whether to use them often depends on the scale of the project. The
security issue can also be completely prevented with little overhead by
requiring the hash of each third-party artifact to be specified in the source
repository, causing the build to fail if the artifact is tampered with. Another
alternative that completely sidesteps the issue is to vendor your projects
dependencies. When a project vendors its dependencies, it checks them into
source control alongside the projects source code, either as source or as
binaries. This effectively means that all of the projects external dependencies
are converted to internal dependencies. Google uses this approach internally,
checking every third-party library referenced throughout Google into a
`third_party` directory at the root of Googles source tree. However, this works
at Google only because Googles source control system is custom built to handle
an extremely large monorepo, so vendoring might not be an option for all
organizations.