Blame - site/hard_to_write_rules.md - bazel

blob: dd341d31ad1219ec3a07ac61af3ed354b12bfe02 [file] [log] [blame] [view]

Tobias Werth	590209f	2016-06-22 15:23:56 +0000	[diff] [blame]	1	---
				2	layout: contribute
				3	title: Why is it so difficult to write Bazel rules?
				4	---
				5
				6	# Why is it difficult to write Bazel rules?
				7
				8	We have heard feedback from various people multiple times that Bazel rules are
				9	hard to write. There is no single root cause, but it’s due to a combination of
				10	historical circumstances and intrinsic complexity in the problem domain. This
				11	document attempts to give a high level overview of the specific issues that we
				12	believe to be the main contributors.
				13
				14	* Assumption: Aim for Correctness, Throughput, Ease of Use & Latency
				15	* Assumption: Large Scale Repositories
				16	* Assumption: BUILD-like Description Language
				17	* Intrinsic: Remote Execution and Caching are Hard
				18	* Historic: Hard Separation between Loading, Analysis, and Execution is
				19	Outdated, but still affects the API
				20	* Intrinsic: Using Change Information for Correct and Fast Incremental Builds
				21	requires Unusual Coding Patterns
				22	* Intrinsic: Avoiding Quadratic Time and Memory Consumption is Hard
				23
				24	## Assumption: Aim for Correctness, Throughput, Ease of Use & Latency
				25
				26	We assume that the build system needs to be first and foremost correct with
				27	respect to incremental builds, i.e., for a given source tree, the output of the
				28	same build should always be the same, regardless of what the output tree looks
				29	like. In the first approximation, this means Bazel needs to know every single
				30	input that goes into a given build step, such that it can rerun that step if any
				31	of the inputs change. There are limits to how correct Bazel can get, as it leaks
				32	some information such as date / time of the build, and ignores certain types of
				33	changes such as changes to file attributes. Sandboxing helps ensure correctness
				34	by preventing reads to undeclared input files. Besides the intrinsic limits of
				35	the system, there are a few known correctness issues, most of which are related
				36	to Fileset or the C++ rules, which are both hard problems. We have long-term
				37	efforts to fix these.
				38
				39	The second goal of the build system is to have high throughput; we are
				40	permanently pushing the boundaries of what can be done within the current
				41	machine allocation for a remote execution service. If the remote execution
				42	service gets overloaded, nobody can get work done.
				43
				44	Ease of use comes next, i.e., of multiple correct approaches with the same (or
				45	similar) footprint of the remote execution service, we choose the one that is
				46	easier to use.
				47
				48	For the purpose of this document, latency denotes the time it takes from
				49	starting a build to getting the intended result, whether that is a test log from
				50	a passing or failing test, or an error message that a BUILD file has a
				51	typo.
				52
				53	Note that these goals often overlap; latency is as much a function of throughput
				54	of the remote execution service as is correctness relevant for ease of use.
				55
				56
				57	## Assumption: Large Scale Repositories
				58
				59	The build system needs to operate at the scale of large repositories where large
				60	scale means that it does not fit on a single hard drive, so it is impossible to
				61	do a full checkout on virtually all developer machines. A medium-sized build
				62	will need to read and parse tens of thousands of BUILD files, and evaluate
				63	hundreds of thousands of globs. While it is theoretically possible to read all
				64	BUILD files on a single machine, we have not yet been able to do so within a
				65	reasonable amount of time and memory. As such, it is critical that BUILD files
				66	can be loaded and parsed independently.
				67
				68
				69	## Assumption: BUILD-like Description Language
				70
				71	For the purpose of this document, we assume a configuration language that is
				72	roughly similar to BUILD files, i.e., declaration of library and binary rules
				73	and their interdependencies. BUILD files can be read and parsed independently,
				74	and we avoid even looking at source files whenever we can (except for
				75	existence).
				76
				77
				78	## Intrinsic: Remote Execution and Caching are Hard
				79
				80	Remote execution and caching improve build times in large repositories by
				81	roughly two orders of magnitude compared to running the build on a single
				82	machine. However, the scale at which it needs to perform is staggering: Google's
				83	remote execution service is designed to handle a huge number of requests per
				84	second, and the protocol carefully avoids unnecessary roundtrips as well as
				85	unnecessary work on the service side.
				86
				87	At this time, the protocol requires that the build system knows all inputs to a
				88	given action ahead of time; the build system then computes a unique action
				89	fingerprint, and asks the scheduler for a cache hit. If a cache hit is found,
				90	the scheduler replies with the digests of the output files; the files itself are
				91	addressed by digest later on. However, this imposes restrictions on the Bazel
				92	rules, which need to declare all input files ahead of time.
				93
				94
				95	## Historic: Hard Separation between Loading, Analysis, and Execution is Outdated, but still affects the API
				96
				97	Technically, it is sufficient for a rule to know the input and output files of
				98	an action just before the action is sent to remote execution. However, the
				99	original Bazel code base had a strict separation of loading packages, then
				100	analyzing rules using a configuration (command-line flags, essentially), and
				101	only then running any actions. This distinction is still part of the rules API
				102	today, even though the core of Bazel no longer requires it (more details below).
				103
				104	That means that the rules API requires a declarative description of the rule
				105	interface (what attributes it has, types of attributes). There are some
				106	exceptions where the API allows custom code to run during the loading phase to
				107	compute implicit names of output files and implicit values of attributes. For
				108	example, a java_library rule named ‘foo’ implicitly generates an output named
				109	‘libfoo.jar’, which can be referenced from other rules in the build graph.
				110
				111	Furthermore, the analysis of a rule cannot read any source files or inspect the
				112	output of an action; instead, it needs to generate a partial directed bipartite
				113	graph of build steps and output file names that is only determined from the rule
				114	itself and its dependencies.
				115
				116
				117	## Intrinsic: Using Change Information for Correct and Fast Incremental Builds requires Unusual Coding Patterns
				118
				119	Above, we argued that in order to be correct, Bazel needs to know all the input
				120	files that go into a build step in order to detect whether that build step is
				121	still up-to-date. The same is true for package loading and rule analysis, and we
				122	have designed [Skyframe] (http://www.bazel.io/docs/skyframe.html) to handle this
				123	in general. Skyframe is a graph library and evaluation framework that takes a
				124	goal node (such as ‘build //foo with these options’), and breaks it down into
				125	its constituent parts, which are then evaluated and combined to yield this
				126	result. As part of this process, Skyframe reads packages, analyzes rules, and
				127	executes actions.
				128
				129	At each node, Skyframe tracks exactly which nodes any given node used to compute
				130	its own output, all the way from the goal node down to the input files (which
				131	are also Skyframe nodes). Having this graph explicitly represented in memory
				132	allows the build system to identify exactly which nodes are affected by a given
				133	change to an input file (including creation or deletion of an input file), doing
				134	the minimal amount of work to restore the output tree to its intended state.
				135
				136	As part of this, each node performs a dependency discovery process; i.e., each
				137	node can declare dependencies, and then use the contents of those dependencies
				138	to declare even further dependencies. In principle, this maps well to a
				139	thread-per-node model. However, medium-sized builds contain hundreds of
				140	thousands of Skyframe nodes, which isn’t easily possible with current Java
				141	technology (and for historical reasons, we’re currently tied to using Java, so
				142	no lightweight threads and no continuations).
				143
				144	Instead, Bazel uses a fixed-size thread pool. However, that means that if a node
				145	declares a dependency that isn’t available yet, we may have to abort that
				146	evaluation and restart it (possibly in another thread), when the dependency is
				147	available. This, in turn, means that nodes should not do this excessively; a
				148	node that declares N dependencies serially can potentially be restarted N times,
				149	costing O(N^2) time. Instead, we aim for up-front bulk declaration of
				150	dependencies, which sometimes requires reorganizing the code, or even splitting
				151	a node into multiple nodes to limit the number of restarts.
				152
				153	Note that this technology isn’t currently available in the rules API; instead,
				154	the rules API is still defined using the legacy concepts of loading, analysis,
				155	and execution phases. However, a fundamental restriction is that all accesses to
				156	other nodes have to go through the framework so that it can track the
				157	corresponding dependencies. Regardless of the language in which the build system
				158	is implemented or in which the rules are written (they don’t have to be the
				159	same), rule authors must not use standard libraries or patterns that bypass
				160	Skyframe. For Java, that means avoiding java.io.File as well as any form of
				161	reflection, and any library that does either. Libraries that support dependency
				162	injection of these low-level interfaces still need to be setup correctly for
				163	Skyframe.
				164
				165	This strongly suggests to avoid exposing rule authors to a full language runtime
				166	in the first place. The danger of accidental use of such APIs is just too big -
				167	several Bazel bugs in the past were caused by rules using unsafe APIs, even
				168	though the rules were written by the Bazel team, i.e., by the domain experts.
				169
				170
				171	## Intrinsic: Avoiding Quadratic Time and Memory Consumption is Hard
				172
				173	To make matters worse, apart from the requirements imposed by Skyframe, the
				174	historical constraints of using Java, and the outdatedness of the rules API,
				175	accidentally introducting quadratic time or memory consumption is a fundamental
				176	problem in any build system based on library and binary rules. There are two
				177	very common patterns that introduce quadratic memory consumption (and therefore
				178	quadratic time consumption).
				179
				180	1. Chains of Library Rules
				181	Consider the case of a chain of library rules A depends on B, depends on C, and
				182	so on. Then, we want to compute some property over the transitive closure of
				183	these rules, such as the Java runtime classpath, or the C++ linker command for
				184	each library. Naively, we might take a standard list implementation; however,
				185	this already introduces quadratic memory consumption: the first library
				186	contains one entry on the classpath, the second two, the third three, and so
				187	on, for a total of 1+2+3+...+N = O(N^2) entries.
				188
				189	2. Binary Rules Depending on the Same Library Rules
				190	Consider the case where a set of binaries that depend on the same library
				191	rules; for example, you might have a number of test rules that test the same
				192	library code. Let’s say out of N rules, half the rules are binary rules, and
				193	the other half library rules. Now consider that each binary makes a copy of
				194	some property computed over the transitive closure of library rules, such as
				195	the Java runtime classpath, or the C++ linker command line. For example, it
				196	could expand the command line string representation of the C++ link action. N/2
				197	copies of N/2 elements is O(N^2) memory.
				198
				199
				200	### Custom Collections Classes to Avoid Quadratic Complexity
				201
				202	Bazel is heavily affected by both of these scenarios, so we introduced a set of
				203	custom collection classes that effectively compress the information in memory by
				204	avoiding the copy at each step. Almost all of these data structures have set
				205	semantics, so we called the class NestedSet. The majority of changes to reduce
				206	Bazel’s memory consumption over the past several years were changes to use
				207	NestedSet instead of whatever was previously used.
				208
				209	Unfortunately, usage of NestedSet does not automatically solve all the issues;
				210	in particular, even just iterating over a NestedSet in each rule re-introduces
				211	quadratic time consumption. NestedSet also has some helper methods to facilitate
				212	interoperability with normal collections classes; unfortunately, accidentally
				213	passing a NestedSet to one of these methods leads to copying behavior, and
				214	reintroduces quadratic memory consumption.