Blame - site/docs/skylark/depsets.md - bazel

blob: 61919dcf2dd784a170480b77b1be4551fabdc55c [file] [log] [blame] [view]

Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	1	---
				2	layout: documentation
				3	title: Depsets
				4	---
				5
				6	# Depsets
				7
laurentlb	377fbd0	2017-11-29 09:43:34 -0800	[diff] [blame]	8	[Depsets](lib/depset.html) are a specialized data structure for efficiently
				9	collecting data across a target’s transitive dependencies. Since this use case
				10	concerns the [analysis phase](concepts.md#evaluation-model), depsets are useful
				11	for authors of rules and aspects, but probably not macros.
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	12
				13	The main feature of depsets is that they support a time- and space-efficient
				14	merge operation, whose cost is independent of the size of the existing contents.
				15	Depsets also have well-defined ordering semantics.
				16
				17	Example uses of depsets include:
				18
				19	* storing the paths of all object files for a program’s libraries, which can
				20	then be passed to a linker action
				21
				22	* for an interpreted language, storing the transitive source files that will
				23	be included in an executable's runfiles
				24
laurentlb	377fbd0	2017-11-29 09:43:34 -0800	[diff] [blame]	25	If you don't need the merge operation, consider using another type, such as
				26	[list](lib/list.html) or [dict](lib/dict.html).
				27
dmarting	f8480fa	2017-09-15 13:15:43 +0200	[diff] [blame]	28	<!-- [TOC] -->
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	29
				30	## Full example
				31
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	32
laurentlb	0e8af49	2018-01-25 09:06:08 -0800	[diff] [blame]	33	This example is avalable at
				34	[https://github.com/bazelbuild/examples/tree/master/rules/depsets](https://github.com/bazelbuild/examples/tree/master/rules/depsets).
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	35
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	36	Suppose we have a hypothetical interpreted language Foo. In order to build each
laurentlb	377fbd0	2017-11-29 09:43:34 -0800	[diff] [blame]	37	`foo_binary` we need to know all the `*.foo` files that it directly or indirectly
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	38	depends on.
				39
				40	```python
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	41	# //depsets:BUILD
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	42
				43	load(":foo.bzl", "foo_library", "foo_binary")
				44
				45	# Our hypothetical Foo compiler.
				46	py_binary(
				47	name = "foocc",
				48	srcs = ["foocc.py"],
				49	)
				50
				51	foo_library(
				52	name = "a",
				53	srcs = ["a.foo", "a_impl.foo"],
				54	)
				55
				56	foo_library(
				57	name = "b",
				58	srcs = ["b.foo", "b_impl.foo"],
				59	deps = [":a"],
				60	)
				61
				62	foo_library(
				63	name = "c",
				64	srcs = ["c.foo", "c_impl.foo"],
				65	deps = [":a"],
				66	)
				67
				68	foo_binary(
				69	name = "d",
				70	srcs = ["d.foo"],
				71	deps = [":b", ":c"],
				72	)
				73	```
				74
				75	```python
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	76	# //depsets:foocc.py
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	77
				78	# "Foo compiler" that just concatenates its inputs to form its output.
				79	import sys
				80
				81	if __name__ == "__main__":
				82	assert len(sys.argv) >= 1
				83	output = open(sys.argv[1], "wt")
				84	for path in sys.argv[2:]:
				85	input = open(path, "rt")
				86	output.write(input.read())
				87	```
				88
laurentlb	377fbd0	2017-11-29 09:43:34 -0800	[diff] [blame]	89	Here, the transitive sources of the binary `d` are all of the `*.foo` files in
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	90	the `srcs` fields of `a`, `b`, `c`, and `d`. In order for the `foo_binary`
				91	target to know about any file besides `d.foo`, the `foo_library` targets need to
				92	pass them along in a provider. Each library receives the providers from its own
				93	dependencies, adds its own immediate sources, and passes on a new provider with
				94	the augmented contents. The `foo_binary` rule does the same, except that instead
				95	of returning a provider, it uses the complete list of sources to construct a
				96	command line for an action.
				97
				98	Here’s a complete implementation of the `foo_library` and `foo_binary` rules.
				99
				100	```python
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	101	# //depsets/foo.bzl
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	102
				103	# A provider with one field, transitive_sources.
laurentlb	0e8af49	2018-01-25 09:06:08 -0800	[diff] [blame]	104	FooFiles = provider(fields = ["transitive_sources"])
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	105
				106	def get_transitive_srcs(srcs, deps):
				107	"""Obtain the source files for a target and its transitive dependencies.
				108
				109	Args:
				110	srcs: a list of source files
				111	deps: a list of targets that are direct dependencies
				112	Returns:
				113	a collection of the transitive sources
				114	"""
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	115	return depset(
				116	srcs,
				117	transitive = [dep[FooFiles].transitive_sources for dep in deps])
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	118
				119	def _foo_library_impl(ctx):
				120	trans_srcs = get_transitive_srcs(ctx.files.srcs, ctx.attr.deps)
				121	return [FooFiles(transitive_sources=trans_srcs)]
				122
				123	foo_library = rule(
				124	implementation = _foo_library_impl,
				125	attrs = {
				126	"srcs": attr.label_list(allow_files=True),
				127	"deps": attr.label_list(),
				128	},
				129	)
				130
				131	def _foo_binary_impl(ctx):
				132	foocc = ctx.executable._foocc
				133	out = ctx.outputs.out
				134	trans_srcs = get_transitive_srcs(ctx.files.srcs, ctx.attr.deps)
				135	srcs_list = trans_srcs.to_list()
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	136	ctx.actions.run(executable = foocc,
				137	arguments = [out.path] + [src.path for src in srcs_list],
				138	inputs = srcs_list + [foocc],
				139	outputs = [out])
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	140
				141	foo_binary = rule(
				142	implementation = _foo_binary_impl,
				143	attrs = {
				144	"srcs": attr.label_list(allow_files=True),
				145	"deps": attr.label_list(),
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	146	"_foocc": attr.label(default=Label("//depsets:foocc"),
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	147	allow_files=True, executable=True, cfg="host")
				148	},
				149	outputs = {"out": "%{name}.out"},
				150	)
				151	```
				152
				153	You can test this by copying these files into a fresh package, renaming the
laurentlb	377fbd0	2017-11-29 09:43:34 -0800	[diff] [blame]	154	labels appropriately, creating the source `*.foo` files with dummy content, and
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	155	building the `d` target.
				156
				157	## Description and operations
				158
				159	Conceptually, a depset is a directed acyclic graph (DAG) that typically looks
				160	similar to the target graph. It is constructed from the leaves up to the root.
				161	Each target in a dependency chain can add its own contents on top of the
				162	previous without having to read or copy them.
				163
				164	Each node in the DAG holds a list of direct elements and a list of child nodes.
				165	The contents of the depset are the transitive elements, i.e. the direct elements
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	166	of all the nodes. A new depset can be created using the
				167	[depset](lib/globals.html#depset) constructor: it accepts a list of direct
				168	elemens and another list of child nodes.
Jon Brandvein	25aa033	2017-02-24 16:25:15 +0000	[diff] [blame]	169
				170	```python
				171	s = depset(["a", "b", "c"])
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	172	t = depset(["d", "e"], transitive = [s])
Jon Brandvein	25aa033	2017-02-24 16:25:15 +0000	[diff] [blame]	173
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	174	print(s) # depset(["a", "b", "c"])
				175	print(t) # depset(["d", "e", "a", "b", "c"])
Jon Brandvein	25aa033	2017-02-24 16:25:15 +0000	[diff] [blame]	176	```
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	177
				178	To retrieve the contents of a depset, use the
				179	[to_list()](lib/depset.html#to_list) method. It returns a list of all transitive
				180	elements, not including duplicates. There is no way to directly inspect the
				181	precise structure of the DAG, although this structure does affect the order in
				182	which the elements are returned.
				183
Jon Brandvein	25aa033	2017-02-24 16:25:15 +0000	[diff] [blame]	184	```python
				185	s = depset(["a", "b", "c"])
				186
				187	print("c" in s.to_list()) # True
				188	print(s.to_list() == ["a", "b", "c"]) # True
				189	```
				190
				191	The allowed items in a depset are restricted, just as the allowed keys in
				192	dictionaries are restricted. In particular, depset contents may not be mutable.
				193
				194	Depsets use reference equality: a depset is equal to itself, but unequal to any
				195	other depset, even if they have the same contents and same internal structure.
				196
				197	```python
				198	s = depset(["a", "b", "c"])
				199	t = s
				200	print(s == t) # True
				201
				202	t = depset(["a", "b", "c"])
				203	print(s == t) # False
				204
Jon Brandvein	25aa033	2017-02-24 16:25:15 +0000	[diff] [blame]	205	d = {}
				206	d[s] = None
				207	d[t] = None
				208	print(len(d)) # 2
				209	```
				210
				211	To compare depsets by their contents, convert them to sorted lists.
				212
				213	```python
				214	s = depset(["a", "b", "c"])
				215	t = depset(["c", "b", "a"])
				216	print(sorted(s.to_list()) == sorted(t.to_list())) # True
				217	```
				218
				219	There is no ability to remove elements from a depset. If this is needed, you
				220	must read out the entire contents of the depset, filter the elements you want to
				221	remove, and reconstruct a new depset. This is not particularly efficient.
				222
				223	```python
				224	s = depset(["a", "b", "c"])
				225	t = depset(["b", "c"])
				226
				227	# Compute set difference s - t. Precompute t.to_list() so it's not done
				228	# in a loop, and convert it to a dictionary for fast membership tests.
				229	t_items = {e: None for e in t.to_list()}
				230	diff_items = [x for x in s.to_list() if x not in t_items]
				231	# Convert back to depset if it's still going to be used for merge operations.
				232	s = depset(diff_items)
				233	print(s) # depset(["a"])
				234	```
				235
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	236	## Order
				237
				238	The `to_list` operation performs a traversal over the DAG. The kind of traversal
				239	depends on the order that was specified at the time the depset was
				240	constructed. It is useful for Bazel to support multiple orders because sometimes
				241	tools care about the order of their inputs. For example, a linker action may
				242	need to ensure that if `B` depends on `A`, then `A.o` comes before `B.o` on the
				243	linker’s command line. Other tools might have the opposite requirement.
				244
				245	Three traversal orders are supported: `postorder`, `preorder`, and
				246	`topological`. The first two work exactly like [tree
				247	traversals](https://en.wikipedia.org/wiki/Tree_traversal#Depth-first_search)
				248	except that they operate on DAGs and skip already visited nodes. The third order
				249	works as a topological sort from root to leaves, essentially the same as
				250	preorder except that shared children are listed only after all of their parents.
Jon Brandvein	25aa033	2017-02-24 16:25:15 +0000	[diff] [blame]	251	Preorder and postorder operate as left-to-right traversals, but note that within
				252	each node direct elements have no order relative to children. For topological
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	253	order, there is no left-to-right guarantee, and even the
				254	all-parents-before-child guarantee does not apply in the case that there are
				255	duplicate elements in different nodes of the DAG.
				256
Jon Brandvein	25aa033	2017-02-24 16:25:15 +0000	[diff] [blame]	257	```python
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	258	# This demonstrates different traversal orders.
Jon Brandvein	25aa033	2017-02-24 16:25:15 +0000	[diff] [blame]	259
				260	def create(order):
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	261	cd = depset(["c", "d"], order = order)
				262	gh = depset(["g", "h"], order = order)
				263	return depset(["a", "b", "e", "f"], transitive = [cd, gh])
Jon Brandvein	25aa033	2017-02-24 16:25:15 +0000	[diff] [blame]	264
				265	print(create("postorder").to_list()) # ["c", "d", "g", "h", "a", "b", "e", "f"]
				266	print(create("preorder").to_list()) # ["a", "b", "e", "f", "c", "d", "g", "h"]
				267	```
				268
				269	```python
				270	# This demonstrates different orders on a diamond graph.
				271
				272	def create(order):
				273	a = depset(["a"], order=order)
Dmitry Lomov	9ee4073	2017-11-24 03:48:44 -0800	[diff] [blame]	274	b = depset(["b"], transitive = [a], order = order)
				275	c = depset(["c"], transitive = [a], order = order)
				276	d = depset(["d"], transtive = [b, c], order = order)
Jon Brandvein	25aa033	2017-02-24 16:25:15 +0000	[diff] [blame]	277	return d
				278
				279	print(create("postorder").to_list()) # ["a", "b", "c", "d"]
				280	print(create("preorder").to_list()) # ["d", "b", "a", "c"]
				281	print(create("topological").to_list()) # ["d", "b", "c", "a"]
				282	```
				283
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	284	Due to how traversals are implemented, the order must be specified at the time
				285	the depset is created with the constructor’s `order` keyword argument. If this
				286	argument is omitted, the depset has the special `default` order, in which case
brandjon	cb68a96	2018-06-25 09:34:55 -0700	[diff] [blame]	287	there are no guarantees about the order of any of its elements (except that it
				288	is deterministic).
Jon Brandvein	25aa033	2017-02-24 16:25:15 +0000	[diff] [blame]	289
				290	For safety, depsets with different orders cannot be merged with the `+` operator
				291	unless one of them uses the default order; the resulting depset’s order is the
				292	same as the left operand. Note that when two depsets of different order are
				293	merged in this way, the child may appear to have had its elements rearranged
Googler	a1a1e61	2018-01-22 06:06:00 -0800	[diff] [blame]	294	when it is traversed via the parent. **The `+` operator is deprecated, anyway;
				295	use the `transitive` argument instead.**
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	296
				297	## Performance
				298
				299	To see the motivation for using depsets, consider what would have happened if we
				300	had implemented `get_transitive_srcs()` without them. A naive way of writing
				301	this function would be to collect the sources in a list.
				302
				303	```python
				304	def get_transitive_srcs(srcs, deps):
				305	trans_srcs = []
				306	for dep in deps:
				307	trans_srcs += dep[FooFiles].transitive_sources
				308	trans_srcs += srcs
				309	return trans_srcs
				310	```
				311
				312	However, this does not take into account duplicates, so the source files for `a`
				313	will appear twice on the command line and twice in the contents of the output
				314	file.
				315
				316	The next alternative is using a general set, which can be simulated by a
Googler	9594d5e	2018-06-21 10:58:15 -0700	[diff] [blame]	317	dictionary where the keys are the elements and all the keys map to `True`.
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	318
				319	```python
				320	def get_transitive_srcs(srcs, deps):
				321	trans_srcs = {}
				322	for dep in deps:
				323	for file in dep[FooFiles].transitive_sources:
Googler	9594d5e	2018-06-21 10:58:15 -0700	[diff] [blame]	324	trans_srcs[file] = True
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	325	for file in srcs:
Googler	9594d5e	2018-06-21 10:58:15 -0700	[diff] [blame]	326	trans_srcs[file] = True
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	327	return trans_srcs
				328	```
				329
				330	This gets rid of the duplicates, but it makes the order of the command line
				331	arguments (and therefore the contents of the files) unspecified, although still
				332	deterministic.
				333
Jon Brandvein	49fc305	2017-03-07 15:50:31 +0000	[diff] [blame]	334	Moreover, both this approach and the list-based one are asymptotically worse
				335	than the depset-based approach. Consider the case where there is a long chain of
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	336	dependencies on Foo libraries. Processing every rule requires copying all of the
				337	transitive sources that came before it into a new data structure. This means
				338	that the time and space cost for analyzing an individual library or binary
				339	target is proportional to its own height in the chain. For a chain of length n,
				340	foolib_1 ← foolib_2 ← … ← foolib_n, the overall cost is effectively the
				341	[triangle sum](https://en.wikipedia.org/wiki/Triangular_number) 1 + 2 + … + n,
				342	which is O(n^2). This cost is wasteful because the library rule’s behavior is
				343	not actually affected by the transitive sources.
				344
				345	Generally speaking, depsets should be used whenever you are accumulating more
				346	and more information through your transitive dependencies. This helps ensure
				347	that your build scales well as your target graph grows deeper. The exact
				348	advantage will depend on how deep the target graph is and how many elements per
				349	target are added.
				350
				351	To actually get the performance advantage, it’s important to not retrieve the
				352	contents of the depset unnecessarily in library rules. One call to `to_list()`
				353	at the end in a binary rule is fine, since the overall cost is just O(n). It’s
				354	when many non-terminal targets try to call `to_list()` that we start to get into
				355	quadratic behavior.
				356
				357	## Upcoming changes
				358
				359	The API for depsets is being updated to be more consistent. Here are some recent
				360	and/or upcoming changes.
				361
brandjon	be02b51	2018-08-28 08:15:18 -0700	[diff] [blame]	362	* When it's necessary to retrieve a depset's contents, this should be done by
				363	explicitly converting the depset to a list via its `to_list()` method. Do
				364	not iterate directly over the depset itself; direct iteration is deprecated
				365	and will be removed. For example, don't use `list(...)`, `sorted(...)`, or
				366	other functions expecting an iterable, on depsets. The rationale of this
				367	change is that iterating over depsets is generally expensive, and expensive
				368	operations should be made obvious in code.
Jon Brandvein	a2e5084	2017-02-23 20:40:21 +0000	[diff] [blame]	369
Jon Brandvein	25aa033	2017-02-24 16:25:15 +0000	[diff] [blame]	370	* Depset elements currently must have the same type, e.g. all ints or all
				371	strings. This restriction will be lifted.
				372
Googler	a1a1e61	2018-01-22 06:06:00 -0800	[diff] [blame]	373	* A merge operation should be done by using the `transitive` argument in the
				374	depset constructor. All other methods (`\|` and `+` operators, and the
				375	`union` method) are deprecated and will be going away.