Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 1 | --- |
| 2 | layout: documentation |
| 3 | title: Depsets |
| 4 | --- |
| 5 | |
| 6 | # Depsets |
| 7 | |
laurentlb | 377fbd0 | 2017-11-29 09:43:34 -0800 | [diff] [blame] | 8 | [Depsets](lib/depset.html) are a specialized data structure for efficiently |
| 9 | collecting data across a target’s transitive dependencies. Since this use case |
| 10 | concerns the [analysis phase](concepts.md#evaluation-model), depsets are useful |
| 11 | for authors of rules and aspects, but probably not macros. |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 12 | |
| 13 | The main feature of depsets is that they support a time- and space-efficient |
| 14 | merge operation, whose cost is independent of the size of the existing contents. |
| 15 | Depsets also have well-defined ordering semantics. |
| 16 | |
| 17 | Example uses of depsets include: |
| 18 | |
| 19 | * storing the paths of all object files for a program’s libraries, which can |
| 20 | then be passed to a linker action |
| 21 | |
| 22 | * for an interpreted language, storing the transitive source files that will |
| 23 | be included in an executable's runfiles |
| 24 | |
laurentlb | 377fbd0 | 2017-11-29 09:43:34 -0800 | [diff] [blame] | 25 | If you don't need the merge operation, consider using another type, such as |
| 26 | [list](lib/list.html) or [dict](lib/dict.html). |
| 27 | |
dmarting | f8480fa | 2017-09-15 13:15:43 +0200 | [diff] [blame] | 28 | <!-- [TOC] --> |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 29 | |
| 30 | ## Full example |
| 31 | |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 32 | |
laurentlb | 0e8af49 | 2018-01-25 09:06:08 -0800 | [diff] [blame] | 33 | This example is avalable at |
| 34 | [https://github.com/bazelbuild/examples/tree/master/rules/depsets](https://github.com/bazelbuild/examples/tree/master/rules/depsets). |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 35 | |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 36 | Suppose we have a hypothetical interpreted language Foo. In order to build each |
laurentlb | 377fbd0 | 2017-11-29 09:43:34 -0800 | [diff] [blame] | 37 | `foo_binary` we need to know all the `*.foo` files that it directly or indirectly |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 38 | depends on. |
| 39 | |
| 40 | ```python |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 41 | # //depsets:BUILD |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 42 | |
| 43 | load(":foo.bzl", "foo_library", "foo_binary") |
| 44 | |
| 45 | # Our hypothetical Foo compiler. |
| 46 | py_binary( |
| 47 | name = "foocc", |
| 48 | srcs = ["foocc.py"], |
| 49 | ) |
| 50 | |
| 51 | foo_library( |
| 52 | name = "a", |
| 53 | srcs = ["a.foo", "a_impl.foo"], |
| 54 | ) |
| 55 | |
| 56 | foo_library( |
| 57 | name = "b", |
| 58 | srcs = ["b.foo", "b_impl.foo"], |
| 59 | deps = [":a"], |
| 60 | ) |
| 61 | |
| 62 | foo_library( |
| 63 | name = "c", |
| 64 | srcs = ["c.foo", "c_impl.foo"], |
| 65 | deps = [":a"], |
| 66 | ) |
| 67 | |
| 68 | foo_binary( |
| 69 | name = "d", |
| 70 | srcs = ["d.foo"], |
| 71 | deps = [":b", ":c"], |
| 72 | ) |
| 73 | ``` |
| 74 | |
| 75 | ```python |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 76 | # //depsets:foocc.py |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 77 | |
| 78 | # "Foo compiler" that just concatenates its inputs to form its output. |
| 79 | import sys |
| 80 | |
| 81 | if __name__ == "__main__": |
| 82 | assert len(sys.argv) >= 1 |
| 83 | output = open(sys.argv[1], "wt") |
| 84 | for path in sys.argv[2:]: |
| 85 | input = open(path, "rt") |
| 86 | output.write(input.read()) |
| 87 | ``` |
| 88 | |
laurentlb | 377fbd0 | 2017-11-29 09:43:34 -0800 | [diff] [blame] | 89 | Here, the transitive sources of the binary `d` are all of the `*.foo` files in |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 90 | the `srcs` fields of `a`, `b`, `c`, and `d`. In order for the `foo_binary` |
| 91 | target to know about any file besides `d.foo`, the `foo_library` targets need to |
| 92 | pass them along in a provider. Each library receives the providers from its own |
| 93 | dependencies, adds its own immediate sources, and passes on a new provider with |
| 94 | the augmented contents. The `foo_binary` rule does the same, except that instead |
| 95 | of returning a provider, it uses the complete list of sources to construct a |
| 96 | command line for an action. |
| 97 | |
| 98 | Here’s a complete implementation of the `foo_library` and `foo_binary` rules. |
| 99 | |
| 100 | ```python |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 101 | # //depsets/foo.bzl |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 102 | |
| 103 | # A provider with one field, transitive_sources. |
laurentlb | 0e8af49 | 2018-01-25 09:06:08 -0800 | [diff] [blame] | 104 | FooFiles = provider(fields = ["transitive_sources"]) |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 105 | |
| 106 | def get_transitive_srcs(srcs, deps): |
| 107 | """Obtain the source files for a target and its transitive dependencies. |
| 108 | |
| 109 | Args: |
| 110 | srcs: a list of source files |
| 111 | deps: a list of targets that are direct dependencies |
| 112 | Returns: |
| 113 | a collection of the transitive sources |
| 114 | """ |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 115 | return depset( |
| 116 | srcs, |
| 117 | transitive = [dep[FooFiles].transitive_sources for dep in deps]) |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 118 | |
| 119 | def _foo_library_impl(ctx): |
| 120 | trans_srcs = get_transitive_srcs(ctx.files.srcs, ctx.attr.deps) |
| 121 | return [FooFiles(transitive_sources=trans_srcs)] |
| 122 | |
| 123 | foo_library = rule( |
| 124 | implementation = _foo_library_impl, |
| 125 | attrs = { |
| 126 | "srcs": attr.label_list(allow_files=True), |
| 127 | "deps": attr.label_list(), |
| 128 | }, |
| 129 | ) |
| 130 | |
| 131 | def _foo_binary_impl(ctx): |
| 132 | foocc = ctx.executable._foocc |
| 133 | out = ctx.outputs.out |
| 134 | trans_srcs = get_transitive_srcs(ctx.files.srcs, ctx.attr.deps) |
| 135 | srcs_list = trans_srcs.to_list() |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 136 | ctx.actions.run(executable = foocc, |
| 137 | arguments = [out.path] + [src.path for src in srcs_list], |
| 138 | inputs = srcs_list + [foocc], |
| 139 | outputs = [out]) |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 140 | |
| 141 | foo_binary = rule( |
| 142 | implementation = _foo_binary_impl, |
| 143 | attrs = { |
| 144 | "srcs": attr.label_list(allow_files=True), |
| 145 | "deps": attr.label_list(), |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 146 | "_foocc": attr.label(default=Label("//depsets:foocc"), |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 147 | allow_files=True, executable=True, cfg="host") |
| 148 | }, |
| 149 | outputs = {"out": "%{name}.out"}, |
| 150 | ) |
| 151 | ``` |
| 152 | |
| 153 | You can test this by copying these files into a fresh package, renaming the |
laurentlb | 377fbd0 | 2017-11-29 09:43:34 -0800 | [diff] [blame] | 154 | labels appropriately, creating the source `*.foo` files with dummy content, and |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 155 | building the `d` target. |
| 156 | |
| 157 | ## Description and operations |
| 158 | |
| 159 | Conceptually, a depset is a directed acyclic graph (DAG) that typically looks |
| 160 | similar to the target graph. It is constructed from the leaves up to the root. |
| 161 | Each target in a dependency chain can add its own contents on top of the |
| 162 | previous without having to read or copy them. |
| 163 | |
| 164 | Each node in the DAG holds a list of direct elements and a list of child nodes. |
| 165 | The contents of the depset are the transitive elements, i.e. the direct elements |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 166 | of all the nodes. A new depset can be created using the |
| 167 | [depset](lib/globals.html#depset) constructor: it accepts a list of direct |
| 168 | elemens and another list of child nodes. |
Jon Brandvein | 25aa033 | 2017-02-24 16:25:15 +0000 | [diff] [blame] | 169 | |
| 170 | ```python |
| 171 | s = depset(["a", "b", "c"]) |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 172 | t = depset(["d", "e"], transitive = [s]) |
Jon Brandvein | 25aa033 | 2017-02-24 16:25:15 +0000 | [diff] [blame] | 173 | |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 174 | print(s) # depset(["a", "b", "c"]) |
| 175 | print(t) # depset(["d", "e", "a", "b", "c"]) |
Jon Brandvein | 25aa033 | 2017-02-24 16:25:15 +0000 | [diff] [blame] | 176 | ``` |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 177 | |
| 178 | To retrieve the contents of a depset, use the |
| 179 | [to_list()](lib/depset.html#to_list) method. It returns a list of all transitive |
| 180 | elements, not including duplicates. There is no way to directly inspect the |
| 181 | precise structure of the DAG, although this structure does affect the order in |
| 182 | which the elements are returned. |
| 183 | |
Jon Brandvein | 25aa033 | 2017-02-24 16:25:15 +0000 | [diff] [blame] | 184 | ```python |
| 185 | s = depset(["a", "b", "c"]) |
| 186 | |
| 187 | print("c" in s.to_list()) # True |
| 188 | print(s.to_list() == ["a", "b", "c"]) # True |
| 189 | ``` |
| 190 | |
| 191 | The allowed items in a depset are restricted, just as the allowed keys in |
| 192 | dictionaries are restricted. In particular, depset contents may not be mutable. |
| 193 | |
| 194 | Depsets use reference equality: a depset is equal to itself, but unequal to any |
| 195 | other depset, even if they have the same contents and same internal structure. |
| 196 | |
| 197 | ```python |
| 198 | s = depset(["a", "b", "c"]) |
| 199 | t = s |
| 200 | print(s == t) # True |
| 201 | |
| 202 | t = depset(["a", "b", "c"]) |
| 203 | print(s == t) # False |
| 204 | |
Jon Brandvein | 25aa033 | 2017-02-24 16:25:15 +0000 | [diff] [blame] | 205 | d = {} |
| 206 | d[s] = None |
| 207 | d[t] = None |
| 208 | print(len(d)) # 2 |
| 209 | ``` |
| 210 | |
| 211 | To compare depsets by their contents, convert them to sorted lists. |
| 212 | |
| 213 | ```python |
| 214 | s = depset(["a", "b", "c"]) |
| 215 | t = depset(["c", "b", "a"]) |
| 216 | print(sorted(s.to_list()) == sorted(t.to_list())) # True |
| 217 | ``` |
| 218 | |
| 219 | There is no ability to remove elements from a depset. If this is needed, you |
| 220 | must read out the entire contents of the depset, filter the elements you want to |
| 221 | remove, and reconstruct a new depset. This is not particularly efficient. |
| 222 | |
| 223 | ```python |
| 224 | s = depset(["a", "b", "c"]) |
| 225 | t = depset(["b", "c"]) |
| 226 | |
| 227 | # Compute set difference s - t. Precompute t.to_list() so it's not done |
| 228 | # in a loop, and convert it to a dictionary for fast membership tests. |
| 229 | t_items = {e: None for e in t.to_list()} |
| 230 | diff_items = [x for x in s.to_list() if x not in t_items] |
| 231 | # Convert back to depset if it's still going to be used for merge operations. |
| 232 | s = depset(diff_items) |
| 233 | print(s) # depset(["a"]) |
| 234 | ``` |
| 235 | |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 236 | ## Order |
| 237 | |
| 238 | The `to_list` operation performs a traversal over the DAG. The kind of traversal |
| 239 | depends on the *order* that was specified at the time the depset was |
| 240 | constructed. It is useful for Bazel to support multiple orders because sometimes |
| 241 | tools care about the order of their inputs. For example, a linker action may |
| 242 | need to ensure that if `B` depends on `A`, then `A.o` comes before `B.o` on the |
| 243 | linker’s command line. Other tools might have the opposite requirement. |
| 244 | |
| 245 | Three traversal orders are supported: `postorder`, `preorder`, and |
| 246 | `topological`. The first two work exactly like [tree |
| 247 | traversals](https://en.wikipedia.org/wiki/Tree_traversal#Depth-first_search) |
| 248 | except that they operate on DAGs and skip already visited nodes. The third order |
| 249 | works as a topological sort from root to leaves, essentially the same as |
| 250 | preorder except that shared children are listed only after all of their parents. |
Jon Brandvein | 25aa033 | 2017-02-24 16:25:15 +0000 | [diff] [blame] | 251 | Preorder and postorder operate as left-to-right traversals, but note that within |
| 252 | each node direct elements have no order relative to children. For topological |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 253 | order, there is no left-to-right guarantee, and even the |
| 254 | all-parents-before-child guarantee does not apply in the case that there are |
| 255 | duplicate elements in different nodes of the DAG. |
| 256 | |
Jon Brandvein | 25aa033 | 2017-02-24 16:25:15 +0000 | [diff] [blame] | 257 | ```python |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 258 | # This demonstrates different traversal orders. |
Jon Brandvein | 25aa033 | 2017-02-24 16:25:15 +0000 | [diff] [blame] | 259 | |
| 260 | def create(order): |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 261 | cd = depset(["c", "d"], order = order) |
| 262 | gh = depset(["g", "h"], order = order) |
| 263 | return depset(["a", "b", "e", "f"], transitive = [cd, gh]) |
Jon Brandvein | 25aa033 | 2017-02-24 16:25:15 +0000 | [diff] [blame] | 264 | |
| 265 | print(create("postorder").to_list()) # ["c", "d", "g", "h", "a", "b", "e", "f"] |
| 266 | print(create("preorder").to_list()) # ["a", "b", "e", "f", "c", "d", "g", "h"] |
| 267 | ``` |
| 268 | |
| 269 | ```python |
| 270 | # This demonstrates different orders on a diamond graph. |
| 271 | |
| 272 | def create(order): |
| 273 | a = depset(["a"], order=order) |
Dmitry Lomov | 9ee4073 | 2017-11-24 03:48:44 -0800 | [diff] [blame] | 274 | b = depset(["b"], transitive = [a], order = order) |
| 275 | c = depset(["c"], transitive = [a], order = order) |
| 276 | d = depset(["d"], transtive = [b, c], order = order) |
Jon Brandvein | 25aa033 | 2017-02-24 16:25:15 +0000 | [diff] [blame] | 277 | return d |
| 278 | |
| 279 | print(create("postorder").to_list()) # ["a", "b", "c", "d"] |
| 280 | print(create("preorder").to_list()) # ["d", "b", "a", "c"] |
| 281 | print(create("topological").to_list()) # ["d", "b", "c", "a"] |
| 282 | ``` |
| 283 | |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 284 | Due to how traversals are implemented, the order must be specified at the time |
| 285 | the depset is created with the constructor’s `order` keyword argument. If this |
| 286 | argument is omitted, the depset has the special `default` order, in which case |
brandjon | cb68a96 | 2018-06-25 09:34:55 -0700 | [diff] [blame] | 287 | there are no guarantees about the order of any of its elements (except that it |
| 288 | is deterministic). |
Jon Brandvein | 25aa033 | 2017-02-24 16:25:15 +0000 | [diff] [blame] | 289 | |
| 290 | For safety, depsets with different orders cannot be merged with the `+` operator |
| 291 | unless one of them uses the default order; the resulting depset’s order is the |
| 292 | same as the left operand. Note that when two depsets of different order are |
| 293 | merged in this way, the child may appear to have had its elements rearranged |
Googler | a1a1e61 | 2018-01-22 06:06:00 -0800 | [diff] [blame] | 294 | when it is traversed via the parent. **The `+` operator is deprecated, anyway; |
| 295 | use the `transitive` argument instead.** |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 296 | |
| 297 | ## Performance |
| 298 | |
| 299 | To see the motivation for using depsets, consider what would have happened if we |
| 300 | had implemented `get_transitive_srcs()` without them. A naive way of writing |
| 301 | this function would be to collect the sources in a list. |
| 302 | |
| 303 | ```python |
| 304 | def get_transitive_srcs(srcs, deps): |
| 305 | trans_srcs = [] |
| 306 | for dep in deps: |
| 307 | trans_srcs += dep[FooFiles].transitive_sources |
| 308 | trans_srcs += srcs |
| 309 | return trans_srcs |
| 310 | ``` |
| 311 | |
| 312 | However, this does not take into account duplicates, so the source files for `a` |
| 313 | will appear twice on the command line and twice in the contents of the output |
| 314 | file. |
| 315 | |
| 316 | The next alternative is using a general set, which can be simulated by a |
Googler | 9594d5e | 2018-06-21 10:58:15 -0700 | [diff] [blame] | 317 | dictionary where the keys are the elements and all the keys map to `True`. |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 318 | |
| 319 | ```python |
| 320 | def get_transitive_srcs(srcs, deps): |
| 321 | trans_srcs = {} |
| 322 | for dep in deps: |
| 323 | for file in dep[FooFiles].transitive_sources: |
Googler | 9594d5e | 2018-06-21 10:58:15 -0700 | [diff] [blame] | 324 | trans_srcs[file] = True |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 325 | for file in srcs: |
Googler | 9594d5e | 2018-06-21 10:58:15 -0700 | [diff] [blame] | 326 | trans_srcs[file] = True |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 327 | return trans_srcs |
| 328 | ``` |
| 329 | |
| 330 | This gets rid of the duplicates, but it makes the order of the command line |
| 331 | arguments (and therefore the contents of the files) unspecified, although still |
| 332 | deterministic. |
| 333 | |
Jon Brandvein | 49fc305 | 2017-03-07 15:50:31 +0000 | [diff] [blame] | 334 | Moreover, both this approach and the list-based one are asymptotically worse |
| 335 | than the depset-based approach. Consider the case where there is a long chain of |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 336 | dependencies on Foo libraries. Processing every rule requires copying all of the |
| 337 | transitive sources that came before it into a new data structure. This means |
| 338 | that the time and space cost for analyzing an individual library or binary |
| 339 | target is proportional to its own height in the chain. For a chain of length n, |
| 340 | foolib_1 ← foolib_2 ← … ← foolib_n, the overall cost is effectively the |
| 341 | [triangle sum](https://en.wikipedia.org/wiki/Triangular_number) 1 + 2 + … + n, |
| 342 | which is O(n^2). This cost is wasteful because the library rule’s behavior is |
| 343 | not actually affected by the transitive sources. |
| 344 | |
| 345 | Generally speaking, depsets should be used whenever you are accumulating more |
| 346 | and more information through your transitive dependencies. This helps ensure |
| 347 | that your build scales well as your target graph grows deeper. The exact |
| 348 | advantage will depend on how deep the target graph is and how many elements per |
| 349 | target are added. |
| 350 | |
| 351 | To actually get the performance advantage, it’s important to not retrieve the |
| 352 | contents of the depset unnecessarily in library rules. One call to `to_list()` |
| 353 | at the end in a binary rule is fine, since the overall cost is just O(n). It’s |
| 354 | when many non-terminal targets try to call `to_list()` that we start to get into |
| 355 | quadratic behavior. |
| 356 | |
| 357 | ## Upcoming changes |
| 358 | |
| 359 | The API for depsets is being updated to be more consistent. Here are some recent |
| 360 | and/or upcoming changes. |
| 361 | |
brandjon | be02b51 | 2018-08-28 08:15:18 -0700 | [diff] [blame] | 362 | * When it's necessary to retrieve a depset's contents, this should be done by |
| 363 | explicitly converting the depset to a list via its `to_list()` method. Do |
| 364 | not iterate directly over the depset itself; direct iteration is deprecated |
| 365 | and will be removed. For example, don't use `list(...)`, `sorted(...)`, or |
| 366 | other functions expecting an iterable, on depsets. The rationale of this |
| 367 | change is that iterating over depsets is generally expensive, and expensive |
| 368 | operations should be made obvious in code. |
Jon Brandvein | a2e5084 | 2017-02-23 20:40:21 +0000 | [diff] [blame] | 369 | |
Jon Brandvein | 25aa033 | 2017-02-24 16:25:15 +0000 | [diff] [blame] | 370 | * Depset elements currently must have the same type, e.g. all ints or all |
| 371 | strings. This restriction will be lifted. |
| 372 | |
Googler | a1a1e61 | 2018-01-22 06:06:00 -0800 | [diff] [blame] | 373 | * A merge operation should be done by using the `transitive` argument in the |
| 374 | depset constructor. All other methods (`|` and `+` operators, and the |
| 375 | `union` method) are deprecated and will be going away. |