Ulf Adams | 81e0858 | 2015-03-13 12:07:01 +0000 | [diff] [blame] | 1 | # Skyframe - The parallel evaluation and incrementality model of Bazel |
| 2 | |
| 3 | ## Data model |
| 4 | |
| 5 | The data model consists of the following items: |
| 6 | |
| 7 | - `SkyValue`. Also called nodes. `SkyValues` are immutable objects that contain all the data built |
| 8 | over the course of the build and the inputs of the build. Examples are: input files, output |
| 9 | files, targets and configured targets. |
| 10 | - `SkyKey`. A short immutable name to reference a `SkyValue`, for example, `FILECONTENTS:/tmp/foo` |
| 11 | or `PACKAGE://foo`. |
| 12 | - `SkyFunction`. Builds nodes based on their keys and dependent nodes. |
| 13 | - Node graph. A data structure containing the dependency relationship between nodes. |
| 14 | - `Skyframe`. Code name for the incremental evaluation framework Bazel is based on. |
| 15 | |
| 16 | |
| 17 | ## Evaluation |
| 18 | |
| 19 | A build consists of evaluating the node that represents the build request (this is the state we are |
| 20 | striving for, but there is a lot of legacy code in the way). First its `SkyFunction` is found and |
| 21 | called with the key of the top-level `SkyKey`. The function then requests the evaluation of the |
| 22 | nodes it needs to evaluate the top-level node, which in turn result in other function invocations, |
| 23 | and so on, until the leaf nodes are reached (which are usually nodes representing input files in |
| 24 | the file system). Finally, we end up with the value of the top-level `SkyValue`, some side effects |
| 25 | (e.g. output files in the file system) and a directed acyclic graph of the dependencies between the |
| 26 | nodes that were involved in the build. |
| 27 | |
| 28 | A `SkyFunction` can request `SkyKeys` in multiple passes if it cannot tell in advance all of the |
| 29 | nodes it needs to do its job. A simple example is evaluating an input file node that turns out to |
| 30 | be a symlink: the function tries to read the file, realizes that it's a symlink, and thus fetches |
| 31 | the file system node representing the target of the symlink. But that itself can be a symlink, in |
| 32 | which case the original function will need to fetch its target, too. |
| 33 | |
| 34 | The functions are represented in the code by the interface `SkyFunction` and the services |
| 35 | provided to it by an interface called `SkyFunction.Environment`. These are the things functions can |
| 36 | do: |
| 37 | |
| 38 | - Request the evaluation of another node by way of calling `env.getValue`. If the node is |
| 39 | available, its value is returned, otherwise, `null` is returned and the function itself is |
| 40 | expected to return `null`. In the latter case, the dependent node is evaluated, and then the |
| 41 | original node builder is invoked again, but this time the same `env.getValue` call will return a |
| 42 | non-`null` value. |
| 43 | - Request the evaluation of multiple other nodes by calling `env.getValues()`. This does |
| 44 | essentially the same, except that the dependent nodes are evaluated in parallel. |
| 45 | - Do computation during their invocation |
| 46 | - Have side effects, for example, writing files to the file system. Care needs to be taken that two |
| 47 | different functions do not step on each other's toes. In general, write side effects (where |
| 48 | data flows outwards from Bazel) are okay, read side effects (where data flows inwards into Bazel |
| 49 | without a registered dependency) are not, because they are an unregistered dependency and as |
| 50 | such, can cause incorrect incremental builds. |
| 51 | |
| 52 | `SkyFunction` implementations should not access data in any other way than requesting dependencies |
| 53 | (e.g. by directly reading the file system), because that results in Bazel not registering the data |
| 54 | dependency on the file that was read, thus resulting in incorrect incremental builds. |
| 55 | |
| 56 | Once a function has enough data to do its job, it should return a non-`null` value indicating |
| 57 | completion. |
| 58 | |
| 59 | This evaluation strategy has a number of benefits: |
| 60 | |
| 61 | - Hermeticity. If functions only request input data by way of depending on other nodes, Bazel |
| 62 | can guarantee that if the input state is the same, the same data is returned. If all sky |
| 63 | functions are deterministic, this means that the whole build will also be deterministic. |
| 64 | - Correct and perfect incrementality. If all the input data of all functions is recorded, Bazel |
| 65 | can invalidate only the exact set of nodes that need to be invalidated when the input data |
| 66 | changes. |
| 67 | - Parallelism. Since functions can only interact with each other by way of requesting |
| 68 | dependencies, functions that do not depend on each other can be run in parallel and Bazel can |
| 69 | guarantee that the result is the same as if they were run sequentially. |
| 70 | |
| 71 | ## Incrementality |
| 72 | |
| 73 | Since functions can only access input data by depending on other nodes, Bazel can build up a |
| 74 | complete data flow graph from the input files to the output files, and use this information to only |
| 75 | rebuild those nodes that actually need to be rebuilt: the reverse transitive closure of the set of |
| 76 | changed input files. |
| 77 | |
| 78 | In particular, two possible incrementality strategies exist: the bottom-up one and the top-down one. |
| 79 | Which one is optimal depends on how the dependency graph looks like. |
| 80 | |
| 81 | - During bottom-up invalidation, after a graph is built and the set of changed inputs is known, |
| 82 | all the nodes are invalidated that transitively depend on changed files. This is optimal |
| 83 | if we know that the same top-level node will be built again. |
| 84 | Note that bottom-up invalidation requires running `stat()` on all input files of the previous |
| 85 | build to determine if they were changed. This can be improved by using `inotify` or a similar |
| 86 | mechanism to learn about changed files. |
| 87 | |
| 88 | - During top-down invalidation, the transitive closure of the top-level node is checked and only |
| 89 | those nodes are kept whose transitive closure is clean. This is better if we know that the |
| 90 | current node graph is large, but we only need a small subset of it in the next build: bottom-up |
| 91 | invalidation would invalidate the larger graph of the first build, unlike top-down invalidation, |
| 92 | which just walks the small graph of second build. |
| 93 | |
| 94 | We currently only do bottom-up invalidation. |
| 95 | |
| 96 | To get further incrementality, we use _change pruning_: if a node is invalidated, but upon rebuild, |
| 97 | it is discovered that its new value is the same as its old value, the nodes that were invalidated |
| 98 | due to a change in this node are "resurrected". |
| 99 | |
| 100 | This is useful, for example, if one changes a comment in a C++ file: then the `.o` file generated |
| 101 | from it will be the same, thus, we don't need to call the linker again. |
| 102 | |
| 103 | ## Incremental Linking / Compilation |
| 104 | |
| 105 | The main limitation of this model is that the invalidation of a node is an all-or-nothing affair: |
| 106 | when a dependency changes, the dependent node is always rebuilt from scratch, even if a better |
| 107 | algorithm would exist that would mutate the old value of the node based on the changes. A few |
| 108 | examples where this would be useful: |
| 109 | |
| 110 | - Incremental linking |
| 111 | - When a single `.class` file changes in a `.jar`, we could theoretically modify the `.jar` file |
| 112 | instead of building it from scratch again. |
| 113 | |
| 114 | The reason why Bazel currently does not support these things in a principled way (we have some |
| 115 | measure of support for incremental linking, but it's not implemented within Skyframe) is twofold: |
| 116 | we only had limited performance gains and it was hard to guarantee that the result of the mutation |
| 117 | is the same as that of a clean rebuild would be, and Google values builds that are bit-for-bit |
| 118 | repeatable. |
| 119 | |
| 120 | Until now, we could always achieve good enough performance by simply decomposing an expensive build |
| 121 | step and achieving partial re-evaluation that way: it splits all the classes in an app into |
| 122 | multiple groups and does dexing on them separately. This way, if classes in a group do not change, |
| 123 | the dexing does not have to be redone. |
| 124 | |
| 125 | ## Restarting SkyFunctions |
| 126 | |
| 127 | Another inefficiency is that, currently, if a `SkyFunction` implementation cannot complete its job |
| 128 | because one of its dependencies is missing, it needs to be completely restarted instead of resuming |
| 129 | where it left off. This is currently not a big problem because we usually learn all the |
| 130 | dependencies after a small amount of work. The only exceptions are package loading and execution of |
| 131 | actions; these are both external processes that are expensive to restart. We allow package loading |
| 132 | to proceed fully, store the loaded package away, record the dependencies in the graph, and on |
| 133 | re-execution of the function return the already loaded package. I.e., we allow the function to keep |
| 134 | state between executions. |
| 135 | |
| 136 | If this turns out to be a significant performance or code health problem, there are alternative ways |
| 137 | to add a more principled mechanism to keep state between executions: |
| 138 | |
| 139 | - Splitting each node into multiple ones so that each smaller node only has to do one round of |
| 140 | dependency discovery (effectively continuation passing); this requires explicit code. |
| 141 | - By reimplementing Skyframe on some sort of lightweight thread infrastructure (e.g. |
| 142 | [Quasar](http://docs.paralleluniverse.co/quasar/)) so that function execution can be suspended |
| 143 | and resumed without a large performance hit and without requiring this to be explicit in the |
| 144 | code. |
| 145 | - By maintaining state for each `SkyFunction` instance between restarts (this is the workaround we |
| 146 | are using for package loading, but is not implemented as a first-class feature of the evaluation |
| 147 | framework). |
| 148 | |
| 149 | ## Mapping to Bazel concepts |
| 150 | |
| 151 | This is a rough overview of some of the `SkyFunction` implementations Bazel uses to perform a build: |
| 152 | |
| 153 | - **FileStateValue**. The result of an `lstat()`. For existent files, we also compute additional |
| 154 | information in order to detect changes to the file. This is the lowest level node in the Skyframe |
| 155 | graph and has no dependencies. |
| 156 | - **FileValue**. Used by anything that cares about the actual contents and/or resolved path of a |
| 157 | file. Depends on the corresponding `FileStateValue` and any symlinks that need to be resolved |
| 158 | (e.g. the `FileValue` for `a/b` needs the resolved path of `a` and the resolved path of `a/b`). |
| 159 | The distinction between `FileStateValue` is important because in some cases (for example, |
| 160 | evaluating file system globs (e.g. `srcs=glob(["*/*.java"])`) the contents of the file are not |
| 161 | actually needed. |
| 162 | - **DirectoryListingValue**. Essentially the result of `readdir()`. Depends on the associated |
| 163 | `FileValue` associated with the directory. |
| 164 | - **PackageValue**. Represents the parsed version of a BUILD file. Depends on the `FileValue` of |
| 165 | the associated `BUILD` file, and also transitively on any `DirectoryListingValue` that is used |
| 166 | to resolve the globs in the package (the data structure representing the contents of a `BUILD` |
| 167 | file internally) |
| 168 | - **ConfiguredTargetValue**. Represents a configured target, which is a tuple of the set of actions |
| 169 | generated during the analysis of a target and information provided to configured targets that |
| 170 | depend on this one. Depends on the `PackageValue` the corresponding target is in, the |
| 171 | `ConfiguredTargetValues` of direct dependencies, and a special node representing the build |
| 172 | configuration. |
| 173 | - **ArtifactValue**. Represents a file in the build, be it a source or an output artifacts |
| 174 | (artifacts are almost equivalent to files, and are used to refer to files during the actual |
| 175 | execution of build steps). For source files, it depends on the `FileValue` of the associated |
| 176 | node, for output artifacts, it depends on the `ActionExecutionValue` of whatever action generates |
| 177 | the artifact. |
| 178 | - **ActionExecutionValue**. Represents the execution of an action. Depends on the `ArtifactValues` |
| 179 | of its input files. The action it executes is currently contained within its sky key, which is |
| 180 | contrary to the concept that sky keys should be small. We are working on solving this |
| 181 | discrepancy (note that `ActionExecutionValue` and `ArtifactValue` are unused if we do not run the |
| 182 | execution phase on Skyframe). |
| 183 | |