Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 1 | --- |
| 2 | layout: documentation |
Greg Estren | c081675 | 2020-02-20 13:04:29 -0800 | [diff] [blame] | 3 | title: Debugging remote cache hits for remote execution |
Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 4 | --- |
| 5 | |
Googler | a85685a | 2021-01-08 16:53:49 -0800 | [diff] [blame] | 6 | # Debugging Remote Cache Hits for Remote Execution |
Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 7 | |
| 8 | This page describes how to check your cache hit rate and how to investigate |
| 9 | cache misses in the context of remote execution. |
| 10 | |
| 11 | This page assumes that you have a build and/or test that successfully |
| 12 | utilizes remote execution, and you want to ensure that you are effectively |
| 13 | utilizing remote cache. |
| 14 | |
Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 15 | ## Checking your cache hit rate |
| 16 | |
| 17 | In the standard output of your Bazel run, look at the `INFO` line that lists |
| 18 | processes, which roughly correspond to Bazel actions. That line details |
| 19 | where the action was run. Look for the `remote` label, which indicates an action |
| 20 | executed remotely, `linux-sandbox` for actions executed in a local sandbox, |
| 21 | and other values for other execution strategies. An action whose result came |
| 22 | from a remote cache is displayed as `remote cache hit`. |
| 23 | |
| 24 | For example: |
| 25 | |
ecngtng | b35944e | 2021-10-18 02:33:09 -0700 | [diff] [blame] | 26 | INFO: 11 processes: 6 remote cache hit, 3 internal, 2 remote. |
Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 27 | |
ecngtng | b35944e | 2021-10-18 02:33:09 -0700 | [diff] [blame] | 28 | In this example there were 6 remote cache hits, and 2 actions did not have |
| 29 | cache hits and were executed remotely. The 3 internal part can be ignored. |
| 30 | It is typically tiny internal actions, such as creating symbolic links. Local |
| 31 | cache hits are not included in this summary. If you are getting 0 processes |
| 32 | (or a number lower than expected), run `bazel clean` followed by your build/test |
| 33 | command. |
Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 34 | |
| 35 | ## Troubleshooting cache hits |
| 36 | |
| 37 | If you are not getting the cache hit rate you are expecting, do the following: |
| 38 | |
Greg Estren | c081675 | 2020-02-20 13:04:29 -0800 | [diff] [blame] | 39 | ### Ensure re-running the same build/test command produces cache hits |
Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 40 | |
| 41 | 1. Run the build(s) and/or test(s) that you expect to populate the cache. The |
Googler | f86ad69 | 2021-03-08 11:18:47 -0800 | [diff] [blame] | 42 | first time a new build is run on a particular stack, you can expect no remote |
Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 43 | cache hits. As part of remote execution, action results are stored in the |
| 44 | cache and a subsequent run should pick them up. |
| 45 | |
| 46 | 2. Run `bazel clean`. This command cleans your local cache, which allows |
| 47 | you to investigate remote cache hits without the results being masked by |
| 48 | local cache hits. |
| 49 | |
| 50 | 3. Run the build(s) and test(s) that you are investigating again (on the same |
| 51 | machine). |
| 52 | |
Googler | 25a3c7d | 2021-01-22 18:38:49 +0100 | [diff] [blame] | 53 | 4. Check the `INFO` line for cache hit rate. If you see no processes except |
ecngtng | b35944e | 2021-10-18 02:33:09 -0700 | [diff] [blame] | 54 | `remote cache hit` and `internal`, then your cache is being correctly populated and |
Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 55 | accessed. In that case, skip to the next section. |
| 56 | |
| 57 | 5. A likely source of discrepancy is something non-hermetic in the build causing |
| 58 | the actions to receive different action keys across the two runs. To find |
| 59 | those actions, do the following: |
| 60 | |
| 61 | a. Re-run the build(s) or test(s) in question to obtain execution logs: |
| 62 | |
| 63 | bazel clean |
ecngtng | b35944e | 2021-10-18 02:33:09 -0700 | [diff] [blame] | 64 | bazel $YOUR_FLAGS build //your:target --execution_log_json_file=/tmp/exec1.json |
Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 65 | bazel clean |
ecngtng | b35944e | 2021-10-18 02:33:09 -0700 | [diff] [blame] | 66 | bazel $YOUR_FLAGS build //your:target --execution_log_json_file=/tmp/exec2.json |
Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 67 | |
| 68 | b. [Compare the execution logs](#comparing-the-execution-logs) between the |
| 69 | two runs. Ensure that the actions are identical across the two log files. |
| 70 | Discrepancies provide a clue about the changes that occurred between the |
| 71 | runs. Update your build to eliminate those discrepancies. |
| 72 | |
| 73 | If you are able to resolve the caching problems and now the repeated run |
| 74 | produces all cache hits, skip to the next section. |
| 75 | |
| 76 | If your action IDs are identical but there are no cache hits, then something |
| 77 | in your configuration is preventing caching. Continue with this section to |
| 78 | check for common problems. |
| 79 | |
| 80 | 5. Check that all actions in the execution log have `cacheable` set to true. If |
| 81 | `cacheable` does not appear in the execution log for a give action, that |
| 82 | means that the corresponding rule may have a `no-cache` tag in its |
| 83 | definition in the `BUILD` file. Look at the human-readable `progress_message` |
| 84 | field in the execution log to help determine where the action is coming from. |
| 85 | |
| 86 | 6. If the actions are identical and `cacheable` but there are no cache hits, it |
| 87 | is possible that your command line includes `--noremote_accept_cached` which |
| 88 | would disable cache lookups for a build. |
| 89 | |
| 90 | If figuring out the actual command line is difficult, use the canonical |
| 91 | command line from the |
| 92 | [Build Event Protocol](/build-event-protocol.html) |
| 93 | as follows: |
| 94 | |
| 95 | a. Add `--build_event_text_file=/tmp/bep.txt` to your Bazel command to get |
| 96 | the text version of the log. |
| 97 | |
| 98 | b. Open the text version of the log and search for the |
| 99 | `structured_command_line` message with `command_line_label: "canonical"`. |
| 100 | It will list all the options after expansion. |
| 101 | |
| 102 | c. Search for `remote_accept_cached` and check whether it's set to `false`. |
| 103 | |
| 104 | d. If `remote_accept_cached` is `false`, determine where it is being |
| 105 | set to `false`: either at the command line or in a |
| 106 | [bazelrc](/guide.html#where-are-the-bazelrc-files) file. |
| 107 | |
| 108 | ### Ensure caching across machines |
| 109 | |
| 110 | After cache hits are happening as expected on the same machine, run the |
| 111 | same build(s)/test(s) on a different machine. If you suspect that caching is |
| 112 | not happening across machines, do the following: |
| 113 | |
| 114 | 1. Make a small modification to your build to avoid hitting existing caches. |
| 115 | |
| 116 | 2. Run the build on the first machine: |
| 117 | |
| 118 | bazel clean |
ecngtng | b35944e | 2021-10-18 02:33:09 -0700 | [diff] [blame] | 119 | bazel ... build ... --execution_log_json_file=/tmp/exec1.json |
Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 120 | |
| 121 | 3. Run the build on the second machine, ensuring the modification from step 1 |
| 122 | is included: |
| 123 | |
| 124 | bazel clean |
ecngtng | b35944e | 2021-10-18 02:33:09 -0700 | [diff] [blame] | 125 | bazel ... build ... --execution_log_json_file=/tmp/exec2.json |
Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 126 | |
| 127 | 4. [Compare the execution logs](#comparing-the-execution-logs) for the two |
| 128 | runs. If the logs are not identical, investigate your build configurations |
| 129 | for discrepancies as well as properties from the host environment leaking |
| 130 | into either of the builds. |
| 131 | |
| 132 | ## Comparing the execution logs |
| 133 | |
| 134 | Execution logs contain records of all actions executed during the build. For |
| 135 | each action there is a |
| 136 | [SpawnExec](https://github.com/bazelbuild/bazel/blob/42389d9468a954f3793a19f8e026b022b39aefca/src/main/protobuf/spawn.proto#L67) |
| 137 | element containing all of the information from the action key, Thus, if the |
| 138 | logs are identical then so are the action cache keys. |
| 139 | |
| 140 | To compare logs for two builds that are not sharing cache hits as expected, |
| 141 | do the folowing: |
| 142 | |
ecngtng | b35944e | 2021-10-18 02:33:09 -0700 | [diff] [blame] | 143 | 1. Get the execution logs from each build and store them as `/tmp/exec1.json` and |
| 144 | `/tmp/exec2.json`. |
Googler | 2af641c | 2018-11-26 03:01:41 -0800 | [diff] [blame] | 145 | |
ecngtng | b35944e | 2021-10-18 02:33:09 -0700 | [diff] [blame] | 146 | 2. Use your favourite text differ to diff `/tmp/exec1.json` and |
| 147 | `/tmp/exec2.json`. |