Blame - site/docs/remote-execution-caching-debug.md - bazel

blob: ba92b69578d0d4b40e57c8d29b7f7adf43ec3f39 [file] [log] [blame] [view]

Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	1	---
				2	layout: documentation
Greg Estren	c081675	2020-02-20 13:04:29 -0800	[diff] [blame]	3	title: Debugging remote cache hits for remote execution
Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	4	---
				5
Googler	a85685a	2021-01-08 16:53:49 -0800	[diff] [blame]	6	# Debugging Remote Cache Hits for Remote Execution
Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	7
				8	This page describes how to check your cache hit rate and how to investigate
				9	cache misses in the context of remote execution.
				10
				11	This page assumes that you have a build and/or test that successfully
				12	utilizes remote execution, and you want to ensure that you are effectively
				13	utilizing remote cache.
				14
Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	15	## Checking your cache hit rate
				16
				17	In the standard output of your Bazel run, look at the `INFO` line that lists
				18	processes, which roughly correspond to Bazel actions. That line details
				19	where the action was run. Look for the `remote` label, which indicates an action
				20	executed remotely, `linux-sandbox` for actions executed in a local sandbox,
				21	and other values for other execution strategies. An action whose result came
				22	from a remote cache is displayed as `remote cache hit`.
				23
				24	For example:
				25
ecngtng	b35944e	2021-10-18 02:33:09 -0700	[diff] [blame]	26	INFO: 11 processes: 6 remote cache hit, 3 internal, 2 remote.
Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	27
ecngtng	b35944e	2021-10-18 02:33:09 -0700	[diff] [blame]	28	In this example there were 6 remote cache hits, and 2 actions did not have
				29	cache hits and were executed remotely. The 3 internal part can be ignored.
				30	It is typically tiny internal actions, such as creating symbolic links. Local
				31	cache hits are not included in this summary. If you are getting 0 processes
				32	(or a number lower than expected), run `bazel clean` followed by your build/test
				33	command.
Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	34
				35	## Troubleshooting cache hits
				36
				37	If you are not getting the cache hit rate you are expecting, do the following:
				38
Greg Estren	c081675	2020-02-20 13:04:29 -0800	[diff] [blame]	39	### Ensure re-running the same build/test command produces cache hits
Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	40
				41	1. Run the build(s) and/or test(s) that you expect to populate the cache. The
Googler	f86ad69	2021-03-08 11:18:47 -0800	[diff] [blame]	42	first time a new build is run on a particular stack, you can expect no remote
Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	43	cache hits. As part of remote execution, action results are stored in the
				44	cache and a subsequent run should pick them up.
				45
				46	2. Run `bazel clean`. This command cleans your local cache, which allows
				47	you to investigate remote cache hits without the results being masked by
				48	local cache hits.
				49
				50	3. Run the build(s) and test(s) that you are investigating again (on the same
				51	machine).
				52
Googler	25a3c7d	2021-01-22 18:38:49 +0100	[diff] [blame]	53	4. Check the `INFO` line for cache hit rate. If you see no processes except
ecngtng	b35944e	2021-10-18 02:33:09 -0700	[diff] [blame]	54	`remote cache hit` and `internal`, then your cache is being correctly populated and
Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	55	accessed. In that case, skip to the next section.
				56
				57	5. A likely source of discrepancy is something non-hermetic in the build causing
				58	the actions to receive different action keys across the two runs. To find
				59	those actions, do the following:
				60
				61	a. Re-run the build(s) or test(s) in question to obtain execution logs:
				62
				63	bazel clean
ecngtng	b35944e	2021-10-18 02:33:09 -0700	[diff] [blame]	64	bazel $YOUR_FLAGS build //your:target --execution_log_json_file=/tmp/exec1.json
Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	65	bazel clean
ecngtng	b35944e	2021-10-18 02:33:09 -0700	[diff] [blame]	66	bazel $YOUR_FLAGS build //your:target --execution_log_json_file=/tmp/exec2.json
Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	67
				68	b. [Compare the execution logs](#comparing-the-execution-logs) between the
				69	two runs. Ensure that the actions are identical across the two log files.
				70	Discrepancies provide a clue about the changes that occurred between the
				71	runs. Update your build to eliminate those discrepancies.
				72
				73	If you are able to resolve the caching problems and now the repeated run
				74	produces all cache hits, skip to the next section.
				75
				76	If your action IDs are identical but there are no cache hits, then something
				77	in your configuration is preventing caching. Continue with this section to
				78	check for common problems.
				79
				80	5. Check that all actions in the execution log have `cacheable` set to true. If
				81	`cacheable` does not appear in the execution log for a give action, that
				82	means that the corresponding rule may have a `no-cache` tag in its
				83	definition in the `BUILD` file. Look at the human-readable `progress_message`
				84	field in the execution log to help determine where the action is coming from.
				85
				86	6. If the actions are identical and `cacheable` but there are no cache hits, it
				87	is possible that your command line includes `--noremote_accept_cached` which
				88	would disable cache lookups for a build.
				89
				90	If figuring out the actual command line is difficult, use the canonical
				91	command line from the
				92	[Build Event Protocol](/build-event-protocol.html)
				93	as follows:
				94
				95	a. Add `--build_event_text_file=/tmp/bep.txt` to your Bazel command to get
				96	the text version of the log.
				97
				98	b. Open the text version of the log and search for the
				99	`structured_command_line` message with `command_line_label: "canonical"`.
				100	It will list all the options after expansion.
				101
				102	c. Search for `remote_accept_cached` and check whether it's set to `false`.
				103
				104	d. If `remote_accept_cached` is `false`, determine where it is being
				105	set to `false`: either at the command line or in a
				106	[bazelrc](/guide.html#where-are-the-bazelrc-files) file.
				107
				108	### Ensure caching across machines
				109
				110	After cache hits are happening as expected on the same machine, run the
				111	same build(s)/test(s) on a different machine. If you suspect that caching is
				112	not happening across machines, do the following:
				113
				114	1. Make a small modification to your build to avoid hitting existing caches.
				115
				116	2. Run the build on the first machine:
				117
				118	bazel clean
ecngtng	b35944e	2021-10-18 02:33:09 -0700	[diff] [blame]	119	bazel ... build ... --execution_log_json_file=/tmp/exec1.json
Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	120
				121	3. Run the build on the second machine, ensuring the modification from step 1
				122	is included:
				123
				124	bazel clean
ecngtng	b35944e	2021-10-18 02:33:09 -0700	[diff] [blame]	125	bazel ... build ... --execution_log_json_file=/tmp/exec2.json
Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	126
				127	4. [Compare the execution logs](#comparing-the-execution-logs) for the two
				128	runs. If the logs are not identical, investigate your build configurations
				129	for discrepancies as well as properties from the host environment leaking
				130	into either of the builds.
				131
				132	## Comparing the execution logs
				133
				134	Execution logs contain records of all actions executed during the build. For
				135	each action there is a
				136	[SpawnExec](https://github.com/bazelbuild/bazel/blob/42389d9468a954f3793a19f8e026b022b39aefca/src/main/protobuf/spawn.proto#L67)
				137	element containing all of the information from the action key, Thus, if the
				138	logs are identical then so are the action cache keys.
				139
				140	To compare logs for two builds that are not sharing cache hits as expected,
				141	do the folowing:
				142
ecngtng	b35944e	2021-10-18 02:33:09 -0700	[diff] [blame]	143	1. Get the execution logs from each build and store them as `/tmp/exec1.json` and
				144	`/tmp/exec2.json`.
Googler	2af641c	2018-11-26 03:01:41 -0800	[diff] [blame]	145
ecngtng	b35944e	2021-10-18 02:33:09 -0700	[diff] [blame]	146	2. Use your favourite text differ to diff `/tmp/exec1.json` and
				147	`/tmp/exec2.json`.