Blame - docs/unpin.md - crubit

blob: f71879cfe998d304054e2294f7e06e6b82659410 [file] [log] [blame] [view]

Devin Jeanpierre	2111ede	2022-04-25 15:52:46 -0700	[diff] [blame]	1	# `Unpin` for C++ Types
				2
				3	SUMMARY: A C++ type is `Unpin` if it is trivially relocatable (e.g., a trivial
				4	type, or a nontrivial type which is `[[clang::trivial_abi]]`), and is `final`.
				5	Any such type can be used by value or plain reference/pointer in interop, all
				6	non-`Unpin` types must instead be used behind pinned pointers and references.
				7
				8	A C++ type `T` is `Unpin` (always safe to manipulate through `&mut T`) if it is
				9	known to be a trivially relocatable type (move+destroy is logically
				10	equivalent to `memcpy`+release) with insignificant padding (it does not
				11	matter if the padding is included in that `memcpy`).
				12
				13	`Unpin` C++ types can be used like any other normal Rust type: they are always
				14	safe to access by reference or by value. Non-`Unpin` types, in contrast, can
				15	only be accessed behind pins such as `Pin<&mut T>`, or `Pin<Box<T>>`, because it
				16	may not be safe to directly mutate. These types are never used directly by value
				17	in Rust, because value-like assignment has incorrect semantics: it fails to run
				18	C++ special members for non-trivially-relocatable types, it can overwrite
				19	padding for types with significant padding.
				20
				21	## Trivially Relocatable Types
				22
				23	In C++, moving a value between locations in memory involves executing code to
				24	either initialize (move-construct) or overwrite (move-assign) the new location.
				25	The old location still exists, but is in a moved-from state, and must still be
				26	destroyed to release resources.
				27
				28	(For example, `std::string x = std::move(y);` will run the move constructor, so
				29	that `x` contains the same value that `y` used to have before the move. The
				30	variable `y` will still be a valid string, but might be empty, or might contain
				31	some garbage value. The destructors for both `x` and `y` will run when they go
				32	out of scope.)
				33
				34	Rust does not have move constructors or move assignment. In fact, there is no
				35	way to customize what happens during moving or assignment: in Rust, moving or
				36	swapping an object means changing its location in memory, as if by `memcpy`
				37	without running the destructor logic in the old location. Another way of looking
				38	at it is that it's as if an object moved around in memory over time: it is
				39	constructed in one place, and then further operations and eventual destruction
				40	might happen in other places. We call such a Rust-like move a "trivial
				41	relocation" operation.
				42
				43	Despite C++ moves using explicit construction and destruction calls, many C++
				44	types could also have used the Rust movement model. We call such types
				45	trivially relocatable types.
				46
				47	For example, a C++ `std::unique_ptr`, implemented in the obvious way, is
				48	trivially relocatable: its actual location in memory does not matter. In
				49	contrast, a self-referential type is not trivially relocatable, because to
				50	relocate it, you must also update the pointer it has to itself. This is done
				51	inside the move constructor in C++, but cannot be done in the Rust model, where
				52	the move operation is not customizable.
				53
				54	For more background, see
				55	[P1144](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1144r5.html).
				56
				57	### Which types are trivially relocatable?
				58
				59	For the purpose of Rust/C++ interop, we define a type to be trivially
				60	relocatable if, and only if, it is "trivial for calls" in Clang. That is,
				61	either:
				62
				63	1. It is actually
				64	[trivial](https://en.cppreference.com/w/cpp/named_req/TrivialType), or
				65	2. It uses
				66	[`[[clang::trivial_abi]]`](https://clang.llvm.org/docs/AttributeReference.html#trivial-abi)
				67	to make itself trivial for calls
				68
				69	This definition is conservative: some types that could be considered trivially
				70	relocatable are not trivial for calls. (For example, `std::unique_ptr` uses
				71	`[[clang::trivial_abi]]` only in the unstable libc++ ABI; the stable libc++ ABI
				72	predates this attribute, and adding it now is ABI-breaking.)
				73
				74	This definition is, however, sound: all types which are trivial for calls are
				75	trivially relocatable, because a type which is trivial for calls is
				76	trivially-relocated when passed by value as a function argument.
				77
				78	### Expanding trivial relocatability
				79
				80	We are working to extend libc++ and Clang to trivially relocate these types in
				81	even more circumstances, which would make `[[clang::trivial_abi]]` more
				82	compelling and more widely used, enhancing both performance and
				83	Rust-compatibility for our C++ core libraries.
				84
				85	* [[clang] Mark `trivial_abi` types as "trivially relocatable".](https://reviews.llvm.org/D114732)
				86	* [Use trivial relocation operations in std::vector, by porting D67524 and
				87	part of D61761 to work on top of the changes in
				88	D114732.](https://reviews.llvm.org/D119385)
				89
				90	A future change to C++ or Clang in the vein of
				91	[P1144](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1144r5.html)
				92	could make types trivially relocatable without requiring ABI changes as
				93	`[[clang::trivial_abi]]` does, although in the short term this doesn't seem very
				94	likely.
				95
				96	## Insignificant Padding
				97
				98	If a type has padding, then even if the type is trivially relocatable and
				99	therefore safe to write as if by `memcpy`, **Rust will `memcpy` an incorrect
				100	number of bytes**: Rust will include the padding, though C++ would not.
				101	Trivially relocatable types where the padding potentially has semantic meaning
				102	can still be handled by value, but are `!Unpin`, and all mutable references Rust
				103	receives from C++ must be `Pin<&mut T>`. Only trivially relocatable types where
				104	the padding has no significance can be `Unpin` and safe to deal with via `&mut`.
				105
				106	Significant padding occurs via inheritance -- derived types may reuse the
				107	padding for other objects -- and from the `[[no_unique_address]]` attribute
				108	(which declares the padding to be reusable).
				109
				110	For the purposes of C++/Rust interop, `[[no_unique_address]]` is an unsafe
				111	feature, and any type which cannot be inherited from (via e.g. `final`) is
				112	considered to have insignificant padding.
				113
				114	### When is padding significant?
				115
				116	In C++, if you take a mutable reference to a base class subobject, and pass it
				117	around, this is ultimately pretty safe. If you assign to it, it is a bit bad --
				118	it will assign to only the base class subobject (if it's nonvirtual), not just
				119	the subclass -- but it's possible for this to make sense, and if it were truly
				120	dangerous they'd probably have deleted assignment or not inherited from the base
				121	class.
				122
				123	In Rust, this is extremely dangerous, because the size of the base class
				124	subobject can extend to include fields from the derived class. For example, take
				125	this class hierarchy:
				126
				127	```c++
				128	class Base {
				129	int64_t x_;
				130	int32_t y_;
				131	/* ...methods... */
				132	};
				133
				134	class Derived : public Base {
				135	int32_t size_;
				136	char* data_;
				137	/* ...methods... */
				138	};
				139	```
				140
				141	Here we have a class `Derived` with some string data, which inherits from
				142	`Base`. But something unfortunate happens: because `Base` has an extra 32 bits
				143	of tail padding, and is not POD for the purpose of layout, the `size_` member of
				144	the derived class is stored inside the tail padding for `Base`. This is allowed
				145	by the C++ standard, and actually taken advantage of in the Itanium ABI.
				146
				147	In C++, this presents no problems, as C++ assignment doesn't do something like
				148	`memcpy sizeof(x) bytes`, even when the class is trivially assignable. It only
				149	copies the real data size, excluding padding. And so this code will not
				150	accidentally overwrite the `size_` field:
				151
				152	```c++
				153	Derived& d = ...;
				154	Base& b1 = d;
				155	Base& b2 = ...;
				156	std::swap(b1, b2);
				157	```
				158
				159	But the seemingly equivalent Rust code absolutely will:
				160
				161	```rs
				162	let d : &mut Derived = ...;
				163	let b1 : &mut Base = d.into();
				164	let b2 : &mut Base = ...;
				165	// This overwrites size_ from the derived class with uninitialized memory from
				166	// b2.
				167	std::mem::swap(b1, b2); // Catastrophically bad.
				168	```
				169
				170	As a consequence, types like `Base` should not be exposed as `&mut` references:
				171	they might refer to a base class subobject, in which case assignment in Rust
				172	will do the wrong thing. Even if they are trivially relocatable and assignment
				173	is equivalent to a `memcpy`, Rust will memcpy the wrong number of bytes.
				174
				175	### Gaps
				176
				177	#### `[[no_unique_address]]`
				178
				179	The exact same behavior can occur with `[[no_unique_address]]`. There are three
				180	options:
				181
				182	1. Live with the unsafety of `[[no_unique_address]]`, and make it buyer beware.
				183	This is similar to how we treat packed struct fields.
				184
				185	2. Forbid `[[no_unique_address]]` in the C++ style guide, except for zero-sized
				186	types (which we can probably handle fine).
				187
				188	3. Switch approaches: rather than only allowing it for `final` classes and the
				189	like, only allow it for classes whose data size is guaranteed to be the same
				190	as their stride, possibly using something like a `[[pod_layout]]` attribute.
				191
				192	For now, we take approach #1: `[[no_unique_address]]` is considered an unsafe
				193	feature, which can render padding significant on any type which has padding.
				194
				195	#### Lambdas
				196
				197	TODO: implement this.
				198
				199	Lambdas are class types, are not `final`, and cannot be marked `final`. Most
				200	likely, we need to simply pretend that they are `final` -- it is not very useful
				201	to inherit from a lambda, and this should not break people in practice.
				202
				203	### How common is this?
				204
				205	Only ~4% of classes at Google are base
				206	classes to some other type.
				207
				208	This means the number of classes that should be pinned due to potentially
				209	significant padding is low, and the number of classes that should be marked
				210	final is high. Mixed blessings: more boilerplate in C++, but less annoyance in
				211	Rust, as the vast majority of classes can be marked `final` via LSC.
				212
				213	However, 4% doesn't quite seem small enough that we can pretend the issue
				214	doesn't exist.