Summary: We describe a static analysis that infers lifetimes in C++ function signatures.
NOTE: This document describes the approach we are currently pursuing but it is a) incomplete, and b) out of date. It has become clear that we are still making changes to the static analysis frequently enough that it does not seem worth updating a document in parallel with those changes. Once the static analysis appears reasonably stable, we plan to update this document to describe it.
Lifetime analysis has two goals:
To infer and verify lifetimes, we perform a pointer analysis[^1]. For each pointer or other reference-like type, a pointer analysis determines a points-to set consisting of the storage locations it may point to.
There are different approaches to pointer analysis that can be classified according to various properties. The pointer analysis we perform here has the following properties:
The pointer analysis we perform is relatively coarse-grained in that we do not distinguish between different storage locations with the same lifetime; equivalently, we can say that we identify a storage location merely by its lifetime.
A points-to set is therefore just a set of lifetimes; a reference-like object is also simply identified by its lifetime. The state that is tracked during the analysis is therefore just a mapping from a lifetime (identifying the reference-like object) to a set of lifetimes (identifying the storage locations it may point to).
This coarse-grained approach simplifies the analysis and is sufficient for our purposes because we are only attempting to infer and verify statements about lifetimes.
We analyze all functions in a translation unit for which we have a definition.
We attempt to analyze all of these functions in topological order so that callees are analyzed before callers. Where recursion makes this impossible, we analyze the functions that take part in the recursive cycle in arbitrary order. We accept that this may make it impossible to infer lifetimes for functions in a recursive cycle.
As explained in the introduction, we identify an object (often called a storage location in pointer analysis) merely by its lifetime.
A points-to set is therefore simply a set of lifetimes. It represents the set of objects that a reference-like type or glvalue can be referencing at some point of execution of the program. We will sometimes refer to the objects in a points-to set as pointees.
We associate each local variable in the function with a different local lifetime. This serves two purposes: a) It reflects the fact that local variables do, in general, have different lifetimes, and this is important for lifetime verification. b) It allows us to associate a different points-to set with different local variables of reference-like type, and this is required to make the analysis precise enough.
We perform a data-flow analysis using the Clang dataflow framework (documentation) to propagate points-to sets through the function. After the analysis is complete, we produce lifetime annotations from the points-to sets; if these lifetime annotations are different from existing annotations (ignoring pure renamings), we output the new annotations as suggested edits.
The data-flow analysis tracks the following state:
The join operation on points-to sets means taking the union of the two sets.
The initial state for the data flow analysis is produced as follows:
During the analysis, we propagate points-to sets through expressions and update the points-to sets of reference-like objects.
After the analysis is complete, we obtain lifetime annotations by examining the points-to sets of all parameters of reference-like type and the return value (if applicable), descending into pointees that are themselves of reference-like type.
For every points-to set, we look at the set of lifetimes of its pointees. If there are multiple lifetimes, they are substituted by a single lifetime. This lifetime then becomes the lifetime of the corresponding reference or pointer type in the signature.
Here are some examples:
void foo(int* from, int** to) { // from_pointee (int): '1 // to_pointee (int *): '2 // to_pointee_pointee (int): '3 // from: { from_pointee } // to: { to_pointee } // to_pointee: { to_pointee_pointee } *to = from; // to_pointee: { from, to_pointee_pointee } }
TODO: Explain. Also talk about why, after the assignment *to = from
, we keep to_pointee_pointee
in the points-to set and how we can, in some cases, eliminate it. (Distinguish between scalar and aggregate pointees -- the latter are arrays, for example. We can only delete existing pointees if *to has a single pointee and it's scalar.)
int* target(int* p1, int* p2) { // p1_pointee (int): '1 // p1: { p1_pointee } // p2_pointee (int): '2 // p2: { p2_pointee } int** pp; if (foo()) { pp = &p1; // pp: { p1 } } else { pp = &p2; // pp: { p2 } } // pp: { p1, p2 } int local = 42; *pp = &local; // glvalue on left side is { p1, p2 }, so: // p1: { p1_pointee, local } // p2: { p2_pointee, local } return p1; // rval: { p1_pointee, local } }
TODO: Explain. Also mention how this is an example where we have two pointees on the left hand side, so we can't eliminate existing pointees from p1
and p2
.
Here is how we handle function calls:
'l
in an invariant argument position, replace the argument's existing points-to set with the points-to set established for 'l
in Step 1.'l
, find the points-to set established for 'l
in Step 1; this becomes the points-to set for the call expression's value.If the 'static
lifetime occurs in output parameters (i.e. in invariant position) or in the return value, the callee may be returning references to pointees that do not occur as inputs to the callee. Therefore, when we encounter the 'static lifetime in these positions, we create new pointees for the corresponding outputs.
Here is an example of how this works:
void copy_ptr(int *'x from, int *'x *'y to) { *to = from; } int * get_lesser_of(int * arg1, int * arg2) { // arg1_pointee (int): '1 // arg2_pointee (int): '2 // arg1: { arg1_pointee } // arg2: { arg2_pointee } int* result = arg2; // result: { arg2_pointee } if (*arg1 < *arg2) { copy_ptr(arg1, &result); // &result: { result } // 'x pointees: { arg1_pointee, arg2_pointee } // result: { arg1_pointee, arg2_pointee } } return result; }
TODO: Continue exposition
Inferring lifetimes for virtual member functions is complicated by two factors:
For more details, see this section in the lifetime annotation specification.
We will describe an approach that can infer and update lifetimes for virtual member functions progressively, as each translation unit is processed.
If a translation unit contains definitions for multiple overrides, or if it contains the definition of the the base class function and at least one override, we analyze these definitions in topological order from base class to more derived class.
If the definitions are contained in different translation units, we effectively process them in the same order because we analyze dependencies of a library before analyzing the library itself, and libraries containing derived classes generally depend on the library containing the base class.
TODO: The description above implicitly assumes we're talking about the initial change where we add lifetimes across the codebase. Discuss also how this applies when people are editing code.
When we encounter the definition of a virtual member function (whether it is the base class implementation or an override), we first perform lifetime inference on its implementation, as for any other function, and update the declaration of the member function in its containing class.
If the function is an override, call it Derived::f
, we then update the lifetimes of every base class function Base::f
that it overrides. (There may be several if there is a chain of overrides.) We do so as follows:
Base::f
does not yet contain any lifetime annotations, annotate it with the lifetimes of Derived::f
. Because we process base class functions before derived class functions, this case can only occur if Base::f
is pure virtual.Base::f
are more permissive than the lifetimes inferred for Derived::f
, perform lifetime substitutions on the lifetimes of Base::f
until they are at most as permissive as those of Derived::f
.Base::f
at most as permissive as the lifetimes inferred for Derived::f
, do nothing.TODO: Can we ever get caught in a situation where neither the second nor the third point above applies? I think we‘ll always be able to restrict the lifetimes of Base::f
until they’re compatible with Derived::f
, but this needs a formal argument.
TODO: Discuss how the lifetime changes affect callers – may need to process them again.
TODO: Show an example
Templates pose a specific challenge to lifetime analysis:
The lifetime annotation specification defines what the semantics of lifetimes on templates should be but does not say how they should be implemented. That is the purpose of this section.
Before we discuss generally how we will analyze templates, let us look at some scenarios that may occur.
As an example of why we want to be able to analyze templates themselves, let’s take a look at part of a simplified implementation of std::vector
:
template <class T> class vector { public: vector(const vector& other); T* $a begin() $a { return data_; } T* $a end() $a { return data_ + size_; } private: T* data_; size_t size_; };
We should be able to infer the lifetimes of begin()
and end()
from the template itself. These member functions operate only on pointers to T
, and the lifetime behavior of a pointer to T
is independent of the type T
itself[^3].
On the other hand, we cannot infer the lifetimes of the copy constructor. It calls the copy constructor of T
, and as explained in the lifetime annotation specification, copy and move operations can have two different lifetime signatures.
Here is another example of how lifetimes can depend on a template argument:
template <int i> int* return_ith(int* i0, int* i1) { if (i == 0) { return i0; } else { return i1; } }
This example is contrived, but it is certainly not implausible that a trait argument could affect behavior in a similar way.
While these examples do show the limitations of lifetime analysis on templates, we likely won’t need to do anything subtle to detect them within the analysis. In the case of the copy constructor of vector
, we will notice when calling the copy constructor of T
that we’re doing member lookup on a dependent type and that we can’t continue the analysis. In the case of return_ith()
, we will be able to analyze the function, but we will conclude that the lifetimes of all pointers involved are the same. This is more restrictive than the result we would obtain if we analyzed a template instantiation, but this limitation may be acceptable.
The constraints described above imply that lifetime analysis of templates need to proceed in two phases:
Analysis of the template itself. We first attempt to infer lifetimes on the template itself, as well as any partial or full specializations, to the extent that the lifetimes do not depend on template arguments. If the inferred lifetimes are different from the function’s current (possibly elided) lifetimes, we generate a corresponding annotation. If we cannot infer lifetimes for the function, we annotate all lifetimes on the function as unsafe. This is required to distinguish this case from the situation where we were able to infer lifetimes and those lifetimes are elided.
TODO: Is there any alternative to marking the lifetimes unsafe? This isn‘t what we usually use unsafe lifetimes for, but I also don’t really want to invent yet another syntax.
Performing lifetime analysis on the template itself, rather than only on instantiations, serves two purposes: a) It documents the lifetimes in the code, and b) it saves us from having to analyze every instantiation in cases where the lifetimes don’t depend on template arguments.
Analysis of template instantiations. In the following situations, we infer lifetimes on a function template instantiation or member function of a class template instantiation that is called in the translation unit we are analyzing:
We use the inferred lifetimes when performing lifetime analysis on the callers of these functions, but we obviously cannot produce annotations for these inferred lifetimes.
As discussed in the lifetime annotation specification, any lifetimes in a template argument should be propagated to all uses of the argument. Clang does not provide a built-in mechanism for this, so this needs to be done in the lifetime analysis code.
TODO
If we detect that there is a lifetime error – either because a function is returning a reference to a local or because there is a lifetime error inside the function – we want to produce an easily comprehensible error message that explains the error.
TODO: Explain how
We previously considered an alternative approach that built a set of constraints between lifetimes involved in the function. Unfortunately, this approach produced wrong results on some fairly simple examples involving variable overwrites. A coworker identified a way to extend the approach in a way that overcame many of these limitations, but this extension introduced additional complexity. In the end, we decided that the approach based on points-to-sets was the simpler alternative.
The borrow checker in Rust, in addition to checking lifetimes, also enforces the exclusivity rule: at any given time the program may have either one mutable reference or any number of immutable references to the same storage location.
The exclusivity rule protects against certain kinds of memory safety errors. For example, if it was applied to C++, it would catch the use after free here:
int test() { std::vector<int> xs; xs.push_back(10); const int &x0 = xs[0]; // `x0` borrows `xs` here. xs.push_back(20); // exclusivity error: `xs` is mutably borrowed here, // overlapping with the `x0` borrow. return x0; // `xs` is borrowed by `x0` at least until here // because `x0` is used here. }
Most C++ iterator invalidation bugs could be prevented by enforcing exclusivity: while there are outstanding iterators that borrow the container, the container can't be mutated.
From our experience porting woff2 from C++ to Rust, adjusting existing code to follow the exclusivity rule is one of the most difficult steps in porting. Therefore, it makes sense to separate rolling out lifetime checking from exclusivity checking. Lifetime checks without exclusivity checks don't guarantee memory safety, but they catch memory safety issues on their own, and should not require many adjustments to C++ code.
Exclusivity checking could be rolled out in an optional second step. This would not only provide additional memory safety to the C++ code but would facilitate a manual or automatic conversion of C++ code to Rust.
Lifetime verification does not establish spatial memory safety, that is, it does not prove that all accesses are in bounds. Rust collections perform these checks at runtime.
[^1]: The following documents provide more background material: doc 1, doc 2. [^2]: Unless converting constructors and conversion constructors are used to simulate variance. [^3]: Note, however, that Clang is currently very conservative in assigning types to type-dependent expressions.