blob: a5ddfd222ee26f27671a5e04ce06de6f68675e19 [file] [log] [blame] [view]
# Static Analysis for C++ Lifetimes
[TOC]
Summary: We describe a static analysis that infers lifetimes in C++ function
signatures.
NOTE: This document describes the approach we are currently pursuing but it is
a) incomplete, and b) out of date. It has become clear that we are still making
changes to the static analysis frequently enough that it does not seem worth
updating a document in parallel with those changes. Once the static analysis
appears reasonably stable, we plan to update this document to describe it.
## Introduction
Lifetime analysis has two goals:
* Infer lifetime annotations to put in C++ function signatures, using the
attributes described in [this doc](lifetime_annotations_cpp.md).
* Verify lifetime-correctness of function bodies.
To infer and verify lifetimes, we perform a
[pointer analysis](https://en.wikipedia.org/wiki/Pointer_analysis)[^1]. For each
pointer or other reference-like type, a pointer analysis determines a points-to
set consisting of the storage locations it may point to.
There are different approaches to pointer analysis that can be classified
according to various properties. The pointer analysis we perform here has the
following properties:
* *Intraprocedural, context-insensitive*. We analyze each function
individually and do not take into account how it is called from different
callsites.
* *Array-insensitive*. We treat all elements in an array containing a
reference-like type as having the same lifetime.
* *Field-insensitive*. We treat member variables of reference-like type as
having the same lifetime as the object they are contained in (unless they
carry a lifetime annotation).
* *Flow-sensitive*. When analyzing a function, we take statement ordering and
control flow into account. We believe flow sensitivity is important to avoid
inferring overly restrictive lifetimes and emitting false positive errors.
The pointer analysis we perform is relatively coarse-grained in that we do not
distinguish between different storage locations with the same lifetime;
equivalently, we can say that we identify a storage location merely by its
lifetime.
A points-to set is therefore just a set of lifetimes; a reference-like object is
also simply identified by its lifetime. The state that is tracked during the
analysis is therefore just a mapping from a lifetime (identifying the
reference-like object) to a set of lifetimes (identifying the storage locations
it may point to).
This coarse-grained approach simplifies the analysis and is sufficient for our
purposes because we are only attempting to infer and verify statements about
lifetimes.
## Analysis of a translation unit
We analyze all functions in a translation unit for which we have a definition.
We attempt to analyze all of these functions in topological order so that
callees are analyzed before callers. Where recursion makes this impossible, we
analyze the functions that take part in the recursive cycle in arbitrary order.
We accept that this may make it impossible to infer lifetimes for functions in a
recursive cycle.
## Analysis of a function
As explained in the introduction, we identify an object (often called a storage
location in pointer analysis) merely by its lifetime.
A **points-to set** is therefore simply a set of lifetimes. It represents the
set of objects that a reference-like type or glvalue can be referencing at some
point of execution of the program. We will sometimes refer to the objects in a
points-to set as **pointees**.
We associate each local variable in the function with a different local
lifetime. This serves two purposes: a) It reflects the fact that local variables
do, in general, have different lifetimes, and this is important for lifetime
verification. b) It allows us to associate a different points-to set with
different local variables of reference-like type, and this is required to make
the analysis precise enough.
We perform a data-flow analysis using the
[Clang dataflow framework](https://github.com/llvm/llvm-project/tree/main/clang/include/clang/Analysis/FlowSensitive)
([documentation](https://clang.llvm.org/docs/DataFlowAnalysisIntro.html))
to propagate points-to sets through the function. After the analysis is
complete, we produce lifetime annotations from the points-to sets; if these
lifetime annotations are different from existing annotations (ignoring pure
renamings), we output the new annotations as suggested edits.
The data-flow analysis tracks the following state:
* For each reference-like object (identified by its lifetime), the points-to
set of that reference-like object
* For each expression of reference-like type, the points-to set of the
expression
* For each glvalue expression, the points-to set representing the glvalue’s
referent
* If the function’s return type is a reference or pointer type, a points-to
set for the return value
The join operation on points-to sets means taking the union of the two sets.
The initial state for the data flow analysis is produced as follows:
* Associate each parameter of reference-like type with a points-to set
containing a new unique regular lifetime representing the pointee.
* If the pointee is itself of reference-like type, recursively associate that
pointee with a points-to set containing a new regular lifetime, and so on.
During the analysis, we propagate points-to sets through expressions and update
the points-to sets of reference-like objects.
After the analysis is complete, we obtain lifetime annotations by examining the
points-to sets of all parameters of reference-like type and the return value (if
applicable), descending into pointees that are themselves of reference-like
type.
For every points-to set, we look at the set of lifetimes of its pointees. If
there are multiple lifetimes, they are substituted by a single lifetime. This
lifetime then becomes the lifetime of the corresponding reference or pointer
type in the signature.
Here are some examples:
```c++
void foo(int* from, int** to) {
// from_pointee (int): '1
// to_pointee (int *): '2
// to_pointee_pointee (int): '3
// from: { from_pointee }
// to: { to_pointee }
// to_pointee: { to_pointee_pointee }
*to = from;
// to_pointee: { from, to_pointee_pointee }
}
```
TODO: Explain. Also talk about why, after the assignment `*to = from`, we keep
`to_pointee_pointee` in the points-to set and how we can, in some cases,
eliminate it. (Distinguish between scalar and aggregate pointees -- the latter
are arrays, for example. We can only delete existing pointees if *to has a
single pointee and it's scalar.)
```c++
int* target(int* p1, int* p2) {
// p1_pointee (int): '1
// p1: { p1_pointee }
// p2_pointee (int): '2
// p2: { p2_pointee }
int** pp;
if (foo()) {
pp = &p1; // pp: { p1 }
} else {
pp = &p2; // pp: { p2 }
}
// pp: { p1, p2 }
int local = 42;
*pp = &local; // glvalue on left side is { p1, p2 }, so:
// p1: { p1_pointee, local }
// p2: { p2_pointee, local }
return p1; // rval: { p1_pointee, local }
}
```
TODO: Explain. Also mention how this is an example where we have two pointees on the left hand side, so we can't eliminate existing pointees from `p1` and `p2`.
### Function calls
Here is how we handle function calls:
1. **Create a mapping from callee lifetimes to points-to sets.** For each
variable lifetime that occurs in the callee's parameter list, find the union
of all points-to sets in those argument positions to yield a mapping from
lifetimes to points-to sets.
2. **Propagate points-to sets to output parameters.** For each lifetime `'l` in
an invariant argument position, replace the argument's existing points-to
set with the points-to set established for `'l` in Step 1.
3. **Step 3: Determine points-to set for the return value.** If the return
value is of reference-like type with lifetime `'l`, find the points-to set
established for `'l` in Step 1; this becomes the points-to set for the call
expression's value.
If the `'static` lifetime occurs in output parameters (i.e. in invariant
position) or in the return value, the callee may be returning references to
pointees that do not occur as inputs to the callee. Therefore, when we encounter
the 'static lifetime in these positions, we create new pointees for the
corresponding outputs.
Here is an example of how this works:
```c++
void copy_ptr(int *'x from, int *'x *'y to) {
*to = from;
}
int * get_lesser_of(int * arg1, int * arg2) {
// arg1_pointee (int): '1
// arg2_pointee (int): '2
// arg1: { arg1_pointee }
// arg2: { arg2_pointee }
int* result = arg2;
// result: { arg2_pointee }
if (*arg1 < *arg2) {
copy_ptr(arg1, &result);
// &result: { result }
// 'x pointees: { arg1_pointee, arg2_pointee }
// result: { arg1_pointee, arg2_pointee }
}
return result;
}
```
TODO: Continue exposition
### Virtual member functions
Inferring lifetimes for virtual member functions is complicated by two factors:
* The lifetimes of the base class member function
are constrained by the lifetimes of all of its overrides.
* The definitions of the overrides and the base class function (if it is not
pure virtual) are typically contained in different translation units, and we
plan to analyze each translation unit individually.
For more details, see [this section](lifetime_annotations_cpp.md#functions) in
the lifetime annotation specification.
We will describe an approach that can infer and update lifetimes for virtual
member functions progressively, as each translation unit is processed.
If a translation unit contains definitions for multiple overrides, or if it
contains the definition of the the base class function and at least one
override, we analyze these definitions in topological order from base class to
more derived class.
If the definitions are contained in different translation units, we effectively
process them in the same order because we analyze dependencies of a library
before analyzing the library itself, and libraries containing derived classes
generally depend on the library containing the base class.
TODO: The description above implicitly assumes we're talking about the initial
change where we add lifetimes across the codebase. Discuss also how this applies
when people are editing code.
When we encounter the definition of a virtual member function (whether it is the
base class implementation or an override), we first perform lifetime inference
on its implementation, as for any other function, and update the declaration of
the member function in its containing class.
If the function is an override, call it `Derived::f`, we then update the
lifetimes of every base class function `Base::f` that it overrides. (There may
be several if there is a chain of overrides.) We do so as follows:
* **If the declaration of `Base::f` does not yet contain any lifetime
annotations**, annotate it with the lifetimes of `Derived::f`. Because we
process base class functions before derived class functions, this case can
only occur if `Base::f` is pure virtual.
* **If the existing lifetimes of `Base::f` are more permissive than the
lifetimes inferred for `Derived::f`**, perform lifetime substitutions on the
lifetimes of `Base::f` until they are at most as permissive as those of
`Derived::f`.
* **If the existing lifetimes of `Base::f` at most as permissive as the
lifetimes inferred for `Derived::f`**, do nothing.
TODO: Can we ever get caught in a situation where neither the second nor the
third point above applies? I think we'll always be able to restrict the
lifetimes of `Base::f` until they're compatible with `Derived::f`, but this
needs a formal argument.
TODO: Discuss how the lifetime changes affect callers – may need to process them
again.
TODO: Show an example
### Templates
Templates pose a specific challenge to lifetime analysis:
* Reference-like types may occur in the template itself as well as in template
arguments and dependent types.
* For reference-like types that occur in the template, we wish to infer and
check lifetimes on the template itself to the greatest extent possible. This
reflects the fact that, even though C++ templates are not really generics,
they are often used as if they were. However, the semantics of C++ templates
pose two difficulties here:
* Templates may be specialized, and we must be careful not to apply the
lifetimes inferred on the primary template to the specialization.
* The inferred lifetimes and the lifetime correctness of a template may,
in general, depend on the template arguments, even if the template
arguments and dependent types do not contain any reference-like types.
We show an example of this below.
The lifetime annotation specification
[defines](lifetime_annotations_cpp.md#templates) what the semantics of lifetimes
on templates should be but does not say how they should be implemented. That is
the purpose of this section.
#### Example scenarios
Before we discuss generally how we will analyze templates, let us look at some
scenarios that may occur.
As an example of why we want to be able to analyze templates themselves, let’s
take a look at part of a simplified implementation of `std::vector`:
```c++
template <class T>
class vector {
public:
vector(const vector& other);
T* $a begin() $a { return data_; }
T* $a end() $a { return data_ + size_; }
private:
T* data_;
size_t size_;
};
```
We should be able to infer the lifetimes of `begin()` and `end()` from the
template itself. These member functions operate only on pointers to `T`, and the
lifetime behavior of a pointer to `T` is independent of the type `T` itself[^3].
On the other hand, we cannot infer the lifetimes of the copy constructor. It
calls the copy constructor of `T`, and as
[explained](lifetime_annotations_cpp.md#special-member-functions) in the
lifetime annotation specification, copy and move operations can have two
different lifetime signatures.
Here is another example of how lifetimes can depend on a template argument:
```c++
template <int i>
int* return_ith(int* i0, int* i1) {
if (i == 0) {
return i0;
} else {
return i1;
}
}
```
This example is contrived, but it is certainly not implausible that a trait
argument could affect behavior in a similar way.
While these examples do show the limitations of lifetime analysis on templates,
we likely won’t need to do anything subtle to detect them within the analysis.
In the case of the copy constructor of `vector`, we will notice when calling the
copy constructor of `T` that we’re doing member lookup on a dependent type and
that we can’t continue the analysis. In the case of `return_ith()`, we will be
able to analyze the function, but we will conclude that the lifetimes of all
pointers involved are the same. This is more restrictive than the result we
would obtain if we analyzed a template instantiation, but this limitation may be
acceptable.
#### General approach
The constraints described above imply that lifetime analysis of templates need
to proceed in two phases:
* **Analysis of the template itself**. We first attempt to infer lifetimes on
the template itself, as well as any partial or full specializations, to the
extent that the lifetimes do not depend on template arguments. If the
inferred lifetimes are different from the function’s current (possibly
elided) lifetimes, we generate a corresponding annotation. If we cannot
infer lifetimes for the function, we annotate all lifetimes on the function
as unsafe. This is required to distinguish this case from the situation
where we *were* able to infer lifetimes and those lifetimes are elided.
TODO: Is there any alternative to marking the lifetimes unsafe? This isn't
what we usually use unsafe lifetimes for, but I also don't really want to
invent yet another syntax.
Performing lifetime analysis on the template itself, rather than only on
instantiations, serves two purposes: a) It documents the lifetimes in the
code, and b) it saves us from having to analyze every instantiation in cases
where the lifetimes don’t depend on template arguments.
* **Analysis of template instantiations**. In the following situations, we
infer lifetimes on a function template instantiation or member function of a
class template instantiation that is called in the translation unit we are
analyzing:
* If the template itself contains reference-like types but does not
provide lifetimes for these.
* If the template arguments contain reference-like types.
We use the inferred lifetimes when performing lifetime analysis on the
callers of these functions, but we obviously cannot produce annotations for
these inferred lifetimes.
As discussed in the lifetime annotation specification, any lifetimes in a
template argument should be propagated to all uses of the argument. Clang does
not provide a built-in mechanism for this, so this needs to be done in the
lifetime analysis code.
### Verifying lifetime correctness
TODO
### Generating error messages
If we detect that there is a lifetime error – either because a function is
returning a reference to a local or because there is a lifetime error inside the
function – we want to produce an easily comprehensible error message that
explains the error.
TODO: Explain how
## Alternative considered
We previously considered an alternative approach that built a set of constraints
between lifetimes involved in the function. Unfortunately, this approach
produced wrong results on some fairly simple examples involving variable
overwrites. A coworker identified a way to extend the approach in a way that
overcame many of these limitations, but this extension introduced additional
complexity. In the end, we decided that the approach based on points-to-sets was
the simpler alternative.
## Differences from Rust
### The exclusivity rule
The borrow checker in Rust, in addition to checking lifetimes, also enforces the
exclusivity rule: at any given time the program may have either one mutable
reference or any number of immutable references to the same storage location.
The exclusivity rule protects against certain kinds of memory safety errors. For
example, if it was applied to C++, it would catch the use after free here:
```c++
int test() {
std::vector<int> xs;
xs.push_back(10);
const int &x0 = xs[0]; // `x0` borrows `xs` here.
xs.push_back(20); // exclusivity error: `xs` is mutably borrowed here,
// overlapping with the `x0` borrow.
return x0; // `xs` is borrowed by `x0` at least until here
// because `x0` is used here.
}
```
Most C++ iterator invalidation bugs could be prevented by enforcing exclusivity:
while there are outstanding iterators that borrow the container, the container
can't be mutated.
From our experience porting woff2 from C++ to Rust, adjusting existing code to
follow the exclusivity rule is one of the most difficult steps in porting.
Therefore, it makes sense to separate rolling out lifetime checking from
exclusivity checking. Lifetime checks without exclusivity checks don't guarantee
memory safety, but they catch memory safety issues on their own, and should not
require many adjustments to C++ code.
Exclusivity checking could be rolled out in an optional second step. This would
not only provide additional memory safety to the C++ code but would facilitate a
manual or automatic conversion of C++ code to Rust.
### Spatial memory safety
Lifetime verification does not establish spatial memory safety, that is, it does
not prove that all accesses are in bounds. Rust collections perform these checks
at runtime.
<!-- Footnotes themselves at the bottom. -->
## Notes
[^1]: The following documents provide more background material:
[doc 1](https://groups.seas.harvard.edu/courses/cs252/2011sp/slides/Lec06-PointerAnalysis.pdf),
[doc 2](https://yanniss.github.io/points-to-tutorial15.pdf).
[^2]: Unless converting constructors and conversion constructors are used to
simulate variance.
[^3]: Note, however, that Clang is currently very conservative in assigning
types to type-dependent expressions.