Inactivity leak under MaxEB (EIP-7251)

The beacon chain is designed to automatically recover finality if more than 1/3 of validators become permanently inactive (commonly referred to as surviving World War 3). Inactive validators are leaked to reduce their share of active stake and restore finality. To finalize again, it needs to satisfy this inequality:

participating_stake / active_stake > 66%

Assuming participating_stake is constant (no new deposits) it needs to reduce active_stake by:

Forcefully ejecting non-participating stake (ejection condition)
Reducing the balance of active non-participating validators (inactivity leak)

In order for 1) to be formalized, we must set a constant, EJECTION_BALANCE, where a validator is ejected if its balance falls below this value. The issue with 1) is that ejections must go through the exit queue. However, if more than 1/3 of validators need to be ejected, the exit queue will take months to clear. So in practice 2) dominates to restore finality. Because of this, most validators will leak past the ejection balance, based on what share of stake is permanently offline.

Inactivity leak epoch-by-epoch timeline

The ejection condition dictated by EJECTION_BALANCE can be misunderstood as a lower bound for validator balances. However, during an inactivity leak, most validators will leak well past EJECTION_BALANCE. Let's run through an example step by step to illustrate it.

Note: the charts below are created with a Jupyter Notebook implementing the Bellatrix spec with faster config variables. For detailed numbers refer to the next section which runs a full inactivity leak simulation.

The following plots track three key variables through an inactivity leak where 50% of the validators suddenly become offline.

🟥 Balance: Solid Line: Average of offline validators. Area: Range of all values
🟦 Inactivity scores: Solid Line: Average of offline validators, Area: Range of all values
🟩 Fraction of active validators

If finality delays for more than 4 epochs, the inactivity score of offline validators starts to increase linearly with time. Inactivity balance penalties are proportional to the inactivity score, so balance decreases ~ quadratically with time.

simulation_example_0

Eventually, the balance of inactive validators reaches EJECTION_BALANCE and those validators are ejected at once. Note that this is a perfect example where all validators are starting with the same balance. In a real scenario, there will be some offset between ejections. Also, validators will likely attempt to voluntary exit as soon as possible, making exits more staggered. However, since this action does not meaningfully alter the results, it is omitted for now.

simulation_example_1

Despite all validators being ejected at once, the exit queue forces a constant rate of exits per unit of time. The validators at the start of the queue exit first, with a balance close to EJECTION_BALANCE. However, the rest continue to be active and offline increasing their inactivity penalties and leaking balance.

simulation_example_2

Eventually, thanks to the leak, the share of participating stake reaches 2/3 and the network finalizes. Inactivity scores of still active validators reduce at a higher rate but take some time to reach 0. During this time, not yet exited validators keep leaking.

The result is an incentive to race to the exit, as the validators that can front-run the queue will suffer the least penalties. However, the majority of offline validators will end up with a balance far below EJECTION_BALANCE.

The continued leak post-finality is per design. If the network stops leaking exactly when reaching 2/3 participation, it will be at risk of going back into non-finality again. This leaking "inertia" allows us to overshoot into higher participation regimes.

What's the effect of `EJECTION_BALANCE`?

EJECTION_BALANCE moderates how early validators start to be placed in the ejection queue during an inactivity leak. Does faster or slower ejection meaningfully affect finality recovery and bounds validator losses?

The previous section shows an exemplary case of an inactivity leak. However, to get exact numbers for a network of 1 million indexes, the Python spec is too slow. Each run takes 10-30 minutes on moderate hardware 🫠. To iterate faster I translated the simplified Bellatrix spec to a faster language and computed key metrics with different settings (source code).

To estimate the effects of the EJECTION_BALANCE constant, a new simulation was run to calculate two main variables as a function of inactive validator percent:

Inactivity Leak Stop - This is the time (in days) before the inactivity leak stops
Total Balance Burned - The total network-wide balance burned at the end of the simulation, defined when the max inactivity score of all active validators is 0.

This simulation was using Bellatrix constants, 1e6 initial active indexes and, an equal initial balance of all validators of 32 ETH. The results are shown in the following graphs

inactivity_leak_stop

total_balance_burned

Source code and tabulated results

Except for impractically high values, EJECTION_BALANCE does not significantly influence our two main variables. In the simulated scenario with most validators' initial balances at 32 ETH (mainnet today), the effect of the current ejection condition (EJECTION_BALANCE = 16 ETH) against no ejection condition at all (EJECTION_BALANCE = 0 ETH) is minimal.

Ejection condition under MaxEB (EIP-7251)

EIP-7251: Increases the MAX_EFFECTIVE_BALANCE which extends the range of possible active balances a validator can have. From genesis, the beacon chain has targeted all validators to have a range of active balance between 16 (EJECTION_BALANCE) and 32 ETH (MAX_EFFECTIVE_BALANCE).

balance_range_deneb

As we have seen before, EJECTION_BALANCE does not contribute meaningfully to finality recovery. So, why was it added in the first place? The beacon chain relies on validators having a sufficiently uniform balance to ensure that committees are majority honest. Since committee selection is not balance-based, the lower bound of 16 ETH ensures that a random selection of indexes is majority honest with very high probability.

With EIP-7251, the active balance range is extended to between 16 (EJECTION_BALANCE) and 2048 ETH (MAX_EFFECTIVE_BALANCE_EIP7251)

balance_range_maxeb

Let's explore the implications of this extended range:

Does it affect finality recovery?

No, as we have seen above reducing the ejection balance does not meaningfully affect the time to recovery. Increasing the active balance upper bound is conceptually equivalent to reducing the lower bound, i.e. reducing ejection balance:

balance_range_deneb_low_ejection

Does it increase the balance leaked during non-finality?

During non-finality, the average balance leaked is not a function of ejection balance, so the % of leaked balance will remain roughly the same as today. Refer to the previous section for exact numbers.

Does it increase the time to eject offline validators during finality?

Yes, with EIP-7251 an offline validator during timely finality can theoretically lose more balance. However, the inactivity penalties are so small that it will take a validator decades to reach EJECTION_BALANCE, even starting from 32ETH.

balance_offline_timely_finality

Waiting 30 or 50 years to be ejected is a sub-optimal option. Instead, a perpetually offline validator (due to key loss for example) should use EIP-7002 (Execution layer triggerable exits) to exit the validator and suffer way fewer losses than if waiting for any safe EJECTION_BALANCE value. If chronically offline validators become a network-wide issue, there is plenty of time (years) to design and ship a solution.

Summary

Should the ejection mechanism be modified to accommodate EIP-7251?

We have established that:

EJECTION_BALANCE does not meaningfully contribute to finality recovery
EJECTION_BALANCE is not useful to clean up perpetually offline validators
with MaxEB, the network can handle big ranges of active balances

Therefore, we should do nothing. Leave the parameter at EJECTION_BALANCE = 16, ship EIP-7002, and promote client diversity so stakers don't have to worry about the inactivity leak.