Eth2.0 Staking - Failover & Dangers of Redundancy

Eth2.0 Staking - Failover & Dangers of Redundancy
Photo by Jarrod Erbe / Unsplash

Validators and Validator Clients

A validator is a construct on the Beacon Chain that represents a single consensus participant. The validator is identified by its public key, and is assigned duties to perform towards validating the Beacon Chain. Performing these duties successfully leads to the validator being rewarded, whereas failure may lead to punitive slashing (in addition to the loss of potential rewards). In Phase 0, duties are limited to proposing blocks and making attestations on the Beacon Chain.

Validators use a piece of software called a validator client (VC) for performing their validator duties[1]. All Eth2.0 client teams have this component as a part of their Eth2.0 client suite.

Needless to say, a validator will want to have an active validator client at all times, so that it gains maximum rewards and avoids any penalties. The usual methods for achieving high uptime are redundancy & failover. This post will explore the dangers of redundancy, and a suggested failover protocol for Eth2.0 VCs.

Types of Losses/Penalties

There are 3 types of losses that a validator can experience, listed here in the increasing order of severity:

  • Penalty:
    When a validator fails to performs a duty that it has been assigned, it is issued a penalty. The stake of the validator is diminished by the reward amount for the missed duty.
  • Inactivity Leak:
    When a validator fails to make attestations while finalization on the beacon chain has stalled[2], the stake of the validator is issued an "inactivity penalty".
  • Slashing Penalty:
    When a validator produces messages that violate the Casper FFG rules, the validator is immediately slashed - the validator's stake is reduced by a small fraction[3] and the validator is removed from participating in consensus (and hence loses on all future rewards). Moreover, until the ability to transfer or withdraw staked ETH is introduced, the validator's stake will remain locked & unusable on the Beacon Chain.

The first two types of losses can be experienced if the validator's VC goes offline. The third type - slashing penalty - can only be experienced as a result of an incorrectly set up validator client (or an explicit attack on Eth2.0 consensus).

In the context of failovers & redundancy, the main priority should be to prevent slashing penalties at all costs, but also to improve uptime to reduce lost rewards and potential inactivity leaks.

Dangers of Redundant Validator Clients

Some stakers may think that running redundant active VCs insures against the failure of some of the VCs. But in fact - running redundant VCs is unsafe, and will almost certainly lead to the validator getting slashing!

Let's look at a practical case where this happens:

Redundant VCs lead to slashings!

The validator is running two Eth2.0 client instances C1 and C2, both with active VCs. Unstable network conditions (peering/connectivity issues, message propagation delay, partitioning, etc.) have caused the two client instances to disagree on the canonical chain. C1 and C2 see B1 and B2 as the current head checkpoints respectively (head checkpoint, not head block - this kind of head checkpoint discrepancy can happen without malicious behavior from any validator). When the validator is executing its attestation duty, each client instance makes a vote with the target in the chain that it deems canonical. This results in two attestations that vote for a different target checkpoint, but at the same height. This is a double vote which violates the Casper FFG rules, and leads to a slashing of the validator.

Failover Protocol for Validator Clients

Note: By "failover", I mean manually or automatically starting a new VC instance after the earlier one has stopped. However, I do not recommend automatic failover mechanisms because of the high risk of a faulty implementation, which can lead to two active VCs!

The take-away from the previous section is that the validator should only have one active validator client instance at any given time. But what happens when the this instance fails? Stakers should plan for this situation in advance by defining their own failover protocol for starting the new validator client instance. Before making a safe & fool-proof failover protocol, let's explore an in-built safety feature of validator clients: the slashing protection mechanism.

Slashing Protection Mechanism

Slashing Protection Mechanism: VC software from all Eth2.0 client teams have a slashing protection mechanism that serves as a fail-safe against any unexpected behavior. According to the slashing rules, only a pair of attestations/blocks can cause a slashing, and this can be checked by simply inspecting the pair of messages. The VC stores the set of all attestations and blocks that it has previously signed in a slashing protection database. Before signing on any new attestation/block, the VC checks this new message against entries in the slashing protection database, and signs the new message only if it does not produce a slashable pair.

Hence, VCs require 3 items for correct & safe operation:

  • fully synced Beacon Node (BN), to get information about the Beacon Chain
  • validator signing private key, to actually sign the messages
  • up-to-date slashing protection database, for the fail-safe slashing protection mechanism

Any good failover protocol should account for scenarios where any of these items are lost. The first two are easy to keep available - redundant BNs can be maintained for easily connecting to the new VC, and the signing key is a read-only file that can be copied from a backup location.

The last item - up-to-date slashing protection database - is a big challenge to backup & keep available! There are many possibilities of failures where the database is lost completely: filesystem corruption, disk failure, hardware loss due to disaster, etc. Data backup & availability is a multi-billion dollar problem that has many existing solutions - block & file-level mirroring for backups, RAID for availability, etc. However, there's a simple trick we can use to rebuild the slashing protection database in the event of its loss! The database can be rebuilt in the minimal format that does not require a complete history of all signed messages. A utility for this type of rebuild can be found in the adiasg/eth2-slashing-protection-rebuild repository.

Note: SD cards are unreliable storage devices, and stakers running their VC on a Raspberry Pi are especially prone to losing their slashing protection database!

Suggested Failover Protocol

A simple-but-effective setup can be achieved by maintaining a redundant BN, and keeping backup validator keys easily available.

Initial VC Setup & Failover in Various Scenarios

The various failures can be handled independently in this setup:

  • Failure of BN - Shift the VC to the redundant BN
  • Loss of Validator Keyfile - Copy from the backup keyfile in cold storage
  • Loss of Slashing Protection Database - Rebuild the DB or restore from a real-time backup

Secret-Shared Validators

(i.e., The Correct Way to Achieve Resilient Validators)

The most ideal resilient setup can be achieved through secret-shared validators - a topic for a separate post. More information can be found in this online talk:

  1. This distinction between validator and validator client is also described by Jim McDonald in this blog. ↩︎

  2. As per current spec parameters, the inactivity penalties kicks in if there is no finalized checkpoint in the last MIN_EPOCHS_TO_INACTIVITY_PENALTY = 4 epochs. ↩︎

  3. As per current spec parameters, this fraction is equal to 1/MIN_SLASHING_PENALTY_QUOTIENT = 1/128 of the effective balance of the validator. This amounts to 0.25 ETH if the validator has a balance of 32 ETH. ↩︎