ADR 004: Keyshare Backups


Changelog

  • 2022-07-15

Status

Accepted


Context

Vaults containing all network funds are composed of keyshares generated by the member nodes of an Asgard at each churn interval, and stored on Bifrost's persistent disk. There are a number of factors to consider that could result in the complete loss of this file that we must consider, to name a few:

  1. Compromised (not necessarily malicious) infrastructure, tooling, operator machines
  2. Forced provider shutdown due to censorship, unpaid accounts, etc
  3. Human error during operation

In order to ensure there is no period of time in which loss of keyshares would incur loss of network funds, operators must immediately back up their keyshares after each churn. Currently the official mechanism for this backup is the utility command make backup in the node-launcher repo, which will copy the keyshares to the operator's local machine. This approach requires responsive and proactive node operators to continuously backup to protect the network, and there is no way for external persons to verify existence of node backups.

Since moving away from Yggdrasil vaults in favor of a greater number of Asgards, some risk is reduced since loss of funds requires losing a supermajority of members, but risk remains. In the ideal scenario, a node operator should be able to securely backup only their mnemonic once and leverage it to recover their node and any corresponding funds.


Decision

TBD - there have been many discussions around this, and the options listed in alternatives are still relevant.


Detailed Design

The proposed design extends the TssPool message sent after vault creation to include a keyshares_backup field, which contains the bytes for the newly created keyshares after churn, compressed with lzma (to reduce chain bloat), and symmetrically encrypted using the node's mnemonic as the passphrase (the same mnemonic generated at node creation used for the thornode private key). The initial pass of this implementation began before the introduction of the ADR process and is currently under review at https://gitlab.com/thorchain/thornode/-/merge_requests/2235. These keyshares will intentionally skip storage in a KV store in the thornode application state to avoid further bloat, instead a CLI utility will be provided to via tci to pull and decrypt the latest keyshare backup for the node from an RPC endpoint, via tci nodes recover-keyshares --address <node-address>

Checks

Sanity checks against mnemonics before encryption:

  1. Validate BIP39 mnemonic.
  2. Validate the entropy of the byte-wise probability distribution of the mnemonic (greater than the minimum of 1e8 randomly generated mnemonics).

Sanity checks against encrypted payload before send:

  1. Check that encrypted output is not equal to input.
  2. Check that decrypted output equals the input.
  3. Check that the output does NOT contain the input.
  4. Check that the output does NOT contain the passphrase.
  5. Check that the output does NOT contain any word of the passphrase.

Positives

  1. Publishing the encrypted keyshares to the chain allows anyone to verify that a sufficient number of keyshares have been preserved such that loss of funds is not possible, so long as NOs have backed up their mnemonic.
  2. Embedding the shares in the TssPool messages ensures that the shares are preserved immediately at the time of creation.

Negatives

  1. Although we compress the shares before encrypting to reduce size, this results in some bloat in chain state. This size is dependent on the number of members in an Asgard, but is on the order of 100Kb in current conditions - breaking the same set of nodes into more asgards reduces the aggregate size of this bloat.
  2. Although we add a significant number of checks to prevent it, there is some risk in publishing these keyshares to a location that is publicly visible. Note that most of the vectors we consider a malicious actor could take (infra, supply chain) would result in them having access to the keyshares before they are encrypted and published anyway.

Potential Suggested Modifications

  1. Only backup some sample (like 50%) of the keyshares in this form - this mitigates some of the unease in negative #2, and still provides a safety net to reduce the likelihood of losing funds if a large percentage of the network was lost.

Alternative Approaches

The main tradeoff is whether or not to publish the encrypted payload somewhere publicly visible - this is a positive since any person can verify and backup the encrypted keyshares of nodes, and a negative since publishing this data could potentially carry some security risk and also adds to bloat. We will outline the alternatives under consideration below in 2 categories to represent this tradeoff and ignore it in the positives and negatives - in all cases the backup is encrypted.

Alternative Approaches (Private Backup)

1. Bifrost Sends Encrypted Keyshare to NO Configured Bucket/Email/ETC

We could deploy a Postfix instance in the cluster to send an email with the encrypted shares to an address the NO configures, or have the NO pass in something like an S3 endpoint and auth token that would be used to push them to the target service.

Negatives
  1. Additional setup and reliance on external services (the provider for the mail server, S3 API, etc).

2. Node Operators Get Slashed Until make backup Heartbeat

This would keep the current approach to backup creation and extend make backup to also send a transaction with a "heartbeat" message - after a certain buffer of blocks after the churn, nodes which have not sent the heartbeat will begin receiving slash points.

Negatives
  1. Requires active participation from node operators to secure backups, could still lose funds if nodes were lost before a supermajority of all vaults engage.

3. Node Operators Manage Separate Cron Backup

This would basically require node operators to manage a machine that has persistent authorization to their Kubernetes cluster, and adding TC_NO_CONFIRM=true NAME=thornode make backup to a crontab.

Negatives
  1. Node operator must maintain, monitor, and secure (it has all the keys) the backup machine separately, since it cannot be on the same infrastructure provider as the node, and must have persistent authorization to the cluster in order to create the backups, which creates additional security risk.

4. Bifrost Sends Encrypted Keyshare to Other Active Bifrosts

This would be similar to the proposed design, but Bifrost would be extended to handle distribution of the encrypted keyshares to other active nodes instead of posting them on chain. Recovery would require cooperation from the nodes that held the backup. There could be a variant of this approach to only send keyshares to a subset of other nodes - these nodes could be randomly selected or perhaps the other members of the same vault. An additional variant could extend this pattern with a verification message posted on chain, so that one node could signal to the network that it has persisted the encrypted keyshares of another node.

Negatives
  1. Additional complexity to add more P2P logic into Bifrost.

Alternative Approaches (Public Backup)

1. Bifrost Sends Encrypted Keyshare to IPFS

Same as proposed designed, but we push backups to IPFS and record the key in the TssPool message.

Negatives
  1. Additional dependency, complexity, backup point of failure for IPFS integration.

Open Questions

The following questions are generally relevant for any approach taken.

  1. Symmetric encryption with mnemonic or asymmetric with key (generated from mnemonic)?

    Update: It seems devs are mostly satisfied with currently proposed symmetric approach.

  2. In either case for #1, which encryption library to prefer (stdlib vs something like age)?

    Update: It seems devs are mostly satisfied with currently proposed usage of age.


References

  • ...