Announcing our public testnet
Announcement
We are now hosting publicly-accessible Lighthouse nodes and provide instructions for running your own beacon node and/or validator client. In other words, we have a public testnet that you can join today. Get started with our documentation
Before you decide to run a node, I kindly ask you to read the rest of this section (feel free to skip the following technical section).
Our testnet has two notable characteristics:
- It uses the mainnet specification (with slight modifications to increase inactive validator churn and decrease lag when following the Eth1 chain).
- It has over 16,384 validators.
When launching a beacon chain testnet, you can pick and choose your spec
(mainnet
or minimal
) and the number of validators (typically something
above 64
). If you want to do an accurate simulation the computational load of
running in production, you need to choose the mainnet
spec and have at least
16,384
validators (i.e. the minimum number of validators required to launch the Beacon Chain mainnet). If you're not concerned with simulating the actual
computational load (e.g., you're demonstrating APIs) then you'll likely choose
minimal
spec and a validator count in the tens or hundreds.
At the time of writing (and as far as we know), this is the first mainnet
testnet with 16,384
validators. It has been a huge undertaking to get this
testnet running and the Lighthouse team is proud of this achievment.
Choosing the mainnet
spec means that the BeaconState
object is much larger;
some fields grow from a length of 64 to 8,192. Merkle hashing, serialization,
database interactions and copying in memory become orders of magnitude more
onerous. Additionally, choosing a higher validator count means even more
Merkle hashing and more BLS signatures. BLS is a primary bottleneck for block
and attestation verification.
Taking these challenges into consideration, we ask that you bear with us whilst
we improve our sync times. We're syncing at about 4-8 blocks/sec on a consumer
laptop at the moment, but we've seen successive major improvements over the
past week as we focus on optimization. To give an idea of how fast we're
progressing, less than a week ago we were syncing at less than 0.2 blocks/sec.
Although sync is presently slow, we're comfortably running a Lighthouse node on
an Amazon t2.medium
that's managing 4,096 validators. Performance is reasonable once synced.
Additionally, we have our validators highly concentrated and we're expecting to deploy malicious nodes over the coming weeks. We're going to start trying to crash this testnet and I suspect we'll be successful. If you decide to run a node and contribute to the project by reporting bugs and making suggestions, we'll be very grateful. If we need to reboot the testnet, just reach out if you need more Goerli ETH.
The Technical Part
What's missing?
Although we're seeking to simulate the production beacon chain, we still don't have all features. Here's a list of how we diverge from what we can expect to run in production:
Attestation Aggregation
Presently we are not using the attestation aggregation scheme in the spec. Instead, we are running our validators across a handful of nodes and these nodes are aggregating the attestations (mitigating the need for a distributed aggregation scheme).
Once we implement the attestation aggregation scheme, we can expect to see an increase in network traffic, computational load and code complexity. Expect to see PRs for the aggregation scheme in the coming weeks.
Slashing Protection or Detection
Our validator client is presently without validator slashing protection. Whilst we have an implementation in this PR, we decided not to make it a priority for this testnet. We chose this because it's not expected to have a significant impact on computational load and it's also interesting to see how the network can survive validators casting conflicting votes.
Expect to see slashing protection in the master
branch in the next two weeks.
Large Node Counts
Presently we have less than 10 nodes on the network at any given time. On the one hand, this makes problems with syncing much more obvious because there's no one else to fall back on. It also makes the network more unstable, which helps us detect bugs easier. On the other hand, it fails to trigger a whole other set of bugs that arise from large DHTs, noisy networks, etc.
Our intention in the next few weeks is to use some cloud container service (e.g., AWS ECS) to spin up hundreds or thousands of nodes on this network and observe the results.
Optimizing State Transition
When moving over to the mainnet
spec with 16k validators, we are primarily
concerned with block import times (this involves verifying the block and
storing it in the DB). Specifically, we are interested in the time it takes to
process a block in two scenarios;
- When syncing from some old block (e.g., first boot).
- When following the head (i.e., when we've finished syncing).
They are different beasts because (1) involves importing lots of successive blocks really fast and (2) involves processing a block or two in short bursts. For Lighthouse, we can surprisingly import blocks much faster in scenario (2) than in (1). We're using LevelDB and are noticing more than a 10x slowdown in write times when importing multiple blocks. We only identified this as an issue on Sunday and we will work to solve it this week.
When we first ran with mainnet
specs, we found the SSZ serialization was far
too slow (multiple seconds). This was due to two main factors;
- We were decoding all BLS public keys into actual curve co-ordinates
- We were storing our tree hash (Merkle root) caches to disk.
We solved (1) by just treating public keys as byte arrays until the point we need to do actual cryptographic operations (@protolambda and I have been talking about this for months). We solved (2) by simply not storing the tree-hash caches to disk any more, our in-memory caches are sufficient for normal operation. The trade-off here is that if someone builds off weird blocks from the past, we'll have to do more work to compute the tree hash (presently about 300ms). We can refine this approach in later releases.
Our "fork choice" times are also quite notable (hundreds of ms). This time not only involves running the fork choice algorithm but also updating the head, caches and persisting the current beacon chain state to disk (e.g., the canonical head, other known heads, the state of fork choice, etc.). We have a clear path forward to reduce these times:
- Store some ancestors for each block in our reduced-tree fork choice (this means less state reads).
- Be more granular in when we persist the beacon chain to disk (e.g., only store things that have changed, not everything always).
I'm confident there's still a lot of fat in block processing, I think it's safe to expect another order of magnitude improvement in the coming weeks. Time will tell.
This marks the first official release of Lighthouse (v0.1.0
).