Stability

Some solutions already implemented

Sync stalling

Rules about what is considered our best peer post-merge is not implemented
Trying the same peer over and over again (shuffling the peers)

Invalid block errors

Consensus layer is on the fork with bad data
Besu has a storage exception to report the invalid block to the consensus layer, which sets us off to a wrong fork (potential fix by Justin; GH issue)
Potentially another Besu internal error could cause invalid blocks of which we don’t know about yet

Potential solution identified

Worldstate root mismatch

Bonsai and snapshots
Solution: Confirmed working for many cases, needs more testing and handling of any corner cases

Issues around peering

Need restart to find new peers sometimes during sync

Potentially because of the lack of evaluating of peers during sync and post sync
Solution: ??

Losing many peers

Because threads were blocked, we lost many peers
Vertx for example uses different approaches to threading
Solution: ??

Issues on user experience

Difficulty in reliably communicating with new/inexperienced users

Docs & lack of education
More complicated set up post-merge
Solution: Write up ‘What to expect from staking at home’ and FAQ for Besu Docs

Out of Memory errors

Documentation on how kind of memory config is needed
No mechanism to detect memory leaks
Potential mitigation: make deploying Besu easier by providing default configs
Solution: ??

Users don’t know how much syncing is done

Insufficient logging & bad log UX
Solution/plan: ??

Users hesitant to update or restart Besu with the latest version due to the impression of being unstable

Issues with RPC calls

Incompatibilities with RPC spec / not-same-as-geth causing crashes

Does not meet Chainlink and other orgs needs for RPC calls (accuracy, speed)
Do we implement all of the RPC interfaces that Geth does? I.e. Logger, Trace (all the methods)
Solution: ??

Some specific RPC calls (trace/debug) take a long time or OOM

Lack of testing of large RPC calls
Might need to understand the root cause better
Solution: ?

Performance

Staking Performance

Poor execution performance leading to missed attestations

More investigation ongoing, and some user stories being created

Poor block production

As we tweak the tx pool to build the best block for the user with valuable/DOS-resistant transactions, we need to ensure no performance hit to the client - uses a lot of CPU because we are repeating the block production until CL asks for it
Late blocks could also cause import challenges, would cause restart of the building of the block
Snapshots could help with concurrency in this case
4844 could alter this process and requires good performance as well

I/O and Disk Performance

Besu has problems with slow IO/disks → Besu is generating a lot of IO

We are not using the flat DB during block processing, so have to gather a lot of data from disk
Need caching in more areas - R/W caching
Doing less work, persisting less to the disk, persisting trie logs but not the worldstate (Amez / Karim)
The first hotspot in Besu is reading data from RocksDB using the RocksDB.get method. This is mainly caused by the fact that we have to get most of the WorldState nodes from the Patricia Merkle Trie
Need to identify more areas where IO contention is commonplace

Trace Performance

Poor performance of tracing of blocks / transactions

Not sure why we are slow
Many times Besu would crash when tracing a full block
OOM errors
Short timeout can cause issues
Is db tuned for tracing?
3786
Will need good performance for any rollups use-cases
Solution?: Instead of replaying each time the traces for each user request why not saving this trace result in a separate database or a separate module instead of saving the block and the worldstate for each block

Solution?: Separate into a different microservice

Solution?: Separate the query part of Besu into a completely separate process. Queries that are asking Besu questions related to the chain do not need to slow down the main flow.

Sync Performance

Poor syncing performance (still the case with 22.10.0?)

Do we need to verify the proof of work blocks on Mainnet?
Useless conversion from byte to RLP and back during sync
Sometimes we are stuck for some time during a snap sync (need investigation)

Full sync / forest performance (also snapsync for the block downloading part)

Persisted worldstate changes may be able to help with full sync on Bonsai
Forest we need to determine what the areas are for performance improvements - some of the recent bonsai improvements can be tweaked to suit forest use-case (unknown though)

EVM Performance - Pending Amez Availability and IMAPP Testing

We need analysis that will tell us Gas cost of each operation corresponds to the algorithmic complexity of Besu implementation. Bonsai might have consequences on algorithmic complexity for some operations.

IMAPP testing - we need an overall analysis (Matt has connected with this team for a profile)
Do not have a profile of Besu’s EVM performance (work with Danno?)
SLOAD, SSTORE - slowest vs gas cost?
EVM performance improvements often appear without context and the broader team is unsure of how the optimizations are created. Is there a standard playbook of optimizations that we are running through, or are there EVM specific performance observations that we are reacting to?

Do we know how we can solve these problems?

Having automatic test on nightly/ci to detect regression asap
Having more modularity
Having a tracing solution on the JAVA level (improve observability)
Using torrent for downloading the block during the initial sync (archive)?
Separate the query part of Besu into a completely separate process. Queries that are asking Besu questions related to chain do not need to slow down the main flow.

Process Improvements (Q1?)

Performance Testing

Lack of performance testing, especially on RPC methods

How do we get alerts and be aware that there is an actual performance regression?
Automated performance testing each release, nothing for the moment

Hive tests return the time the calls are taking, but small load
Can we separate some RPC methods into separate microservices?

Slow Release / Testing Process (CPU, test bounding, process)

Manual release process with a lot of wasted time waiting for builds that could be avoided

Waiting for multiple full builds to complete because you are merging a PR or changing version number, doesn’t need a full build
With code changes, then a full build should be required

Make it easier/faster to run all the tests locally, or avoid the need to create a draft PR for running tests remotely

Support for many features causes tests to be slow (ETC tests, Quorum tests, etc.) when they are not necessarily needed for certain modifications

Issues with the process

Late discovered regressions

Need a more comprehensive testing strategy across contributors
Solution: ??

Establish better process on how to respond to the problem we discover

Good case study: sepolia issue over the weekend on Oct 29/30

Besu : Q4 2022 Stability and Performance Improvements

Stability

Performance

Process Improvements (Q1?)