History, breaking it down into sub-classes of APIs

aaron · March 19, 2021, 1:42am

I have been thinking more about the “history” API situation and think we need to potentially split this category of API node up into multiple sub types. One generic type doesn’t accurately describe these types of services anymore and we have entered a phase where the roles and responsibilities of what a “history” server does varies greatly from implementation to implementation.

This is probably true with “APIs” in general as well, since a robust infrastructure is going to have dedicated nodes for specific query types (push, tables, blocks, etc), but for now, I plan on talking about history specifically.

To start with, I think the first division of what “history” should be broken into could be:

Transaction History
Account/Contract History

These two API types cater to difference use cases and audiences, and both serve valuable information. Many services, including a lot we (Greymass) operate, only require the transaction history portion to be exposed via APIs and don’t need advanced querying mechanisms. However, there are a lot of other services and applications that heavily rely on the ability to analyze detailed history of an account/contract in order to get the data they need.

Transaction History

We have hit a point with our custom history APIs (which mirror the v1 specification, but don’t use the plugin at all) that I would consider them to be of the “transaction history” type. They are able to keep track of all transactions on the blockchain very effectively and respond within milliseconds when you ask about the state of any transaction. The reason for this is because it’s far easier to scale with this limited approach. We do this using the trace API as the backend (which is prunable) and a custom multithreaded indexer feeding into a database for API requests to consume.

Everything is stored on disk (normal SSDs), the ram requirements are next to nothing (more helps with caching), and its multithreaded indexers chew through history no problem. I think our last replay took 5 days for 2+ years worth of EOS history. All these approaches scales incredibly well and serve an audience of API consumers who need to monitor specific transactions for finality.

Account/Contract History

The other two prevalent solutions are dfuse and hyperion, which fit into the second category, Account/Contract history. They offer incredibly advanced ways of aggregating the historical data of the chain and are capable of serving a completely different subset of API consumers. They have greatly evolved over the past few years, but have challenges scaling to combat “spam”. These scaling challenges I assume are due to the increased number of indexes required to make it capable of servicing the flexible queries they offer.

This increased demand requires RAM for effective index usage, similar (if not more) disk space requirements for the bulk data storage, and take a significant amount of time to regenerate since there’s multiple indices that require updating with every transaction on the network. It’s required though, and that advanced view into the historical data is incredibly valuable to serve a subset of API consumers who benefit from it.

Creating a split in API types

My thought behind splitting these two API types into separate categories falls in line with the adage of “do one thing, and do it well”.

It’s a principal we have applied to our overall API infrastructure by delegating roles to specific servers optimized for specific tasks. We have dedicated APIs that accept incoming transactions, dedicated APIs to serve out blocks, and dedicated APIs to service history. This same concept should also apply to history to make “history” less confusing to the developers who need to utilize it.

One size does not fit all here, and if we were to have each type of API do what they do best, that will likely improve the effectiveness of each.

abourget · March 20, 2021, 3:43am

To respond to the feature matrix Dan talked about in the BP channel, I thought useful of describing what the dfuse stack does, and what it doesn’t do.

I hope Igor (or Rio :P) drops note about what Hyperion does and doesn’t do.

dfuse Search offers a simple indexing of actions. Think of it as an Elasticsearch collection called actions, with those fields: “receiver, account, action, auth, scheduled, status, notif, input, event, ram.consumed, ram.released, db.key, db.table, data.account, data.active, data.active_key, data.actor, data.amount, data.auth, data.authority, data.bid, data.bidder, data.canceler, data.creator, data.executer, data.from, data.is_active, data.is_priv, data.isproxy, data.issuer, data.level, data.location, data.maximum_supply, data.name, data.newname, data.owner, data.parent, data.payer, data.permission, data.producer, data.producer_key, data.proposal_name, data.proposal_hash, data.proposer, data.proxy, data.public_key, data.producers, data.quant, data.quantity, data.ram_payer, data.receiver, data.requested, data.requirement, data.symbol, data.threshold, data.to, data.transfer, data.voter, data.voter_name, data.weight, data.abi, data.code”. You can then query on those fields to retrieve actions from history. The search engine will return the whole transaction, highlighting the actions that matched your query.

The product itself is fork-aware, provides multiple guarantees (thanks to custom cursors) not present in other products. It is a distributed system, which can provide more or less replication factors for different parts of the chain, and reversible segments of the chain are queryable in tiny 1-block indexes. This means real-time querying of the reversible segment, in both directions (ASC or DESC), also means you can have streaming search (real-time listening) on incoming blocks. The solution also allows you to filter out what you don’t want when indexing, on two axis: only a part of the history (so time-wise), and/or by flushing out unwanted content (filtering). Note also that this software (like all dfuse components) is detached from nodeos execution: it can be used to re-index large networks in a matter of minutes (provided enough CPU power of course :), without the need to replay the chain. All dfuse components are also designed with parallelism in mind, to allow those re-indexings to be done in parallel. dfuse Search feeds from the Firehose.

But mind you, this is an index of raw actions. It doesn’t provide the current state, or the past state (although actions do convey state changes they caused, in the form of old_row and new_row). It does not do aggregation queries. It is not useful to get all the latest token balances of an account. It is pretty much overkill for wallets that need a (often short) list of recent transactions for a new account they’re serving. It requires huge amounts of RAM, and of storage if you want to keep everything. It also requires a K/V store to be loaded with the actual transactions and blocks contents, since the indexes only contain that: indexes. It’s great to find a needle in a haystack fast, but you better have a need for it, because the cost can be much higher (especially if you don’t filter anything

You can see it in action here. Here’s an architecture overview of its components. Search query docs.

The dfuse Firehose contains a stream of all the data. Think of it as a better SHIP. Something that can be consumed online because it doesn’t hit a node. It can be filtered on-demand when the user queries the service (instead of needing to configure a nodeos process to filter out what it writes). It contains block state (with data to generate merkle proofs), all transactions traces, and rich data about actions (including RAM consumes/releases and their cause, and state deltas at the action level), all feature upgrade operations, both global and user-centric resource limits deltas, all deferred transactions events, creations, cancellations, etc… (yeah I know its deprecated but its there). Basically, it lacks absolutely nothing you could desire if you squeezed a node executing transactions. It is backed by two things: 1) files that contain past blocks (and their traces, etc…), chunked by 100 blocks, usually stored in some object storage, shared disk or whatever, those files include all forks seen… 2) a live feed from one or more nodeos nodes (for high-availability). This service is what feeds all the other higher-level services. It is fork-aware (helps a consumer navigate forks), through a similar use of cursors as Search, for guaranteed linearity, across disconnections, etc…

It’s a very raw form of streaming blockchain data access, and does that extremely well, reliably and is also the fastest thing you’ll see as nodes race to push out data to consumers (if there’s 3 nodes in the cluster, the first to see a block pushes it out to consumers, the other 2 will be dedup’d out).

The Firehose is currently served as a gRPC service, with data being binary packed in Protobuf. See GitHub - dfuse-io/playground-firehose-eosio-go: Playground to play with EOSIO Firehose service (with stats for sample code and to start using it.

This service is not useful to query for a random transaction. It’s not made to query the current state, nor the past state (although actions will come with their state deltas). It is not fit for searching the history if you’re looking for a needle in the haystack (unless you want the system to process TBs of data, by opening all the files and parsing them, and taking a lot of time). It will certainly not help you listing your current token balances.

The dfuse State DB is a purpose-built piece of software, and specialized database to provide a full snapshot of the whole state at each block. Of course, it does not make sense to clone 8GB of RAM to storage each 0.5s, and thankfully, the whole 8GB doesn’t change at each block. State DB, backed solely by a K/V store, uses special-purpose indexing strategy to allow for quick querying of any state, at any block height. The service can also do on-the-fly decoding of ABIs, or provide the rows in binary. Its main purpose is to allow fetching of large tables in one consistent swift, which would be impossible to do reliably/consistently by hitting /v1/.../get_table_rows: when iterating by chunks of 1000, at any moment, a new block could come and invalidate what you already fetched.

The general purpose tech is GitHub - dfuse-io/fluxdb: A temporal database for blockchain state … applied in EOSIO as the StateDB here: dfuse-eosio/statedb at develop · dfuse-io/dfuse-eosio · GitHub It’s mostly exposed through REST today, but exists as a gRPC service defined here: proto-eosio/statedb.proto at master · dfuse-io/proto-eosio · GitHub (which supports streaming of the rows in the table, instead of a buffered dump).

This service also supports parallelized processing for extremely quick processing of large networks. However, by design it requires the linear history of row additions, updates and removals for a given table, so a first pass of parallel processing slices the full history into 100 full histories (each containing only a subset of tables). A second parallel operation can then insert those much smaller full-histories, tackling only 1/100th of the tables in each slice.

The live server is fork-aware, and allows querying of the reversible segment.

This service does not do any aggregation. It also does not support conditional filtering, or pagination (although the 2 latter could eventually be implemented). Today, it does not implement secondary indexes either. It is used by other services that bootstrap from a snapshot, and streams changes onwards (like the tokenmeta service).

See GET /v0/state/table | dfuse docs and other REST endpoints under /v0/state.

The dfuse Account History, is yet another purpose-built database, also backed by a K/V store. Its design is to provide a fixed number of historical actions, for each account, or for each tuple of contract+account. Say 1000 transactions per account. The goal is to be able to provide the full history for accounts that don’t do crazy amounts of transactions, yet flush out those realllly spammy accounts that do a lot.

Again, this process was designed to be able to process the history in parallel, and purge the extraneous data going forward. It is an autonomous service, and runs indepenently of the other user-facing services. It feeds from the dfuse Firehose.

It is exposed through GraphQL. You can try a sample query here: GraphiQL: Discover the dfuse GraphQL interface
It also has a gRPC interface (not exposed publicly on our hosted version). You can find its gRPC definition here: proto-eosio/accounthist.proto at master · dfuse-io/proto-eosio · GitHub

tokenmeta is yet another specialized service to serve the token balances of users, and the token hodlers for a given contract, extremely fast, and with a single query. It holds everything in memory, boots by fetching consistent snapshots from StateDB, and then stays up-to-date with the Firehose. It otherwise runs completely independently from dfuse Search and dfuse Account History.

It is accessible here through GraphQL: GraphiQL: Discover the dfuse GraphQL interface
It also has an internal gRPC interface defined here: proto-eosio/tokenmeta.proto at master · dfuse-io/proto-eosio · GitHub

Two other services are not directly exposed but exist in the dfuse stack:

abicodec, a small service that keeps all historical ABIs in memory, and stays in sync through the Firehose, handling forks and the reversible segment. It is responsible for doing binary/json conversions at any block height. Defined here: proto-eosio/abicodec.proto at master · dfuse-io/proto-eosio · GitHub
blockmeta is sort of a spinal cord for the infrastructure, able to quickly answer questions about tip of the chain, irreversible blocks through history. It is backed by a simple K/V store that holds block headers, and irreversible blocks markers. It also does time to blockNumber resolution. It is fed by the Firehose. Defined here: proto/blockmeta.proto at develop · dfuse-io/proto · GitHub and services also: proto/headinfo.proto at develop · dfuse-io/proto · GitHub (all services serve this HeadInfo service, in addition to whatever they do).

More recently, we’ve released GitHub - dfuse-io/dkafka: kafka cloudevent integration which is a Firehose to Kafka pipeline, which deals with reorgs to give good guarantees to kafka stream consumers. Not a data service you’d expose online, but just to show that people get creative when the Firehose is available

I think that covers most of the networked services… let me know if things aren’t clear.

abourget · March 22, 2021, 10:22pm

Oh boy, I forgot two things!!

The famous “push guarantee” endpoint. A middleware sitting in front of the native /v1/chain/push_transaction of a regular node, that will intercept the response (which include traces), and do one of two things: 1) return the error if it’s an error, 2) listen on the Firehose for when the transaction being submitted is included in a block, at different confirmation levels (X-Eos-Push-Guarantee: in-block or handoff:1, handoff:2, …, irreversible), and return the traces of execution of the signed block. The endpoint also periodically resubmits transactions to nodes when not seen in the Firehose. This allows for greatly simplified client code.

It is a simple service, it feeds from the Firehose (to get actually executed transaction traces), and simply reformats the traces as a standard nodeos response.

In its current form, it does not sign transactions nor affect the transaction content in any way.

Last but not least, is the Transaction Lifecycle service, which is yet another service.

It is Websocket based (not gRPC yet), and offers a stream of such objects: TransactionLifecycle | dfuse docs … It allows a client to track all of the states a transaction can be in… and notify you the moment there’s a change. Say a transaction is pending, then is executed in a block, you’ll another object with the details. It supports all the variations of state, like delayed, expired, soft_fail, hard_file, pending, executed, … and also tracks the transaction that creates deferred, the blocks in which the deferred are created and executed, etc… (yeah I know it’s deprecated :P).

It is called with a transaction ID (see get_transaction_lifecycle | dfuse docs).

The service requires a loaded trxdb (which is a key/value store of raw protobuf transactions), to satisfy the fetch: true query, which then fetches from the database the state of any transaction in the history (well, at least those stored in trxdb, which can be partial).

More details about the data artifacts of the different components can be seen here: Data Stores & Artifacts | dfuse docs (like the meaning of that trxdb), and an overview of components is available here: Understanding Components | dfuse docs

Currently, running dfuseeos allows you to run all of the above service in a single binary, all running at the same time, all served from your laptop, in a dev environment. Large prod environments need more involved setup, but is also documented in the Admin guide of the dfuse docs.

Ok, that’s really it I think!

aaron · April 10, 2021, 4:41am

What does the required stack for the firehose look like? I think I’m most curious if its possible to run a lightweight firehose on a blockchain like EOS without needing a massive server to do it.

I’ve been looking at the dfuse stack to see if there’s small pieces we can make use of (instead of the entire stack) and was just curious.

abourget · October 29, 2021, 4:03pm

Running dfuseeos start usually got you the whole start running, including a producing node, a “reading” node, and all the services. So deploying a firehose could be as simple.

Minimum setup you’d want for a firehose is much more lightweight than those dfuse setups you might have seen around. Things like dfuse Search, dfuse StateDB, and some other specialized services are not necessary to run a firehose.

A wrapped nodeos (with dfuseeos start mindreader) to produce the data is necessary, and then the firehose service (with dfuseeos start firehose) is sufficient to serve requests and prepare for scaling/decoupling. You could run in a single process dfuseeos start mindreader,firehose and it would work, but decoupling the generation of block data (the protobuf messages for each block) from serving requests is a good initial step. That’s the simple setup.

After that, you can start augmenting the High Availability properties of your setup. See the graphics in the doc here: RFC - Firehose

The Firehose, that simple pattern, having been brought into the The Graph ecosystem, is going to be brought to all major chains, and will underpin all of The Graph indexing technology. I think it’d be very worthy that you have a good grasp of it. Happy to jump on a call to dig into the details, I’m sure you’d be very excited at where it’s going