Bitcoin to ClickHouse

TLDR;

Loading real-time Bitcoin blockchain data (blocks, transactions) into ClickHouse for analysis is valuable but challenging due to duplicate records often generated during streaming. Traditional ClickHouse deduplication methods (like ReplacingMergeTree with FINAL) can be slow or inefficient for high-volume, real-time ingestion. This article explores how Glassflow, an open-source stream processor, solves this by providing stateful auto-deduplication before data reaches ClickHouse. This ensures cleaner, more accurate data in ClickHouse, simplifies the ETL pipeline, and improves query performance without relying on costly database-side deduplication.

Introduction

ETL (Extract, Transform, Load) is a process to get data from one source, process and load it in a data store. Not only the processing is challenging like deriving metrics and cleaning the data, but the extraction and loading of the data itself.

In this article we are going to have a birds eye overview on how to apply this process to the Bitcoin blockchain (the same steps apply to the other blockchains as well) to suit any data engineering use-case. What are challenges, how to address them and few tips and tricks on how to optimise the process to not break the bank. Interested? Stick for the ride.

What are the benefits of having Bitcoin (realtime) data?

Whole Bitcoin data is stored in the blockchain. Data analysts, traders, market watchers and hobbyists can leverage it for their own use case.

Simpler use cases:

Whale Watching: Track large movements of Bitcoin to/from exchanges or between wallets, potentially indicating market sentiment shifts.
Exchange Flow Monitoring: Observe inflows and outflows from exchanges, which can signal buying or selling pressure.
Transaction Fee Analysis: Monitor mempool data to understand current network congestion and optimal transaction fees in real-time.
Network Health Monitoring: Track block times, hash rate, and mempool size to gauge the overall health and security of the Bitcoin network.

More advanced use cases:

Correlate real-time price and volume data with news feeds, social media sentiment to understand market reactions.
Use real-time data streams to feed machine learning models that attempt to predict price movements, volatility, or network congestion.
For businesses accepting Bitcoin: Monitor incoming transactions in real-time, track confirmations, and update order statuses.

E for Extract.

There are several cases where we host our own VM and sync the Bitcoin blockchain to our server. Depending on the server specs this can take from days or weeks. How to setup your node you can read more here.

Others use public JSON RPC endpoint services where they have already synced nodes ready and running for you. However those are paid services and can get expensive if synced from the blockchains inception.

The choice most likely will depend on the project's requirements and needs. Home lab users go with their own servers, more serious projects go with paid services.

In the case of Bitcoin it would be extracting blocks and transactions and storing them into a datastore, but first we must go over schemas and data duplication.

Why Real-Time Deduplication Is Important

There will be a point (question of when, not if) when duplicates be stored in the database and it can cause all sorts of issues from bad decision making from data analysts to increased disk consumption on Clickhouse clusters.

Compromised Data Accuracy: Analytical queries, such as calculating total transaction volumes, unique address activity, or network fees, will produce skewed and unreliable results.
Inflated Storage Costs: Redundant data consumes unnecessary disk space, leading to higher operational costs.
Degraded Query Performance: ClickHouse may need to scan and process a larger volume of data, including the duplicates, potentially slowing down query execution.
Increased Complexity in Analytics: Analysts might need to implement complex query-side logic to attempt to filter out duplicates, adding an extra layer of effort and potential for error.

And most likely reasons for it are (included, but not limited to):

Request retries (which can resend identical data after transient failures)
Failure of data extraction or data processing - necessitating re-runs that might re-ingest or re-create already handled records.
Issues during insertion into Clickhouse - like retrying entire batches upon partial failure.
Kafka topic retries - where consumers re-process messages due to errors or restarts without proper offset management.
Alongside other unknowns, can all introduce duplicates if not meticulously handled with idempotency or at-least-once processing followed by deduplication.

Handling these potential sources of duplicates requires a solution, ideally one that deduplicates data before the data even reaches ClickHouse. This is where GlassFlow can help. As an open-source stream processor specifically designed for scenarios like Kafka-to-ClickHouse pipelines, it acts as an intermediary processing layer. Setting up its deduplication capabilities is quite straightforward. It essentially provides a block *before* ingesting to ClickHouse, using stateful processing to identify and filter duplicates based on defined criteria (like transaction or block hashes in our case). With a few clicks, you can configure it to ensure only unique records proceed downstream, preventing the duplicate problem at the source.

Bitcoin Schemas and Why They are Important?

Blocks

Blocks provide a chronological and ordered record of confirmed network activity for time-series analysis of transaction volume, fees, and mining statistics. Analyzing block metadata like difficulty, size, and time helps understand network health, congestion, and security trends over time. The linkage via previousblockhash ensures data integrity and allows for tracing the evolution of the entire ledger.

Transactions

Transactions are the fundamental units of value transfer, this enables the analysis of fund flows, economic activity, and user behavior. By examining inputs (vin) and outputs (vout), analysts can trace Bitcoin movements, identify significant transfers (e.g., "whale alerts"), and understand wallet interactions. Analyzing fee in relation to transaction size and network congestion provides insights into user urgency and market dynamics.

Each block can contain thousands of transactions. 12195 max segwit transaction per block to be exact. Each block is processed approximately every 10 minutes. There is a lot of data to process.

Deduplication Strategies in Clickhouse or Why You Should Avoid Them in Real Time Data Streaming.

ReplacingMergeTree Engine

The ReplacingMergeTree table engine is designed to handle duplicates by replacing rows that share the same sorting key. During background data part merges, older versions of rows are discarded, leaving only one row. ClickHouse performs its routine merge operations, however there are several downsides to this:

Duplicate availability during query time: Duplicates will exist in the table for a period until a merge process is complete. For real-time, this delay means applications will query duplicate and inconsistent data.
Compute Overhead: Merges are resource-intensive (CPU, I/O). Relying on them as the primary deduplication mechanism for high-volume inserts will increase load, impacting Clickhouse performance and ingestion rates. If relied solely on a merge tree will result in additional infrastructure spend.

FINAL Modifier in Queries

To ensure queries against a ReplacingMergeTree table return fully deduplicated results, ClickHouse offers the FINAL modifier (e.g., SELECT ... FROM bitcoin_transactions FINAL). This performs all deduplication logic before returning query results. Then again there are downsides.

Severe Performance Impact: Using FINAL will dramatically slow down queries. It forces ClickHouse to read potentially more data from disk and perform computation on the fly to remove duplicates.
Impractical for Real-Time Analytics: The penalty when using FINAL makes it unsuitable for applications that need frequent, low-latency queries, such as (near) real time dashboards.

⠀ So this makes to a look for a better deduplication handling. Ideally before data ingestion into ClickHouse. This would make bring several benefits:

Cleaner data to be inserted.
Reduced storage consumption
Consistently faster query performance.

Why GlassFlow Was the Right Choice

Given the challenges of duplicate records in ClickHouse for real-time scenarios, a solution capable of processing and filtering data before it reaches ClickHouse was required. GlassFlow has a specific set of features that directly addresses the challenge.

Stateful Processing

To determine if an incoming transaction or block has been seen before, some part of our pipeline must remember processed data (or at least some part of it) . GlassFlow designed for stateful operations. This means it can retain information about hashes from the Bitcoin stream over a time window.

Auto Deduplication: Streamlined Duplicate Management

GlassFlow offers an "Auto deduplication" feature. This allows the stream processing pipeline to automatically identify and filter out duplicate records based on defined criteria. This means that for our use case all records with identical hashes arriving within a time period would be considered duplicates.

This provides a great and quick way to ensure that only unique Bitcoin blocks or transactions records are forwarded to ClickHouse.

This as well brings benefits discussed before: only inserting unique records into the ClickHouse and reduces the load on the database which helps to improve performance and save money down the line.

Cleaner Data and Simplified Pipeline

Implementing deduplication with GlassFlow translates to a higher quality data within ClickHouse. The effect is a reduction in duplicate records within tables. By intercepting and filtering out redundant Bitcoin transactions or blocks before they reach the database, the dataset becomes inherently cleaner.

This pre-emptive cleansing ensures reliable data for analysis. Metrics derived from this blockchain data, such as transaction volumes, active address counts, or fee calculations, are more accurate enabling better decision making based on those metrics. This also removes the need for resorting to ClickHouse's ReplaceMergeTree and FINAL query modifiers to remove duplicates as the responsibility of ensuring uniqueness is handled earlier the pipeline.

Also, a cleaner base dataset allows other additional compression and performance optimization tricks for ClickHouse to be run more efficiently. For example, ClickHouse's compression codecs (like ZSTD, LZ4HC) can achieve better ratios on string or json based data columns.

Conclusion

Throughout this article, we've discussed the challenges of data duplication, a common pain in streaming architectures.

We’ve gone over how to deduplicate by leveraging GlassFlow. Glassflow filters the data stream, ensuring that only unique Bitcoin blocks and transactions proceed to ClickHouse.

This pro-active cleansing not only streamlines the ETL pipeline but also directly translates into a more performant and cost-efficient ClickHouse setup.

The core takeaway: integrating stateful stream processing early in the data lifecycle removes a lot of headache, saves money & time and enables users to make more accurate data driven decisions with confidence.

This naturally leads questions:

Are there any opportunities to improve deduplication before data lands in your ClickHouse database?
How do you deal with deduplication?