Clickhouse: MergeTree vs SummingMergeTree
When we started using Clickhouse, the concepts of MergeTree and SummingMergeTree was very confusing. Not only should one understand what each mean, it is extremely crucial to know when to use what as using a wrong one for the other would either result in Resolution/Data loss or would incur huge amount of storage space.
The Problem
Assume CompanyX has a fleet of Trucks running on the streets. The current speed of each truck is being sent every second to a live server. For problem’s sake say there are 100,000 trucks out of which a large number of them are active and keep sending in data.
The ingestion rate in this example is 100,000 data per second (even the idle ones are assumed to report their current speed data which is 0) and assume we are sending this data to something like Kafka. There exists a consumer subscribed to Kafka which reads this data in chunks/batches and writes it to our Clickhouse database.
The problem scenario so far looks something like this,
In order to understand the problem better lets just take one single truck’s data for a given minute. In any minute, the number of data that would be written to Clickhouse is 60 rows (as there are 60 seconds in a minute). However, this level of granularity means that the storage is going to be heavy and also our customer cares only about a minute granularity. So, all we need to store is the “average” speed for every minute.
MergeTree
Lets assume our consumer reads the data in chunks and in each chunk it reads say 15 records (15 seconds speed) for a particular truck, a MergeTree would create a total of 4 records for a minute. The write cycle would look like this
As shown in the diagram above, the next batch would insert another record for the next 15 second and another one for the next 15 second and so on. Our final table would look like this.
SummingMergeTree
Now, lets see how SummingMergeTree would have handled this. Please note the “Summing” here is not the aggregate SUM, instead its the short form for SUMMARY.
The SummingMergeTree would have done the exact thing what the MergeTree did until the first read for 15 seconds, ie., insert one record with average speed aggregated. However, for the second stream of data, from 8:00:16 to 8:00:30 instead of creating a new record it would have simply updated the same data that was already inserted. That way it constantly updates the aggregate and have only one record for each minute.
In a worst case scenario where only 1 second is read in a batch, the MergeTree has the risk of storing 60 records whereas the SummingMergeTree is guaranteed to store only 1 record for the nearest minute
This is a huge difference when it comes to storage space and it is extremely important to know the difference between these
When to use one over the other?
As shown in the example above, you could always resort to SummingMergeTree as it is efficient in storage space and adhering to the given requirements. Even for more granular data, you could still use the SummingMergeTree with appropriate definition of granularity. There is a huge difference between 1 GB and 60 GB and in the world of time series, this is directly related to performance and cost.
Hope this article helped. Let me know if you have any feedback/questions.
Francium Tech is a technology company laser focused on delivering top quality software of scale at extreme speeds. Numbers and Size of the data excite us. If you have any requirements in building a large scale application or want a free health check of your systems or architecture, feel free to shoot an email to contact@francium.tech, we will get in touch with you!