If you have small volumes of data (let's say less than a terabyte of data), then TimescaleDB is OK to use if you are OK with not so fast query performance.
That typo is a great mashup of TimescaleDB and ClickHouse.
Clikchouse performance is better because it's truly column oriented and it has powerful partitioning tools.
However, Clickhouse has quirks and isn't great if you need low latency data updates or if your data is mutable.
Like many popular Data Lake solutions, but it's open-source and written in Rust, which means quite easy to extend for many who know it already.
It looks like it has slightly worser on-disk data compression than ClickHouse, and slightly worser performance for some query types when the queried data isn't cached by the operating system page cache, according to the link above (e.g. when you query terabytes of data, which doesn't fit RAM).
Are there additional features other than S3 storage, which can convince ClickHouse user switching to Databend?
[1] https://docs.victoriametrics.com/victorialogs/
[2] https://docs.victoriametrics.com/victorialogs/data-ingestion...
[1] https://github.com/VictoriaMetrics/VictoriaMetrics/tree/mast...
https://clickhouse.com/blog/building-a-logging-platform-with...
(Full disclosure: I work for ClickHouse and love it here!)
But it didn't work out.
One of the reasons is LogFire allows the users to fetch the service data with arbitrary SQL queries.
So they had to build their own backend in rust, on top of DataFusion.
I used ClickHouse myself and it's been nice, but it's easy when you get to decide what schema you need yourself. For small to medium needs, this plus Grafana works well.
But I must admit that the plug and play aspect of great services like Sentry or LogFire make it so easy to setup it's tempting to skip the whole self hosting. They are not that expensive (unlike datadog), and maintaining your observability code is not free.
E.G: Clickhouse interval support, which is an important type for observability, was lacking. You couldn't subtract datetimes to get an interval. If you'd compared 2 milliseconds intervals to one second ones, it wouldn't look at the unit and would say 2 ms is bigger, etc. So he had to go to the dev team, and after enough back and forth, instead of fixing it, they decided to return an error and he had to insist for a long time until they actually implemented a proper solution.
Quoting him "But like these endless issues with ClickHouse's flavor of SQL were problematic."
Another problem seemed to be that to benefit from very big scaling with things like data in Parquet at rest + local cache meant basically leaking all your money to AWS because the self-hosted version didn't expose a way to do that yourself. Click house scales fine at my size, so I can only trust him on that front since I'm nowhere that big.
Funnily after that, they moved to TimeScale, and the perfs wouldn't work for their use case.
They landed on DataFusion after a lot of trials and errors.
But really interesting perspective on the whole thing, you can see he is kinda obsessed with the user experience. The guy wrote a popular marshmallow alternative, 2 popular celery alternative and one watchdog popular alternative, all FOSS.
These kind of people are the source of all imposter syndrome in the world.
I'll publish that video next week on Bite Code if I can. If I can't, it will have to wait 3 weeks cause I'm leaving for a bit. But Charlie Marsh's one (uv's author) is up, if you are into overachievers.
"But if you will use it, keep in mind that [InfluxDB] uses Bolt as its data backend."
Simply not true. The author seems to have confused the storage that Raft consensus uses for metadata with that used for the time series data. InfluxDB has its own custom data storage layer for time series data, and has had so for many years. A simple glance at the InfluxDB docs would make this clear.
(I was once part of the core database team at InfluxDB and have edited my comment for clarity.)
Thanks for pointing out at this part of the text.
Please note that English is clearly not my native language. Sometimes, I may structure sentences in ambiguous ways due to grammatical or other errors in my writing. I'm not sure if this is the case.
To avoid further misinformation, let me elaborate what I meant to say in the paragraph you have mentioned. It's quoted below for other readers' comfort.
> But if you will use it, keep in mind that it uses Bolt as its data backend. The same Bolt used in the HashiCorp Vault integrated Raft storage and in etcd. Hence, it may need some maintenance too. More specifically, database compaction.
As per InfluxDB documentation, "InfluxDB uses BoltDB to store data including organization and user information, UI data, REST resources, and other key value data." (see here: https://docs.influxdata.com/influxdb/v2/reference/config-opt...). To me that's exactly what a data backend means. This piece of software uses different backends for different types of data if I got it right.
Since I haven't had any experience with InfluxDB itself, it's not clear to me whether its BoltDB storage may (or may not) need maintenance. In my opinion it's just a little detail that may be helpful for some people, so I mentioned it.
It's possible for me to redact this part of the text for more clarification, but that information by itself isn't very important to make me do that.
> There is at least one basic factual error in this blog post, which makes me discount the whole thing.
Everyone makes mistakes, that's why all of us filter and interpret information based on our experience accumulated through years of living on this planet. Such a radical change of perception to a large piece of information caused by a small mistake is not always necessary.
Again, thanks for being interested in making corrections! Have a great rest of your day!
Currently collecting just exception data from services to GlitchTip (Sentry fork), seems most valuable sysadmin-wise while having most security etc. concerns outsourced to managed hosting companies.
Was left curious what anomaly detection methods Elastic has built-in would take to DIY <https://www.elastic.co/guide/en/machine-learning/current/ml-...> with data frame / statistics / ML libraries (Clojure Noj).