Building Observability with ClickHouse
59 points by valyala 8 days ago | 22 comments
  • zokier 4 days ago |
    I see lot of hype around ClickHouse these days. Few years ago I remember TimescaleDB making the rounds, arguably being predecessor for this sort of "observability on SQL" thinking. The article has short paragraph mentioning Timescale, but unfortunately it doesn't really go into comparing it to ClickHouse. How does HN see the situation these days, is ClickHouse simply overtaking Timescale on all axis? That sounds bit of a shame; I have used Timescale a bit and enjoyed it, but just on such small scale that it's operational aspects did not really come up.
    • valyala 4 days ago |
      ClockHouse outperforms TimescaleDB in every aspect on large volumes of data. https://benchmark.clickhouse.com/

      If you have small volumes of data (let's say less than a terabyte of data), then TimescaleDB is OK to use if you are OK with not so fast query performance.

      • xnx 4 days ago |
        > ClockHouse

        That typo is a great mashup of TimescaleDB and ClickHouse.

    • ekabod 4 days ago |
      Clickhouse has been popular for many years. Even before Timescale.
    • shin_lao 4 days ago |
      Timescale "doesn't scale" - in a nutshell.

      Clikchouse performance is better because it's truly column oriented and it has powerful partitioning tools.

      However, Clickhouse has quirks and isn't great if you need low latency data updates or if your data is mutable.

    • BiteCode_dev 4 days ago |
      I echo other's sentiment, ClickHouse is much more performant than TimeScale.
  • k_bx 4 days ago |
    Another project I want to give shout out to is Databend. It's built around the idea of storing your data at S3-compatible storage as Parquet files, and querying as SQL or other protocol.

    Like many popular Data Lake solutions, but it's open-source and written in Rust, which means quite easy to extend for many who know it already.

    • valyala 4 days ago |
      Databend performance looks good! https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQi...

      It looks like it has slightly worser on-disk data compression than ClickHouse, and slightly worser performance for some query types when the queried data isn't cached by the operating system page cache, according to the link above (e.g. when you query terabytes of data, which doesn't fit RAM).

      Are there additional features other than S3 storage, which can convince ClickHouse user switching to Databend?

    • Epa095 4 days ago |
      Interesting! Makes me wonder how they pan out compared to datafusion, which seems to have a lot of traction.
  • dakiol 4 days ago |
    Such a PITA. Unless you have a dedicated team to handle observability, you are in for pain, no matter the tech stack you use.
    • valyala 4 days ago |
      That's not truth. There are solutions for logging, which are very easy to setup and operate. For example, VictoriaLogs [1] (I'm its' author). It is designed from the grounds up to be easy to configure and use. It contains a single self-contained executable without external dependencies, which runs optimally on any hardware starting from Raspberry Pi and ending with a monster machine containing hundreds of CPU cores and terabytes of RAM. It accepts logs over all the popular data ingestion protocols [2]. It provides very easy to use query language for typical querying tasks over logs - LogsQL [3].

      [1] https://docs.victoriametrics.com/victorialogs/

      [2] https://docs.victoriametrics.com/victorialogs/data-ingestion...

      [3] https://docs.victoriametrics.com/victorialogs/logsql/

      • oulipo 4 days ago |
        Interesting, although the doc is not really user-friendly and doesn't show a lot of screenshots from the UI to get a sense of what the product can do
  • h1fra 4 days ago |
    Completely rewriting a system because you don't like JSON is a bit extreme
    • rducksherlock 2 days ago |
      I like extreme at work, it gives me adrenaline
  • ebfe1 4 days ago |
    ClickHouse + Grafana is definitely a fantastic choice, here is another blog from ClickHouse talking about dogfooding their own technology and save millions:

    https://clickhouse.com/blog/building-a-logging-platform-with...

    (Full disclosure: I work for ClickHouse and love it here!)

  • BiteCode_dev 4 days ago |
    Interestingly, I recently interviewed Samuel Colvin, Pydantic's author, and he said when designing his observability Saas called LogFire, he tried multiple backends, including ClickHouse.

    But it didn't work out.

    One of the reasons is LogFire allows the users to fetch the service data with arbitrary SQL queries.

    So they had to build their own backend in rust, on top of DataFusion.

    I used ClickHouse myself and it's been nice, but it's easy when you get to decide what schema you need yourself. For small to medium needs, this plus Grafana works well.

    But I must admit that the plug and play aspect of great services like Sentry or LogFire make it so easy to setup it's tempting to skip the whole self hosting. They are not that expensive (unlike datadog), and maintaining your observability code is not free.

    • mjarrett 4 days ago |
      What kinds of SQL queries could ClickHouse not handle? Were the limitations about expressivity of queries, performance, or something else? I'm considering using CH for storing observability (particularly tracing) data, so I'm curious about any footguns or other reasons it wouldn't be a good fit.
      • BiteCode_dev 4 days ago |
        I'm editing the transcript right now, and he says it's more about exposing a nice API to the user.

        E.G: Clickhouse interval support, which is an important type for observability, was lacking. You couldn't subtract datetimes to get an interval. If you'd compared 2 milliseconds intervals to one second ones, it wouldn't look at the unit and would say 2 ms is bigger, etc. So he had to go to the dev team, and after enough back and forth, instead of fixing it, they decided to return an error and he had to insist for a long time until they actually implemented a proper solution.

        Quoting him "But like these endless issues with ClickHouse's flavor of SQL were problematic."

        Another problem seemed to be that to benefit from very big scaling with things like data in Parquet at rest + local cache meant basically leaking all your money to AWS because the self-hosted version didn't expose a way to do that yourself. Click house scales fine at my size, so I can only trust him on that front since I'm nowhere that big.

        Funnily after that, they moved to TimeScale, and the perfs wouldn't work for their use case.

        They landed on DataFusion after a lot of trials and errors.

        But really interesting perspective on the whole thing, you can see he is kinda obsessed with the user experience. The guy wrote a popular marshmallow alternative, 2 popular celery alternative and one watchdog popular alternative, all FOSS.

        These kind of people are the source of all imposter syndrome in the world.

        I'll publish that video next week on Bite Code if I can. If I can't, it will have to wait 3 weeks cause I'm leaving for a bit. But Charlie Marsh's one (uv's author) is up, if you are into overachievers.

  • otoolep 4 days ago |
    There is at least one basic factual error in this blog post, which makes me discount the whole thing.

    "But if you will use it, keep in mind that [InfluxDB] uses Bolt as its data backend."

    Simply not true. The author seems to have confused the storage that Raft consensus uses for metadata with that used for the time series data. InfluxDB has its own custom data storage layer for time series data, and has had so for many years. A simple glance at the InfluxDB docs would make this clear.

    (I was once part of the core database team at InfluxDB and have edited my comment for clarity.)

    • rducksherlock 2 days ago |
      Hey! The author of the article here.

      Thanks for pointing out at this part of the text.

      Please note that English is clearly not my native language. Sometimes, I may structure sentences in ambiguous ways due to grammatical or other errors in my writing. I'm not sure if this is the case.

      To avoid further misinformation, let me elaborate what I meant to say in the paragraph you have mentioned. It's quoted below for other readers' comfort.

      > But if you will use it, keep in mind that it uses Bolt as its data backend. The same Bolt used in the HashiCorp Vault integrated Raft storage and in etcd. Hence, it may need some maintenance too. More specifically, database compaction.

      As per InfluxDB documentation, "InfluxDB uses BoltDB to store data including organization and user information, UI data, REST resources, and other key value data." (see here: https://docs.influxdata.com/influxdb/v2/reference/config-opt...). To me that's exactly what a data backend means. This piece of software uses different backends for different types of data if I got it right.

      Since I haven't had any experience with InfluxDB itself, it's not clear to me whether its BoltDB storage may (or may not) need maintenance. In my opinion it's just a little detail that may be helpful for some people, so I mentioned it.

      It's possible for me to redact this part of the text for more clarification, but that information by itself isn't very important to make me do that.

      > There is at least one basic factual error in this blog post, which makes me discount the whole thing.

      Everyone makes mistakes, that's why all of us filter and interpret information based on our experience accumulated through years of living on this planet. Such a radical change of perception to a large piece of information caused by a small mistake is not always necessary.

      Again, thanks for being interested in making corrections! Have a great rest of your day!

  • j12a 4 days ago |
    Interesting read, I was comparing some of these tools earlier for small web shop use while I didn't proceed to setup any of them just yet. Demoed Elastic, SigNoz and Grafana Loki, of which Alloy+Loki seemed to make most sense for my needs and didn't cause too much headache setting up on a tiny VM, so that I would have collection going in the first place and a decent method to grep through it.

    Currently collecting just exception data from services to GlitchTip (Sentry fork), seems most valuable sysadmin-wise while having most security etc. concerns outsourced to managed hosting companies.

    Was left curious what anomaly detection methods Elastic has built-in would take to DIY <https://www.elastic.co/guide/en/machine-learning/current/ml-...> with data frame / statistics / ML libraries (Clojure Noj).