If I want to do a bulk copy (say, nightly) with various transformations but not continually stream after, is that supported / a use case that’d be a good fit for your tool?
[1] https://github.com/shayonj/pg_flo?tab=readme-ov-file#streami...
[1] https://github.com/shayonj/pg_flo?tab=readme-ov-file#streami...
There’s a setting that controls whether it will do a snapshot first. Turn it off and it will just start sending through new cdc entries.
> you must set the kafka retention time to infinity
Is this a new retirement? I’ve never had to do this.
Maybe a benefit actually. Do you think we could use pg_flo with Postgres as a service instances like Azure postgres, Supabase, Neon etc? Like you just read the WAL without needing to install an extension that is not approved by the vendor.
Can you elaborate on failure modes? What happens if e.g. NATS server (or worker/replicator) node dies?
In principle, how hard is it move data from Postgres not to another PG but e.g. ElasticSearch/ClickHouse?
There are some plans to introduce metrics/instrumentation to stay on top of these things, to make it more production ready.
Replicator and worker can pause/resume the stream when say shutting down, say during a deploy.
re: moving data, I haven't attempted it, but I have seen PeerDB mentioned for moving data to Clickhouse and it seems quite nice.
I am starting w/ PostgreSQL <> PostgreSQL to mostly get the fundamentals right first. Would love to hear any use cases you have in mind.
1. Syncing data in Postgres to ElasticSearch/ClickHouse (which handle search/analytics on the data I store in PG)
2. Invoking my own workflow engine — I have built a system that allows end-users to define triggers that start workflows when some data change/is created. To determine whether I need to start the workflow, I need to inspect every CRUD operation and check it against triggers defined by the users.
I'm currently doing that in a duck-tape like way by publishing to SNS from my controllers and having SQS subscribers (mapped to Lambdas) that are responsible for different parts of my "pipeline". I don't like this system as it's not fault-tolerant and I'd prefer to do this by processing WAL and explicitly acknowledging processed changes.
re: 2. I have some plans around a control plane that allows users to define these transformation rules and routing config and then take further actions based on the outcomes. If you are interested in it, feel free to sign up on the homepage. Also, very happy for some quick chats too (shayonj at gmail). Thanks
Are you ok with a NATS dependency ? Happy to work with you in supporting a new destination like ES.
Also looking to make NATS optional for smaller/simpler setups (https://github.com/shayonj/pg_flo/issues/21)
Yes, it does use NATS JetStream.
At a high level, CDC is a complex state machine. Temporal helps building the state machine taking care of auto-retries/idempotency at different failure points and also aids in managing and observing it. This is very useful to identify root causes when issues arise.
Managing Temporal shouldn’t be complex. They offer a well-maintained, mature Docker container. From a user standpoint, the software is intuitive and easy to understand. We package the temporal docker container in our own Docker setup and have it integrated into our Helm charts. We’ve quite a few users smoothly using Enterprise (that we open sourced recently) and standard OSS!
https://github.com/PeerDB-io/peerdb/blob/main/docker-compose...
https://github.com/PeerDB-io/peerdb-enterprise
Let me know if there are any questions!
You can append a `template` argument when running the script files which will lead to a set of values file being generated, which you can then modify accordingly. There is a production guide for the same as we have customers in production using GitOps (ArgoCD) with the provided charts (https://github.com/PeerDB-io/peerdb-enterprise/blob/main/PRO...)
Given the acquisition by ClickHouse (congrats!), what can we expect for the CDC for sinks other than CH? Do you plan to continue supporting different targets or should we expect only CH focus?
Edit: also, any plans for supporting e.g. SNS/SQS/NATS or similar?
To elaborate more on the architecture (https://docs.peerdb.io/architecture), the flow-worker does most of the heavy lifting (actual data movement during initial load and CDC), while the other components are fairly lightweight. Allocating around 70-80% of provisioned resources to the flow-worker is a good estimate. For Temporal, you could allocate 20-25% of resources and distribute the rest to other components.
Our open-source offering (https://github.com/PeerDB-io/peerdb) supports multiple connectors (CH and non-CH). Currently, there aren’t any plans to make changes on that front!
Question: Can this handle network disconnection/instability mid copy?
Some more guardrails around network disruption and retries is on the way too.
With pg_flo, I am looking for the tool to be fairly hands off, while declarative at the same time for data filtering, transformation and re-routing like tasks, starting with PostgreSQL as the destination for now. The less time you can spend on cleaning, moving / syncing your data the better. Also quite curious to see what other use cases folks find it useful for.
Replay/reusability has been on my mind and have a sort of an old PR open. Will give it a shot soon.
Open to any suggestions and feedback.
The work/effort I need to put in with kafka etc., w/ Debezium is a short term effort, but evaluating the hassle.
Other solutions being evaluated by my team is peerdb - https://docs.peerdb.io/why-peerdb
I’m encountering a challenge due to two types of delete operations in the main database. The first delete operation is for data pruning, and I don’t want these deletions to be reflected in the replicated database. The second delete operation is for rollback purposes, which alters the data state, and these deletions should indeed be replicated to maintain consistency.
Is there a way to distinguish between these two delete operations so that only the rollback deletions are replicated? And would your tool support this kind of use case?
If its easier, happy to also chat more on an issue in the repo or email (shayonj at gmail)
Looks like they are pivoting to LLM integration.
pg_flo can perform common DDLs for you (not something that's in built in logical replication today, hopefully soon) [2]
[1] https://www.postgresql.org/docs/current/logical-replication-...
[2] https://github.com/shayonj/pg_flo/blob/0e8b6b9ca1caf768b71d3...