Databases in 2024: A Year in Review

619 points by avinassh 9 days ago | 209 comments

mihirrd 8 days ago |
Quite informative
badindentation 8 days ago |
The section on Larry Ellison is amusing.
epolanski 8 days ago |
I couldn't understand if it was satire or something.
Do they believe the guy marries a 30 years old cause she loves him?
In any case, who cares, how was that relevant..
masklinn 8 days ago |
It's definitely satire, how would you take this sentence as serious:
> I told Larry this especially means a lot to me because my former #1 ranked Ph.D. student is now a professor in Michigan's Computer Science department with their famous Database Group.
CRConrad 8 days ago |
> I couldn't understand if it was satire or something.
The quotes from last year's bit on Ellison in sibling comment https://news.ycombinator.com/item?id=42567484 might help you make up your mind.
leeoniya 8 days ago |
"Do not fall into the trap of anthropomorphising Larry Ellison."
https://m.youtube.com/watch?t=33m1s&v=-zRN7XLCRhc&feature=yo...
avinassh 8 days ago |
Larry always makes appearances in his reviews, lol
> But the real big news in 2023 was how Elon Musk personally helped reset Larry's Twitter password after he invested $1b in Musk's takeover of the social media company. And with this $1b password reset, we were graced in October 2023 with Larry's second-ever tweet and his first new one in over a decade.
https://www.cs.cmu.edu/~pavlo/blog/2024/01/2023-databases-re...
> These journalists made it sound like Larry was doing something nefarious or indecent, like the time he made his pregnant third wife sign a prenup two hours before their wedding. I can assure you that Larry was only trying to use his vast wealth as the 7th richest person in the world to help his country. His participation in this call is admirable and should be lauded. Free and fair elections are not a trivial affair, like a boat race where sometimes shenanigans are okay as long as you win. Larry has done other great things with his money that are overlooked, like spending $370m on anti-aging research so that he can live forever
https://www.cs.cmu.edu/~pavlo/blog/2022/12/2022-databases-re...
samanthasu 8 days ago |
would love to see what Andy's take on GreptimeDB https://github.com/GreptimeTeam/greptimedb
leeoniya 8 days ago |
made same comment recently in https://news.ycombinator.com/item?id=42330055#42331927
samanthasu 8 days ago |
looking forward to the bonus content!
DennisZ_89 8 days ago |
Glad to see people start trying on GreptimeDB. We're committed to building a fully open-source and cost-effective, unified time series database for metrics, logs, and events. To this point, it may still be too small to be included in the year database summary. But hope we'll grow fast in 2025 :)
rozenmd 8 days ago |
I loved ottertune, it's a shame it died the way it did.
memhole 8 days ago |
Love the style! CMU making databases cool. Sorry to hear about OtterTune.
Beefin 8 days ago |
TL;DR SQL is king
m_ke 8 days ago |
Andy is a treasure, if only we had more professors like him
antirez 8 days ago |
Wow, the reasons why Redis commands API suck in Andy's video (linked in the post) are the weakest ever. It is possible to make a case against the Redis API (I would not agree of course but... it's totally legitimate), but you gotta have stronger arguments than those, particularly if you are a teacher of some kind. Especially: you need to be somewhat fluent in Redis and how developers use Redis in order to understand why so many people like it, and then elaborate what it's wrong about it (if you believe there is something wrong). The video shows a general feeling of "I don't really use / know this, but I don't like how NON-SQL it is".
nojito 8 days ago |
SQL is king and history has shown non-sql languages are not good which causes many non-sql DBMS's to adopt sql eventually.
antirez 8 days ago |
Many non-SQL DBs had query languages that were broken Javascript-ish versions of SQL. Of course, this is wrong, and people will eventually adopt SQL instead. But if your data model isn't anything like relational DBs, non-SQL makes a ton of sense. OP seems to miss exactly this, that the Redis query language is shaped on the Redis data model, that is basically alien to the relational model.
The idea behind Redis data model is that "describe data" then "query those data in random ways" is conceptually nice but practically will not model very well many use cases. SQL databases plagued tech with performance issues for decades because of that. So Redis says instead: you need to put your data thinking about fundamental things like data structures and access times and the way you'll need those data back. And the API reflects this.
You don't have to automatically agree with that. But you have to understand that, then provide your "I'm against" arguments. Especially if you are in front of young people listening to you.
im_down_w_otp 8 days ago |
Agreed. Many noSQL-boom-era databases eventually bolted on a SQL-esque layer, but that was also because they were mostly also all targeting "enterprise database" use cases and customers who both expected that and whose use cases largely fit with it. So, there was a lot of pressure to conform to norms when the advantage of not doing so wasn't immediately self-evident.
We have a database [1] and query language [2] that's tailored to storing & querying trace/telemetry data produced by different layers and components of cyber-physical systems for systems engineers to analyze, verify, and validate what a complex system is doing. It's not quite a traditional relational problem. It's not quite a traditional time series problem. It's not quite a traditional graph problem.
Addressing the way that systems engineers think about their domain in an effective way required coming up with something different. Are there caveats and rough edges? Sure. But, they're a lot less pernicious and onerous than the alternative of trying to leverage a bunch of ill-fitting menageries of different solutions.
Redis is fit-for-purpose. So, it makes sense that its query interface would also express that.
[1] https://docs.auxon.io/modality/
[2] https://docs.auxon.io/speqtr/
nojito 8 days ago |
>But if your data model isn't anything like relational DBs, non-SQL makes a ton of sense. OP seems to miss exactly this, that the Redis query language is shaped on the Redis data model, that is basically alien to the relational model.
Sure...but all roads lead back to SQL eventually. Another recent example also mentioned in the OP is BigTable adopting SQL.
threeseed 8 days ago |
> but all roads lead back to SQL eventually
No it doesn't. SQL is designed for relational databases.
For other forms i.e. JSON, Graph, Key/Value they all use other query languages.
physicles 8 days ago |
If you write code that uses a hash map, would you insist on using SQL to query it? This makes no sense.
anovick 8 days ago |
SQL (and RDBMS in general) has its limitations, particularly with regards to recursive operations.
An extended Datalog[1] can provide performance optimizations not available to RDBMS.
[1]: https://dl.acm.org/doi/10.1145/3639271
KronisLV 8 days ago |
For what it's worth, SQL kind of sucks. It's just the de facto choice because it's extremely widespread and good enough for 80% of the use cases out there and what's missing can be kludged on top of it, either by specific DB vendors, or by various extensions.
It's not too hard to come up with alternatives that improve upon individual aspects of SQL like https://prql-lang.org/ but the barrier of entry is about as high as trying to make a huge social media network, most attempts will remain niche.
Then again, most software kind of sucks, it's just that some of it also works. For example, the Linux FHS reads like an overcomplicated structure that is the way it is for historical reasons, but works in practice.
fforflo 8 days ago |
I've been working on something Redis-y over the holidays, and it has reinforced my view that it's the epitome of a 20%- 80% tool. I've always used the 20%, but anything beyond that sounds useless unless you've encountered the requirement in a production environment. The challenges Redis has been solving for years, never really touched the research/academic community (even the 20%).
Even in the various taxonomies of DBS in the research literature, Redis was mentioned with a wave of the hand as an "in-memory" database, which undersells the important (for me) part of the "data structure" server.
Putting the "database" after Redis could be a marketing misstep. Because it puts you in the is-it-sql territory.
TL;DR: Redis is mostly appreciated by practitioners (web) developers. Academics find it lacking a theoretical foundation, so... meh.
Tanjreeve 8 days ago |
Developers know it's limits. Or you have developers with vague "scaling issues" or "buggy caching" who don't understand why they have them or suddenly start suffering from them at inconvenient moments.
nicoritschel 8 days ago |
With all due respect, the linked video was pretty fair. It didn't imply not to use Redis, just not as a primary datastore.
I don't think folks work with Redis out of fondness for the model, but because it's the least worst datastore for caching, lightweight message broker, and simple realtime things like counters.
antirez 8 days ago |
Talking about the broken API argument here. Also Redis is particularly useful exactly in other situations compared to what OP says. Leaderboards style use cases with sorted sets are killer applications (super hard to model with SQL) of the data structure server thing. Apparently OP does not understand this and says "simple GET/SET" is what you should use Redis for.
Redis has probabilistic data structures, the ability to implement complex queueing patterns, and so forth. That's where the value is. Otherwise we would still be just with Memcached without caring about Redis. Another killer app was Twitter initial use case (then they used it for pretty much everything): to cache latest N Tweets, using capped lists. I could continue forever.
So OP argoment is flawed IMHO, for the above arguments, not fair. When you talk to students you need to make your homeworks. Really understand the system you are talking and provide a realistic image of it. Then, yes, if you want, criticize it as much as you want, with grounded arguments.
You know what? I re-read this comment and it's embarassing I ever have to write this, because after 15 years of Redis history at such scale and popularity, pretty much everybody that was seriously exposed to Redis knows those stuff. Is tech culture really degraded so much that we have to restate the obvious? Do I really need to explain GET/SET is not exactly where Redis shines after 15 years of half the Internet used all the kind of Redis patterns?
memhole 8 days ago |
What are your thoughts about Rails switching to SQLite from Redis? I've only used Redis to store session data and cache app data. So my opinions are pretty limited and mostly positive.
https://rubyonrails.org/2024/11/7/rails-8-no-paas-required
antirez 8 days ago |
My feeling is that for their use case, it makes sense to have something vertical that just cover the needs of Rails. AFAIK SQLite has a RAM backend, so still you are not going to hit disk. Seems like a good idea to reduce system complexity, to me.
sureglymop 8 days ago |
Maybe this is a weird question but, knowing only some math and not redis, what is a sorted set and how is it different than a list/tuple?
antirez 8 days ago |
Sorted sets are abstract data structures were you insert elements into a set, but every element is associated with a floating point score. Elements are taken ordered inside the sorted sets, so you can ask for ranges, or a specific element rank (position), and so forth. It sounds like the (many) cases where Redis is the best idea to get started and deliver (see for instance the Instagram case, that used Redis for years while becoming bigger and bigger). Then as you understand you are at scale and need just XYZ, you may choose to implement XYZ inside your system in other ways and that's it.
sureglymop 7 days ago |
Thank you for explaining it! I appreciate that.
nicoritschel 8 days ago |
I am grateful for Redis and I agree you pioneered a lot of data access patterns in production for a lot of people, myself included. I've used Redis for 10 years, at times for use cases as you mention, for real time feature engineering for ML as well.
The API is just different compared to SQL, which is a downside for many. There's modern advancements in the space with IVM and more databases are supporting probabilistic data structures.
daneel_w 8 days ago |
> Is tech culture really degraded so much that we have to restate the obvious?
Maybe, though the author of the article is known to be a little bit too opinionated, and unfortunately habitual with phrasing himself in a bombastic manner. The piece reads like a dramatic recap of the past year's sporting events, littered with irrelevant and disconnected references to lyrics and drama in the world of rap and hip hop. A "quirky and fun" journalistic abortion.
somat 8 days ago |
Now I have never used redis in this capacity but these sorted sets(a data structure that maintains it's sort as data is entered I assume) how is that different from "create index on player_score (score)"? that is, an index on the score column. which will create a data structure that maintains it's sorted nature as data is entered.
My naive view is that you create a sorted set every time you define an index. that is, the opposite of "super hard to model with SQL"
mistrial9 8 days ago |
maybe it was a reference to two or three or N level deep nested tree structures. Those dont map as naturally using simple SQL relations.
to11mtm 8 days ago |
> I re-read this comment and it's embarassing I ever have to write this, because after 15 years of Redis history at such scale and popularity, pretty much everybody that was seriously exposed to Redis knows those stuff. Is tech culture really degraded so much that we have to restate the obvious?
I know you wrote OG and did a lot of Redis but...
Yes, Tech culture is that fucked.
In my past job I was hired semi-specifically to deal with a concern, namely that their use case 'fit well with Kafka' but the latency for their case sucked at least as much as the API pain. (Yes I did something better. No, IDK if folks will ever see it).
Now?
Now I spend my days trying to 'paper-over' patterns that drive me to insanity just trying to make it work from a 'people need to learn why things work on a starship' level [0].
On a real level you didn't fail, Redis has lots of great patterns. On a -practical- level it's a shitshow because you now have lots of folks 'glue-gunning' the Redis API on use cases that probably need tweaking or aren't the right fit at all, alas they all worked off the same example on GH/SO/etc and then did their own "this wasn't even the right way to do this so I'm adding glue, what could possibly go wrong" case.
(That said, Nats has decent stuff for this in form of KV CompareExchange style APIs, and I see the inspiration there, so that's something to feel good about.)
[0] - Namely, if anyone has a good prompt for taking a photo of someone and doing img2img of 'Person in astronaut uniform preaching from an ivory tower', that would be a plus
cloverich 8 days ago |
Redis is stable, powerful, widely supported, and has been running strong... over a decade now? Ive never heard it recommended as a primary datastore... why would someone do that? Ive seen it used at scale for numerous businesses now and its caused problems exactly never. People understand how to use it because its relatively simple and provides the first things you need beyond the database. Do people complain about redis commonly? News to me.
nicoritschel 8 days ago |
Adtech/ML
tayo42 8 days ago |
> Andy's video (linked in the post)
Is there a "to long didnt watch" summary any one knows of? I hate videos, but am curious lol
jghn 8 days ago |
Right there with you. The trend towards video content instead of written sucks so much.
samanthasu 8 days ago |
same same
nodamage 8 days ago |
As far as I can tell the two main criticisms in the video are that:
1. The Redis API requires the developer to use different commands to retrieve/manipulate data depending on the type of data being stored. To retrieve a string you use GET, but if you want to retrieve a list it's LRANGE, for a set it's SMEMBERS, for a hash it's HGETALL. (As opposed to an API design which would allow you to call GET on all of the different data types and have it return the right thing.)
2. The lack of a predefined schema means you can overwrite values with different types. So you can create a list named "foo" and then overwrite it with a string named "foo" and then overwrite that with a hash named "foo" and Redis will happily do it, meaning the developer needs to keep track on their end what actual type any given key is holding onto.
To me these criticisms come across as essentially saying "Redis doesn't behave like a RDBMS" to which I suppose antirez's point is "well, yeah, it's not supposed to".
tayo42 7 days ago |
thanks for saving me some time!
apavlo 8 days ago |
> Wow, the reasons why Redis commands API suck in Andy's video (linked in the post) are the weakest ever.
In my example, the API on a key changes based on its value type. And the same collection can have different value types mixed together. You've recreated the worst parts of IBM IMS from the 1960s. However, the original version of IMS only changed the API when a collection's backing data structure changed. Redis can change it on every key!
We didn't get into the semantics of Redis' MULTI...EXEC, which the documentation mischaracterizes as "Transactions". I'm happy that at least you didn't use BEGIN...COMMIT.
antirez 8 days ago |
You totally miss that Redis is more like a remote interpreter with a DSL that manipulates data structures stored at global variables (keys): you (hopefully) would never complain about languages having this semantics.
I don't think you understood how Redis collections work. The items are just strings, they can't be mixed like integers or strings together or whatever, nor collections can be nested.
The Redis commands do type checking to ensure the application is performing the right operation.
In your example, GET against a list, does not make sense because:
1. GET is the retrieve-the-key-of-string-type operation.
2. Having GET doing something like LRANGE 0 -1 would have many side effects. Getting for error a huge list and returning a huge data set without any reason, creating latency issues. Also having options for GET to provide ranges (SQL alike query languages horror story). And so forth.
So each "verb" should do a specific action in a given data type. Are you absolutely sure you were exposed enough to the Redis API, how it works, and so forth?
About MULTI/EXEC, when AOF with fsync configured correctly is used, MULTI/EXEC provide some of the transactional guarantees you think when you hear "transaction", but in general the concept refers to the fact that commands inside MULTI/EXEC have an atomic effect from the point of view of an external observer AND point-in-time RDB files (and AOF as well). MULTI / INCR a / INCR a / EXEC will always result in the observer to see either 2, 4, 6, 8, and so forth, and never 3 or 5.
Anyway, I believe you didn't put enough efforts in understanding how really Redis works. Yet you criticized it with weak arguments in front of the most precious good we have: students. This is the sole reason why I wrote my first comment, I believe this to be a wrong teaching approach.
zzzeek 8 days ago |
> You totally miss that Redis is more like a remote interpreter with a DSL that manipulates data structures stored at global variables (keys):
I think he makes the point that these "global variables" are dynamically typed; you can have "listX" and then write a non-list into that same name; statically typed systems would not allow this. He makes the fairly non-controversial point that a statically typed system (SQL, other than that of SQLite) adds a level of type safety that can guard against software bugs.
thayne 8 days ago |
> you can have "listX" and then write a non-list into that same name; statically typed systems would not allow this
Well, that depends. In most SQL databases there are many cases where supplying the wrong type of value will implicitly convert to the expected type, often in unexpected ways that can result in subtle bugs.
zzzeek 8 days ago |
PostgreSQL is very very good about really never doing this, and also a scalar vs. list is pretty much a PostgreSQL case since most other relational DBs dont have a native ARRAY type. I think you're mostly thinking of MySQL that has some int/string coercion cases which are to be clear bad, but not as egregious as "any arbitrary type goes right in with no checking whatsoever.
as mentioned, SQLite breaks all these rules and I think SQLite is very wrong on this.
jsnell 8 days ago |
> 1. GET is the retrieve-the-key-of-string-type operation.
That's a tautological argument. The question isn't what the definition of GET is, but whether the design is good.
> 2. Having GET doing something like LRANGE 0 -1 would have many side effects. Getting for error a huge list and returning a huge data set without any reason, creating latency issues.
If this really were the reason, you'd have separate operations for tiny strings and huge strings. After all, by analogy having GET return a huge string "without any reason" would create latency issues.
But that's not how Redis works, right?
antirez 8 days ago |
The examples I made are just a subset of the protection that this provides. Similarly you can't LRANGE a set type, and so forth. So this in general makes certain errors evident ASAP (command mismatch with the key type).
This does not meant that Redis would not work having generic LEN, INSERT, RANGE commands. But such commands would end also having type-specific options, that I have the feeling is not very clean. Anyway these are design tastes, but I don't think they dramatically change what Redis is or isn't. The interesting part is the data model, the idea of commands operating on abstract data structures, the memory-disk duality, and so forth. If one wants to analyze Redis, and understand merits and issues, a serious analysis should hardly focus on these kind of small choices.
josephg 8 days ago |
Eh. What people are really arguing about here is redis’s type system. Redis’s approach has some pros and some cons. I think dismissing redis’s approach out of hand for its choices is too simple a treatment.
Most sql databases (like Postgres) require all types to be declared once, and then they do type checking on mutation. In that sense, sql is like a static language like C. But weirdly, the results returned from a sql query are always dynamically typed values, expressed in a table. Applications reading data from sql will still typically need to know what kind of data they expect back - but they usually do that type checking at runtime.
Redis flips both of those choices. It’s dynamically typed - so it won’t check your mutations. But also, you don’t need schema migrations and all the complexity they bring. And rather than having a single “table” type, redis queries can return scalar values, lists or maps. What kind of return value you get back depends on the query function. (Eg GET vs LRANGE).
If you think of a database as the foundation underneath your house, static typing & type checking is a wonderful way to make that foundation more stable. There’s a reason Postgres is the default, after all. But redis isn’t best used like that. Instead, it’s a Swiss Army knife which is best used in small, specific situations in which a real database would be complex overkill. Stuff like caching, event processing, data processing, task queues, configuration, and on and on. Places where you want some of the advantages of a database (fast, concurrent network-accessible storage) but you don’t want to stress about tables and schema migrations.
If you really hate redis, maybe say the same thing I say about Java when I teach it to my students. “I hate this, and I’ll tell you why. But there are smart people out there who disagree with me.”
If you ask me, I wish sql looked more like redis in some ways. I think it’s quite awkward that every sql query returns exactly one “table”. I’d much rather if queries could return scalar values or even multiple tables, depending on your query.
CRConrad 8 days ago |
> I’d much rather if queries could return scalar values
Since when can't they?
josephg 8 days ago |
I mean, they can - but they’re always wrapped up as pseudo-tables.
Not everything is best described as a table, y’know?
CRConrad 7 days ago |
> they’re always wrapped up as pseudo-tables.
Are they??? Not as I understand it.
josephg 6 days ago |
If you call “select 1;”, you get back a table with 1 row and 1 column.
CRConrad 5 days ago |
That's just because SQL clients present their results that way, AFAICS. If you use a sub-query like in, say,
select * from orders where custno = (select custno from customers where name = 'John Doe');
you'll get the same result as if you'd put that scalar in your query, like
select * from orders where custno = 123456; -- John Doe's customer number
Or maybe you're right, that to SQL databases scalar values are single-row single-column tables. But so what? In mathematics, isn't any number also the single-member set of numbers that contains only that number? Where's the harm in that? (And, hey, RDBMSes are founded on set theory...)
So I don't really see what the big problem is either way. Hoping I'm not being stupid AF, maybe you could explain further?
josephg 4 days ago |
Mathematically, they're equivalently expressive. But returning everything as a table has bad ergonomics.
Imagine the programming language equivalent. We could make a programming language where every function call returns a table. If you expect 1 return value from your function, the caller grabs the first row out of the return array, and the first column out of that row. It would absolutely work, and that its mathematically equivalent in some sense. But it would be confusing, computationally inefficient and error prone. What happens if there's more than 1 row in the table? Or more than 1 column? What happens if the type of the columns doesn't match match what you expect? What happens if the table is empty? Or you want a function which returns a two lists instead of one? We could write that programming language. But it would be pretty weird and frustrating to use.
This is the situation today with SQL. Every query returns a dynamically typed table. Its up to the caller to parse that table.
With redis, the caller expresses to the database what kind of value they expect in the query function name. (At least, list or scalar). The database guarantees that a GET request always returns a scalar value, and LRANGE always returns a list. I think this has better ergonomics because the types are more explicit.
gregw2 8 days ago |
Minor nit. Some SQL databases allow you to return multiple tables. IIRC, SQL Server stored procedures can do that. Agreed its not a language feature of SQL.
DrBenCarson 8 days ago |
Just because there are reasons for why Redis sucks doesn’t meant it doesn’t suck
osigurdson 8 days ago |
>> stored at global variables
This is an interesting (and correct) perspective. Global variables scare us in software but we are ok with it when it comes to application state stored in a db.
josephg 8 days ago |
Global variables can definitely be overused, but in the right situations, they’re generally fine. After all, the filesystem is a big global variable too. So is any database. But people don’t complain too much about that.
The strongest argument against global variables is that they don’t show up in the parameter lists of functions. In that way, they’re sort of “spooky action at a distance”. And they encourage functions to be impure. But if this bothers you, you can always pass your database connection as an explicit parameter to any function which interacts with it.
liontwist 8 days ago |
They aren’t scary. They are useful and you probably use them in many forms (lambdas capturing locals, logging, singletons).
This is yet another reason why single threaded should be the default assumption and multi-threaded require special consideration.
enugu 8 days ago |
There have been proposals to have global state in programming languages which function like databases with the advantages of monitoring/persistence/naming etc, but also retain the modularity of local state.
https://www.scattered-thoughts.net/writing/local-state-is-ha...
https://awelonblue.wordpress.com/2012/10/21/local-state-is-p...
zbentley 8 days ago |
> We didn't get into the semantics of Redis' MULTI...EXEC, which the documentation mischaracterizes as "Transactions". I'm happy that at least you didn't use BEGIN...COMMIT.
Hmmm, this is a subtler issue than you make it out to be, I think, though I generally agree with you. The quality issues with Redis's technical design here interrelate substantially with user expectations/perceptions/squishier stuff.
The term "transaction" is anchored in most users' minds to a typical RDBMS transactional model assuming: a) some amount of state capture (e.g. snapshot) at the beginning of the transaction and b) "atomicity" of commit being interpreted as "all requested changes are atomically performed (or none are)" rather than "all requested changes are atomically attempted".
Redis has issues with both of those, so I'm sympathetic to your statement that what they call "transactions" is mis-characterized and would be better described as "best-effort command batching".
It's poor naming/branding to call it "transactions", and I don't think it had to be this way: MULTI/EXEC "transactions" should have been deprecated long ago--in favor of Redis scripts and other changes that should have been made in the Redis engine.
First, a defense of scripts: Redis scripts are, to a certain variety of user who wants transaction-esque functionality, not ideal. Those users may be reluctant to engage with a full procedural programming language rather than the database's query language. However, there's substantial overlap between those users and the ones who will be extremely confused by and unhappy with the existing MULTI/EXEC model--they're the folks with the most specific (wrong, in Redis) assumptions of how transactions should work, and suffer the most from them not working that way. Lua scripts, unfamiliar or not, are likely less troublesome in the long run for this cohort. Specifically, requiring users to be explicit about failure behavior of specific commands via call() vs. pcall() would remove one of the worst sharp edges of the MULTI/EXEC world.
Scripts can't answer other transaction-related needs, though. Ideally, I would have preferred that Redis go in the direction of a uniform set of conditions that can be applied to most write commands. There already are conditions in Redis, but they're not uniformly available: SET + NX/XX conditions single-key writes; WATCH semantically/implicitly conditions later EXEC commands with "if version of $key matches the version retrieved by the WATCH statement", etc. If that type of functionality were made explicit and uniformly available to all or most write operations, a further chunk of transaction-related needs could be addressed. When making single commands conditional isn't enough, scripts used to atomically batch-attempt commands could be invoked with parameters used to conditionalize those scripts' internal commands, and so on.
A final simple affordance in support of transaction-ish behavior would be a connection-scoped value type: either a modifier for arbitrary commands to have them operate on an empty database scoped to the connection, or a simple list-like value for connections to "stash" arbitrary data. This wouldn't fundamentally change any semantics, but would, at the cost of some indirection, marginally reduce the need for clients complexity when buffering conditions/commands for later flush via a pseudo-"commit"-script. This is somewhat hair-splitting, though: MULTI/EXEC is already such a connection-scoped buffer, just one that stages commands and not data. My hunch is that a data-only buffer to be consumed by scripts instead of "EXEC" would be an improvement here, but I may well be wrong.
Now, the system that results from these changes is still not as ergonomic/low friction as traditional transactions, and is especially unergonomic when users have to manually capture undo state and decide on rollback semantics during the failure of script execution. As Antirez mentioned in an adjacent comment, AOF can help ensure appropriate conistency in the face of database crashes during script execution, but database level reconciliation--aka "what is the equivalent of 'rollback' for a given script"--is still on the user to work out.
But that's what we're really talking about here, isn't it? That lack of undo (that is: the ability to capture and discard transactional state a la MVCC) is at the root of most of the weird and not-quite-transactional capabilities of Redis in this era.
Antirez is totally right that adding those capabilities would have substantially complicated the Redis engine, and I believe him when he says that made it not worth it to do so. Given that, I'd have vastly preferred a Redis which embraced providing tools that work in spite of/with full acknowledgement of that lack, rather than concealing it/confusing users by mis-branding MULTI/EXEC as "transactions".
brightball 8 days ago |
Yea, I always use Redis for very specialized purposes.
Like offloading a shared data structure between threads / processes / machines so that I don’t have to deal with thread safety issues.
Spivak 8 days ago |
I understand machines but threads?! Why introduce IPC overhead on the fastest/easiest way to share data? This is beyond a solved problem and your language probably has multiple ready-made battle tested solutions.
In Python you don't even need a lib, dict is thread safe even in nogil.
pdhborges 8 days ago |
In Python you don't even need a lib, dict is thread safe even in nogil.
Is it? https://google.github.io/styleguide/pyguide.html#218-threadi...
Spivak 8 days ago |
Yep! Not every operation you can do on a dict is thread safe but if you find a situation where it isn't it's a bug.
https://github.com/python/cpython/issues/112075
Google is recommending people not rely on it because it does make dict subclasses not substitutable. It's easy enough to avoid the issue completely so in most cases you might as well do that.
brightball 8 days ago |
Depends on the use case. Often if I’m sharing data between threads there’s a good chance I’ll want to scale it to another instance at some point so just going straight to Redis is pretty common.
Especially if there’s a chance I’d want that data to persist across restarts.
It’s one thing if I’m using a BEAM language, but otherwise I’ll usually reach for Redis.
codeulike 8 days ago |
Weird how SQL Server and its Azure variants gets no mention. It dominates in certain sectors. DBEngines ranks it third most popular overall https://db-engines.com/en/ranking
patja 8 days ago |
The fawning over Larry Ellison is also weird.
cactusfrog 8 days ago |
The joke is that his greed/ unwillingness to squeeze margins has made the entire database company ecosystem possible.
RadiozRadioz 8 days ago |
Lots of people deliberately avoid Microsoft technologies and their whole ecosystem. There's of course interesting stuff happening there, but not enough for those outside the ecosystem to care.
It's more a cultural thing than anything else. HN for example largely leans away from MS. It's quite interesting how little overlap there is between the two worlds sometimes.
Speaking as one of those people, it's just not my thing, so it's not on my radar at all. There's enough stuff happening outside MS to keep me busy forever.
teej 8 days ago |
Are people choosing SQL Server independently of the Microsoft ecosystem? My understating is that you typically use it because you’re forced to choose a MS product.
rawgabbit 8 days ago |
SQL Server is a terrific product. And I detest most things Microsoft.
breadwinner 8 days ago |
Except when you need to scale.
bdangubic 8 days ago |
this of course is false… it scales fine if you know what you are doing.
manquer 8 days ago |
This is true for most databases though .
How much is out of the box or simple easy to access configuration not magic incantations either you need expensive courses to know or be battle hardened with years of experience is the question really
breadwinner 7 days ago |
It is of course true... it is well known that SQL Server scales to department level, but Oracle scales to company level. This is true inside Microsoft and Oracle as well. Inside Microsoft, they have a bug database per division but Oracle has a single database for the entire company. Ask people who work at those companies.
See also scalability sections in these artcles:
https://airbyte.com/data-engineering-resources/oracle-vs-sql...
https://futuramo.com/blog/oracle-vs-sql-server-head-to-head-...
bdangubic 7 days ago |
if it is good for SO it should be good for most :)
https://stackoverflow.blog/2008/09/21/what-was-stack-overflo...
breadwinner 7 days ago |
You don't believe that web visitors are directly querying SQL Server, right? I can believe they are storing their employee database in SQL Server... they have hundreds of employees.
bdangubic 7 days ago |
do some research and then come back here… coming with shit like “you don’t believe they are querying sql server directly” is childish and unprofessional.
adzm 8 days ago |
Their query optimizer is incredible. Unfortunately that lets people get away with truly horrifying queries or views nested a dozen layers deep until it falls over.
tormeh 8 days ago |
Us it, though? I worked with it tangentially and found it deficient compared to Postgres. Why pay for a product that's worse than the best free product? In the old days there was a question of who to pay for support that was easier to answer for proprietary DBs, but with cloud services that answer is "you already pay your cloud provider".
branko_d 8 days ago |
I wish Microsoft paid more attention to T-SQL though. It’s an atrociously primitive language in some ways. There is no “record” or “struct” type of any kind, table-valued functions are not composable, an error in one line may throw an exception or just continue execution to the next line depending on whether TRY..CATCH exists at some higher level… to name just a few grievances of many I accumulated over the years.
It can work well performance-wise and security-wise, but programming it can be quite a pain, and I feel that’s unnecessarily so, considering what resources Microsoft has at their disposal.
Tostino 8 days ago |
Agreed with the other person. It's a great database. I wouldn't choose it for a startup over Postgres, but it is extremely capable.
olavgg 8 days ago |
I would use it if it supported backup/restore over unix pipes / ssh.
emmelaich 8 days ago |
If it supports backup to a file, you can have it write to a named pipe and from there to wherever.
I used this hack for backing up Oracle 30 years ago.
Something like 'mknod p backup.dmp; oradump .... file=backup; dd if=backup.dmp | ssh othermachine receiver-process'
thayne 8 days ago |
Not necessarily. That won't work if the backup uses apis that a pipe doesn't support, like seek or reading back from the file.
emmelaich 7 days ago |
Sure; does any backup actually do that? I guess it's possible.
Backups (at least db backups) used to be made with the assumption that the backup device is tape.
stackskipton 8 days ago |
SRE who deals with some .Net stuff that uses MSSQL but is converting to MySQL. so I feel somewhat qualified to talk about MSSQL. TL;DR: Nothing interesting going on.
There is nothing to talk about here. It's boring database engine that powers boring business applications. It's pretty efficient and can scale vertically pretty well. With state of modern hardware, that vertical limit is high enough most people won't encounter it.
It's also going the way of Windows Server which is to say, it's being sold but not a ton of work is being done on it. Companies that are still invested in it are likely because they don't care about cost ultimately or cost of switching is too high to greenlight the switch.
Anyone who does care about cost like my current company has switched to OSS solutions like PostGres/MySQL/$RandomNoSQLOSSOption. My company switched away when turned into SaaS business and those MSSQL server costs ate into bottom line.
This has been happening throughout the ecosystem. Proget which is THE solution for .Net Artifacts is switching to PostGres: https://blog.inedo.com/inedo/so-long-sql-server-thanks-for-a...
Also, I saw this article from Brent Ozar, who I see as MSSQL smart person, which basically said if you have the option, just go with PostGres: https://www.brentozar.com/archive/2023/11/the-real-problem-w...
It's also worth noting that Microsoft even bought PostGres scaling solution called Citus so they read the writing on the wall: https://blogs.microsoft.com/blog/2019/01/24/microsoft-acquir...
rawgabbit 8 days ago |
I was a big proponent of MSSQL. It is still a good product but I see Microsoft constantly fumbling with new OLAP tools. It is a shame but it seems Microsoft is abandoning MSSQL.
datadrivenangel 8 days ago |
If it's any consolation, half of the new cloud OLAP tools are basically still MSSQL
mbreese 8 days ago |
> It's boring database engine that powers boring business applications
I'm taking that as a positive thing... it's boring and does its job with little fanfare. That's pretty much what I want out of a RDBMS. So long as it is "fast-enough" with enough features for the applications that use it, that seems like a good place for an RDBMS to be.
One could still argue about Windows and licensing fees, but from a technical point of view, for business customers, boring isn't necessarily a bad thing.
FridgeSeal 8 days ago |
There’s other boring databases that also reliably fill that job, and they also cost far less.
It can also be a bit of a pain outside the C# ecosystem, whereas every language ever has nice postgres drivers that don’t require us to download arms setup ODBC. It runs on Linux as of a few years ago, but I also wouldn’t be surprised if many people didn’t realise that.
stackskipton 8 days ago |
I’ve run into MSSQL on Linux. Most DBAs know but their entire ecosystem is Windows Server so what’s another Windows Server is their thinking.
rplnt 8 days ago |
> It's boring database engine that powers boring business applications.
FWIW, it also powered the most popular (in terms of player base) MMORPG before WoW took over.
And I wouldn't be surprised to find it in aviation, railways, powerplants, grid control, etc...
Foobar8568 8 days ago |
Before wow , there was either lineage, and then EverQuest.
I guess it was Lineage as Korean used mainly MSFT softwares?
rplnt 8 days ago |
Looking at some subscription charts I was able to quickly find now I see I made a mistake. I was thinking of Lineage 2 using mssql, while it was Lineage (1) that was the major one. I do not know anything about its backend and it would be hard to assume considering how much older it is.
1. https://ics.uci.edu/~wscacchi/GameIndustry/MMOGChart-July200...
greggyb 8 days ago |
I'll probably come across as a shill here, but there is a lot going on with SQL Server, all included in your license (Standard Edition has limitations on scaling).
Some of these things are merely passable, some are great, but it's all included. The key takeaway is that SQL Server is a full data platform, not just an RDBMS.
- RDBMS: very solid, competitive in features - In-memory OLTP: (really a marketing name for a whole raft of algorithmic and architectural optimization) can support throughput that is an order of magnitude higher - OLAP: Columnstore index in RDBMS, can support pure DW style workload or OLAP on transactional data for near-real-time analytics - OLAP: SSAS: two different best-in-class OLAP engines: Multidimensional and Tabular for high-concurrency low-latency reporting/analytics query workloads - SSIS: passable only, but tightly integrated ETL tool; admittedly in maintenance mode - SSRS: dependable paginated / pixel-perfect reporting tool; similar to other offerings in this space - Native replication / HA / DR (one of the only things actually gated behind Enterprise) - Data virtualization: PolyBase
If you're just looking for a standard RDBMS, then there's little to justify the price tag for SQL Server. If you want to get value for money, you take advantage of the tight integration of many features.
There is value for having these things just work out of the box. How much value is up to you and your use cases.
stackskipton 8 days ago |
Yes, it has a ton going on but most of companies I've found using it are using primarily as RDBMS and thus MySQL/Postgres could replace it. Other stuff it did could be replaced by tools more geared towards specific function and most of time, at much lower cost.
Licensing isn't cheap. For anyone wondering, before discount, it's 876/yr per core for Standard and 3288/yr per core for Enterprise. Also note that Standard is limited to 24 cores and 128GB of RAM, if you want to unlock more of that, you must move to Enterprise.
codeulike 8 days ago |
Theres still an express version thats free to use but limits database to 10gig, for what its worth
greggyb 4 days ago |
My point was just that there’s a lot going on there, and value for those who want more than an RDBMS. I have no disagreement that the RDBMS on its own is not worth paying for for most, and especially not for any techish organizations.
I’d also note that most orgs and use cases probably don’t need more than 24 cores and 128GB RAM.
I think for an organization that wants a near-trivial out of the box experience with RDBMS, reporting, and analytics, Standard Edition is not a bad deal. Especially for the many organizations that are already using Microsoft as their identity provider and productivity suite.
datadrivenangel 8 days ago |
Microsoft is a decent deal if you go 100% in on Microsoft, but it's important to budget for additional support because actually using microsoft products has quite the learning curve.
emmelaich 8 days ago |
It's the Linux-isation of the db space. Once Linux was good enough for enterprise work, it massively reduced demand for Solaris/HP-UX/AIX/WindowsNT.
Same thing is happening now to Postgres vs enterprisey DBs.
nevf1 7 days ago |
I have used a lot of RDBMS vs NoSQL solutions and I love SQL Server. I have used and written services consuming/reporting/processing thousands of transactions per second and billions of euro per year.
The profiling abilities of SQL Server Management Studio (SSMS) and its query execution insights, the overall performance and scalability, T-SQL support, in-memory OLTP, and temporal tables - I just love SQL Server.
I'm not sure if it's just that I learned SQL Server better in college than MySQL, Mongo or Postgres but it's just been an amazing UX dev experience throughout the years.
Granted, there's some sticky things in SQL Server, like backups/restores aren't as simple as I'd like, things like distributed transactions aren't straightforward, and obviously the additional licensing cost is a burden particularly for newer/smaller projects, but the juice is worth the squeeze IMHO.
sigbottle 8 days ago |
These year in review posts are really neat, I liked the AI in review posts really well.
Maybe algorithms review or TCS review or some specific math topic review next?
CT4u8798 8 days ago |
I love SQL. I'm not a full-time developer but always use SQL over other abstractions, which I find extremely confusing and way more complicated that plain SQL.
skeeter2020 8 days ago |
I'm now firmly into management but the one skill I use very regularly is SQL. By far the best investment I made in my entire career was a little bit of relational algebra, some casual study of DBMS internals and a lot of hands-on SQL. The quasi-standards have also made it the easiest transfer across specific DBs and their flavours over the years.
PSA: Hi kids, here's a dinosaur with yet more free advice: put the tiniest bit of effort into SQL early on and watch the compound interest add up.
threeseed 8 days ago |
That's just because it is what you are comfortable with.
Many developers will jump straight for ORMs when given the chance.
Tostino 8 days ago |
Which for certain types of applications ORMs absolutely have their use.
dominicrose 8 days ago |
Some ORMs have weird design issues. Eloquent for example allows you to pull relationships lazily on single objects, so if you're in a loop that'll create a lot of queries. So much for laziness! I'm OK with this ability, but the API shouldn't encourage it by making it trivial.
wahnfrieden 8 days ago |
2024 was also the year that Realm died
travisgriggs 8 days ago |
The funding section had me thinking “one of these is not like the others”. Both the amount and count of successive rounds.
ksec 8 days ago |
>Six years after MySQL v8 went GA, the team turned v9 out on the streets. ......Oracle is putting all its time and energy into its proprietary MySQL Heatwave service.
Oracle actually released 9.1 already in 2024. [1] And expect another release this month, and every quarter. So I think MySQL continues to get some new features bug fix and support like it used to. Contrary to most people think it is all going to Heatwave. I just hope Vector will be open source later as official to MySQL rather than behind Heatwaves.
[1] https://dev.mysql.com/doc/relnotes/mysql/9.1/en/news-9-1-0.h...
the_arun 8 days ago |
This person started with news on DB - reviewing all prominent DBs & finally ended talking about love of Larry Ellison. A perfect human in the days of LLMs. Amazing write up.
RedShift1 8 days ago |
I've been using plain postgres for over 5 years now, reading this I feel like I'm in the eye of a storm...
mebcitto 8 days ago |
A couple of spicy things:
> OtterTune. Dana, Bohan, and I worked on this research project and startup for almost a decade. And now it is dead. I am disappointed at how a particular company treated us at the end, so they are forever banned from recruiting CMU-DB students. They know who they are and what they did.
Ouch.
> Lastly, I want to give a shout-out to ByteBase for their article Database Tools in 2024: A Year in Review. In previous years, they emailed me asking for permission to translate my end-of-year database articles into Chinese for their blog. This year, they could not wait for me to finish writing this one, so they jocked my flow and wrote their own off-brand article with the same title and premise.
Also sounds like he's preparing a new company:
> I hope to announce our next start-up soon (hint: it’s about databases).
spprashant 8 days ago |
Anyone know what company he may be talking about?
C-programmer 8 days ago |
Inspect element on https://web.archive.org/web/20240827031455/https://ottertune...
For more context:
> I'm to sad to announce that @OtterTuneAI is officially dead. Our service is shutdown and we let everyone go today (1mo notice). I can't got into details of what happened but we got screwed over by a PE Postgres company on an acquisition offer. https://x.com/andy_pavlo/status/1801687420330770841
yencabulator 8 days ago |
Because that was a little too subtle and I was sufficiently curious:
view-source:https://web.archive.org/web/20240827031455/https://ottertune...
scroll until you see ASCII art
swyx 8 days ago |
needed that. https://en.wikipedia.org/wiki/EnterpriseDB note the PE's
spprashant 8 days ago |
Oh wow, didnt know pe-postgres-company had any negative rep.
Anyone care to explain how a company can screw another company via a acquisition offer?
senderista 8 days ago |
Well, there's a leading Postgres company which is owned by not one but two PE firms...
iso8859-1 8 days ago |
How do they enforce the ban? Do universities have non-compete clauses for PhD students?
mebcitto 8 days ago |
I assume it's not that kind of ban, but more like he'll recommend his students to avoid the company.
lmwnshn 8 days ago |
Pretty much. Plus, from my perspective - if a company is willing to screw over your advisor/professor, you know that they won't hesitate to screw you over too.
mastax 8 days ago |
I think that just means they aren’t allowed at career fairs etc.
mrtimo 8 days ago |
Enjoyed his roundup in the "Shoving Ducks into Everything" section.
DuckDB is a great tool. In April 2020, the creator of DuckDB gave a talk at CMU. In the beginning he makes a convincing argument (in 5 minutes) why data scientists don't use RDBMS and how this was the genesis of DuckDB. Here is a video that starts 3 minutes into the talk (where is argument starts): https://youtu.be/PFUZlNQIndo?si=ql9n2QuBlAEuGIqo&t=204
dig1 8 days ago |
> There was no major effort to fork off MongoDB, Neo4j, Kafka, or CockroachDB when they announced their license changes.
AFAIK people didn't take MongoDB seriously from the start, especially with the "web scale database" joke circulating. The Neo4j Community version has been under GPLv3 for quite some time, while the Enterprise version has always been somewhat closed, regardless of whether the source code was available on GitHub (the mentioned license change affected the Enterprise version).
Regarding CockroachDB, I must admit that I've only heard about it on HN and don't know anyone who seriously uses it. As for Kafka, there are two versions: Apache Kafka, the open-source version that almost everyone uses (under the Apache license), and Confluent Kafka, which is Apache Kafka enhanced with many additional features from Confluent, and the license change affected Confluent Kafka. In short, maybe the majority simply didn't care about these projects very much, so there is no major fork.
> It cannot be because the Redis and Elasticsearch install base is so much larger than these other systems, and therefore, there were more people upset by the change since the number of MongoDB and Kafka installations was equally as large when they switched their licenses.
I can’t speak for MongoDB, but the Confluent Kafka install base is significantly smaller than that of Apache Kafka, Redis and ES.
> Dana, Bohan, and I worked on this research project and startup for almost a decade. And now it is dead. I am disappointed at how a particular company treated us at the end, so they are forever banned from recruiting CMU-DB students. They know who they are and what they did.
Call me a skeptic, but I can't see this as a fair approach. If your company fails for whatever reasons, you should not recruit the university department/group/students against your peers (I can't find that CMU-DB was one of the founders of Ottertune).
Wrt Andy, here are [1] somehow interesting views from (presumably) previous employees.
[1] https://www.reddit.com/r/Database/comments/1dgaazw/comment/l...
paulddraper 8 days ago |
There are production uses of MongoDB (Stripe comes to mind).
But it is certainly not a popular choice there.
threeseed 8 days ago |
MongoDB does over $2b in revenue (and growing by 20%) each year.
There are a lot of production uses.
paulddraper 7 days ago |
In my experience, these are largely startups or non-production use cases.
Also, MongoDB charges an arm and a leg and does not make it particularly easy to self-host (and many newer features are limited to their hosting).
misiek08 8 days ago |
It’s sad, but there are a lot of companies using it. In all places I know (directly or friends) it’s being chosen as „easier”. Almost everyone ends up not knowing the real schema of data inside the database, but using Spring-magic they are already lost and not owning the data. Maybe the enterprise Mongo is not so widely spread (they earn much, because they are not cheap), but free version is IMO overused and popular :(
apavlo 8 days ago |
> Wrt Andy, here are [1] somehow interesting views from (presumably) previous employees.
I am only seeing this now and I take the complaints about being "slightly racist and offensive" very seriously. I am checking with investors, former HR people, and co-founders. I was not made aware of any issues. If anything, I was overly cautious at the company.
I was openly transparent with our employees about every direction the company was pursuing up until the very end. The complaint that "He thinks he knows everything about business" makes me believe this person is just trolling because I was always the first to admit in meetings that I was not an expert in how to run a business. We had to fire people because of inappropriate behavior, but not because I had strong disagreements with how to run the company.
moab 8 days ago |
FWIW, as someone who had multiple friends who worked closely with Andy over 5+ years and at all stages of their career (both BS / PhD) those comments reek of someone with an axe to grind. All of the many anecdotes I have about Andy paint a picture of a great advisor and mentor. I suppose I should say "shenanigans aside", but if you can't separate his jokes from his academic side you need to develop a sense of humor.
lmwnshn 8 days ago |
> ... you should not recruit the university department/group/students against your peers ...
As a student who chose to stay at CMU for a PhD because of this group, it is quite the opposite situation - you may also misunderstand the nature of the "ban" (students can still apply directly to the company).
From the student perspective, we benefit from knowing the reputation of potential employers. For example: CompanyX went back on their promises so don't trust them unless they give it to you right away, CompanyY has a culture of being stingy, the people who went to CompanyZ love it there, and so on.
So it's more like (1) providing additional data about the company's past behavior, and (2) not actively giving the company a platform. I personally find this great for students.
maeil 8 days ago |
Good read!
> Postgres' support for extensions and plugins is impressive. One of the original design goals of Postgres from the 1980s was to be extensible. The intention was to easily support new access methods and new data types and operations on those data types (i.e., object-relational). Since 2006, Postgres' "hook" API. Our research shows that Postgres has the most expansive and diverse extension ecosystem compared to every other DBMS.
Greenhorn developers don't even know that there are non-Postgres databases which have extensions too - such is the gap! I wouldn't be surprised if Postgres had as many as all others combined.
softwaredoug 8 days ago |
On the “Amazon can just offer your DB as a service”
Yes this can happen. But a lot of people don’t want a AWS managed service. They're like 30% cheaper for 30% less value. They can develop a bad reputation and feel like weird forks (kinesis vs Kafka) that have weird undocumented gotchas and edge cases that never get fixed. Many teams want to host on k8s anyway, and you’ll probably have better k8s support from the main project. Another example is the success of Flink over hosted Google Dataflow. Seems eventually the teams I know trend to the most mainstream OSS implementation over time, maybe after early prototyping on a managed system.
IMO it might not be the highest growth market anymore. Those who want to pay for a managed service will. But many are just figuring out a k8s based solution to their infra needs as k8s knowledge becomes more ubiquitous.
senderista 8 days ago |
Kinesis is not Kafka (at all), nor is Google Dataflow Flink (perhaps you're thinking of Apache Beam?).
wslh 8 days ago |
Great heads up. I wonder about graph databases. He mentioned <https://umbra-db.com/> and <https://cedardb.com/> both include the graph use case and I wonder how they compare to <https://neo4j.com/>.
Tanjreeve 8 days ago |
Umbra and cedar are both still relational databases. Afaik the jury is now out on if graph databases are better than modern relational databases for most graph queries especially ones with good query planners/compilers. The only time graph DBs seem to be consistently better is for very specialist many degrees traversals for small amounts of data.
wslh 8 days ago |
My point is that both Umbra and Cedar mentioned graph support, and I don't believe this was a coincidence.
Umbra highlights: "Groupjoins enable efficient computation of aggregates, worst-case optimal joins handle complex queries on graph structured data, and range joins efficiently evaluate queries with conditions on location or time intervals." while Cedar includes in the hero: "CedarDB is a relational-first database system that delivers best-in-class performance for all your workloads, from transactional to analytical to graph,..."
Tanjreeve 7 days ago |
I think that phrase of "relational first" explains it best. Anything that supports joins can support a graph. Whether it'll perform satisfactorily is going to depend on the compiler and use case/size.
atombender 8 days ago |
The article mentions Greenplum, but it's worth noting that when the code was closed, several of the original developers created an open-source fork, Cloudberry, which seems to be thriving. Cloudberry was accepted into the Apache project this year, and has synced with Postgres 14, whereas the closed-source Greenplum is still stuck on Postgres 12.
The architecture is quite ancient at this point, but I'm not sure it's completely outdated. It's single-master shared-nothing, with shards distributed among replicas, similar to Citus. But the GPORCA query planner is probably the most advanced distributed query planner in the open source world at this point. From what I know, Greenplum/Cloudberry can be significantly faster than Citus thanks to the planner being smarter about splitting the work across shards.
ledgerdev 8 days ago |
Thanks for the cloudberry mention, wasn't aware of it.
Icathian 8 days ago |
This is mostly correct, but it's worth mentioning that cloudberry substantially predates Greenplum going closed source. It just got quite a boost from that change happening. Different dev team too, afaik none of the original Greenplum team was involved with Cloudberry until very recently.
Also, Greenplum 7 tracks postgres 14. Which is still old at this point, but not so bad as 12....
I also don't think I'd call the architecture ancient. Just very tightly coupled to postgres' own (as a fork of postgres that tries to ingest new versions from upstream every year or two) and paying the overhead of that choice in the modern landscape.
Source: former member of the Greenplum Kernel team.
atombender 7 days ago |
Thanks for the context. In what way would you say Cloudberry lags behind Greenplum technology-wise? I see newer Greenplum versions have a lot of planner improvements.
Greenplum 7 is listed as tracking Postgres 12 in the release announcement [1], and the release notes for later 7.x versions don't mention anything. Is there a newer release with higher compatibility?
When I say ancient, I mean that it's a "classical" shared-nothing design where the database is partitioned and hosted as parallel, self-contained replica servers, where each node runs as a shard that could, in theory, by queried independently of the master database. This is in contrast to newer architectures where data is sharded at the heap level (e.g. Yugabyte, CockroachDB) and/or compute is separated from data (e.g. Aurora, ClickHouse, Neon, TiDB).
[1] https://greenplum.org/partition-in-greenplum-7-whats-new/
Icathian 6 days ago |
Cloudberry, last I checked, took their snapshot of all the Greenplum utilities way before the repos got archived and development went private. The backup/restore, DR, Upgrade, and other such seem to leave a lot on the table. I haven't checked in a bit, it's possible they've picked back up some of that progress.
You're completely right, I had the wrong PG version in my memory. Embarrassing, thanks for catching that.
dedekind1986 4 days ago |
All the Greenplum utilities you mentioned here are also open-sourced and available for Cloudberry, but some of them are not in the main repo of Apache Cloudberry (This is more a matter of adhering to the Apache Software Foundation's regulations than a technical limitation).
Here is the unofficial roadmap of Cloudberry:
1. Continuously upgrading the PostgreSQL core version, maintaining compatibility with Greenplum Database, and strengthening the product's stability. 2. End-to-end performance optimization to support near real-time analytics, including streaming ingestion, vectorized batch processing, JIT compilation, incremental materialized views, PAX storage format, etc. 3. Supporting lakehouse applications by fully integrating open data lake table formats represented by Apache Iceberg, Hudi, and Delta Lake. 4. Gradually transforming Cloudberry Database into a data foundation supporting AI/ML applications, based on Directory Table, pgvector, and PostgresML.
dedekind1986 4 days ago |
Delighted to see Greenplum mentioned in this article, also equally pleased to see Apache Cloudberry mentioned in the comments. Greenplum has been open-source for nearly a decade, forming a fairly mature global open-source ecosystem, with many core developers distributed around the world ( they were not necessarily hired by Pivotal/VMware/Broadcom). Greenplum forked as Cloudberry wasn't to outdo Greenplum Database, but to foster a more neutral and open community around an MPP database with a substantial global following. To that end, the project was donated to the Apache Software Foundation following Greenplum's decision to close source. Since the project is in its early stages within the Apache incubator, our immediate goal is to build a solid foundation that adheres to Apache standards. Instead of introducing extensive new features, we are concentrating on developing a stable and compatible open-source alternative to Greenplum.
PeterZaitsev 8 days ago |
I think one thing Andy misses about why people were pissed about Elastic and Redis but not as many for MongoDB and some other is their license and size of Contributors Community.
When original license is as restricted as AGPL it is unlikely there is much of embedded use... so less people are impacted in truly catastrophic way
Also if there is no contributor community to speak of... who is going to do the fork ?
I put some thoughts about it in my post about ScyllaDB https://peterzaitsev.com/thoughts-on-scylladb-license-change...
tayo42 8 days ago |
> In the case of Redis, I can only think that people perceive Redis Ltd. as unfairly profiting off others' work since the company's founders were not the system's original creators. An analysis of Redis' source code repository also shows that a sizable percentage of contributions to the DBMS comes from outside the company
He mentions this in "Andy’s Take" section btw
PeterZaitsev 8 days ago |
Yes. Not the license though.
rockwotj 8 days ago |
RE ScyllaDB, there will absolutely be no fork and it’s very unlikely they have ever had a meaningful contribution to Scylla OSS (which is not changing, just going to bit rot and the enterprise version which was closed source was moved to source available). The reasoning being the bar is very high for contributions. It’s C++ 20/23 without virtual memory and a userland cooperative thread per core scheduler (this is the underlying seastar framework). The skills needed to do anything meaningful here is very high and I don’t see the motivation behind it as I would expect people having feature requests would be customers, have an easier time extending Cassandra or the feature would be available in the Enterprise version.
menaerus 8 days ago |
> It’s C++ 20/23 without virtual memory
Not sure what you mean by this? Virtual memory is implied by the CPU MMU and consequently OS kernel. Perhaps you meant they use a lot of custom memory allocation schemes?
Otherwise, I agree that the bar is quite high since (1) the problem at hand is already too complex (scalable LSM), and (2) pretty much anything in the code is custom made, e.g. avoiding the OS kernel as much as possible. And they pay peanuts for the skills needed to do the job.
mike_hearn 7 days ago |
Seastar runs everything in kernel mode.
menaerus 7 days ago |
Makes no sense. Source?
rockwotj 7 days ago |
I mean memory is completely allocated up front and fragmentation is something for application developers to deal with.
menaerus 7 days ago |
I occasionally read their blogs, and I haven't looked into the source for a while, but application developers having to deal with the OOM because one is using custom memory allocation policy that doesn't deal with or suffers from fragmention would be a strange design choice.
jmalloc IME works really well.
rockwotj 7 days ago |
It tries to bypass the kernel as much as possible. It runs in userspace.
mike_hearn 7 days ago |
I'm probably thinking of when it's combined with OSv or something. It's been years since I looked at this.
kwillets 8 days ago |
I spent the past year puzzling over the DB market as well, but I don't feel like I'm much closer to understanding it.
It appears that a lot of attention is now directed at the folks doing 100 MB queries, and the high end has moved past everybody's radar. My idea of an exciting product is Ocient, who have skipped over Cloud and gone for hyperscale on-prem hardware. Yellowbrick is also a contender here.
I have a lot of experience with Vertica, and they seem to have gotten stuck in this niche as well, with sales tilted towards big accounts, but less traction in smaller shops, and a difficult road to get a SaaS or similar easy-start offering.
There's a crossover point where self-managed is cheaper than cloud, but nobody seems to have any idea where it is. Snowflake will gladly tell you that your sub-$1M Vertica cluster should be replaced by $10M of sluggish SaaS, and that you are saving money by doing so. These decisions seem more in the realm of psychology or political science.
DHH's cloud exit was a refreshing take on the expense issue, even if it wasn't strictly in the database space -- the cost per VCPU and so forth that he documented is a good start for estimating savings, and he debunked a lot of the "hidden costs" that cloud maximalists claim.
In the business/financial space the biggest news to me was the correction in Snowflake's stock price, which seemed to indicate that investors were finally noticing metrics like price-performance, but they added a little more AI and went back into irrationality.
I'm heavily in favor of DuckDB, Hudi, Iceberg, S3 tables, and the like. Mixing high-end and low-end tools seems like the best strategy (although settling on one high-end DWH has also worked IME), and the low end is getting better and cheaper, squeezing out the mid-range SaaS vendors.
In research I found Goetz Graefe's work in offset-value coding exciting -- he's wired it into query operators in a way that saves a lot of CPU on sorting and joins/aggregation. This is a technique that I've applied favorably in string sorting, and it was discovered in the DB community decades ago but largely forgotten. (This work precedes 2024, but I'm a slow study.)
downsplat 8 days ago |
> There's a crossover point where self-managed is cheaper than cloud
Single data point here: before cloud managed dbs were a thing our smallish startup was running mysql on virtual servers by installing it from the linux package manager. Always worked great, runs without needing manual attention for years at a time once set up, so I've never felt the need to change.
So at least in some cases the crossover point is "right from the start".
kwillets 7 days ago |
Zero is certainly on my scale. Bonus points if you build a server and keep it under your desk.
senderista 7 days ago |
I was seriously considering applying to Ocient (had an internal referral), but there's no way I could live on their salary ranges ($145K-185K quoted for senior SWE roles), given that I live in a HCOL area.
kwillets 7 days ago |
I don't know much about the financial side of the company, but it seems like a client-led effort by telco's etc. against the dreck that tech VC's keep pushing on them. That can't translate into decent salaries unfortunately.
Silicon Valley doesn't have a good record in the DB/DWH space; producing a fully-featured DBMS doesn't seem to fit the VC model.
bionhoward 8 days ago |
Redis is slow?
kermatt 8 days ago |
I wish there was more context around that statement in his post.
Redis while not having some of the features he mentions in [1] (i.e. SQL), when used for what it excels at is usually not considered "slow".
As an in-memory data structure server, a common use case is to use it for where some operations in a typical RDBMS are slow.
[1] https://youtu.be/fZbwD1gzjLk?t=2018
senderista 8 days ago |
Yes.
1. It is single-threaded, which severely limits throughput for a single instance.
2. All communication must go over a socket, which severely impacts latency for use cases where it could otherwise run in-process.
breadwinner 8 days ago |
What is the better alternative?
bdangubic 8 days ago |
depends on your use case. also, for most redis is fast enough which is why it is wildly popular
yla92 8 days ago |
Do checkout https://www.dragonflydb.io . It's also mentioned in the review!
CRConrad 7 days ago |
I'm so tired of these Web pages that look like iPhone screens.
lmwnshn 8 days ago |
Slow is relative, but you might want to check out Garnet [0] for ideas. Previous discussion at [1], current compatibility at [2].
[0] https://www.microsoft.com/en-us/research/blog/introducing-ga...
[1] https://news.ycombinator.com/item?id=39752504
[2] https://microsoft.github.io/garnet/docs/commands/api-compati...
ak_111 8 days ago |
Wow his database startup that raised 12M died this year after only three years.
If anything this shows how insanely difficult it must be to succeed as a database startup (when was the most recent startup success in this space?), as the founding team is stellar.
On the other hand I am surprised it died this quick and interested to know if they did a proper postmortem. Not only did they raise way more than is needed to survive for three years but the idea is about utilising AI to improve DB performance and I find it hard to imagine they couldn't find more investors to lend them a lifeline with all the AI hype.
Tanjreeve 8 days ago |
No idea about internal workings but as a "DB optimisation" startup you're competing with
- most people don't need it
- People who do need it having DBAs/Operations people
- or consultancies
- Database vendors that have automatic optimisation as a feature
Ok "AI" in the name but I think for something as specific as DB optimisation AI jazz hands probably don't work as well. Writing it out it almost seems harder than being an actual DB vendor.
didgetmaster 8 days ago |
I thought the same thing. I have a personal project that started out as a file system replacement (object store), but also does some amazing DB operations and is useful for analytics. I have often thought about trying to turn it into a profitable business; but the thought of needing to raise insane amounts of capital just to get a few years of runway, seems daunting.
hipadev23 8 days ago |
Andy's ego is far too big to actually run a company and his focus is entirely on vanity metrics. You can see it in these database reviews where he spends substantial time pining over how much money CompanyX raised and shitting on the bread and butter tools used by those of us who don't have VC money to blow on whatever DBaaS is the current hype.
cmrdporcupine 8 days ago |
I mean it's also arguably the most difficult time for startups (in general) in the last 15-20 years. Esp if you got your initial investment and valuation before the end of ZIRP.
lvl155 8 days ago |
I highly recommend their Youtube series on databases. They have great guest speakers.
syspec 8 days ago |
Link?
nojito 7 days ago |
https://www.youtube.com/c/CMUDatabaseGroup
based2 8 days ago |
https://dbos-project.github.io news
refset 8 days ago |
More specifically, DBOS Inc. raised a $8.5 million seed round [0] and is backed by Michael Stonebraker (the creator of Postgres). I initially assumed Andy was alluding to this when he wrote "the most famous database octogenarian splashing cash" :)
[0] https://techcrunch.com/2024/03/12/new-startup-from-postgres-...
polishdude20 8 days ago |
After I interviewing at OtterTune a while back and being bombarded with multiple rounds of leetcode questions, I somehow knew OtterTune wouldn't make it
roark_howard 8 days ago |
DuckDB dominating over DataFusion could fuel the ongoing language war with a great half baked argument!
yencabulator 8 days ago |
Every time I've tried to use DuckDB it has segfaulted on me, so yeah I'm betting on DataFusion..
1egg0myegg0 8 days ago |
That is very surprising to hear! If you can reproduce it, could you please file a bug report on GitHub? That would be a huge help!!
riku_iki 8 days ago |
I personally filed bunch of bugs, they mostly were autoclosed because of no activity after 3 months. This is very discouraging, and I rather invest effort in looking for workarounds in the future.
gigatexal 8 days ago |
This take screams more than a technical criticism but of something personal. “I'll be blunt: I don't care for Redis. It is slow, it has fake transactions, and its query syntax is a freakshow. Our experiments at CMU found Dragonfly to have much more impressive performance numbers (even with a single CPU core). In my database course, I use the Redis query language as an example of what not to do.” (From the article)
Of course it’s not to be used as a general purpose DB it’s keys and values. Used for caches and things like that. In my experience in real world scenarios and loads vanilla single threaded Redis is stable, fast, and nigh bulletproof.
rednafi 8 days ago |
Loved the overview. Hated the shade toward Redis. Redis has arguably the best key-value query syntax, and there’s a reason so many people swear by it. True, the decision-makers at Redis Ltd are absolute pieces of trash, but Redis itself is a delightful piece of engineering artifact.
I don’t care about the billion-dollar drama behind a piece of tech, but Redis defined the key-value query API for many similar databases. Trashing it just because it isn’t SQL-like feels unjustified.
quotemstr 8 days ago |
There's one QOL extension that I haven't seen anyone else implement: dimensional analysis. I can declare a column is an integer. Why not an integer that expresses feet? Why shouldn't I be able to write SELECT 1inch + 1cm and get a correctly computed length? Why can't the query parser help me avoid nonsense like SELECT 1kg + 1hr? All this stuff is pretty straightforward to add and would help avoid avoidable mistakes.
zhousun 8 days ago |
It's such an honor our https://pgmooncake.com/ is covered in the review!
A little sad Andy didn't share more of his thoughts on the intersection between Data and AI, and how that's going to evolve.
Upvoter33 8 days ago |
Pretty funny.
One factual issue: "The university had previously announced that this player was transferring from Louisiana State to Michigan." This is not true. Underwood had committed to LSU but then switched his commitment to Michigan. He was still in high school at the time, and has never attended LSU.
But, do you really expect a funny database prof to know much about football?
mpbart 8 days ago |
I never thought I’d see a discussion about the Underwood NIL drama on a databases blog post but here we are.
bcoates 8 days ago |
"I've never met anybody that used Alteryx"
I have! It's a pretty good no-code/minimal-code graphical ELT+Analytics in one tool. It's one of those alternate-universe tools that has it's own way of doing things from everything else in the industry, but it’s pragmatic and the people who use it tend to love it.
The one thing that makes it viable is that is has/had (pre-acquisition) very aggressive compatibility with anything else that can hold data, so you can use it as a bolt-on to whatever other databases or files your company has.
Despite what the PE press release about the acquisition says, it has virtually nothing to do with AI, at lease in the modern big NN sense.
If you're looking to fix your giant pile of alteryx workbooks or migrate them to something else, hmu
osigurdson 8 days ago |
More like: "Database license drama - a year in review".
pbrunoster 8 days ago |
oracle, sqlserver , db2 , informix , teradata for example, does not exist ... ok
pbrunoster 8 days ago |
oracle, sqlserver , db2 , informix , teradata for example, does not exist ... ok ....
nwatson 8 days ago |
Interesting seeing the death of blockchain-based AWS QLDB mentioned.
I worked at a company for a while that used QLDB as the primary system of record. The idea is great but the problem is that due to performance and other QLDB limitations all data had to be mirrored to an RDBMS via a streaming/queuing system, and there always were programmatic errors in interpreting data arriving for import into the RDBMS ... text field too long for RDBMS field; wrong data type or overflowing integer; invalid text encoding; ... Etc. These errors had to be noticed, debugged, fixed, and data had to be re-streamed. In the meantime official transactions were missing from the RDBMS side, which was used for reporting, driving the UI, deriving monetary obligations, etc. it was not worth the trouble. (I was lucky to not be involved in that design or implementation.)
kopirgan 8 days ago |
Wow so much to learn reading this . Thanks
hdesh 8 days ago |
> The upcoming year is going to be the test of strength for many database startups. Nobody wants to be the next MariaDB Corporation,
The link for "MariaDB corporation" points to an empty image with white colour background. Can anyone explain the context here?
dandan7 8 days ago |
Thanks for the article. On the topic of redis: 3 executives from redis built FalkorDB (succeeded redisgraph) raising 3m to build a graphdb for better rag (ref:https://github.com/FalkorDB/GraphRAG-SDK)
uncomplexity_ 8 days ago |
thank you for your work andy!
for a moment i got reminded of the rap music in your courses
im glad that tigerbeetle got here, really impressive team they have.
there are a lot of other missing alien technologies i've discovered recently too like quickwit which is like elasticsearch but s3-compatible, and typesense which is like elasticsearch but memory-based
swyx 8 days ago |
> I need to figure out to juice my stats because in September 2024, Wikipedia removed the article about me over not having enough citations.
guys, what are we doing here. this is ridiculous. andy pavlo cannot get an article on wikipedia? have you seen his work?
phartenfeller 8 days ago |
There is a lot about Larry Ellison, but not a single word about this year's rather big Oracle release (23ai)?