So far my impression is that there are a whole lot of other things I need to worry about in Postgres before I spend time considering serial vs. random UUID vs. ordered UUID. Am I wrong here and this is something that really matters and you should invest more time in?
Postgres is happier with sequence ID's, but keeping Postgres happy isn't the only design goal. It does well enough for all practical purposes if you need randomness.
See Snowflake IDs for a scheme that gives you the benefit of random UUIDs but are strictly increasing. Which is really UUIDv7 but fits in your bigint column. No entropy required.
It literally is the one thing in the entire stack that must always be happy. Every stateful service likely depends on it. Sad DBs means higher latency for everyone, and grumpy DBREs getting paged.
There are some pragmatic differences I've found, though - generally, DBAs are less focused on things like IaC (though I know at least one who does), SLIs/SLOs, CI/CD, and the other things often associated with SRE. So DBRE is SRE + DBA, or a DB-focused SRE, if you'd rather.
I don’t think people grasp how far a single RDBMS server can take you. Hundreds of thousands of queries per second are well in reach of a well-configured MySQL or Postgres instance on modern hardware. This also has the terrific benefit of making reasoning about state and transactions much, much simpler.
Re: last bit of performance, it’s more than that. If you’re using Aurora, where you pay for every disk op, using UUIDv4 as PK in Postgres will approximately 7x your IOPS for SELECTs using them, and massively (I can’t quantify it on a general basis; it depends on the rest of the table, and your workload split) increase them for writes. That’s not free. On RDS, where you pay for disk performance upfront, you’re cutting into your available performance.
About the only place it effectively doesn’t matter except at insane scale is on native NVMe drives. If you saturate IOPS for one of those without first saturating the NIC, I would love to see your schema and queries.
Like any sort of optimization I believe this will depend on your workload and what's important.
For me when I switched to UUIDv7 a few months ago, it was basically no effort to switch v4 to v7 on a relatively new system. I was observing much higher batch insertion latencies than I expected, and producing inserts that touch less of the btree on an index created a very noticeable reduction in insertion latencies. But my workloads and latencies may look nothing like yours. On amazon RDS instances with EBS volumes and relatively low memory, insertion latency stood out, so using strategies that reduce the number of disk blocks that are needed has an outsized performance impact.
This of course would produce different results on different hardware / system sizing.
> Am I wrong here and this is something that really matters and you should invest more time in?
Specifically, no - you don't need to worry about it. Reconfiguring your tables to use a different style of unique identifier if your tables have a unique identifier is a bit of a pain but no more so than any other instance of renaming a column - if you want to minimize downtime you add the new column, migrate data to the column, deploy code that utilizes the new column and then finally retire the old column. Even if the previous version of the table lacked any sort of unique key it is still possible to add one after the fact (it's a bit technically harder to properly keep them in sync but it is possible to do safely).
It's just a question of the cost of doing so and the benefits of it - I work in a system that exclusively uses integral keys and our data is such that we don't really suffer any downsides from that choice - if you're working in a larger system with less confidence in the security practices of other teams then avoiding sequential keys so that you have obscurity to fall back on if someone really drops the ball on real security isn't the worst idea... but I think the really compelling reason to prefer UUIDs is for the power of distributed generation... that really only applies to inherently decentralized or astoundingly large products though - and if your product eventually grows to astoundingly large you'll have plenty of time to switch first (probably the wake-up call will be closing in on running out of 4 bit serial unique keys).
As a DBRE, I believe it always matters, and you should invest time in it. Pragmatically, it is unlikely to have noticeable effects until your tables are at least in the 1E5 rows range, if not higher. Unfortunately, by that point, it’s likely that you’ll find other things of higher importance to deal with, so the problem will only grow.
Using completely random UUIDs is IMO the worst choice. It’s fine right up until it isn’t, and then you are stuck in hell with no good way out.
Can't say that without context. I've worked in systems where even the version bits were randomized, and for good reason (i know, technically, no UUID anymore).
Most threat models for database IDs do not require their creation timestamp to be secret. Meanwhile every use case for database IDs requires them to be looked up in an index.
Of course production crashed when their latest sql query did a multijoin on the real table with a few millions of rows. The size of the uuid needed to join filled the available RAM, everything slowed to a crawl and the system collapsed.
The uuid as primary key can be seen as a risk multiplicator : it will amplify any performance issue you may have, converting a temporary slowness into a full system stop.
TSID: > A Java library for generating Time-Sorted Unique Identifiers (TSID).
Wouldn't this TSID thing be more useful if it were implemented as a set of PostgreSQL stored procedures or something than a Java library? Not everyone uses Java.
- not everybody uses Postgre (either way, somebody will be left out)
In SQLite, my assumption was that the consensus was towards UUID4 rather than 7 because it meant less likelihood for page cache contention during transaction locks? Would that not also roughly map onto a Postgres-flavored system? Or dues Postgres only have row-level locking?
- UUIDv4 causes page fragmentation (ideally, on disk pages of data would be stored in sorted fashion; HDDs still exist)
- if in your app fresh records have more reads, having them close will help with read speed (being cpu cache friendly)
Page fragmentation from UUIDv4 can be desirable for SQLite because marking the same page as dirty can give SQLite little opportunity to sync it to disk. Causing contention. In this scenario your Write Ahead Log (WAL) file just grows and grows.
SQLite introduced WAL2 which means switching between 2 WAL files instead of one, helping to mitigate unbounded contention. But it's recent. Most tools don't have SQLite supporting that yet.
They often use tricks like including the MAC address of the generator machine and other ways to increase uniqueness assurances.
It was my understanding that uuids are simply very very unlikely to duplicate in situations with random generation.
> Implementations SHOULD NOT assume that UUIDs are hard to guess. For example, they MUST NOT be used as security capabilities (identifiers whose mere possession grants access). Discovery of predictability in a random number source will result in a vulnerability.
https://datatracker.ietf.org/doc/html/rfc9562#name-security-...
That is, if you aren't trying to have a collision and following the algorithm that has that aim, a collision can be exceedingly unlikely; but they can still be easy to guess if you are trying to predict someone else's assignment.
Sequence that easy to guess, but will never collide until it wraps back to 1.
n1=1,n2=2,…
(Caveat: if generated only by single node/thread)
A computer random number generator with a seed deterministically based on MAC address is another less trivial demonstration of the idea.
Cryptographic hashes are usually also dependent on RNGs having good properties (encryption has been broken due to bad RNGs too), but they are a bit more involved to figure out even if you know the input randomness, and are easier to invalidate (change the secret).
√(2^256)×1.1774÷8000000000÷3600÷24÷365 = 1,588,059,910,945,875,138,261
UUIDs are more likely to collide but still basically impossible. You'd have to generate √(2^122)×1.1774 = 2714899559048203259 to have 50% chance of a collision. Just to store the UUIDs for that database would take 39,506,988 TB of space. If you aren't thinking about your database not fitting on millions of drives, don't think about UUID collisions.
- https://alex7kom.github.io/nano-nanoid-cc/ - https://devina.io/collision-calculator
- https://commitfest.postgresql.org/48/4388/ (original patch created live https://www.youtube.com/watch?v=YPq_hiOE-N8)
- https://postgres.fm/episodes/uuid
- https://postgres.fm/episodes/partitioning-by-ulid
- https://gitlab.com/postgres-ai/postgresql-consulting/postgre...
Generally, the principles of separation of concerns and mechanical sympathy should be top of mind when designing a lasting and purposeful database schema.
Finally, since folks often say “I like stripe’s typed random IDs” in these kind of threads: Stripe are lying when they say their IDs are random. They have some random parts but when analyzed in sets, a large chunk of the binary layout is clearly metadata, including embedded timestamps, shard and reference keys, and versioning, in varying combinations depending on the service. I estimate they typically have 48-64 bits of randomness. That’s still plenty for most systems; you can do the same. Personally I am very fond of base58-encoded AES-encrypted bigserial+HMAC locators with a leading type prefix and a trailing metadata digit, and you can in a pinch even do this inside the database with plv8.
CPU works just as fast with both, however your CPU cache is limited and you'll put more 4-byte integers into your L1.
I don't really understand what you want to convey. CPU is very fast with any kind of integer size. There's no performance penalty to use 1-byte integer compared to 8-byte integer. And there's performance penalty when your L1 or L2 or L3 cache is busted and you need to go to the next memory layer.
They don't. CPUs have as many issues with data alignment as with cache sizes.
As individual variables, they take exactly the same space as 64-bit integers, both in registers and in memory (i.e. in the stack used for automatic variables or in the heap used for dynamically-allocated variables), because of the alignment requirements.
Therefore it never makes sense to use other integers than 64-bit, except as members of aggregate data types.
A database table is an aggregate data type, so the use of small integers in some columns may be justified, but such types shall not be used for variables in the programs that process data from the tables, where misuse of the small integers may cause overflows and there is no advantage from using smaller integer sizes.
You’re going to know well in advance before hitting this limit becomes an issue, and you’ll have plenty of time to either take a bit of downtime and do a column conversion, or do an online migration.
I didn't get your point. When it is 2^31, you definitely need bigint already. When it is much smaller, it will be much smaller overhead.
Also, per docs (https://www.postgresql.org/docs/current/storage-page-layout....), each postgres row has 23 bytes overhead, so your transition from 8->4 bytes will bring marginal improvement.
As to the maximal 8.6 GB mentioned, that's not nothing, _especially_ for RAM. Disk is cheap, but RAM isn't. If you have a smaller instance – say, an r6i.xlarge on AWS (4 vCPU, 32 GB RAM), that's 1/4 of the entire DB's memory.
[0]: https://www.postgresql.org/docs/current/storage-page-layout....
The fact that this overhead exists has nothing to do with the 8.6 GB of wasted space on the INTs.
I disagree its orthogonal and explained why. I guess lets agree on disagree.
For example I have a table that has about a billion rows and uses bigserial, but that table references about 6 other much smaller tables that use serial. I'm saving 48 bytes per row, or 90GB in total. It's a fairly significant save, and that's just on this one table: I have a bunch of tables like this. If I had bigint'd everything my disk space would be roughly double. And all of that is not even considering the extra index size and memory requirements.
Using bigint here would add absolutely nothing. I'm never going to have billions of users. I'm never going to have billions of different operating systems. There will never be billions of countries. I can probably get away with smallint for some of those, but I felt the extra headroom for that was worth the extra bytes.
This is why "bigint by default" is just bad advice IMHO. You will always need to think about it. Of course you shouldn't prematurely optimize integer sizes, but it's also not true that it doesn't matter, because it does. "Better safe than sorry" also applies in the other direction: "zomg we're wasting tons of disk space and spend much more money on servers than we need to" just because someone unthinkingly applied some stupid dogma they read on HN and didn't spend 4 seconds thinking about it.
you saving 24 bytes per row: downsizing 6 columns from 8 bytes to 4, which is fraction of your table size. If your system is sensitive to such change, you likely should optimize something else.
> Using bigint here would add absolutely nothing. I'm never going to have billions of users. I'm never going to have billions of different operating systems.
I think you cherry picked some fictional examples which you won't combine in one table, and your real entities are different. It adds process safety: you don't need to play mind quizzes to estimate how system can grow/change in a N years.
What an incredibly rude and dismissive accusation. Here's my table: https://github.com/arp242/goatcounter/blob/master/db/schema.... – number of IDs is actually 7, not 6.
I can give a lot more details and context on all of that and why it works the way it works and the savings are certainly not insignificant and theoretical, but save me real money in server costs every month. But I have little interest in talking to you further since you're just going to insult everyone who tries to inject the slightest nuance into "always use bigserial"-extremism. So good day to you.
the point is that some/many of those ids are not timezone and country, and potentially can grow high in cardinality.
I said you described fictional scenario, and it appears to be true, your table ID fields are very different than timezone, country and user_id.
Unrelated, I quite enjoyed reading your blog posts. Cheers!
This isn’t even optimization, it’s just understanding your tools and needs. It’s akin to profiling your app under load, seeing that at worst it needs 1 GiB of RAM allocated, and then giving it 8 EiB just in case.
By all means, if you can reasonably predict that a given table will near or pass 2^31 rows in the near future, just set it to BIGINT now. But as has been pointed out, most apps are never, ever going to have close to 2 billion users, or for that matter 2 billion anything. The only thing I can think of that might reasonably run up against that would be invoices for large scales, or object metadata for something like Notion.
all computers will be fine with 640kb of ram
Differences can appear in multicolumn indexes because two ints takes 8 bytes while two bigints takes 16, however the right layout of columns for an index is not always the layout that minimizes padding.
CREATE TABLE foo
(id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, iid INT NOT NULL);
CREATE TABLE bar
(id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, iid BIGINT NOT NULL);
CREATE TABLE baz
(id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, iid BIGINT NOT NULL);
-- fill each with 1,000,000 rows, then index
CREATE INDEX {foo,bar,baz}_iid_idx ON {foo,bar,baz}(iid);
SELECT table_name,
pg_size_pretty(pg_table_size(quote_ident(table_name))) "table_size",
pg_size_pretty(pg_indexes_size(quote_ident(table_name))) "index_size"
FROM information_schema.tables
WHERE table_schema = 'public';
table_name | table_size | index_size
------------+------------+------------
foo | 35 MB | 43 MB
bar | 42 MB | 43 MB
baz | 42 MB | 43 MB
`foo` has an INT followed by an INT, and its table size is 35 MB. `bar` has an INT followed by a BIGINT, and its table size is 43 MB; this is the same size for `baz`, despite `baz` being a BIGINT followed by a BIGINT.As your example shows, there is no benefit in index size (e.g for supporting FKs) in going from int to bigint for a single key. You end up with the same index size no matter what, not twice the size which was what I took your original post to mean.
> IMO using bigserial by default is wrong. Use whatever data type is appropriate. Not every table will grow to 4 billion rows and not every table will grow to even 60k rows
The implication I took from that was that they were suggesting using serial over bigserial. My comment was pushing back on that.
Thus, not only would DB ids in principle never show in the UI, they wouldn't even be stable between sessions.
This would protect against a number of security attacks. (I don't know what I'd do about bookmarking, though I suspect most apps have a 80/20 solution they could deploy.) DB ids going out to a user really ought to be considered at least a smell, but it's pretty ingrained into our culture.
When anchors are needed I use a different column for that- not the primary index. (Usually some other unique, already-present, value like the sku or whatever.
The security implications though are substantial. I don't leak primary key info even if the PK is serialized. (These days I'm all-in on UIDS but I have older legacy systems to deal with too.)
It’s like hosting an internal app at a company that contains a SQL injection. “Well, if a hacker can access this app, then that’s a problem that needs addressing either way.” Sure, that may be true, but it’s also true that you’re not a good software engineer.
It's nothing like that at all because the wrong SQL injection can completely ruin people's lives due to leaking stuff it shouldn't whereas the worst an int exhaustion can do is bring some app offline. Whoopdie-doo. Okay, that's not brilliant, but it's not comparable at all.
And there's a reason there aren't tons of "int exhaustion attacks": because there's little point in doing so.
This does happen and break people. You usually don’t hear about it (except on the SRE side) because it is so obvious when it happens to someone they really don’t like talking about it.
"Denial of service" is just not a security bug on the same level as "zomg my credit card is stolen". I have argued this many times before, e.g. https://news.ycombinator.com/item?id=39377795 – It's easy to knock most sites offline regardless just by flooding it with traffic, don't need long-term plans like this (which are likely to be detected before it even has effect).
> This does happen and break people
I have never heard about a deliberate attack using this method.
Defending against denial of service attacks is a gradual, multifaceted problem that is pretty unlikely to hinge on the size of database keys.
If your system is dimensioned to serve a few hundered users then it's not going to be a whole lot more robust if it can theoretically create thousands of trillions of user records rather than just hundereds of millions.
In fact, infinite scalability is a risk in its own right because it can bankrupt you, which is arguably worse than a temporary outage.
That said, I tend to use 64 bit IDs by default as well unless there is a good reason not to (because, you know, what if I'm the next Google?!?).
All it requires is for there to be billions of rows. 32 bits is nutty small considering how little overhead there is in storing 64 bits instead.
Never again.
2 digit years don’t even support the use case of ‘store the current year’ fully.
This is especially true for PostgreSQL which increments sequences for upserts (even if no records are inserted).
That's how I've hit 32-bit limits on tables that had only a couple million rows.
---
I would only use 32-bit for very selective items that is used in a lot of FKs, like a tenant ID.
Secondly, it's not XXXk rows currently-you have to consider the complete lifetime of the table. When rows are deleted/created and how often.
So what you've said is true, but the set of appropriateness for smallint is a much smaller than expected.
For queue tables you can even use `CYCLE` to do that automatically.
Outside of that, put absolutely no thought into it and pick a bigint/bigserial.
Trying to be clever about this ignores the fact that tables and their usage change, and what you believe now will not be correct later.
And when you realise you’ve been too clever, it’s 3am on a Sunday, your production database has exploded and nothing works. You now need to rewrite 30 tables and their foreign keys ASAP with upper management breathing down your neck.
Personally I would only use an int PK in a table where you know the bounds of the data with very high confidence. Like, at my last job the software involved working with our client's physical store locations. If we managed to capture the entire North American market for the type of business that we worked with, we'd be looking at about 50k stores.
Just goes to show you can’t inadvertently disclose anything these days.
You can guess the time the system took between 2 uuid v7 id's.
They can only be used if they're not shown to the user. (so not in the form mysite.com/mypage? id=0190854d-7f9f-78fc-b9bc-598867ebf39a)
A big serial starting at a high number can't provide the time information.
If you use only uuid in your outwards facing api then you still have the problem of slow queries. Since you need them to find the object (as mentioned below)
UUIDv7 has a random part, can be created distributedly, and indexes well.
It’s the best choice for modern application that support distributed data creation.
For me the bigger thing is the randomness. A uid being random for a given row means the opposite is true; any given index entry points to a completely random heap entry.
When backfilling this leads to massive write amplification. Consider a table with rows taking up 40 bytes, so roughly 200 entries per page. If I backfill 1k rows sorted by the id then under normal circumstances I'd expect to update 6-7 pages which is ~50kiB of heap writes.
Whereas if I do that sort of backfill with a uid then I'd expect to encounter each page on a separate row. That means 1k rows backfilled is going to be around 8MB of writes to the heap.
We do use a custom uuid generator that uses the timestamp as a prefix that rotates on a medium term scale. That ensures we get some degree of clustering for records based on insertion time, but you can't go backwards to figure out the actual time. It's still a problem when backfilling and is more about helping with live reads.
This is problematic if you try to depend on the ordering. Nothing is stopping some batch process that started an hour ago from committing a value 100k lower than where you thought the sequence was at. That's an extreme example but the consideration is the same when dealing with millisecond timeframes.
Also, the external thing isn't just for exposing it out to your own apps via APIs, but way more importantly for providing an unmistakable ID to store within external related systems. For example, in your Stripe metadata.
Doing this ensures that ID either exists in your own database or does not, regardless of database rollbacks, database inconsistencies etc. In those situations a numeric ID is a big question mark: Does this record correspond with the external system or was there a reuse of that ID?
I've been burnt taking over poorly managed systems that saved numeric IDs externally, and in trying to heal and migrate that data, ran into tons of problems because of ill-considered rollbacks of the database. At least after I leave the systems I build won't be subtly broken by such bad practices in the future.
Ordering over an insertion timestamp is not enough if two records may have the same timestamp: You may miss a record (or visit a record twice) across multiple queries.
The UUID sorting works in the common case, but if you happen to end your batch near the current time, you still run the risk of losing a few records if the insert frequency is sufficiently high. Admittedly this is only a problem when you are batching through all the way to current insertions.
Also, security can be built around not allowing querying records which are not yours.
I'm all for a little security through obscurity including UUIDs but it shouldn't be the sole thing. Easier to generate a UUID for the sequential and let the database do what it does best (relate many serials among each other).
The other part is being able to use what's built into the database out of the box without a lot more configuration.
Selfishly, I always appreciate learning more about Postgres though :)
You would be immediately leaking how many orders a day your business is getting with sequential id.
Which is fine for almost all of them. All brick and mortar stores "leak" this too; it's really not that hard to guess number of orders for most businesses, and it's not really a problem for the overwhelming majority.
And "Hi, this is Martin, I'd like to ask a question about order 2bf8aa01-6f4e-42ae-8635-9648f70a9a05" doesn't really work. Neither does "John, did you already pay order 2bf8aa01-6f4e-42ae-8635-9648f70a9a05" or "Alice, isn't 2bf8aa01-6f4e-42ae-8635-9648f70a9a05 the same as what we ordered with 7bb027c3-83ea-481a-bb1e-861be18d21ea?"
Especially for order IDs UUIDs are huge PITA because unlike user IDs and other more "internal" IDs, people can and do want to talk about them. You will need some secondary human-friendly unique ID regardless (possibly obfuscated, if you really want to), and if you have that, then why bother giving UUIDs to people?
For extra spice, in some places this is legislated as a per-seller sequence, and in others as a per-customer sequence, so there’s no policy you can apply globally, and this once again highlights the separation of concerns between a primary key and a record locator/identifier.
Same in France. I thought it was a pretty common requirement.
That’s surprising. In Denmark gaps are not allowed. You have to account for all invoices and if you have an invoice numbered 50, then you have at least 50 invoices to account for.
However you are allowed to have multiple sequences, differentiated through a prefix, but all starting at 0. That’s what we recently did to switch invoice generation tools (we actually still run both of them alongside each other atm).
Of course you could still drop some invoices from the end when you do this, but I guess tax authorities accept that risk.
Are you then not doing security by randomness if that is the thing that worries you?
1. The UUIDs should be ordered internally, for B-tree performance
2. The UUIDs should not be ordered externally, for security reasons
Why not use encryption? The unencrypted ID is a sequential id, but as soon as it leaves the database, it's always encrypted. Like, when getting it out:
SELECT encrypt(id) FROM table WHERE something = whatever;
and when putting stuff in: UPDATE table SET something = whatever WHERE id = decrypt(<encrypted-key>)
Seems like the best of both worlds, and you don't need to store separate things.My take is it's rarely necessary to have a token, that you give to an external entity, that has any embedded metadata all - 99.9% of apps aren't operating at a scale where even a million-key hashmap sitting in ram and syncing changes to disk on update would cause any performance difference.
It seems to me: the actual value of knowing these ids/timestamps to a hacker is tiny, but it's not nothing (German tank problem and all that). Like, if a hacker was able to decode the timestamps, it's not ideal, but it's not like a catastrophe either (especially given that half the people in this thread thinks it has no security value at all). Given that threat model, a simple scheme like I suggested seems fine to me.
But it is.
If you have a password in a system, you want to rotate it regularly or at least have that ability (for example, when angry colleague leaves).
Because then you have a key management problem, which adds complexity.
The way I see it is that uuid v7 in itself is great for some use but not for all uses.
You always have to remember that a v7 always carries the id's creation time as metadata with it, whether you want it or not. And if you let external users get the v7, they can get that metadata.
I'm not a security expert but I know enough to know that you should only give the minimal data to a user.
My only guess is that v7 being so new, attacks aren't widespread for now, and I know why the author decided not to focus on "if UUID is the right format for a key", because the answer is no 99% of the time.
The use of UUIDs for documents may come from a no-sql background.
I use bigserials for relational data in relational databases, and if there is a unique document value needed, a UUID is good.
Why not just use the UUID type??
Depends on what you mean by ridiculous.
For example https://sqids.org/ ensures that there are no profanities in the generated ids. And it allows you to add additional words that you want to avoid.
You can still get vaguely offensive sequences like FKNNGR or BLKCNT, but at some point you have to put this down not to your software being offensive or hateful but to humans finding patterns in randomness.
Do you ensure that your software does not form ridiculous words in every language? Or just another US-centric thing?
The idea of avoiding identifiers to be ridiculous is ridiculous to me, honestly...
Postgres' UUID datatype: https://www.postgresql.org/docs/current/datatype-uuid.html#D...
django.db.models.fields.UUIDField: https://docs.djangoproject.com/en/5.0/ref/models/fields/#uui... :
> class UUIDField: A field for storing universally unique identifiers. Uses Python’s UUID class. When used on PostgreSQL and MariaDB 10.7+, this stores in a uuid datatype, otherwise in a char(32)
> [...] Lookups on PostgreSQL and MariaDB 10.7+: Using iexact, contains, icontains, startswith, istartswith, endswith, or iendswith lookups on PostgreSQL don’t work for values without hyphens, because PostgreSQL and MariaDB 10.7+ store them in a hyphenated uuid datatype type.
From the sqlalachemy.types.Uuid docs: https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqla... :
> Represent a database agnostic UUID datatype.
> For backends that have no “native” UUID datatype, the value will make use of CHAR(32) and store the UUID as a 32-character alphanumeric hex string.
> For backends which are known to support UUID directly or a similar uuid-storing datatype such as SQL Server’s UNIQUEIDENTIFIER, a “native” mode enabled by default allows these types will be used on those backends.
> In its default mode of use, the Uuid datatype expects Python uuid objects, from the Python uuid module
From the docs for the uuid Python module: https://docs.python.org/3/library/uuid.html :
> class uuid.SafeUUID: Added in version 3.7.
> safe: The UUID was generated by the platform in a multiprocessing-safe way
And there's not yet a uuid.uuid7() in the uuid Python module.
UUIDv7 leaks timing information ( https://news.ycombinator.com/item?id=40886496 ); which is ironic because uuids are usually used to avoid the "guess an autoincrement integer key" issue
> use “bigint generated always as identity” instead of bigserial.
The commenter you are replying to was not saying anything about whether to use UUIDs or not; they just said "if you are going to use bigserial, you should use bigint generated always as identity instead".
Why does OT compare text and UUID instead of char(32) and UUID?
What advantage would there be for database abstraction libraries like SQLalchemy and Django to implement the UUID type with bigint or bigserial instead of the native pg UUID type?
See: https://wiki.postgresql.org/wiki/Don't_Do_This#Text_storage
Also, I think you're misunderstanding the article. They aren't talking about storing a uuid in a bigint. They're talking about have two different id's. An incrementing bigint is used internally within the db for PK and FK's. A separate uuid is used as an external identifier that's exposed by your API.
Chapter 8. Data Types > Table 8.2. Numeric Types: https://www.postgresql.org/docs/current/datatype-numeric.htm... :
> bigint: -9223372036854775808 to +9223372036854775807
> bigserial: 1 to 9223372036854775807
2*63 == 9223372036854775807
Todo UUID /? postgres bigint UUID: https://www.google.com/search?q=postgres+bigint+uuid :
- UUIDs are 128 bits, and they're unsigned, so: 2*127
- "UUID vs Bigint Battle!!! | Scaling Postgres 302" https://www.scalingpostgres.com/episodes/302-uuid-vs-bigint-...
"Reddit's photo albums broke due to Integer overflow of Signed Int32" https://news.ycombinator.com/item?id=33976355#33977924 re: IPv6 addresses having 64+64=128 bits
FWIW networkx has an in-memory Graph.relabel_nodes() method that assigns ints to unique node names in order to reduce RAM utilization for graph algorithms: https://networkx.org/documentation/stable/reference/generate...
I'm not quite sure what all the links have to do with the topic at hand.
Shouldn't we then link to the docs on how many bits wide db datatypes are, whether a datatype is prefix or suffix searchable, whether there's data leakage in UUID namespacing with primary NIC MAC address and UUIDv7, and whether there will be overflow with a datatype less wasteful than the text datatype for uuids when there is already a UUID datatype for uuids that one could argue to improve if there is a potential performance benefit
For other readers: the two are almost the same in behaviour, they differ primarily in that identity columns don’t need a USAGE grant on their associated sequence, and the ALWAYS makes it harder (but despite the term not actually impossible) to override the default value on row insert.
I don't know much databases in general BTW..
Edit: Can I use uuidv4 as PK since it doesn't leak timestamp data? Why bigserial + uuid7 is better?
Internally, your database looks like:
User
ID - uint128
external_id - UUID (of some sort)
name - string
Post
ID - uint128
UserId - uint128 (User.ID)
external_id - UUID
...
Then you have secondary indices on the external_id columns in both tables, and the UserId column on Post. You can then join from one to the other using the ID column.e.g.
SELECT count(*) FROM
Post JOIN User
ON User.ID = Post.UserID
WHERE
User.external_id = <%USERID_FROM_REQUEST%>;
// Don't forget to bind USER_ID_FROM_REQUEST instead of
// concating the string, no Johny-Tables here!
There should be significant performance benefits from using int's for the joining keys (at least compared to strings), but my experience might be old.pick your poison, slower lookup or more disk usage
In fact if your external ids are evenly distributed you can do that anyway, only index the few leading / trailing bytes of the id. That will increase lookup costs a bit but significantly decrease storage.
It’s “Bobby tables”: <https://xkcd.com/327/>
The problem is indexing and generation.
For example sqlite already generates sequential ids whether you use them or not.
Non sequential uuids suck for indexing as the table grows. There are some benchmarks and explanations to look up.
The benefit is you can merge aggregates from different DB instances without a fuss. And you can generate uuids without asking your DB.
Server 1: [1 4 7]
Server 2: [2 5 8]
Server 3: [3 6 9]
Or whatever is appropriate. You can even add/change this later on.Again, I'm not saying it's necessarily the best solution, I'm just saying it's possible. I'm not really interested in a long discussion about uuid vs. serial because it's been done to death a million times. Previous poster said it's impossible and that's just flat-out wrong.
I don't trust anyone who gives recommendations sight unseen.
That said, I've found that contexts can change. Personally I started with the database being local (send single). It the migrated to remote (web), after that to distributed and from there to mobile.
As contexts gave changed I've found we needed to switch to UUID (and the switch was a pain.) A lot of our problems would simply not exist if we'd used uuid from the start.
So now my "default advice" is uuid, unless there's a -specific- reason not to. For me, performance is not that reason (your database outlives multiple generations of hardware, and faster hardware will dwarf optimisation-via-data-type.)
And don't get me started on natural versus surrogate identifiers- that ship sailed decades ago.)
Though I would wonder why go with UUIDs in that case at all?
# Postgres can Index Function Results (including what you described)
# Postgres does not magically know to USE those results. To use them you must E.G. JOIN (probably works for WHERE?) or ORDER or LIMIT by the same parameter that went into the Index (which is how it's matched up).
Generally, the Primary Key should either be 'small' (BigInt can count as such) and only used internally or be something strongly relevant to how the records in that table can be limited for most (or the most) queries.
In the old times, if you created a "partial index" with a condition of "field_foo IS TRUE" it would not match queries asking for "field_foo = TRUE" (it did actual text matching). Probably some of that is improved today.
Could you share this analysis? Seems interesting.
Where does Stripe make that claim?
I think most of the value of Stripe-style IDs is in their other properties, like the information they bear, readability, copy-pasteability (which uuids lack - double-click to select often stops at hyphens), etc.
While knowing this information is useful, most services fail in different domains and problems way before you reach that point. I'm not sure people really comprehend how hard you can hit a single machine before you need to distribute a workload.
I call BS.
You can pay the cost for something upfront, and the cost of maintaining it, and in the long term paid too much for something you didn't actually need.
Alternatively you can wait to pay it until you're certain you need it but the work involved has become much more significant, in which it can cost more than it would have to have built and maintained it from the beginning.
Compounding the issue is the build-up-front scenario costs fade with time and you don't really think about them, but build-when-you-need-it always creates a stir even if the costs are less overall than build-up-front.
Either way something will go wrong no matter how many times you predict where the cards will fall.
And memory is the key factor of any sizable database.
If you are using PG, simply using it's native UUID type instead of char(36) seems like a no-opportunity-cost obvious optimization choice at least though, if you have a choice?
was saying that even with that poor implementation we still were not having issues using uuids
Also, I have never seen devs (PMs, really – devs are the unfortunate souls slogging through tickets) suddenly care about performance-related tech debt. Why would they, when you can just click a button and double your DB’s hardware? Boom, problem solved… until it isn’t. Eventually, you run out of scaling, and since you probably don’t have a DBA/DBRE (else they’d have been screaming at you for months), it’s going to be extremely painful to solve now.
The bare minimum I’m asking – as a DBRE – is to use UUIDv7 and store them in Postgres’ native UUID type. That’s all. That’s an incredibly small amount of effort to put forth.
I would rather have secure data by default and opt in to optimise when it is clear this info is fine to leak.
It doesn't seem right to me to expose with such ease how many sales you are doing. It's definitely not intentional to expose it.
Random UUID’s shouldn’t, but will impact performance, which in most cases will have more concrete impact on the business.
There is no such thing as a free lunch.
A string will use 36 bytes per row. bigserial will use 8 bytes per row. At 4 billion rows that's about 100G. Now imagine a row with 3 foreign keys to other tables with string UUIDs and you're wasting 300G (vs UUID type) or 400G (vs. bigserial), for no good reason. And doing things like "where id = ?" will be slower. You will be able to keep fewer rows cached in memory. Etc.
It's absolutely not a bikeshed. And migrating all of this later on can be a right pain so it's worth getting it right up-frong.
It's also not more effort to do things right: usually it's exactly the same effort as doing it wrong.
I've never had to move from uuids to integers. I've had to move from integers to uuids plenty of times though.
I'm not against UUIDs nor saying you should optimize everything, I'm just saying you should think about things, and that thinking about things really isn't that time-consuming or that much effort.
When UUIDs become your bottleneck you'll be celebrating for picking UUIDs, because now you can move to a distributed architecture and not worry about IDs.
If your dataset is small the overhead from Uuids wont matter, if your dataset is large the randomness of Uuids will save your ass when you migrate to a distributed solution.
par the RFC
If UUIDs are required for use with any security operation within an application context in any shape or form then [RFC4122] UUIDv4 SHOULD be utilized.
2. UUIDv7 allows to predict first half, if you know the timestamp.
3. UUIDv7 provides 62 bits of randomness compared to 122 bits for UUIDv4.
Whether that's a problem for your particular use-case or not, it's up for you to decide. I don't think that UUIDv7 is "insecure", it just provides different trade-offs and in some situations it might be less secure compared to UUIDv4, but I hardly see any attack vector where you could issue 2^62 requests to brute-force the ID.
> Timestamps embedded in the UUID do pose a very small attack surface. The timestamp in conjunction with an embedded counter does signal the order of creation for a given UUID and its corresponding data but does not define anything about the data itself or the application as a whole. If UUIDs are required for use with any security operation within an application context in any shape or form, then UUIDv4 (Section 5.4) SHOULD be utilized.
https://datatracker.ietf.org/doc/html/rfc9562#name-security-...
https://datatracker.ietf.org/doc/html/rfc9562
(which wasn't yet finished at the time of the article)
The size of that problem depends on your situation
Or, let's say you are a middleware company and you're trying to contract your services to two other companies. These companies are competitors of each other, but they're also the biggest in their space.
If you have use, let's say, autoincrementing ints for orders or operations, then the companies could use the distance between order numbers as a way of calculating how much business the other business is doing.
If that functionality is possible / easy, that may be a reason for those customers to stop using you, and you've now lost one or two of your major contracts.
Apparently this has been brought up by others and is called the German Tank Problem.
And, it’s so random that if you ever do see a collision you should immediately start looking for a compromised system or bug. This is basically how GitHub discovered the OpenSSL bug that had removed too much entropy from the RNG setup.
Read and take notes. This is crazy in untrusted environments.
You do not need to search. I have generated hundreds of millions and have never hit a duplicate. Its technically possible, but its so vanishingly rare it will be something to brag about, not worry about. If you're really worried about it, and building something that cannot tolerate an error, put a try/catch around it and detect a PK failure and try again. But it will be a waste of time.
So if you are using it as a PK value, only ever insert and if there is a duplicate, blow up loudly.
Performance wise, that isn’t particularly impactful IMO.
So if there's a collision you'll get an error. You can write application code to handle the error (by re-generating a new ID and re-trying). Or you can figure it's so unlikely (googling for uuidv4 says "1 in 2.71 x 1018", which is pretty huge; not sure for uuidv7 which will be somewhat more likely) that you aren't going to worry about it, and if an error happens, oh well, errors happen sometimes, depending on the domain you are in, which is probably what many apps do and do fine with it.
I would guess that pg's built in uuidv4-generating implementation might re-try on colision under-the-hood, but i haven't checked to see, and it may also just count on the improbability and raise an error if it happens!
I was hoping to see it in version 17, but can't get a really good read on what's going on with this feature.
For my workloads at least, I think it's worth the tradeoff of forcing Postgres to muster 15 random bytes.
I can absolutely see scenarios where it would be desirable or convenient to create the these IDs in the application code. However, I can equally assert that you can, and often should, let the database be the source of truth for record identity when using database surrogate keys for that purpose... but I won't.
The truth is there are many kinds of architectures and many kinds of applications the details of which have a direct bearing on where or where not certain operations should be conducted. In many environments in which I work, the idea of "the application" can be murky. In many enterprise environments databases can be driven by and integrated into many applications, possibly from different software developers and vendors with varying degrees of influence in how they internally operate. In such environments the database can be the great unifier and placing substantial control of data in the database itself can make sense in these heterogenous application environments: but it's very much a circumstantial call.
I would expect if there really were one true way to develop application or how to use databases to achieve goals, we could well be asking why they built these functions in the first place... I suspect it's because there are a reasonable number of valid use cases to justify the lift. I'd like to think my use cases fit the envisioned valid use cases thus my interest and question.
I feel like I’m taking crazy pills. Isn’t the whole (or 99% of the) point of uuids that you can generate them client/application side for idempotency, because entity creation starts there and you want an id earlier than the first round trip completion? If they’re generated by the db (with world-knowledge) then you can auto increment a 64 bit is significantly more performant.
But no word on if it's delayed, or there's some problem, etc. It's still tagged as 17... and some of the related work has been committed, but not the UUIDv7 itself so it seems.
I suppose TSID works to this end, but certainly more complicated.
Traditional businesses can figure this out by sitting in the parking lot. Why SaaS has decided it’s a huge problem is beyond me.
It certainly helped the Allies in the war, as previous intelligence had the rate of tank production much higher, and they were expending a lot of effort trying to exceed that previous false number.
Information is power, etc. etc.
As to if it's worth it for your SaaS to mitigate is up to you of course. I know I notice when things like Invoice #'s, my userID, customerID, etc. get shown, and it's a small number, or small delta. And that impacts my behavior. Does it also impact someone like a VC’s behavior? Or a competitor? Unknown.
But Hell, a bunch of people right now are probably launching SaaS's with NO-OP auth flows for documents or images, or with trivial external SQL injection flaws, which is a way bigger problem.
But it doesn't make the vulnerability/problem non-existent. Also feel free to use 16 bit auto incrementing primary keys for all your core tables if you want.
But if we're in an engineering discussion on the topic, knowledgeable folks will bring up the issues with it, because that is what they do. Most would feel it is their duty to do it, even.
To be clear, this is not directed at you specifically, and I have no idea what your level of expertise is on anything.
In general, I’ve found HN commenters level of knowledge to be fairly bimodal. They’re either regurgitating things they read on a Medium blog, or they really know their shit.
Every time this topic comes up, people delightedly mention the German Tank Problem, but I have never, not once, seen anyone post an actual example of when a modern business got rekt by a competitor using knowledge gained from monotonic IDs.
Re: security viz. AuthZ, my stance is the same as SQL injection – it’s such a trivially easy problem to avoid that it shouldn’t be a consideration for harming the performance of the DB. (Not that SQL injection mitigations cause performance impacts; sorry, that analogy didn’t work well)
At $previous_job (payments provider), the sales engineers would often spot merchants using sequential order IDs and mention it to the account managers.
Rekt? I guess not, but knowing what percentage of their business we processed was extremely valuable information when it came to renegotiate the contract.
It requires physical access. Presence can be achieved by placing a dash camera or window camera strategically.
> and they were expending a lot of effort trying to exceed that previous false number.
The idea here being that if the allies just made more tanks than the axis they would automatically win? It's a lionized story of intelligence agency cleverness built squarely on the back of insane military "strategy."
No, it’s that they would set tank production quotas (and budgets) based on what they expect to need to produce to crush the enemy. Modulo whatever level of effectiveness had been apparent so far.
Which, for any given set of resources, would mean taking away resources from something else. Either manpower, steel, oil/gas, etc.
So overproducing tanks means fewer aircraft, or less artillery, etc.
Convincing the enemy you’re producing too much (or too little) of something to get them to waste their time and energy is a classic counter intelligence move.
WW2 was won largely with logistics and economic might + reasonable tactics. Applying the economic might effectively was part of that.
So much so that it calls into question the entire idea of clandestine wartime intelligence having any exceptional value.
> WW2 was won largely with logistics and economic might + reasonable tactics.
I think people who say this simply haven't counted the dead.
> Applying the economic might effectively was part of that.
Which had essentially no bearing on our problems with Japan.
The war in the pacific was started by, and almost entirely dominated by, economic matters. Japan attacked the US because of US embargoes on fuel, for instance.
And the creation of the atom bomb (which ended it) was only possible due to the insane logistical feats of the US - and its excess economic capacity and resources. Which is why Germany failed, and the USSR only got one after getting the ‘ingredients list’ post war - once the wartime resources had been freed up.
War is hell, so don’t start them. But also, don’t think modern war isn’t won (or lost) based on logistics and economic factors. Which intelligence can help with.
For me the priority is security. If I get a link (visible or invisible) that contains a numeric ID, there's the possibility to tweak that link with another number.
Ideally, the server treats that number as suspect. Every. Single . Time. In practice I only need one developer to miss the check in one place and I have a serious security issue. Like leaking my entire customer list serious. This has happened multiple times -that we know of-.
Yes obscurity is not security. But sequential numbers are insecure by design, and it takes a lot of well-maintained walls to protect them. One crack leaves them very vulnerable.
I’m also going to use this as yet another example of why getting rid of QA in favor of Ship It Now was a bad idea.
See how big the delta is between the account IDs.
That gives you potential traction.
If you have access to revenue data (every investor is going to want this first thing), that gives you rough churn too.
Do the same for creating users, if you can see user Id’s.
Just hope you never have to merge tables from two databases together.
I know of only one person in my entire career that had to do this. And they managed it just fine despite working with auto-incrementing big ints.
Yet some folks advocate that all projects should pay an expensive insurance against this elusive event of two databases being merged.
I wonder how. I've had to do several big merges in my career, and it was always a nightmare because of all the external systems which were already referencing and storing those pre-existing ints. Sure, merging the databases is easy if you don't mind regenerating all the Id's, but it's not usually that simple.
They wrote a small script with the logic involved in the merging. PKs and FKs of only one database had to be incremented by an offset of max(table.pk) + safe margin.
They did this for each table.
Once this script was tested multiple times with subsets of each database, they stopped production and ran the script against it (with backup fallbacks). A small downtime window in a Sunday.
And that was it. The databases never had to pay the UUID tax, before or after.
Oh I see, we're talking about two entirely different worlds here, lol.
I've seen architecture astronauts make their business pay unreasonable tech insurances by adding complexity to avoid simply pausing production for some minutes when it could have been much cheaper this way.
And from my understanding, in the case I mentioned, they chose to stop production to simplify the process. But they didn't have to.
A mixture of replication plus code changes to write in two databases could also have solved the issue.
Most business die because they can't move fast enough. Not because their production database stopped for a few minutes.
"If your architecture can't withstand life threatening solar flares, third world war, sabotaging of undersea cables and 1 billion concurrent users can you even call yourself an engineer?"
Okay but why
"UUID - even though always looks similar - comes in multiple variants. Java's UUID.randomUUID() - returns UUID v4 - which is a pseudo-random value. For us the more interesting one is UUID v7 - which produces time-sorted values. It means that each time new UUID v7 is generated, a greater value it has. And that makes it a good fit for B-Tree index."
I've done a quick benchmark of /dev/urandom locally, and I get ~450 MB/s out, which is roughly 28M worth of UUIDs per second — on my Ryzen 6850U laptop. Even that amounts to sub-microsecond times.
I distinctly remember a time where you could only get a few MB/s, and external sources of true entropy were outright expensive for anything more than a few KB/s.
I personally prefer ULID since it is compat with a UUID type and you also get a timestamp lexicographically built into the ID so that sorting by ID also means sorting by timestamp. There are multiple PG extensions to make it easy to drop in and use.
I want to use something url friendly too since uuid sucks..
FWIW, I do expect UUIDv7 to take roughly half the time to generate if it has roughly half the random bits of UUIDv4 and an unprotected source of entropy is used.
If I don't want to leak the timestamp, I just use an auto generated integer along with it.
It is inspired by Stripe IDs and looks like that: user_2x4y6z8a0b1c2d3e4f5g6h7j8k
I've recently built a postgres extension for it that allows to use it like UUID: https://github.com/blitss/typeid-postgres
If your concern is globally unique identifiers (i.e. so that you can merge tables across multiple instances of your database), then UUID is exactly what you want. This is entirely what it is designed for.
If your concern is the privacy of sequence statistics, then UUID incidentally solves your problem. It may not be precisely what you want, and could continue to leak private information depending on the specific variant used. If you want privacy of sequence statistics, then I would suggest something like a sha256 hash of the primary key concatenated with a cryptographic salt stored in a separate column. These make excellent identifiers in places like APIs and URLs.
If you desire a unique identifier that has a high quality UX, then this is in addition to the above columns. This sequence generally has lower entropy than the cryptographic approach (or the UX would suffer), so additional measures should be taken to protect the privacy of the identifiers (e.g. expire redemption tokens after a period of time and reissue via email).
Autoincrementing integers are really nice if you don't actually need UUIDs. Lots of tricks you can apply with them.
Is there a reason not to use version 4 UUID and if time ordering is needed, save timestamp explicitly as another column?
I personally prefer the extra 128 bits of entropy, but I have no evidence that says it's more secure in practice.
Obviously, this has ramifications for data write perf, but may well be worth it depending on the use case. Technically, you could also make the integer alternate (and leave UUID as primary), but that may not de desirable for clustered / index-organized tables (for DBMSes that support them).
The dev experience of debugging with UUIDs involved degrades so much, it is depressing.
For example it's much harder to spot patterns and wrong IDs in SQL query results if you are looking at these giant blobs of random characters called UUIDs.
I'm skeptical that for the vast majority of uses, have time data in the IDs (and also not leaking it from actual timestamp fields) is an issue, and for the cases where it does, just use random IDs.
Even theoretically, is there a way to translate timestamps into a German tank problem? If I give you a sampled set of timestamps (with no other node identifiers), can you estimate the cardinality of the greater dataset?