Why is hash(-1) == hash(-2) in Python?

100 points by alexmolas 4 days ago | 85 comments

Terr_ 4 days ago |
In a way this is an argument for languages where it's normal to have result-types [0] or exceptions, instead of -1 etc. It just feels wrong to have any potential collision or confusion between normal results and error results.
[0] https://en.wikipedia.org/wiki/Result_type
kevinventullo 21 hours ago |
We all know that types are not Pythonic.
(I’m only mostly joking)
folkrav 21 hours ago |
Ackchyually, Python has pretty strong typing, as far as dynamic languages go.
bhickey 20 hours ago |
Sure. Everything is a PyObject.
nas 10 hours ago |
Every PyObject structure has a ob_type pointer.
ackfoobar 9 hours ago |
Every Object in Java has a getClass method. If we changed all the static type information in Java to the `Object` type, that'd be pretty close to Python's case.
GP's post is probably the "unityped" critique of dynamically typed languages by Robert Harper: https://existentialtype.wordpress.com/2011/03/19/dynamic-lan...
Insanity 20 hours ago |
If you use the actual type system and something like mypy, Python is a joy work with. (For my definition of joy, which includes static typing).
IshKebab 20 hours ago |
Actually writing Python is reasonably nice in that case.
But dealing with Python infrastructure is so awful as to make the whole experience just bad.
uv fixes a lot of that, but I think it will be some time before it's used everywhere, and I have zero hope that the Python devs will ever do the right thing and officially replace Pip with uv.
arccy 19 hours ago |
you also have to trawl through the flags to make sure it actually checks types, e.g. check_untyped_defs
kstrauser 21 hours ago |
Not really, I don’t think. Python code doesn’t ever really use hash() for anything where specific outputs are expected. Even if hash(a)==hash(b), it’s not implied that a==b or a is b.
TRiG_Ireland 20 hours ago |
Yes, but having result types would avoid the need for this special casing.
tmvphil 20 hours ago |
But -2 seems like just a bad choice here. -1 and -2 are very "common" integers. Seems they could have just picked a quasi-random large integer to reduce the likelihood of collision. I expect hash functions to have collisions, but I expect it to be "unlikely" in a way that scales with the size of the hash.
kstrauser 20 hours ago |
I don't think that matters here, though. This specific hash function is literally used just for putting values in dict buckets. The docs say:
> Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup.
They're not to be used as a crypto digest. The downside here, then is that if -1 and -2 are used as dict keys, they'll end up bucketed together. Apparently that hasn't been enough of an issue over the years to bother changing it. One advantage is that those values are common enough that it might break code that incorrectly expects hash() to have predictable outputs. If it used 0xffffffff as the value, lots of busted tests may never stumble across it.
itishappy 19 hours ago |
> The downside here, then is that if -1 and -2 are used as dict keys, they'll end up bucketed together.
Note that they'll only end up bucketed together in the underlying hashmap. The dictionary will still disambiguate the keys.
>>> a = -1 >>> b = -2 >>> {a: a, b: b} {-1: -1, -2: -2}
kstrauser 18 hours ago |
Yep, absolutely. End programmers will never see that unless they go digging into the CPython code. It's otherwise invisible.
daxfohl 11 hours ago |
Not only bucketed together, which means they could still be disambiguated in C by the full hash, but they'll actually share the full hash, which means they'd have to jump back into Python code to check `__eq__` to disambiguate IIUC. That seems like a huge perf issue, is it not?
kstrauser 10 hours ago |
Without looking it up, I think that's right, but... that's just kind of inevitable with hash tables. Unless you have a small number of keys and get very lucky, it's exceedingly likely that you'll have multiple keys in the same bucket, and you'll have to use whatever language's `__eq__` equivalent mechanism.
A mitigating factor is that dict keys have to be hashable, which implies immutability, which generally implies a small number of native, fast types like str, numbers, or simple data structures like tuples. You could have some frozendict monstrosity as keys, but 1) that's not something you see often, and 2) don't do that. It you must, define a fast __eq__. For example, an object mapping to a database row might look like:
def __eq__(self, other): if self.pk != other.pk: return False # Fail quickly # If that's true, only then do a deep comparison return self.tuple_of_values == other.tuple_of_values
But again, that's just not really something that's done.
nneonneo 20 hours ago |
I bet -2 is way less common than -1 in a typical codebase, particularly a C one.
Ekaros 10 hours ago |
Are they? At least as keys for most dictionaries. Zero and positive integers sure, but negative ones would be somewhat rare. I could maybe see something like error handling, but then do you need performance there?
nayuki 20 hours ago |
In other words, this is a case of https://en.wikipedia.org/wiki/In-band_signaling versus out-of-band signaling https://en.wikipedia.org/wiki/Signaling_(telecommunications)... .
EricRiese 20 hours ago |
Clears throat the precise wiki page you're looking for is the Semipredicate Problem
https://en.wikipedia.org/wiki/Semipredicate_problem
kazinator 15 hours ago |
-1 is in-band signaling; but it steals a value.
kazinator 15 hours ago |
1. But, Python has exceptions! The underlying C language doesn't, but Python's run-time has them and can use them in the C code.
2. It may be an argument for ensuring that absolutely everything that is an object can hash: the object hasher must not have error states. Nobody wants the overhead of a simple hash code being wrapped in a result type.
kstrauser 21 hours ago |
Because the hash function is only reliably useful for laying out dict keys, and its outputs are totally opaque and coincidental.
The pigeonhole principle says there are an infinite number of inputs with the same hash output. Trying to figure out how that happens can be fun and enlightening, but why? “Because it just does, that’s all.”
Etheryte 21 hours ago |
I think this is not constructive criticism. There's a good, clear cut reason for this specific overlap (error handling), saying it collides just because is not really useful.
kstrauser 21 hours ago |
But even that’s an implementation detail that happens to be true today, but could easily disappear tomorrow. There’s nothing in the docs saying it has to be that way, or that you can infer anything from a hash output other than that it’s going to be consistent within a single execution of that Python program.
tooltower 20 hours ago |
This article is not about the API contract of the hash function, or the abstraction it provides. If you are just trying to hash things, you don't need any info here.
It's very much trying to go _under_ the abstraction layer to investigate its behavior. Because it's interesting.
This is very similar to how people investigate performance quirks or security issues.
kstrauser 20 hours ago |
I read the article and it's interesting! I learned something from it. From an end user POV it explains the mechanism behind how hash(-1)==hash(-2), which is neat. There's not really a why behind it, though. It wasn't really planned or designed for -1 to have an unexpected hash value. That's just the value the devs randomly picked as a flag. It could've been -837 just as easily.
sedatk 20 hours ago |
It could have been a bug. There's nothing wrong with the question and seeking answers to it. It was an interesting read too.
alain94040 20 hours ago |
If the article title had been "why does the hash function never return -1", you would agree it's not a random behavior and serves a very clear purpose. It just so happens that the original author didn't know that until they did all the digging (nice job by the way).
pmarreck 21 hours ago |
A better question might be, why are people still using OO dynamic languages without any typing?
kstrauser 21 hours ago |
Not sure. Is anyone using a language like that? Python certainly isn’t one.
worik 21 hours ago |
> Python certainly isn’t one.
I am not a Python expert, but...
Python is described as OO and I thought is is dynamically typed
Not quite "without any typing" but close
kstrauser 20 hours ago |
Not really. Python's strongly typed, but dynamically. You can think of Python variables as untyped references to strongly typed values: the values are typed, but their names are not.
Dylan16807 20 hours ago |
I agree with your last sentence, but I think we shouldn't say python itself is "strongly typed".
kstrauser 20 hours ago |
Why, specifically? It's commonly described as strongly typed and I'm not aware of an argument against that.
Dylan16807 20 hours ago |
Is the specific reason not clear? You just said it. The variables are untyped.
Strong versus weak is not super solidly defined as referring only to values and not variables.
kstrauser 19 hours ago |
> Is the specific reason not clear?
Correct.
> Strong versus weak is not super solidly defined as referring only to values and not variables.
You're right. It's not super solidly defined as referring only to variables and not values, either. If dynamic typing is the same as weak typing, there'd be no point in having both phrases.
Dylan16807 19 hours ago |
>> Is the specific reason not clear?
> Correct.
Is it still unclear with the clarification you didn't quote? I can't tell if I need to explain more or not.
> It's not super solidly defined as referring only to variables and not values, either.
I didn't mean to suggest it was.
> If dynamic typing is the same as weak typing, there'd be no point in having both phrases.
It's not the same. There's a bunch of things referred to as "weak".
tuveson 20 hours ago |
> untyped references to strongly typed values
I would call that weakly typed since there is no step before your code runs that validates (or even tries to validate) that your program satisfies a "type system". But since the definition of "strong" and "weak" typing has always been vague, I will just say that it is stupidly typed[1], and that I am a pedantic nerd.
1. https://danieltuveson.github.io/programming/languages/stupid...
wruza 19 hours ago |
This is just hair splitting. What everyone means by "typing" today is typing members in structures and co. Every time you reach for python after a typed language, it feels like going through a barbed minefield blindfolded. The fact that its "world" almost entirely consists of little moving subclasses in subnamespaces doesn't help either. All you can think of python is how to write as few lines of it as possible to return to DX faster. It is my only use case for AI-based development because I cannot stand an extra minute in this abomination without cursing into the editor.
dragonwriter 19 hours ago |
> Every time you reach for python after a typed language, it feels like going through a barbed minefield blindfolded.
Using both Python and mandatorily statically typed languages regularly professsionally (my main working languages being C#, Python, and a mix of JS and TS), that's not my experience at all.
(Of course, Python has a variety of optional static typecheckers with differing degrees of type inference, as well as supporting explicit type specifications; its not untyped unless you choose to use it that way.)
kragen 19 hours ago |
It's not hairsplitting, and it's not what everyone means, just the people you talk to. Languages without typing are things like assembly language(s), which underlie everything you do on a computer and is (are) still in the TIOBE Index's top 20 https://www.tiobe.com/tiobe-index, as well as much more obscure languages. The distinction between untyped languages and dynamically typed languages is night and day.
I'm pretty competent in Python, C, Java, JavaScript, assembly, and a number of other languages†, and my experience in Python is nothing like what you're describing, although I do have my own frustrations with it. It's possible that when you're more experienced you'll have a different perspective, but it sounds like you're stuck inside a pretty small bubble right now.
______
† Last year I also programmed in C++, Lua, C#, bash, Tcl, Emacs Lisp, Forth, Golang, OCaml, Scheme, Perl, Common Lisp, and the ngspice scripting language.
wruza 18 hours ago |
I have a similar list, no need to worry about my YoE in various things. I left full-time python many years ago. You're right about my current bubble though. I find it productive and useful, compared to python which I have to touch from time to time. My worst gripe is that ML is my new interest and it is all python. I'm basically screwed.
worik 19 hours ago |
> Not really. Python's strongly typed, but dynamically
Yea, nah!
You cannot have both.
kstrauser 19 hours ago |
That's an interesting and uncommon claim.
dragonwriter 19 hours ago |
There are different definitions of most of the terms around typing which make most typing conversations confusing as people talk past each other.
Particularly, there is a common definition of “typed” which is exactly equivalent to “statically typed” under which all so-called “dynamically typed” languages are actually “untyped”, and within that system strong and weak typing are either a meaningless distinction or one within the set of statically types languages.
There's also, of course, a common usage within which “dynamically typed” is meaningful and strong vs. weak typing is a separate distinction, usually mostly discussed within languages with dynamic typing, though languages like C being both statically typed and weakly typed has been discussed.
nayuki 20 hours ago |
> without any typing?
>>> "abc" + 321 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can only concatenate str (not "int") to str
I rest my case.
homebrewer 20 hours ago |
TaPL should be required reading for self proclaimed "engineers". Do untyped languages even exist, or could they? maybe something purely theoretical for ivory tower academics? Even POSIX shell and assembler type their values.
nayuki 20 hours ago |
> Do untyped languages even exist, or could they? Even POSIX shell and assembler type their values.
I program in assembly language. Memory is just a big array of bytes. Values don't have types, but each instruction chooses what type to interpret memory as - e.g. uint8, int16, float32, x86 real-mode pointer, long-mode pointer, etc.
int_19h 19 hours ago |
I think one can reasonably argue that, if the language only has a single type, it is effectively "untyped".
That would mean languages like Tcl. Or, for that matter, B.
tjalfi 14 hours ago |
BCPL[0], BLISS[1], and Forth[2] are all untyped.
[0] https://en.wikipedia.org/wiki/BCPL
[1] https://en.wikipedia.org/wiki/BLISS
[2] https://en.wikipedia.org/wiki/Forth_(programming_language)

ks2048 21 hours ago |

The next k for which hash(k) != k: 2**61-1

    print(hash(2**61 - 2))
    2305843009213693950

    print(hash(2**61 - 1))
    0

TRiG_Ireland 20 hours ago |
I enjoyed this, because it seems to be written by an interested and interesting human, at a level that I could actually follow.
jiggawatts 20 hours ago |
This seems like a loaded footgun: anyone writing a custome hash function has to know about -1 being treated like an error and handle it in a similar way.
I can’t think of any other language where this kind of thing happens, which means other developers won’t expect it either.
I can see the bug report now: “certain records cause errors, occurs only for about one in a few billion.”
krab 20 hours ago |
I don't think so. This is true for people writing cpython. But if you implement custom `__hash__` method, you should be fine.
dagw 20 hours ago |
If you're writing a custom hash function in python for a custom python class, then this won't affect you since it's only treated as an error in the underlying C code.
If you're adding a new type to the core python language then you have to be aware of this, but if you're hacking the C implementation to change the core language then you're probably pretty well versed in the cpython internals, or at least surrounded by people that are.
dagw 20 hours ago |
interestingly enough though, it doesn't seem to be possible to write a custom hash function that returns -1
>>> class Foo: def __hash__(self): return -1 >>> f = Foo() >>> print(hash(f)) -2
So if you're doing something where you have a custom __hash__ function that you're expecting to return -1 for a certain value and then are testing for value of the hash of an object rather than testing the property of the object directly, then this might bite you. But I cannot think of any reasonable case where you might want to do that.
omoikane 20 hours ago |
Perhaps the worry was that because the observed behavior up until now was that hash never returns -1, if it were possible possible to write a custom hash function that returns -1, that might break some unsuspecting code due to Hyrum's Law.
dragonwriter 18 hours ago |
If the built-in hash function returned -1 when a custom __hash__ method did so, then custom numeric classes following the hashing guidelines in the Python Language Reference designed to maintain the x==y => hash(x)==hash(y) invariant would fail to maintain that invariant on the reference implementation of Python due to a CPython implementation detail, which would be a Bad Thing™.
dragonwriter 18 hours ago |
You can write a custom hash method that returns -1 (and the one you wrote verifiably does.)
The standard (not custom) hash function, in CPython, which calls your custom method and returns the actual value CPython uses for bucketing, will return -2 in that case, though.
Which is necessary, so that, e.g., following the single mathematical function for hashing of numeric types published in the Python Language Reference will preserve x==y implies hash(x)==hash(y) for custom and build in numeric types in CPython where hash(-1) does not follow that pattern.
Even though for -1 itself, the divergence from the rule is in __hash__ itself. This is also consistent with the basic rule that dunder methods are for communication with the standard machinery, but external consumers should use the standard machinery, not the dunder method.
intalentive 20 hours ago |
>>> d = {-2: None}
>>> -1 in d
False
What gives?
krab 20 hours ago |
Because -1 is not equal to -2. They will fall into the same bucket of the hash map, but the hash collisions are expected and must be accounted for.
Arnavion 20 hours ago |
If hashmaps only looked at hash(k1) == hash(k2) and not also k1 == k2 they would be quite useless as maps.
hyperpape 20 hours ago |
Since hashes aren't guaranteed to be unique, a dictionary should use equality to actually confirm that the object with the given hash matches the key present in the dictionary.
nayuki 20 hours ago |
> The __hash__() method should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects ...
-- https://docs.python.org/3/reference/datamodel.html#object.__...
> The general contract of hashCode is:
> * If two objects are equal according to the equals method, then calling the hashCode method on each of the two objects must produce the same integer result.
> * It is not required that if two objects are unequal according to the equals method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
-- https://docs.oracle.com/en/java/javase/23/docs/api/java.base...
Galanwe 20 hours ago |
> So, in C, when you design a function, and you want your function to be able to indicate “there was an error”, it has to return that error as its return value.
Well, you can also use an errno-like system. It has its own set of drawbacks as well, but it removes the "I need to reserve a sentinel value" problem.
secondcoming 20 hours ago |
Out params are thing too.
kragen 19 hours ago |
Also you can longjmp().
sgerenser 18 hours ago |
Functions using errno typically still use a return value of -1 to indicate an error has occurred, then errno just stores the specific error code. Otherwise, how do you even know that you need to check errno?
BenjiWiebe 18 hours ago |
You could always check errno and reset it to zero before any function call.
kazinator 14 hours ago |
There are examples of such an ambiguity in the ISO C library itself.
In those situations, the application must begin by clearing errno to zero.
Then, it checks for a certain error value or values from the function which are ambiguous: they could be legit or indicate an error.
If one of those values occurs, and if errno is nonzero, then the error occurred.
This is how you deal with, for instance the strtol (string to long int) function. If there is a range error, strtol returns LONG_MIN or LONG_MAX. Those are also valid values in the range of long, but when no error has occurred, they are produced without errno being touched.
strtol can also return 0 in another error case, when the input is such that no conversion can be performed. ISO C doesn't require errno to be set to anything in this case, unfortunately. The case is distinguished from a legitimate zero by the original pointer to the string being stored in *endptr (if the caller specifies endptr that is not null).
ISO C and POSIX library functions do not reset errno to zero. They either leave it alone or set it to a nonzero value.
If you need to use the above trick and are working inside a C library function, you have to save the original errno value before storing a zero in it, and then put that value back if no error has happened.
moralestapia 17 hours ago |
>it has to return that error as its return value.
This is the kind of crap ignorant C developers do and then pretend is the language's fault.
You can jump, you can return a reference to a complex type, a tuple, whatever.
snickerbockers 20 hours ago |
>CPython is written in C, and unlike Python, C doesn’t have exceptions. So, in C, when you design a function, and you want your function to be able to indicate “there was an error”, it has to return that error as its return value
This is just wrong. C doesn't feature 'error handling' as a dedicated form of branching the way many higher level languages do and you're in no way required to use return codes as an error signal. This is a case of bad API design and is entirely python's fault.

nneonneo 20 hours ago |

Naturally, this extends to custom types:

  >>> class X:
  ...     def __hash__(self): return -1
  ... 
  >>> hash(X())
  -2

daxfohl 19 hours ago |
Another question is why do ints largely hash to themselves? Most hashing guides recommend hashes be far more distributed IIUC.
woodruffw 19 hours ago |
Python's `__hash__` used primarily for hash map bucketing. In that context, the identity of an integer is a perfectly cromulent bucketing value.
zarzavat 19 hours ago |
If your keys are multiples of the table size then all your keys will hash to the same bucket, in a naive hash table at least.
daxfohl 18 hours ago |
More explicitly, since buckets in Python's Set implementation are the bottom N bits of the hash (where N depends on set size), this ensures linear sequences of ints (which are the most common type of int set) are all in different buckets, so zero collisions. This is the case except when the differences between integers in the set are multiples of powers of 2. But even then, the way Set is implemented, you don't see perf start to regress until like 2**20 or so. If everything in the set is multiples of 2**20, then they start to underperform better-distributed hashes. But it never gets more than like twice as slow.
So, it ends up being a matter of optimizing for the common case, but still being reasonably performant for the worst case.
kazinator 14 hours ago |
There could be an issue whereby some attacker controls the hash inputs which happen to be integers, and chooses a sequence which collides to the same hash value. It would be easy to identify such a sequence.
It just doesn't seem to have been identified as a real isssue, probably because the most commonly hashed external data is character strings. (For those, there are countermeasures like a properly scrambled hashing function, modified by a seed that can be randomized.)
vhcr 19 hours ago |
Somewhat related, hash() returns a result % sys.hash_info.modulus, so if you build a set or a dict with keys multiple of sys.hash_info.modulus, you run into quadratic behavior.
>>> {i * sys.hash_info.modulus for i in range(5000)} 0.40s >>> {i * sys.hash_info.modulus for i in range(50000)} 29.34s >>> {i for i in range(50000)} 0.06s
binary132 19 hours ago |
The fact that a C function really only has its return value for communicating success or error status is why most fallible C functions use mutable parameters, also known as output arguments, for their result value or values. This is like, elementary C program design.
lifthrasiir 15 hours ago |
In this case however it can be partly justified as a space saving mechanism, as that hash has to be also cached in the object. It's like Rust's `Option<NonZeroU64>` done manually.
binary132 15 hours ago |
What I’m saying is either the value or the error can be discarded. If it was an error, the exception mechanism can be used since at that point you’re in the Python wrapper. It would probably be a little slower though, but Python is already slow. That way the error doesn’t need to be in the API at all.
kazinator 15 hours ago |
Object hashes that can fail or be unimplemented, where lower-level hashes have to be aware of this and stay away from the value that the parent uses for error signaling?
What else is unmitigated shit in Python? :)