In the article the author tried Tesseract which uses ML and has some neural network models, and also tried ChatGPT.
I have come to the same conclusion as the author when doing OCR that needed 100% accuracy.
When you know the font, spacing and the layout is fixed, old school statistical analysis of the pixels works a treat.
It's a bit more effort to set up since you actually have to set it up. But at least it's done right.
Of course it'd be better than something that is intentionally limiting itself. But that says nothing.
https://web.archive.org/web/20250106075631/https://nickfa.ro...
The service is indeed great, Mischa does an excellent job.
Yeah PHP on httpd can be flaky, I'd wish for a lighter solution for wikis.
to compare two images, i1 and i2
l1 = length(gzip(i1))
l2 = length(gzip(i2))
l12 = length(gzip(concatenate(i1, i2))
ncd = (l12 - min(l1, l2))/max(l1, l2)
Here is a nice article where I found out about this long ago.https://yieldthought.com/post/95722882055/machine-learning-t...
From the article:
"Basically it states that the degree of similarity between two objects can be approximated by the degree to which you can better compress them by concatenating them into one object rather than compressing them individually."
[1] https://en.wikipedia.org/wiki/Normalized_compression_distanc...
It probably would have added the overhead from compression which in my case would have been detrimental.