I had a lot of trouble understand what was going on from just the original publication[0].
Thanks for original!
My favorite article on the latter is Cosma Shalizi's excellent post showing that all "attention" is really doing is kernel smoothing [0]. Personally having this 'click' was a bigger insight for me than walking through this post and implementing "attention is all you need".
In a very real sense transformers are just performing compression and providing a soft lookup functionality on top of an unimaginably large dataset (basically the majority of human writing). This understanding of LLMs helps to better understand their limitations as well as their, imho untapped, usefulness.
0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...
Almost at the same time as the emergence of transformers, I had only minimal contact with the field. I was just aware of the appearance of the Vaswani paper, but only now have I returned to the subject in a way that requires more rigour. And I stumbled upon "attention" in the same way as the author. It did not help to know more about the biological model [1].
Yes, kernels. Asking myself, what software implementations of the superior colliculus or yet retina cell complexes like DS (direction-sensitive) or OMS (object-motion-sensitive) cells could provide.
[1] for example: https://mitpress.mit.edu/9780262019163/the-new-visual-neuros...