The Illustrated Transformer (2018)

159 points by debdut 3 days ago | 11 comments

ryan-duve 3 days ago |
I gave a talk on using Google BERT for financial services problems at a machine learning conference in early 2019. During my preparation, this was the only resource on transformers I could find that was even remotely understandable to me.
I had a lot of trouble understand what was going on from just the original publication[0].
[0] https://arxiv.org/abs/1706.03762
andoando 3 days ago |
Maybe Im dumb but I still can't make much sense of this.
isaacfung 3 days ago |
Maybe it's easier to understand in the format of annotated code
https://nlp.seas.harvard.edu/2018/04/03/attention.html
ronald_petty 2 days ago |
Updated (in case above link goes away) - https://nlp.seas.harvard.edu/annotated-transformer/
Thanks for original!
jerpint 3 days ago |
I go back regligiously to this post whenever I need a quick visual refresh on how transformers work, I can’t overstate how fantastic it is
xianshou 3 days ago |
Illustrated Transformer is amazing as a way of understanding the original transformer architecture step-by-step, but if you want to truly visualize how information flows through a decoder-only architecture - from nanoGPT all the way up to a fully represented GPT-3 - nothing beats this:
https://bbycroft.net/llm
cpldcpu 3 days ago |
whoa, that's awesome.
crystal_revenge 2 days ago |
While I absolutely love this illustration (and frankly everything Jay Alammar does), it is worth recognizing there is a distinction between visualizing how a transformer (or any model really works) and what the transformer is doing.
My favorite article on the latter is Cosma Shalizi's excellent post showing that all "attention" is really doing is kernel smoothing [0]. Personally having this 'click' was a bigger insight for me than walking through this post and implementing "attention is all you need".
In a very real sense transformers are just performing compression and providing a soft lookup functionality on top of an unimaginably large dataset (basically the majority of human writing). This understanding of LLMs helps to better understand their limitations as well as their, imho untapped, usefulness.
0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...
mbeex a day ago |
Interesting read! As a mathematician I always had difficulties with AI jargon, even though I've been writing on neural networks since the late nineties.
Almost at the same time as the emergence of transformers, I had only minimal contact with the field. I was just aware of the appearance of the Vaswani paper, but only now have I returned to the subject in a way that requires more rigour. And I stumbled upon "attention" in the same way as the author. It did not help to know more about the biological model [1].
Yes, kernels. Asking myself, what software implementations of the superior colliculus or yet retina cell complexes like DS (direction-sensitive) or OMS (object-motion-sensitive) cells could provide.
[1] for example: https://mitpress.mit.edu/9780262019163/the-new-visual-neuros...
photon_lines 2 days ago |
Great post and write-up - I also made an in-depth explorations and did my best to use visuals - for anyone interested you can find it here: https://photonlines.substack.com/p/intuitive-and-visual-guid...
tomashm 2 days ago |
This is good, but what bade me finally understand the transformer architecture [0] and attention [1], are 3Blue1Brown's videos.
0. https://www.youtube.com/watch?v=wjZofJX0v4M
1. https://www.youtube.com/watch?v=eMlx5fFNoYc