Arguably the November 1996 launch of 3dfx kickstarted GPU interest and OpenGL.
After reading that, it’s hard to take author seriously on the rest of the claims.
If you are going to count 3dfx as a proper GPU and not just a geometry and lighting accelerator, then you might as well go back further and count things like the SGI Reality Engine. Either way, 3dfx wasn't really first to anything meaningful.
The article is a very good historical one showing how 3 important things came together to make the current progress possible viz;
1) Geoffrey Hinton's back-propagation algorithm for deep neural networks
2) Nvidia's GPU hardware used via CUDA for AI/ML and
3) Fei-Fei Li's huge ImageNet database to train the algorithm on the hardware. This team actually used "Amazon Mechanical Turk"(AMT) to label the massive dataset of 14 million images.
Excerpts;
“Pre-ImageNet, people did not believe in data,” Li said in a September interview at the Computer History Museum. “Everyone was working on completely different paradigms in AI with a tiny bit of data.”
“That moment was pretty symbolic to the world of AI because three fundamental elements of modern AI converged for the first time,” Li said in a September interview at the Computer History Museum. “The first element was neural networks. The second element was big data, using ImageNet. And the third element was GPU computing.”
“GeForce 256 was marketed as "the world's first 'GPU', or Graphics Processing Unit", a term Nvidia defined at the time as "a single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that is capable of processing a minimum of 10 million polygons per second"”
They may have been the first with a product that fitted that definition to market.
I don't think you can get a speedup by running neural networks on the GeForce 256, and the features listed there aren't really relevant (or arguably even present) in today's GPUs. As I recall, people were trying to figure out how to use GPUs to get faster processing in their Beowulfs in the late 90s and early 21st century, but it wasn't until about 02005 that anyone could actually get a speedup. The PlayStation 3's "Cell" was a little more flexible.
I mean, I certainly don't think NVIDIA invented the GPU—that's a clear error in an otherwise pretty decent article—but it was a pretty gradual process.
According to Wikipedia, Nvidia released its first product, the RV1, in November 1995, the same month 3dfx released its first Voodoo Graphics 3D chip. Is there reason to think the 3dfx card was more of a "true" GPU than the RV1? If not, I'd say Nvidia has as good a claim to inventing the GPU as 3dfx does.
3dfx Voodoo cards were initially more successful, but I don’t think anything not actually used for deep learning should count.
Study story problems and you end up with string theory. Study data computed from endless world of stuff, find utility.
What a shock building the bridge is more useful than a drawer full of bridge designs.
Here, opinions will differ.
I'd say theoretical proofs of impossibility tend to make valid logical deductions within the formal model they set up, but the issue is that model often turns out to be a deficient representation of reality.
For instance, Minsky and Papert's Perceptrons book, credited in part with prompting the 1980s AI winter, gives a valid mathematical proof about inability of networks within their framework to represent the XOR function. This function is easily solved by multilayer neural networks, but Minsky/Papert considered those to be a "sterile" extension and believed neural networks trained by gradient descent would fail to scale up.
Or more contemporary, Gary Marcus has been outspoken since 2012 that deep learning is hitting a wall - giving the example that a dense network trained on just `1000 -> 1000`, `0100 -> 0100`, `0010 -> 0010` can't then reliably predict `0001 -> 0001` because the fourth output neuron was never activated in training. Similarly, this function is easily solved by transformers representing input/output as a sequence of tokens thus not needing to light up an untrained neuron to give the answer (nor do humans when writing/speaking the answer).
If I claimed that it was topologically impossible to drink a Capri-Sun, then someone comes along and punctures it with a straw (an unaccounted for advancement from the blindspot of my model), I could maybe cling on and argue that my challenge remains technically true and unsolved because that violates one of the constraints I set out - but at the very least the relevance of my proof to reality has diminished and it may no longer support the viewpoints/conclusions I intended it to ("don't buy Capri-Sun"). Not to say that theoretical results can't still be interesting in their own right - like the halting problem, which does not apply to real computers.
This is really, really common. And it's done both by mistake and in bad faith. In fact, it's a guarantee that once anybody tries anything different enough, they'll be constantly attacked this way.
When the first serious CUDA based ML demos started appearing a decade or so ago, it was, at least to me, pretty clear that this would lead to AGI in 10-15 years - and here we are. It was the same sort of feeling as when I first saw the WWW aged 11, and knew that this was going to eat the world - and here we are.
The thing that flummoxes me is that now that we are so obviously on this self-reinforcing cycle, how many are still insistent that AI will amount to nothing.
I am reminded of how the internet was just a fad - although this is going to have an even greater impact on how we live, and our economies.
For me the current systems still clearly fall short of that goal.
It’s like jet engines and cheap intercontinental travel becoming an inevitability once the rubicon of powered flight is crossed - and everyone bitching about the peanuts while they cruise at inconceivable speed through the atmosphere.
Optimism is good, blind optimism isn't.
It isn’t a question of optimism - in fact, I am deeply pessimistic as to what ML will mean for humanity as a whole, at least in the short term - it’s a question of seeing the features of a confluence of technology, will, and knowledge that has in the past spurred technical revolution.
Newcomen was far from the first to develop a steam engine, but there was suddenly demand for such beasts, as shallow mines became exhausted, and everything else followed from that.
ML has been around in one form or another for decades now - however we are now at the point where the technology exists, insofar as modern GPUs exist, the will exists, insofar as trillions of dollars of investment flood into the space, and the knowledge exists, insofar as we have finally produced machine learning models which are non-trivial.
Just as with powered flight, the technology - the internal combustion engine - had to be in place, as did the will (the First World War), and the knowledge, which we had possessed for centuries but had no means or will to act upon. The idea was, in fact, ridiculous. Nobody could see the utility - until someone realised you could use them to drop ordnance on your enemies.
With supersonic flight - the technology is emerging, the will will be provided by the substantial increase in marginal utility provided by sub-hour transit compared to the relatively small improvement Concorde offered, and the knowledge, again, we already have.
So no, not optimism - just observance of historical forces. When you put these things together, there tend to be technical revolutions, and resultant societal change.
The cool thing for you is that this statement is unfalsifiable.
But what I really took issue with was your timeframe in the original comment. Your statements imply you fully expect AGI to be a thing within a couple of years. I do not.
Funny thing. I expect deep learning to have a really bad next decade, people betting on it to see quick bad results, and maybe it even disappear from the visible economy for a while exactly because there have been hundreds of billions of dollars invested.
What is no fault of the technology, that I expect to have some usefulness on the long term. I expect a really bad reaction to it coming exclusively from the excessive hype.
Again, I cannot understand for the life of me how people cannot see this.
Our labeled datasets were thousands of times too small.
Our computers were millions of times too slow.
We initialized the weights in a stupid way.
We used the wrong type of non-linearity.
I wonder what would have happened if we poured the same amount of money, talent and hardware into SVMs, random forests, KNN, etc.
I don't say that transformers, LLMs, deep learning and other great things that happened in the neural network space aren't very valuable, because they are.
But I think in the future we should also study other options which might be better suited than neural networks for some classes of problems.
Can a very large and expensive LLM do sentiment analysis or classification? Yes, it can. But so can simple SVMs and KNN and sometimes even better.
I saw some YouTube coders doing calls to OpenAI's o1 model for some very simple classification tasks. That isn't the best tool for the job.
I think neural networks are fundamental and we will focus/experiment a lot more with architecture, layers and other parts involved but emerging features arise through size
As a simple example, if you ask a question and part of the answer is directly quoted from a book from memory, that text is not computed/reasoned by the AI and so doesn't have an "explanation".
But I also suspect that any AGI would necessarily produce answers it can't explain. That's called intuition.
What you'd get is an explanation saying "it quoted this verbatim", or possibly "the top neuron is used to output the word 'State' after the word 'Empire'".
You can try out a system here: https://monitor.transluce.org/dashboard/chat
Of course the AI could incorporate web search, but then what if the explanation is just "it did a web search and that was the first result"? It seems pretty difficult to recursively make every external tool also explainable…
The search is for something that can write an essay, drive a car, and cook lunch so we need something new.
A Prolog query is explainable precisely because, by construction, it itself is the explanation. And you can go step by step and understand how you got a particular result, inspecting each variable binding and predicate call site in the process.
Despite all the billions being thrown at modern ML, no one has managed to create a model that does something like what Prolog does with its simple recursive backtracking.
So the moral of the story is that you can 100% trust the result of a Prolog query, but you can't ever trust the output of an LLM. Given that, which technology would you rather use to build software on which lives depend on?
And which of the two methods is more "artificially intelligent"?
https://transluce.org/observability-interface
LLMs don't have enough self-awareness to produce really satisfying explanations though, no.
Simplest example: OCR. A network identifying digits can often be explained as recognizing lines, curves, numbers of segments etc.. That is an explanation, not "computer says it looks like an 8"
And is that explanation the way how they obtained the "cat-nes" of the picture, or do they just see that it is a cat immediately and obviously and when you ask them for an explanation they come up with some explaining noises until you are satisfied?
For cat vs pumpkin they will think you are making fun of them, but it very much is explainable. Though now I am picturing a puzzle about finding orange cats in a picture of a pumpkin field.
people did that to horses. No car resulted from it, just slightly better horses.
>I saw some YouTube coders doing calls to OpenAI's o1 model for some very simple classification tasks. That isn't the best tool for the job.
This "not best tool" is just there for the coders to call while the "simple SVMs and KNN" would require coding and training by those coders for the specific task they have at hand.
But that's backwards from how new techniques and progress is made. What actually happens is somebody (maybe a student at a university) has an insight or new idea for an algorithm that's near $0 cost to implement a proof-of concept. Then everybody else notices the improvement and then extra millions/billions get directed toward it.
New ideas -- that didn't cost much at the start -- ATTRACT the follow on billions in investments.
This timeline of tech progress in computer science is the opposite from other disciplines such as materials science or bio-medical fields. Trying to discover the next super-alloy or cancer drug all requires expensive experiments. Manipulating atoms & molecules requires very expensive specialized equipment. In contrast, computer science experiments can be cheap. You just need a clever insight.
An example of that was the 2012 AlexNet image recognition algorithm that blew all the other approaches out of the water. Alex Krizhevsky had an new insight on a convolutional neural network to run on CUDA. He bought 2 NVIDIA cards (GTX580 3GB GPU) from Amazon. It didn't require NASA levels of investment at the start to implement his idea. Once everybody else noticed his superior results, the billions began pouring in to iterate/refine on CNNs.
Both the "attention mechanism" and the refinement of "transformer architecture" were also cheap to prove out at a very small scale. In 2014, Jakob Uszkoreit thought about an "attention mechanism" instead of RNN and LSTM for machine translation. It didn't cost billions to come up with that idea. Yes, ChatGPT-the-product cost billions but the "attention mechanism algorithm" did not.
>into SVMs, random forests, KNN, etc.
If anyone has found an unknown insight into SVM, KNN, etc that everybody else in the industry has overlooked, they can do cheap experiments to prove it. E.g. The entire Wikipedia text download is currently only ~25GB. Run the new SVM classification idea on that corpus. Very low cost experiments in computer science algorithms can still be done in the proverbial "home garage".
This falls apart for breakthroughs that are not zero cost to do a proof-of concept.
Think that is what the parent is rereferring . That other technologies might have more potential, but would take money to build out.
I though the main insights were embeddings, positional encoding and shortcuts through layers to improve back propagation.
People don't even think of doing anything else and those that might do, are paid to pursue research on LLMs.
Fact by definition
From my perspective, that is actually what happened between the mid-90s to 2015. Neural netowrks were dead in that period, but any other ML method was very, very hot.
That was the key takeaway for me from this article. I didn't know of Fei-Fei Li's ImageNet contribution which actually gave all the other researchers the essential data to train with. Her intuition that more data would probably make the accuracy of existing algorithms better i think is very much under appreciated.
Key excerpt;
So when she got to Princeton, Li decided to go much bigger. She became obsessed with an estimate by vision scientist Irving Biederman that the average person recognizes roughly 30,000 different kinds of objects. Li started to wonder if it would be possible to build a truly comprehensive image dataset—one that included every kind of object people commonly encounter in the physical world.
The utility of big datasets was indeed surprising, but that skepticism came about from recognizing the scaling paradigm must be a dead end: vertebrates across the board require less data to learn new things, by several orders of magnitude. Methods to give ANNs “common sense” are essentially identical to the old LISP expert systems: hard-wiring the answers to specific common-sense questions in either code or training data, even though fish and lizards can rapidly make common-sense deductions about manmade objects they couldn’t have possibly seen in their evolutionary histories. Even spiders have generalization abilities seemingly absent in transformers: they spin webs inside human homes with unnatural geometry.
Again it is surprising that the ImageNet stuff worked as well as it did. Deep learning is undoubtedly a useful way to build applications, just like Lisp was. But I think we are about as close to AGI as we were in the 80s, since we have made zero progress on common sense: in the 80s we knew Big Data can poorly emulate common sense, and that’s where we’re at today.
Sometimes I wonder if it’s fair to say this.
Organisms have had billions of years of training. We might come online and succeed in our environments with very little data, but we can’t ignore the information that’s been trained into our DNA, so to speak.
What’s billions of years of sensory information that drove behavior and selection, if not training data?
This extends to all sorts of animal cognitive experiments: crows understand simple pulleys simply by inspecting them, but they couldn’t have evolved to use pulleys. Mice can quickly learn that hitting a button 5 times will give them a treat: does it make sense to say that they encountered a similar situation in their evolutionary past? It makes more sense to suppose that mice and crows have powerful abilities to reason causally about their actions. These abilities are more sophisticated than mere “Pavlovian” associative reasoning, which is about understanding stimuli. With AI we can emulate associative reasoning very well because we have a good mathematical framework for Pavlovian responses as a sort of learning of correlations. But causal reasoning is much more mysterious, and we are very far from figuring out a good mathematical formalism that a computer can make sense of.
I also just detest the evolution = training data metaphor because it completely ignores architecture. Evolution is not just glomming on data, it’s trying different types of neurons, different connections between them, etc. All organisms alive today evolved with “billions of years of training,” but only architecture explains why we are so much smarter than chimps. In fact I think the “evolution” preys on our misconception that humans are “more evolved” than chimps, but our common ancestor was more primitive than a chimp.
A recent paper tested both linguists and LLMs at learning a language with less than 200 speakers and therefore virtually no presence on the web. All from a few pages of explanations. The LLMs come close to humans.
https://arxiv.org/abs/2309.16575
Another example is the ARC-AGI benchmark, where the model has to learn from a few examples to derive the rule. AI models are closing the gap to human level, they are around 55% while humans are at 80%. These tests were specifically designed to be hard for models and easy for humans.
Besides these examples of fast learning, I think the other argument about humans benefiting from evolution is also essential here. Similarly, we can't beat AlphaZero at Go, as it evolved its own Go culture and plays better than us. Evolution is powerful.
Then we compile and run that source code and our individual lived experience is the training data for the instantiation of that architecture, e.g. our brain.
It's two different but interrelated training/optimization processes.
It's not just information (e.g. sets of innate smells and response tendencies), but it's also all of the advanced functions built into our brains (e.g. making sense of different types of input, dynamically adapting the brain to conditions, etc.).
Like how good would LLMs be if their training set was built by humans responding with an intelligent signal at every crossroads.
Schooling/education feels much more like supervised training and reinforcement (and possibly just context).
I think it's dismissive to assume that evolution hasn't influenced how well you're able to pick up new behavior, because it's highly likely it's not entirely novel in the context of your ancestry, and the traits you have that have been selected for.
There's around 600MB in our DNA. Subtract this from the size of any LLM out there and see how much you get.
It's like a 600mb demoscene demo for Conway's game of life!
Our DNA also stores a lot of information, but it is not that much.
Our dogs can learn about things such as vehicles that they have not been exposed to nearly enough, evolution wide. And so do crows, using cars to crack nuts and then waiting for red lights. And that's completely unsupervised.
We have a long way to go.
the human brain is absolutely inundated with data, especially from visual, audio, and kinesthetic mediums. the data is a very different form than what one would use to train a CNN or LLM, but it is undoubtedly data. newborns start out literally being unable to see, and they have to develop those neural pathways by taking in the "pixels" of the world for every millisecond of every day
Also, separately, I'm only assuming but it seems the reason you think these deductions are different from hard wired answers if that their evolutionary lineage can't have had to make similar deductions. If that's your reasoning, it makes me wonder if you're using a systematic description of decisions and of the requisite data and reasoning systems to make those decisions, which would be interesting to me.
I have difficulties understanding why you could even believe in such a fallacy: just look around you: most jobs that have to be done require barely any intelligence, and on the other hand, there exist few jobs that do require an insane amount of intelligence.
That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.
Really? Did it take at least an entire rack to store?
my storage hierarchy goes 1) 1 storage drive 2) 1 server maxed out with the biggest storage drives available 3) 1 rack filled with servers from 2 4) 1 data center filled with racks from 3
It's a terrible measurement because it's an irrelevant detail about how their data is stored that no one actually knows if your data is being stored in a proprietary cloud except for people that work there on that team.
So while someone could say they used a 10 TiB data set, or 10T parameters, how many "racks" of AWS S3 that is, is not known outside of Amazon.
mid-90's had LeCun telling everyone that big neural nets were the future.
My only surprise is how long it took to get to imagenet, but in retrospect, I appreciate that a number of conditions had to be met (much more data, much better algorithms, much faster computers). I also didn't recognize just how poorly MLPs were for sequence modelling, compared to RNNs and transformers.
I haven't invested the time to take the loss function from our paper and implement in a modern framework, but IIUC, I wouldn't need to provide the derivatives manually. That would be a satisfying outcome (indicating I had wasted a lot of effort learning math that simply wasn't necessary, because somebody had automated it better than I could do manually, in a way I can understand more easily).
Most methods research went into ways of building beliefs about a domain into models as biases, so that they could be more accurate in practice with less data. (This describes a lot of PGM work). This was partly because there was still a tug of war between CS and traditional statistics communities on ML, and the latter were trained to be obsessive about model specification.
One result was that the models that were practical for production inference were often trained to the point of diminishing returns on their specific tasks. Engineers deploying ML weren't wishing for more training instances, but better data at inference time. Models that could perform more general tasks -- like differentiating 90k object classes rather than just a few -- were barely even on most people's radar.
Perhaps folks at Google or FB at the time have a different perspective. One of the reasons I went ABD in my program was that it felt industry had access to richer data streams than academia. Fei Fei Li's insistence on building an academic computer science career around giant data sets really was ingenius, and even subversive.
I definitely remember that bias about neural nets, to the point of my first grad ML class having us recreate proofs that you should never need more than two hidden layers (one can pick up the thread at [1]). Of all the ideas clunking around in the AI toolbox at the time, I don't really have background on why people felt the need to kill NN with fire.
[1] https://en.wikipedia.org/wiki/Universal_approximation_theore...
I've never heard this be put so succinctly! Thank you
Last week Hugging Face released SmolLM v2 1.7B trained on 11T tokens, 3 orders of magnitude more training data for the same number of tokens with almost the same architecture.
So even back in 2019 we can say we were working with a tiny amount of data compared to what is routine now.
The same has always been true. There has never been a stance along the lines of "ah, let's not collect more data - it's not worth it!". It's always been other reasons, typically the lack of resources.
TinyLlama[1] has been made by an individual on their own last year, training a 1.1B model on 3T tokens with just 16 A100-40G GPUs in 90 days. It was definitely within reach of any funded org in 2019.
In 2022 (IIRC), Google released the Chinchilla paper about the compute-optimal amount of data to train a given model, for a 1B model, the value was determined to be 20B tokens, which again is 3 orders of magnitude below the current state of the art for the same class of model.
Until very recently (the first llama paper IIRC, and people noticing that the 7B model showed no sign of saturation during its already very long training) the ML community vastly underestimated the amount of training data that was needed to make a LLM perform at its potential.
I started with ML in 1994, I was in a small poor lab - so we didn't have state of the art hardware. On the other hand I think my experience is fairly representative. We worked with data sets on spark workstations that were stored in flat files and had thousands or sometimes tens of thousands of instances. We had problems keeping our data sets on the machines and often archived them to tape.
Data came from very deliberate acquisition processes. For example I remember going to a field exercise with a particular device and directing it's use over a period of days in order to collect the data that would be needed for a machine learning project.
Sometime in the 2000's data started to be generated and collected as "exhaust" from various processes. People and organisations became instrumented in the sense that their daily activities were necessarily captured digitally. For a time this data was latent, people didn't really think about using it in the way that we think about it now, but by about 2010 it was obvious that not only was this data available but we had the processing and data systems to use it effectively.
A model is grown, not crafted like a computer program, which makes it hard to predict. (More precisely, a big growth phase follows the crafting phase.)
Dall-E was the only big surprising consumer model-- 2022 saw a sudden huge leap from "txt2img is kind of funny" to "txt2img is actually interesting". I would have assumed such a thing could only come in 2030 or earlier. But deep learning is full of counterintuitive results (like the NFL theorem not mattering, or ReLU being better than sigmoid).
But in hindsight, it was naive to think "this does not work yet" would get in the way of the products being sold and monetized.
I argue that Mikolov with word2vec was instrumental in current AI revolution. This demonstrated the easy of extracting meaning in mathematical way from text and directly lead to all advancements we have today with LLMs. And ironically, didn’t require GPU.
Jensen Huang, reasonably, was desperate for any market that could suck up more compute, which he could pivot to from GPUs for gaming when gaming saturated its ability to use compute. Screen resolutions and visible polygons and texture maps only demand so much compute; it's an S-curve like everything else. So from a marketing/market-development and capital investment perspective I do think he deserves credit. Certainly the Intel guys struggled to similarly recognize it (and to execute even on plain GPUs.)
But... the technical/academic insight of the CUDA/GPU vision in my view came from Ian Buck's "Brook" PhD thesis at Stanford under Pat Hanrahan (Pixar+Tableau co-founder, Turing Award Winner) and Ian promptly took it to Nvidia where it was commercialized under Jensen.
For a good telling of this under-told story, see one of Hanrahan's lectures at MIT: https://www.youtube.com/watch?v=Dk4fvqaOqv4
Corrections welcome.
By that time, GP-GPU had been around for a long, long time. CUDA still doesn't have much to do with AI - sure, it supports AI usage, even includes some AI-specific features (low-mixed precision blocked operations).
I don’t think they were thinking of LLMs in 2014, tbf.
TAM: Total Addressable Market
I often wonder how many useful technologies could exist if trends went a different way. Where would we be if neural nets hadn't caught on and SVMs and expert systems had.
Obvious contextual/attention caveats aside, a few thousand binary classifiers operating bitwise over the training & inference sets would get you enough bandwidth for a half-ass language model. 2^N can be a very big place very quickly.
The issue with SVMs is that they get intractably expensive for large datasets. The cost of training a neural network scales linearly with dataset size, while the kernel trick for SVMs scales with dataset size squared. You could never create an LLM-sized SVM.