Recurrent models with constant hidden state are naturally suited to streaming data, potentially opening the door to unexplored new use cases.
On a side node, and that's what led me to the link above, I wonder if it would be possible to chain N streaming LLMs in an agent workflow and get a final output stream almost instantaneously without waiting for N-1 LLM to complete their reply.
RWKV team's focus is still however is first in the multi-lingual text space, then multi-modal space in the future.
its one of those retrospectively obvious/genius insights that i wish i understood when i first met him
This is a full drop in replacement for any transformer model use cases on model sizes 32B and under, as it has equal performance to existing open 32B models in most benchmarks
We are in works on a 70B, which will be a full drop in replacement for most text use cases
It's definitely something we are tracking to do as well =)
2023: https://latent.space/p/rwkv
2024: https://www.youtube.com/watch?v=LPe6iC73lrc <- offers a bit of professional compare and contrast vs state space models.
i think its cool that now both RNN and LSTM (with xLSTM) now have modern attention-inspired variants that solve the previous issues. I wonder if 1) its possible to overcome the "hardware lottery" that transformers have now won, and 2) if recurrent/selective state can do the kind of proper lookback on extremely long context that we will want it to do to compete with full attention (easy to say no, harder to propose what to do about it).
there's also Liquid AI, whatever it is that they do.
the hardware lottery, well... imo it's really about leveraging fully parallel training to learn how to use a memory. attention is quadratic but it can be computed in parallel. it's an end to end learned memory. getting that kind of pattern into RNNs won't be easy but it's going to be crucial before we boil the ocean.
Arguably with other recurrent architecture (State Space, etc) with very different design implementation. The issue of old recurrent design was just the way LSTM was designed. Not the recurrent nature.
In the recent RWKV7 incarnation, you could argue it's a type of Linear RNN, but past versions had an issue of taking its previous state from a lower layer, allowing for parallelism, but makes it closer to a convolution than a recurrent computation.
As for 1), I'd like to believe so, but it's hard to get people away from the addictive drug that is the easily parallelised transformer, 2) (actual) RNNs and attention mechanisms to me seem fairly powerful (expressivity wise) and perhaps most acceptable by the community.
It's possible we start seeing more blended version of RNN/attention architecture exploring different LLM properties.
In particular, Aaren architecture in the former paper "can not only (i) be trained in parallel (like Transformers) but also (ii) be updated efficiently with new tokens, requiring only constant memory for inferences (like traditional RNNs)."
Personally I think it's important not to call some of these recent architectures RNNs because they have theoretical properties that do not match (read: they're worse) what we've "classically" called RNNs.
Ref: https://arxiv.org/abs/2404.08819
As a rule of thumb: you generally don't get parallelism for free, you pay for it with poorer expressivity.
Given limits on clock speed, massive parallelism is always going to be the way to approach brain-like levels of parallel computation, so any model architecture aspiring to human level AGI needs to be able to take advantage of that.
Its the same principle as open transformer models where an adapter is used to generate the embedding
However currently the core team focus is in scaling the core text model, as this would be the key performance driver, before adapting multi-modal.
The tech is there, the base model needs to be better
Not saying its a good or bad idea, but pointing out that having a fixed state in between has interesting applications in this space
I have a task(OCR cleaning) that I´m evaluating faster options and look like RWKV would be a nice alternative.
I keep seeing papers like “Repeat After Me” claiming serious weaknesses of state space vs transformer models. What are the current weaknesses of RWKV vs transformers? Have you mitigated them? If so, how?
The other issue is that file sharing being illegal, Wikipedia requiring derivatives to be copyleft, etc means I can’t train models with most data legally. Pre-1920’s works in Project Gutenberg are totally public domain. Both the model and the training data would be 100% legal for reproducible research. Would your team be willing to train a 3B-7B model on only Gutenberg and release it to the public domain?
(Note: The Stack without GitHub Issues can be used for permissive code. However, there could be contamination issues like incorrect licenses, PII, etc. So, maybe at least one, 100% legal model. Maybe a second with Gutenberg and The Stack for coding research.)
Example use of Gutenberg:
That really depends on whether LLM pretraining ends up held as an infringing use. (Of course, it’ll take a while for the cases to work through the courts and for a body of jurisprudence to be developed on this subject.)
Making copies of and sharing copyrighted works without the authors’ permission is already illegal as proven in countless, file-sharing cases. The AI trainers do this with data sets like Common Crawl, The Pile, and RefinedWeb. Just sharing them is illegal for most of the content in them.
I got ideas for how to deal with that in countries with TDM exceptions, like Singapore. For now, the only things we can share with others for model training are (a) public domain works and (b) content licensed for permissive use and sharing. Gutenberg entries before a certain year should be pretty risk-free.
ps Eugene you should brag about that on the homepage of RWKV.
Any kind of quickstart guide?
Also found.tried rwkv.cpp, and I can't seem to compile that either.