LoRA vs. Full Fine-Tuning: An Illusion of Equivalence

76 points by timbilt 5 hours ago | 10 comments

K0balt 3 hours ago |
So, in layman’s terms, LoRa appears to “traumatize “ the model to some degree, connecting the vector space with strong “jumpers” (intruder dimensions) to change it’s behavior, instead of subtly conforming the entire model into a shape that accommodates the new data.
These jumpers or shortcuts do create connections between the relevant new concepts in the model, but by directly connecting them instead of associating them through the existing network of concepts, nuance is lost and the bypassed areas become deemphasized, leading to forgetting of previously held associations.
Because of this, In general, fine tuning produces better results than LoRa in most cases, especially when forgetting of existing training is detrimental.
Or, to further oversimplify the issue in SE terms, LoRa == monkeypatching. (Is this a kind of intruder dimension?)
Mockapapella an hour ago |
Thank you for this layman explanation
ismailmaj 40 minutes ago |
How does it compare to partially fine-tuning the model by freezing most of the network beside the last few layers?
pwillia7 2 hours ago |
This tracks with my feelings making and using Stable Diffusion Loras and fine tunes. Still, with the speed to train and use, Loras have worked for me in most use cases and it hasn't been worth fine tuning the entire model.
K0balt 2 hours ago |
Yeah,it reflects the “feel” I get from lLoRa as well, especially if I overdo it. The new data becomes the preferred output even for unrelated inputs. I always felt like it was bludgeoning the model to some extent vs finetuning.
Also, LoRa tuning an extensively tuned model occasionally provokes full on delusional “insanity” or gibberish seizures.
I have had really good luck though using a highly tuned model as the training basis for a LoRa and then applying that LoRa mask to the base version of that model. I’m not sure why that seems to work better than the same LoRa training directly on the base model.
AstroJetson an hour ago |
I was excited to click the link to see how fine tuning LoRA frequencies I was using on my Mesh network would work.
But no, another "AI model discussions" Ya'll need to start picking names for thing that don't collide with other's "Rapid Unique Sentence Training" for preloading language models with non orthogonal sentences. "Phillips-Young Training Hyper-Orthogonal Networks" Using the work of Phillips and Young to restructure Orthogonal Networks to be hyper dense.
rjsw 36 minutes ago |
We could switch to just referring to everything as "the thing" or "the idea".
sorenjan 19 minutes ago |
> I was excited to click the link to see how fine tuning LoRA frequencies I was using on my Mesh network would work.
You're thinking of LoRa radio, from Long Range. There's one of you in each LoRA comment section, I have a hard time believing it's an actual mistake in good faith anymore.
sorenjan 3 minutes ago |
> We randomly initialize A such that it has singular values of 1, freeze it, and only train B. When we do this, we see a sharp reduction in high ranking intruder dimensions in comparison to those in normal LoRA
This sounds interesting, but I can't see that they do much with this result. Are they saving it for a follow up paper? I would think that if their whole paper is about a big problem with LoRAs and they then find what looks like an easy solution for that problem that would warrant more than a paragraph just before the conclusion.
It would also have been interesting if they included the DoRA method, they reference it briefly and that paper claims to resemble fine tuning learning behavior.
But perhaps this paper is focused on LoRA behavior, and a separate paper comparing various improvements is better.