Where are the control vectors and what do people use now?

#16
by rookaw - opened

I am wondering why there were no new control vectors uploaded here in a long time. Are control vectors not effective? What are people doing otherwise? It seems like a good idea that could be more powerful if we had more knobs to turn.

It's just that the only models getting released are:

  • Large MoE models which are good at writing but too large to easily create control vectors for.
  • Small models mainly trained on synthetic data that are benchmaxxed and are terrible at writing.
  • Finetunes of older large dense models which the base models' control vectors work for anyway.

@jukofyork given that this is sort of a "current state of things" thread, I figured this might be the best place to ask.

First, let me thank you for all your work and for sharing it all!

I found your abliteration script on GitHub here: https://github.com/Sumandora/remove-refusals-with-transformers/issues/1#issuecomment-2156641126
And your explanation comment (which was also very helpful) here: https://github.com/FailSpy/abliterator/issues/10#issuecomment-2156659963

And so a question I have, if you don't mind, is what your current thoughts are on abliteration?
I know huihui-ai makes a lot of them now. I'm just wondering if you still think abliteration is worth it?

My second question is something I've been trying to figure out for a long time. I can't find where, but at some point you or someone else posted a comment along the lines of "layer x is the refusal layer".
What I don't understand is, why isn't that layer omitted during runtime? Where that layer just isn't loaded with a flag like --ignore-layer 4?

In other words, do you know why an entire new model is uploaded in abliterated models?
I'm not concerned about the bandwidth but I feel like I'm missing something in understanding the bigger picture of why it's not done that way.

So I just wanted to ask those 2 questions, if you're not too busy and don't mind of course. Thank you again!

@jukofyork given that this is sort of a "current state of things" thread, I figured this might be the best place to ask.

First, let me thank you for all your work and for sharing it all!

I found your abliteration script on GitHub here: https://github.com/Sumandora/remove-refusals-with-transformers/issues/1#issuecomment-2156641126
And your explanation comment (which was also very helpful) here: https://github.com/FailSpy/abliterator/issues/10#issuecomment-2156659963

And so a question I have, if you don't mind, is what your current thoughts are on abliteration?
I know huihui-ai makes a lot of them now. I'm just wondering if you still think abliteration is worth it?

I haven't really used many abliterated models so can't say from personal experience, but I have read that it doesn't really help a lot of the recently released models much

I suspect it's because they tend to have been trained on much more highly filtered data, whereas the older models were trained on much more raw data and then "reinforcement learning from human feedback" (RLHF).

The RLHF likely creates a much more obvious single direction that can be remove:

https://thinkingmachines.ai/blog/lora/

whereas filtering the training data likely means even if you remove the refusal to answer part; there is no knowledge in the LLM of how to answer anyway...

My second question is something I've been trying to figure out for a long time. I can't find where, but at some point you or someone else posted a comment along the lines of "layer x is the refusal layer".
What I don't understand is, why isn't that layer omitted during runtime? Where that layer just isn't loaded with a flag like --ignore-layer 4?

No, I don't think it really works like this - it's just that layer probably has the clearest effect on he refusals, and then they use the PCA results for it to apply to other layers.

Thank you @jukofyork , I appreciate you taking the time to help me.

but I have read that it doesn't really help a lot of the recently released models much

I had vaguely remember reading that somewhere too, makes me wonder if there will be any way around that eventually.

And thank you for explaining about the how the PCA results are applied too, that's very helpful.

Sign up or log in to comment