Owner Sep 12

•

Continuation of Wur doomed!.

For longer text chunks or stories, https://pastebin.com works great and helps prevent the thread from slowing down!

🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧
🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛🟧
🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧🟧
⬜🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧⬛🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜⬜🟧🟧⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜⬜🟧🟧🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛⬛🟧⬜
⬜🟧⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛🟧⬜
⬜🟧⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛🟧⬜
⬜🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧⬜

jukofyork pinned discussion Sep 12

gghfez

Sep 12

The doom is still buried within Command-A for sure.

jukofyork

Owner Sep 12

•

edited Sep 12

The doom is still buried within Command-A for sure.

Only another 38 days to go:

Spoiler

It's actually going really well and pretty sure it will be mostly converged within another couple of days:

🤞

jukofyork

Owner Sep 12

•

edited Sep 12

A `step 601` preview - all with `temperature = 0`:

https://pastebin.com/GASKaHTk

https://pastebin.com/CRT81QLb

It's still messing up some end of lines, but I can live with that if it works... Likely can be fixed later using the new class 0 random data if a problem.
The Grimdark story was noticeably (much!) better compared to the inverse.
The Battlestar Galactica story showed that even though Q8_0, F16 and BF16 all diverge slightly from F32; it's not clearly making them any worse (I actually liked the Q8_0 story best!).

Size	Name
287M	command-a-03-2025-lora-Q8_0.ggu
541M	command-a-03-2025-lora-F16.gguf
541M	command-a-03-2025-lora-BF16.gguf
1.1G	command-a-03-2025-lora-F32.gguf

It still has a way to go before it starts to converge, but I would think by step 1000 it will be pretty close:

ChuckMcSneed

Sep 12

566 responses in previous thread! In the future we may be the reason for hf staff to implement multi-page view of discussions.

jukofyork

Owner Sep 12

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

BigHuggyD

Sep 15

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

That was really cool. Thanks for sharing!

jukofyork

Owner Sep 15

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

That was really cool. Thanks for sharing!

Yeah, and llama-3.1:405b doing so well was quite a surprise too (and makes you a bit sad everything seems to be moving away from large dense models ).

159 hidden messages

Expand all

rookaw

3 days ago

•

edited 3 days ago

@jukofyork Have you ruled out OLMo v2? It's a fully open language model and each branch is a checkpoint (lots of checkpoints available): https://huggingface.co/allenai/OLMo-2-0325-32B

I've been meaning to try it out but just taking the earliest checkpoint I like during the initial stage of the pretraining pipeline. After reading https://arxiv.org/abs/2503.19206 (Overtrained Language Models Are Harder to Fine-Tune) I want to experiment with the fine tuning results on this model at different checkpoints to see if it makes a big difference for writing.

Unfortunately vocab size is 100k.

jukofyork

Owner 3 days ago

I've trained up a speculative decoding draft model for Mistral-Large-Instruct-2411 (and should work on Mistral-Large-Instruct-2407):

https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0
https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF

jukofyork

Owner 3 days ago

@jukofyork Have you ruled out OLMo v2? It's a fully open language model and each branch is a checkpoint (lots of checkpoints available): https://huggingface.co/allenai/OLMo-2-0325-32B

I've been meaning to try it out but just taking the earliest checkpoint I like during the initial stage of the pretraining pipeline. After reading https://arxiv.org/abs/2503.19206 (Overtrained Language Models Are Harder to Fine-Tune) I want to experiment with the fine tuning results on this model at different checkpoints to see if it makes a big difference for writing.

Unfortunately vocab size is 100k.

Not yet, but I've never really had much luck with 30B and smaller models for writing - they seem to get confused very quickly :/

ChuckMcSneed

3 days ago

I've trained up a speculative decoding draft model for Mistral-Large-Instruct-2411 (and should work on Mistral-Large-Instruct-2407):

https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0
https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF

Nice! I've used Mistral 7B for that in the past and it worked quite well, got easy 25% speedup with temp=0, interesting how this one compares to it.

gghfez

2 days ago

@empeza

https://www.arxiv.org/abs/2510.13939

I haven't read it all yet, but one thing that sticks out to me. None of those older models (GPT-4o, Gemini-1.5 and Claude-Sonnet 3.5) have the dreaded "Not X, Y" construct.

Also, it seems like ChatGPT started the annoying "dimly lit" slop propagation lol

@jukofyork

I'm beginning to think that older models with smaller vocabularies (eg: ~32k) and larger hidden states (eg: 8k or 12k) are likely to be the best targets for creative writing fine-tunes.
{ "hidden_size": 12288, "vocab_size": 32768 }

Interesting. I can't remember the source, but I saw speculation about the lack of SWA, and the ratio:

"num_attention_heads": 96,
"num_key_value_heads": 8

being relevant with this model.

Nice! I've used Mistral 7B for that

Yeah, that particular combo is the best draft model example I've seen (with ExllamaV2).

I see GLM-4-32B-Base-0414 in your base list now, 32b dense model seems perfect for experiments.

jukofyork

Owner 1 day ago

@empeza

https://www.arxiv.org/abs/2510.13939

I haven't read it all yet, but one thing that sticks out to me. None of those older models (GPT-4o, Gemini-1.5 and Claude-Sonnet 3.5) have the dreaded "Not X, Y" construct.

I'm noticing this more than anything else now 😱

Tried searching using Kagi for pre-2020 discussions around this (if you just search for this now using Google you find only the talk about recent use by AI). The most interesting discussion is here:

https://github.com/UniversalDependencies/docs/issues/311

Turns out there are actually a few variations and it's interesting to read their discussion on it.

This discussion on "What is the logical operator for but?" is quite interesting too:

https://math.stackexchange.com/questions/64123/what-is-the-logical-operator-for-but

So maybe not only a result of Elara-like feedback, but also by all the code and STEM benchmaxxing.

Also, it seems like ChatGPT started the annoying "dimly lit" slop propagation lol

@jukofyork

I'm beginning to think that older models with smaller vocabularies (eg: ~32k) and larger hidden states (eg: 8k or 12k) are likely to be the best targets for creative writing fine-tunes.
{ "hidden_size": 12288, "vocab_size": 32768 }

Interesting. I can't remember the source, but I saw speculation about the lack of SWA, and the ratio:
"num_attention_heads": 96,
"num_key_value_heads": 8
being relevant with this model.

I'm taking a break from trying any more fine-tuning until I can see a good way to add an auxiliary loss to encourage less "one-sided" control adapters. I can see many ways to do it for full-batch learning, but not for mini-batch... It may require some careful matching of opposing paragraphs using an embedding database to get working sadly.

gghfez

1 day ago

I'm noticing this more than anything else now 😱

I've started noticing it a lot in debates now. It seems to be a strawman technique.
"It's not that , it's <something moderately shitty that doesn't seem as bad by contrast>"

also by all the code and STEM benchmaxxing
You're probably onto something here. There's a clear correlation, models like Qwen3 are great for STEM, but do this in almost every reply!

It's a long shot, but I'm training a control-vector on Kimi-K2 to try to remove this pattern🤞
direct_statement_vs_contrast_negation
It's a really creative model, but coming back to it after GLM-4.6, I can't help but notice it doing this almost as often as Qwen3.

jukofyork

Owner 1 day ago

It's not very good at coding, so not sure how useful these will be:

https://huggingface.co/jukofyork/command-a-03-2025-DRAFT-0.8B-v3.0
https://huggingface.co/jukofyork/command-a-03-2025-DRAFT-0.8B-v3.0-GGUF

but maybe useful for somebody who wants to do continued fine-tuning for stuff command-a is good at (like RAG, etc).

jukofyork
/

creative-writing-control-vectors-v3.0

“The doom lies in yourself, not in your name.”

A `step 601` preview - all with `temperature = 0`:

“The doom lies in yourself, not in your name.”

A step 601 preview - all with temperature = 0:

A `step 601` preview - all with `temperature = 0`: