“The doom lies in yourself, not in your name.”
Continuation of Wur doomed!.
For longer text chunks or stories, https://pastebin.com works great and helps prevent the thread from slowing down!
🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧
🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛🟧
🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧🟧
⬜🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧⬛🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜⬜🟧🟧⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜⬜🟧🟧🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛⬛🟧⬜
⬜🟧⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛🟧⬜
⬜🟧⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛🟧⬜
⬜🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧⬜
The doom is still buried within Command-A for sure.
	
		
	
	
		A step 601 preview - all with temperature = 0:
	
- It's still messing up some end of lines, but I can live with that if it works... Likely can be fixed later using the new class 0random data if a problem.
- The Grimdark story was noticeably (much!) better compared to the inverse.
- The Battlestar Galactica story showed that even though Q8_0,F16andBF16all diverge slightly fromF32; it's not clearly making them any worse (I actually liked theQ8_0story best!).
| Size | Name | 
|---|---|
| 287M | command-a-03-2025-lora-Q8_0.ggu | 
| 541M | command-a-03-2025-lora-F16.gguf | 
| 541M | command-a-03-2025-lora-BF16.gguf | 
| 1.1G | command-a-03-2025-lora-F32.gguf | 
It still has a way to go before it starts to converge, but I would think by step 1000 it will be pretty close:
566 responses in previous thread! In the future we may be the reason for hf staff to implement multi-page view of discussions.
This was posted on Hacker News today:
Absolutely fascinating!
This was posted on Hacker News today:
Absolutely fascinating!
That was really cool. Thanks for sharing!
This was posted on Hacker News today:
Absolutely fascinating!
That was really cool. Thanks for sharing!
Yeah, and llama-3.1:405b doing so well was quite a surprise too (and makes you a bit sad everything seems to be moving away from large dense models ).
@jukofyork Have you ruled out OLMo v2? It's a fully open language model and each branch is a checkpoint (lots of checkpoints available): https://huggingface.co/allenai/OLMo-2-0325-32B
I've been meaning to try it out but just taking the earliest checkpoint I like during the initial stage of the pretraining pipeline. After reading https://arxiv.org/abs/2503.19206 (Overtrained Language Models Are Harder to Fine-Tune) I want to experiment with the fine tuning results on this model at different checkpoints to see if it makes a big difference for writing.
Unfortunately vocab size is 100k.
I've trained up a speculative decoding draft model for Mistral-Large-Instruct-2411 (and should work on Mistral-Large-Instruct-2407):
https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0
https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF
@jukofyork Have you ruled out OLMo v2? It's a fully open language model and each branch is a checkpoint (lots of checkpoints available): https://huggingface.co/allenai/OLMo-2-0325-32B
I've been meaning to try it out but just taking the earliest checkpoint I like during the initial stage of the pretraining pipeline. After reading https://arxiv.org/abs/2503.19206 (Overtrained Language Models Are Harder to Fine-Tune) I want to experiment with the fine tuning results on this model at different checkpoints to see if it makes a big difference for writing.
Unfortunately vocab size is 100k.
Not yet, but I've never really had much luck with 30B and smaller models for writing - they seem to get confused very quickly :/
I've trained up a speculative decoding draft model for
Mistral-Large-Instruct-2411(and should work onMistral-Large-Instruct-2407):https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0
https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF
Nice! I've used Mistral 7B for that in the past and it worked quite well, got easy 25% speedup with temp=0, interesting how this one compares to it.
I haven't read it all yet, but one thing that sticks out to me. None of those older models (GPT-4o, Gemini-1.5 and Claude-Sonnet 3.5) have the dreaded "Not X, Y" construct.
Also, it seems like ChatGPT started the annoying "dimly lit" slop propagation lol
I'm beginning to think that older models with smaller vocabularies (eg: ~32k) and larger hidden states (eg: 8k or 12k) are likely to be the best targets for creative writing fine-tunes.
{ "hidden_size": 12288, "vocab_size": 32768 }
Interesting. I can't remember the source, but I saw speculation about the lack of SWA, and the ratio:
"num_attention_heads": 96,
"num_key_value_heads": 8
being relevant with this model.
Nice! I've used Mistral 7B for that
Yeah, that particular combo is the best draft model example I've seen (with ExllamaV2).
I see GLM-4-32B-Base-0414 in your base list now, 32b dense model seems perfect for experiments.
I haven't read it all yet, but one thing that sticks out to me. None of those older models (GPT-4o, Gemini-1.5 and Claude-Sonnet 3.5) have the dreaded "Not X, Y" construct.
I'm noticing this more than anything else now 😱
Tried searching using Kagi for pre-2020 discussions around this (if you just search for this now using Google you find only the talk about recent use by AI). The most interesting discussion is here:
https://github.com/UniversalDependencies/docs/issues/311
Turns out there are actually a few variations and it's interesting to read their discussion on it.
This discussion on "What is the logical operator for but?" is quite interesting too:
https://math.stackexchange.com/questions/64123/what-is-the-logical-operator-for-but
So maybe not only a result of Elara-like feedback, but also by all the code and STEM benchmaxxing.
Also, it seems like ChatGPT started the annoying "dimly lit" slop propagation lol
I'm beginning to think that older models with smaller vocabularies (eg: ~32k) and larger hidden states (eg: 8k or 12k) are likely to be the best targets for creative writing fine-tunes.
{ "hidden_size": 12288, "vocab_size": 32768 }Interesting. I can't remember the source, but I saw speculation about the lack of SWA, and the ratio:
"num_attention_heads": 96, "num_key_value_heads": 8being relevant with this model.
I'm taking a break from trying any more fine-tuning until I can see a good way to add an auxiliary loss to encourage less "one-sided" control adapters. I can see many ways to do it for full-batch learning, but not for mini-batch... It may require some careful matching of opposing paragraphs using an embedding database to get working sadly.
I'm noticing this more than anything else now 😱
I've started noticing it a lot in debates now. It seems to be a strawman technique.
"It's not that , it's <something moderately shitty that doesn't seem as bad by contrast>"
also by all the code and STEM benchmaxxing
You're probably onto something here. There's a clear correlation, models like Qwen3 are great for STEM, but do this in almost every reply!
It's a long shot, but I'm training a control-vector on Kimi-K2 to try to remove this pattern🤞direct_statement_vs_contrast_negation
It's a really creative model, but coming back to it after GLM-4.6, I can't help but notice it doing this almost as often as Qwen3.
It's not very good at coding, so not sure how useful these will be:
https://huggingface.co/jukofyork/command-a-03-2025-DRAFT-0.8B-v3.0
https://huggingface.co/jukofyork/command-a-03-2025-DRAFT-0.8B-v3.0-GGUF
but maybe useful for somebody who wants to do continued fine-tuning for stuff command-a is good at (like RAG, etc).




 
						 
						