Emin Temiz PRO
AI & ML interests
Recent Activity
Organizations
Yes. I meant the case where parameters in the LLM does not match the retrieved knowledge.
Retrieved knowledge may say "the white of an egg is more beneficial", whereas LLM may have the opinion "the yolk". And eventually it may produce the yolk as the answer even though the context is full of white.
Google has a benchmark for this i think "FACTS Grounding". I don't agree with its name choice but it is relevant here.
I made an LLM act as an aggregator and combine answers of several other LLMs like in a mixture of agents scenario. The aggregator does not always produce the average or median answer. It brings its own opinion.
I think Google has a benchmark for this (sticking to context and not bringing its own words).
Kimi was OK until it started "thinking" ..
- Better signal thanks to new models like Enoch
- MoA of top and bottoms of current leaderboard to add more diverse inputs
- Better questions
- A faster and more precise measurement methodology
- Explanations on how each column is calculated
- More sample questions revealed
Current version: https://huggingface.co/blog/etemiz/aha-leaderboard
Tell me how I can improve more.
CWClabs/CWC-Mistral-Nemo-12B-V2-q4_k_m
We are going to be using it as one of the ground truths for AHA Leaderboard 2.0 (the next version).
We will be able to generate some RL datasets for folks to align their own LLMs with humanity. We will generate answers from best models and worst models and do mixture of agents that combines the answer, and publish results as dataset(s). Things looking bright!
Working on a broader version of the AHA leaderboard. Follow for more quackery :)
This fine tuning would score 56 and be placed 1st in the leaderboard but I didn't add it, I only include full trainings in the leaderboard or (further tunings by the same company):
https://huggingface.co/CWClabs/CWC-Mistral-Nemo-12B-V2-q4_k_m
LLM builders in general are not doing a great job of making human aligned models.
I don't want to say this is a proxy for p(doom)... But it could be if we are not careful.
Most probable cause is reckless training LLMs using outputs of other LLMs, and don't caring about curation of datasets and not asking 'what is beneficial for humans?'...
huihui-ai/Huihui-GLM-4.5-Air-abliterated-GGUF
@huihui-ai
Our leaderboard can be used for human alignment in an RL setting. Ask the same question to top models and worst models and the answer from top models can get +1 score, bad models can get -1. Ask many times with higher temperature to generate more answers. What do you think?
https://huggingface.co/blog/etemiz/aha-leaderboard
It is weird that we can also understand humans more thanks to LLM research. Human behavior has parallels to this. When we optimize for short term pleasure (high time preference) we end up feeding the beast in us, ending up misaligned with other humans. But if we care about other humans (low space preference) we are more aligned with them. Feeding ego can have parallels to reward hacking. Overcoming ego can be described as having high human alignment score..
I think the reverse is also true, like a benevolent, properly aligned LLM can "subconscious teach" another LLM and proper alignment can spread like a virus.
https://x.com/OwainEvans_UK/status/1947689616016085210
Hi Doctor Chad, nice to see you too
Thanks for sharing,
I will test that. Yi 1.5 has the second place on my leaderboard!