Eval Request

#414
by Babsie - opened

Hello! Request for eval, when you have the time.
Babsie/Trickster-Theta-4-70B
and
Babsie/BrownLoafers-70B

By the way, as an aside, thanks for all the work you put into the board. It's been invaluable to me as I learn to mangle models about and blow shit up. I haven't had this much fun in a long ass time, and without your data, it would make things a lot more... ah... mysterious! With a lot more frustrating failures. So, thanks ๐Ÿ‘
Still learning. And the data back on how these little buggers do will help me out. I should probably run some myself, probably best to learn (and being less of a nuisance) than asking someone else to do it.

You wouldn't want to point me to the library were you got some of these evals, would you (I don't mean the actual questions, but where I could construct them, I have a specific goal in mind around specific measurements)? A link would be very helpful. I get lost in Hugging Face. Attention of a gnat.

I run local models using vllm (for its batching), but as for the automated testing system, I programmed it pretty much from scratch. Lot of python code that is specifically shaped around my benchmarks.
For automated testing you either need to instruct the model to give its answer in an easily parsable format or to have a separate llm dedicated to parsing responses for answers.

DontPlanToEnd changed discussion status to closed

Yes, I'm starting to gather this as I read the boards in spaces. I do this already when I test my models for "is this a shit model?" I have a gauntlet of things I run them through. It's just aren't scored on a "objective" rank system that could be shown as a transparent "this model scored n/10 on X with card adherence" or "n/10 on capacity to incorporate a large, layered, complex card." etc.
grading systems are... insanely problematic (research background in anthropology and studying primates) I have no idea how you managed it. Kudos.
I'll have a rummage around. Thanks for your time.

Sign up or log in to comment