Evaluation reproducibility issue and temperature setting

#3
by Tae-Gyun - opened

Hello, thank you for your great work.

I'm using the "Unbabel/M-Prometheus-7B" model to evaluate LLM responses via PrometheusEval.absolute_grade(). In order to improve evaluation reproducibility, I set the temperature parameter to a lower value (e.g., between 0 and 0.9). However, when doing so, I frequently encounter the following issue: Retrying failed batches

This seems to happen more often when the temperature is low, and in some cases the evaluation fails entirely.

Could you clarify:

  • Whether low temperature values are supported for inference-time deterministic scoring, and if this issue is expected or avoidable.
  • Whether there are recommended inference settings (e.g., temperature, top_p, etc.) for stable and reproducible scoring using the model?

Additionally, I noticed that the retry logic for failed batches is capped at a maximum of 10 attempts. When evaluating around 9K Korean samples, 10 retries were not sufficient to complete all requests successfully. As a workaround, I modified the retry loop to allow unlimited retries, for example:

# Retry logic with progress bar
retries = 0
while to_retry_inputs:
    retries += 1
    print(f"Retrying failed batches: Attempt {retries}")
...

I have two follow-up questions here:

  • Is there a specific reason why the maximum retry count is limited to 10?
  • Would it be safe to use a higher or unlimited retry threshold like above, or could it lead to unintended issues (e.g., hanging processes, resource leaks)?

Thanks again for your support and for providing such a useful evaluation framework!

Sign up or log in to comment