Evaluation reproducibility issue and temperature setting
Hello, thank you for your great work.
I'm using the "Unbabel/M-Prometheus-7B" model to evaluate LLM responses via PrometheusEval.absolute_grade(). In order to improve evaluation reproducibility, I set the temperature parameter to a lower value (e.g., between 0 and 0.9). However, when doing so, I frequently encounter the following issue: Retrying failed batches
This seems to happen more often when the temperature is low, and in some cases the evaluation fails entirely.
Could you clarify:
- Whether low temperature values are supported for inference-time deterministic scoring, and if this issue is expected or avoidable.
- Whether there are recommended inference settings (e.g., temperature, top_p, etc.) for stable and reproducible scoring using the model?
Additionally, I noticed that the retry logic for failed batches is capped at a maximum of 10 attempts. When evaluating around 9K Korean samples, 10 retries were not sufficient to complete all requests successfully. As a workaround, I modified the retry loop to allow unlimited retries, for example:
# Retry logic with progress bar
retries = 0
while to_retry_inputs:
retries += 1
print(f"Retrying failed batches: Attempt {retries}")
...
I have two follow-up questions here:
- Is there a specific reason why the maximum retry count is limited to 10?
- Would it be safe to use a higher or unlimited retry threshold like above, or could it lead to unintended issues (e.g., hanging processes, resource leaks)?
Thanks again for your support and for providing such a useful evaluation framework!