Welcome

Instructions

Use the Analytics platform to examine and analyze LLM evaluation experiments. The experiments are assumed to comprise:

A dataset of tasks, where each task has at least one triplet of context, grounding document and response. There may be multiple responses, if multiple models are being evaluated simultaneously (at most 5 models are allowed).
Each response is evaluated on at least one metric. A metric may be categorical (yes/no, Likert scale) or numeric. One experiment may include any number of categorical or numeric metrics, though we strongly caution against including too many as this makes the instance-level analysis challenging.
There is at least one annotator / evaluator. The annotator may be human or algorithm (whether a defined quanititative metric, or an LLM). One experiment may include any number of human or algorithmic annotators.

Upload your experiment data on the following page, which contains a detailed example of the expected schema. You will need to provide sufficient metadata about the tasks, metrics, and annotators. If the uploaded document is not well-formed, you will see a verification error.