Rouge-L RougeL score based on word overlapExtractiveness Extractiveness score based on passage overlapRecall Recall score based on word overlapLength Length in charactersBert-KPrec Bert-KPrec score based on word overlapAnswerability (Accuracy) Accuracy score for predicted vs gold class for answerabilityFluency The response is coherent, natural, and not dismissive.Fluency Score Explanation Annotator feedback explaining their score for fluencyAnswer Relevance The response provides appropriate amount of useful information.Answer Relevance Score Explanation Annotator feedback explaining their score for answer relevanceFaithfulness The response is faithful and grounded on the context.Faithfulness Score Explanation Annotator feedback explaining their score for faithfulnessWin Rate (Head-to-Head) Number of times this model response is preferred over other model responses from the same task.