# of evaluations
# of annotators
Metrics
Metric A
Metric B
Metric C
Models
Model A
Model B
Duration
Click to sort rows by Task header in ascending order | Click to sort rows by Targets header in ascending order | Click to sort rows by GPT-3.5 (Turbo) prediction header in ascending order | Click to sort rows by Llama (13b-chat) prediction header in ascending order | Click to sort rows by Mistral prediction header in ascending order | Click to sort rows by Reference prediction header in ascending order |
---|
Cohen's kappa | Intepretation |
---|---|
0 | No agreement |
0.10-0.20 | Slight agreement |
0.21-0.40 | Fair agreement |
0.41-0.60 | Moderate agreement |
0.61-0.80 | Substantial agreement |
0.81-0.99 | Near perfect agreement |
1 | Perfect agreement |
Click to sort rows by Model header in ascending order | Click to sort rows by Fluency header in ascending order | Click to sort rows by Answer Relevance header in ascending order | Click to sort rows by Faithfulness header in ascending order | Click to sort rows by Win Rate (Head-to-Head) header in ascending order | Click to sort rows by [object Object] header in ascending order |
---|---|---|---|---|---|
Reference | 3.96 ± 0.16 (1) | 3.73 ± 0.33 (2) | 3.39 ± 0.31 (1) | 64.56 ± 22.74 (1) | 5 (1) |
GPT-3.5 (Turbo) | 3.96 ± 0.14 (1) | 3.75 ± 0.34 (1) | 2.78 ± 0.31 (3) | 60.78 ± 20.48 (2) | 7 (2) |
Mistral | 3.91 ± 0.16 (2) | 3.57 ± 0.38 (3) | 2.7 ± 0.36 (4) | 51.89 ± 18.13 (3) | 12 (3) |
Llama (13b-chat) | 3.64 ± 0.38 (3) | 3.19 ± 0.42 (4) | 3.01 ± 0.38 (2) | 37.67 ± 17.12 (4) | 13 (4) |
Click to sort rows by Model header in ascending order | Click to sort rows by Recall header in ascending order | Click to sort rows by Rouge-L header in ascending order | Click to sort rows by Bert-KPrec header in ascending order | Click to sort rows by Answerability (Accuracy) header in ascending order | Click to sort rows by Extractiveness header in ascending order | Click to sort rows by Length header in ascending order | Click to sort rows by [object Object] header in ascending order |
---|---|---|---|---|---|---|---|
Reference | 0.8 (1) | 0.8 (1) | 0.34 (3) | 1 (1) | 0.34 (1) | 203.97 (4) | 11 (1) |
GPT-3.5 (Turbo) | 0.5 (3) | 0.35 (2) | 0.41 (1) | 0.92 (2) | 0.28 (3) | 415.78 (2) | 13 (2) |
Llama (13b-chat) | 0.54 (2) | 0.3 (4) | 0.34 (3) | 0.91 (3) | 0.32 (2) | 485.4 (1) | 15 (3) |
Mistral | 0.48 (4) | 0.32 (3) | 0.39 (2) | 0.89 (4) | 0.27 (4) | 410.24 (3) | 20 (4) |
Majority of annotators selected different values for a given metric.
Majority of annotators selected a same value for a given metric.
Majority of annotators selected a same value for a given metric and the most common value and the 2nd most common value were less that 2 units apart.
All annotators selected a same value for a given metric.