loading

RAG Model Performance (ClapNQ)

# of tasks

100

# of annotators

Metrics

Models

GPT-3.5 (Turbo)

Llama (13b-chat)

Mistral

Reference

Choose models

GPT-3.5 (Turbo)

Llama (13b-chat)

Mistral

Reference

Show targets

answerability

Click to sort rows by Task header in ascending order	Click to sort rows by Targets header in ascending order	Click to sort rows by GPT-3.5 (Turbo) prediction header in ascending order	Click to sort rows by Llama (13b-chat) prediction header in ascending order	Click to sort rows by Mistral prediction header in ascending order	Click to sort rows by Reference prediction header in ascending order

Items per page:

1–10 of 100 items

of 10 pages

Choose models

GPT-3.5 (Turbo)

Llama (13b-chat)

Mistral

Reference

answerability

Agreement Distribution

Fluency

Answer Relevance

Faithfulness

Win Rate (Head-to-Head)

Annotator Contribution

Fluency

Answer Relevance

Faithfulness

Win Rate (Head-to-Head)

Inter Annotator Agreement (Cohen's Kappa
How to interprete Cohen's kappa coefficient?
Cohen's kappa
Intepretation
0 No agreement
0.10-0.20 Slight agreement
0.21-0.40 Fair agreement
0.41-0.60 Moderate agreement
0.61-0.80 Substantial agreement
0.81-0.99 Near perfect agreement
1 Perfect agreement
)

Cohen's kappa	Intepretation
0	No agreement
0.10-0.20	Slight agreement
0.21-0.40	Fair agreement
0.41-0.60	Moderate agreement
0.61-0.80	Substantial agreement
0.81-0.99	Near perfect agreement
1	Perfect agreement

Fluency

Answer Relevance

Faithfulness

Win Rate (Head-to-Head)

Time spent

Time spent per task (in minutes)

Fluency

Choose aggregator

Caution: Denominator might vary for categorical metrics.

Answer Relevance

Choose aggregator

Caution: Denominator might vary for categorical metrics.

Faithfulness

Choose aggregator

Caution: Denominator might vary for categorical metrics.

Win Rate (Head-to-Head)

Choose aggregator

answerability

Models

Metrics

Human Evaluations (100/100)

Click to sort rows by Model header in ascending order	Click to sort rows by Fluency header in ascending order	Click to sort rows by Answer Relevance header in ascending order	Click to sort rows by Faithfulness header in ascending order	Click to sort rows by Win Rate (Head-to-Head) header in ascending order	Click to sort rows by [object Object] header in ascending order
Reference	3.96 ± 0.16 (1)	3.73 ± 0.33 (2)	3.39 ± 0.31 (1)	64.56 ± 22.74 (1)	5 (1)
GPT-3.5 (Turbo)	3.96 ± 0.14 (1)	3.75 ± 0.34 (1)	2.78 ± 0.31 (3)	60.78 ± 20.48 (2)	7 (2)
Mistral	3.91 ± 0.16 (2)	3.57 ± 0.38 (3)	2.7 ± 0.36 (4)	51.89 ± 18.13 (3)	12 (3)
Llama (13b-chat)	3.64 ± 0.38 (3)	3.19 ± 0.42 (4)	3.01 ± 0.38 (2)	37.67 ± 17.12 (4)	13 (4)

^* (rank) indicates model's comparative position w.r.t to other models for a given metric^* value±std shows averages of aggregate values and standard deviation across all tasks

reflects confidence level on the aggregate values.

■

# of tasks where where minority rating is far from majority rating,

■

# of tasks where where minority rating is similar to majority rating and

■

# of tasks where where all annotators chose same rating

Algorithmic Evaluations (100/100)

Click to sort rows by Model header in ascending order	Click to sort rows by Recall header in ascending order	Click to sort rows by Rouge-L header in ascending order	Click to sort rows by Bert-KPrec header in ascending order	Click to sort rows by Answerability (Accuracy) header in ascending order	Click to sort rows by Extractiveness header in ascending order	Click to sort rows by Length header in ascending order	Click to sort rows by [object Object] header in ascending order
Reference	0.8 (1)	0.8 (1)	0.34 (3)	1 (1)	0.34 (1)	203.97 (4)	11 (1)
GPT-3.5 (Turbo)	0.5 (3)	0.35 (2)	0.41 (1)	0.92 (2)	0.28 (3)	415.78 (2)	13 (2)
Llama (13b-chat)	0.54 (2)	0.3 (4)	0.34 (3)	0.91 (3)	0.32 (2)	485.4 (1)	15 (3)
Mistral	0.48 (4)	0.32 (3)	0.39 (2)	0.89 (4)	0.27 (4)	410.24 (3)	20 (4)

^* (rank) indicates model's comparative position w.r.t to other models for a given metric

# of evaluations

# of annotators

Metrics

Metric A

Metric B

Metric C

Models

Model A

Model B

Duration

Additional Filters

Additional Filters

Agreement Distribution

Fluency

Answer Relevance

Faithfulness

Win Rate (Head-to-Head)

Annotator Contribution

Fluency

Answer Relevance

Faithfulness

Win Rate (Head-to-Head)

Inter Annotator Agreement (Cohen's Kappa How to interprete Cohen's kappa coefficient?Cohen's kappaIntepretation0No agreement0.10-0.20Slight agreement0.21-0.40Fair agreement0.41-0.60Moderate agreement0.61-0.80Substantial agreement0.81-0.99Near perfect agreement1Perfect agreement)

How to interprete Cohen's kappa coefficient?

Fluency

Answer Relevance

Faithfulness

Win Rate (Head-to-Head)

Time spent

Time spent per task (in minutes)

Fluency

Answer Relevance

Faithfulness

Win Rate (Head-to-Head)

Additional Filters

Hide Models & Metrics

Human Evaluations (100/100)

Algorithmic Evaluations (100/100)

Additional Filters

Additional Filters

Additional Filters

Spearman correlation(100/100)

Additional Filters

Additional Filters

Additional Filters

Additional Filters

Agreement Distribution

Fluency

Answer Relevance

Faithfulness

Win Rate (Head-to-Head)

Annotator Contribution

Fluency

Answer Relevance

Faithfulness

Win Rate (Head-to-Head)

Inter Annotator Agreement (Cohen's Kappa How to interprete Cohen's kappa coefficient?Cohen's kappaIntepretation0No agreement0.10-0.20Slight agreement0.21-0.40Fair agreement0.41-0.60Moderate agreement0.61-0.80Substantial agreement0.81-0.99Near perfect agreement1Perfect agreement)

How to interprete Cohen's kappa coefficient?

Fluency

Answer Relevance

Faithfulness

Win Rate (Head-to-Head)

Time spent

Time spent per task (in minutes)

Fluency

Answer Relevance

Faithfulness

Win Rate (Head-to-Head)

Additional Filters

Hide Models & Metrics

Human Evaluations (100/100)

Algorithmic Evaluations (100/100)

Additional Filters

Additional Filters

Additional Filters

Spearman correlation(100/100)

Inter Annotator Agreement (Cohen's Kappa
How to interprete Cohen's kappa coefficient?
Cohen's kappa
Intepretation
0 No agreement
0.10-0.20 Slight agreement
0.21-0.40 Fair agreement
0.41-0.60 Moderate agreement
0.61-0.80 Substantial agreement
0.81-0.99 Near perfect agreement
1 Perfect agreement
)

Inter Annotator Agreement (Cohen's Kappa
How to interprete Cohen's kappa coefficient?
Cohen's kappa
Intepretation
0 No agreement
0.10-0.20 Slight agreement
0.21-0.40 Fair agreement
0.41-0.60 Moderate agreement
0.61-0.80 Substantial agreement
0.81-0.99 Near perfect agreement
1 Perfect agreement
)