InspectorRAGet

Switch to dark mode

Export

Report bug

Github

Examples

RAG Model Performance (ClapNQ)

# of tasks

100

# of annotators

8

Metrics

Analytics platform supports three kind of metrics

Go vs No-Go Rubric
Intuitive Rubric
Detailed Rubric

Reference

RougeL score based on word overlap

Extractiveness score based on passage overlap

Recall score based on word overlap

Length in characters

Bert-KPrec score based on word overlap

Accuracy score for predicted vs gold class for answerability

The response is coherent, natural, and not dismissive.

Annotator feedback explaining their score for fluency

The response provides appropriate amount of useful information.

Annotator feedback explaining their score for answer relevance

The response is faithful and grounded on the context.

Annotator feedback explaining their score for faithfulness

Number of times this model response is preferred over other model responses from the same task.

Models

GPT-3.5 (Turbo)

Llama (13b-chat)

Mistral

Reference

Examples

RAG Model Performance (ClapNQ)

# of tasks

100

# of annotators

8

Metrics

Analytics platform supports three kind of metrics

Go vs No-Go Rubric
Intuitive Rubric
Detailed Rubric

Reference

RougeL score based on word overlap

Extractiveness score based on passage overlap

Recall score based on word overlap

Length in characters

Bert-KPrec score based on word overlap

Accuracy score for predicted vs gold class for answerability

The response is coherent, natural, and not dismissive.

Annotator feedback explaining their score for fluency

The response provides appropriate amount of useful information.

Annotator feedback explaining their score for answer relevance

The response is faithful and grounded on the context.

Annotator feedback explaining their score for faithfulness

Number of times this model response is preferred over other model responses from the same task.

Models

GPT-3.5 (Turbo)

Llama (13b-chat)

Mistral

Reference