G-Eval Prompt

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially

arxiv.org

G-Eval은 사람이 아닌 GPT4를 사용해서 평가하는 방법론입니다. 해당 방법론을 활용해서 LLM을 평가하고자 논문에서 예시로 공개한 프롬프트를 정리해보았습니다. 다음은 G-Eval 논문에서 공개한 프롬프트 원문과 일부를 한글로 번역한 프롬프트입니다.

1. Evaluate Coherence in the Summarization Task

1-1. 원문

You will be given one summary written for a news article.
Your task is to rate the summary on one metric.
Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:
Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby ”the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic.”

Evaluation Steps:
1. Read the news article carefully and identify the main topic and key points.
2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.

Example:
Source Text:
{{Document}}

Summary:
{{Summary}}

Evaluation Form (scores ONLY):
- Coherence:

1-2. 번역

뉴스 기사 요약문이 하나 주어집니다.
당신의 과제는 이 요약문을 하나의 기준에 따라 평가하는 것입니다.
아래 설명을 잘 읽고 이해한 뒤, 평가하는 동안에도 이 문서를 참고해 주세요.

평가 기준:
일관성(Coherence, 1~5점)
요약문 전체의 문장들이 잘 이어지고 구조가 잘 짜여 있는지를 평가합니다.
단순히 관련 정보가 나열된 것이 아니라, 문장 간 연결이 자연스럽고
하나의 주제를 중심으로 잘 조직된 내용을 전달하는지 확인해 주세요.
(DUC 평가 기준에 기반)

평가 방법:
1. 뉴스 원문을 먼저 읽고, 주요 주제와 핵심 내용을 파악합니다.
2. 요약문을 읽고, 원문의 주요 내용이 잘 담겨 있는지, 그리고 논리적인 순서로 잘 설명되어 있는지를 확인합니다.
3. 일관성(Coherence)을 기준으로 1점부터 5점까지 점수를 매겨 주세요. (1점은 매우 부족함, 5점은 매우 우수함)

예시:
원문:
{{Document}}

요약:
{{Summary}}

평가 항목 (점수만 작성)

일관성(Coherence):

2. Evaluate Engagingness in the Dialogue Generation Task

2-1. 원문

You will be given a conversation between two individuals. You will then be given one potential response for the next turn in the conversation. The response concerns an interesting fact, which will be provided as well.
Your task is to rate the responses on one metric.
Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Crieteria:
Engagingness (1-3) Is the response dull/interesting?

- A score of 1 (dull) means that the response is generic and dull.
- A score of 2 (somewhat interesting) means the response is somewhat interesting and could engage you in the conversation (e.g., an opinion, thought)
- A score of 3 (interesting) means the response is very interesting or presents an interesting fact

Evaluation Steps:
1. Read the conversation, the corresponding fact and the response carefully.
2. Rate the response on a scale of 1-3 for engagingness, according to the criteria above.
3. Provide a brief explanation for your rating, referring to specific aspects of the response and the conversation.

Example:
Conversation History:
{{Document}}

Corresponding Fact:
{{Fact}}

Response:
{{Response}}

Evaluation Form (scores ONLY):
- Engagingness:

2-2. 번역

두 사람 간의 대화가 주어집니다.
그 다음, 대화의 다음 차례로 이어질 수 있는 하나의 응답이 제시됩니다.
이 응답은 흥미로운 사실과 관련되어 있으며, 해당 사실도 함께 제공됩니다.

당신의 과제는 이 응답을 하나의 기준에 따라 평가하는 것입니다.
아래 설명을 주의 깊게 읽고, 평가하는 동안에도 이 문서를 열어두고 필요할 때마다 참고해 주세요.

평가 기준:
흥미도(Engagingness, 1~3점) - 응답이 얼마나 흥미로운가?

- 1점 (지루함): 응답이 일반적이고 지루하며 대화에 큰 기여를 하지 않음
- 2점 (다소 흥미로움): 약간 흥미롭고, 대화를 이어갈 수 있을 정도의 의견이나 생각을 포함
- 3점 (매우 흥미로움): 매우 흥미롭거나, 흥미로운 사실을 새롭게 제시함

평가 방법:
1. 대화 내용, 관련된 사실, 응답을 모두 주의 깊게 읽습니다.
2. 위 기준에 따라 응답의 흥미도를 1~3점 사이에서 평가합니다.
3. 평가한 이유를 간단히 작성합니다. 응답과 대화 내용의 구체적인 부분을 언급해 주세요.

예시
대화 내용:
{{Document}}

관련 사실:
{{Fact}}

응답:
{{Response}}

평가 항목 (점수만 작성):
흥미도(Engagingness):

'LLM' 카테고리의 다른 글

허깅페이스 모델 로컬 다운로드 (0)	2024.11.27
RRADistill (1)	2024.11.26
HAE-RAE Bench (0)	2024.02.25
Positional Embedding (0)	2024.02.11
Tuning (0)	2024.02.09