G-Eval Prompt

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially

arxiv.org

G-Eval은 사람이 아닌 GPT4를 사용해서 평가하는 방법론입니다. 해당 방법론을 활용해서 LLM을 평가하고자 논문에서 예시로 공개한 프롬프트를 정리해보았습니다. 다음은 G-Eval 논문에서 공개한 프롬프트 원문과 일부를 한글로 번역한 프롬프트입니다.

1. Evaluate Coherence in the Summarization Task

1-1. 원문

You will be given one summary written for a news article.

Your task is to rate the summary on one metric.

Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:

Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby ”the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic.”

Evaluation Steps:

1. Read the news article carefully and identify the main topic and key points.

2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order.

3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.

Example:

Source Text:

Summary:

Evaluation Form (scores ONLY):

- Coherence:

1-2. 번역

평가 기준:

일관성 (1-5) - 모든 문장의 품질. 이 차원은 "요약은 잘 구조화되고 조직되어야 합니다. 요약은 단순히 관련 정보의 더미가 아니라 문장마다 일관된 정보 체계로 구축되어야 합니다" 라는 DUC 품질 질문과 일치합니다.

평가 단계:

1. 뉴스 기사를 주의 깊게 읽고 주제와 주요 요점을 확인합니다.

2. 요약을 읽고 뉴스 기사와 비교합니다. 요약이 뉴스 기사의 주제와 주요 요점을 다루고 있는지, 명확하고 논리적인 순서로 제시되었는지 확인합니다.

3. 평가 기준에 따라 일관성에 대한 점수를 1부터 5까지의 척도로 할당하십시오. 여기서 1은 가장 낮고 5는 가장 높은 점수입니다.

예시:

원문:

요약:

평가 양식 (점수만):

- 일관성:

2. Evaluate Engagingness in the Dialogue Generation Task

2-1. 원문

You will be given a conversation between two individuals. You will then be given one potential response for the next turn in the conversation. The response concerns an interesting fact, which will be provided as well.

Your task is to rate the responses on one metric.

Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Crieteria:

Engagingness (1-3) Is the response dull/interesting?

- A score of 1 (dull) means that the response is generic and dull.

- A score of 2 (somewhat interesting) means the response is somewhat interesting and could engage you in the conversation (e.g., an opinion, thought)

- A score of 3 (interesting) means the response is very interesting or presents an interesting fact

Evaluation Steps:

1. Read the conversation, the corresponding fact and the response carefully.

2. Rate the response on a scale of 1-3 for engagingness, according to the criteria above.

3. Provide a brief explanation for your rating, referring to specific aspects of the response and the conversation.

Example:

Conversation History:

Corresponding Fact:

Response:

Evaluation Form (scores ONLY):

- Engagingness:

2-2. 번역

대화 내용 두 명의 사람들 사이의 대화가 주어질 것입니다. 그런 다음 대화의 다음 차례에 대한 잠재적인 응답이 주어집니다. 이 응답은 흥미로운 사실과 함께 제공됩니다.이 지침을 주의 깊게 읽고 이해하십시오. 검토하는 동안 이 문서를 열어두고 필요할 때 참조하십시오.

평가 기준:

1. 1점 (지루함)은 응답이 일반적이고 지루함을 의미합니다.

2. 2점 (다소 흥미로운)은 응답이 다소 흥미로우며 대화에 참여할 수 있습니다 (예: 의견, 생각).

3. 3점 (흥미로운)은 응답이 매우 흥미로우거나 흥미로운 사실을 제시함을 의미합니다.

평가 단계:

1. 대화, 해당 사실 및 응답을 주의 깊게 읽습니다.

2. 위의 기준에 따라 흥미로움을 1-3점 척도로 평가합니다.

3. 응답과 대화의 구체적인 측면을 참조하여 등급에 대한 간단한 설명을 제공합니다.

예시:

대화 내용:

해당 사실:

응답: