Evaluation method?

#3
by arthurkim - opened

https://huggingface.co/nayohan/llama3-instrucTrans-enko-8b#%EB%AA%A8%EB%8D%B8-%ED%8F%89%EA%B0%80%EB%B0%A9%EB%B2%95

instrucTrans λͺ¨λΈμ˜ λ²ˆμ—­ κ²°κ³Όλ₯Ό ν‰κ°€ν•˜κΈ° μœ„ν•œ μ½”λ“œλ„ ν˜Ήμ‹œ μžˆμ„κΉŒμš”?

ν…μŠ€νŠΈκ°„μ˜ μœ μ‚¬λ„λ₯Ό ν‰κ°€ν•˜λŠ” κ²ƒμΈκ°€μš”?
예λ₯Όλ“€μ–΄ μ•„λž˜ μΌ€μ΄μŠ€μ˜ 경우 ko_ref와 InstrucTrans κ°„μ˜ μœ μ‚¬λ„ 평가λ₯Ό μ§„ν–‰ν•˜λŠ”κ±΄κ°€μš”?

"en_ref":"This controversy arose around a new advertisement for the latest iPad Pro that Apple released on YouTube on the 7th. The ad shows musical instruments, statues, cameras, and paints being crushed in a press, followed by the appearance of the iPad Pro in their place. It appears to emphasize the new iPad Pro's artificial intelligence features, advanced display, performance, and thickness. Apple mentioned that the newly unveiled iPad Pro is equipped with the latest 'M4' chip and is the thinnest device in Apple's history. The ad faced immediate backlash upon release, as it graphically depicts objects symbolizing creators being crushed. Critics argue that the imagery could be interpreted as technology trampling on human creators. Some have also voiced concerns that it evokes a situation where creators are losing ground due to AI."
"ko_ref":"이번 λ…Όλž€μ€ μ• ν”Œμ΄ μ§€λ‚œ 7일 μœ νŠœλΈŒμ— κ³΅κ°œν•œ μ‹ ν˜• μ•„μ΄νŒ¨λ“œ ν”„λ‘œ κ΄‘κ³ λ₯Ό λ‘˜λŸ¬μ‹Έκ³  λΆˆκ±°μ‘Œλ‹€. ν•΄λ‹Ή κ΄‘κ³  μ˜μƒμ€ 악기와 쑰각상, 카메라, 물감 등을 μ••μ°©κΈ°λ‘œ μ§“λˆ„λ₯Έ λ’€ κ·Έ μžλ¦¬μ— μ•„μ΄νŒ¨λ“œ ν”„λ‘œλ₯Ό λ“±μž₯μ‹œν‚€λŠ” λ‚΄μš©μ΄μ—ˆλ‹€. μ‹ ν˜• μ•„μ΄νŒ¨λ“œ ν”„λ‘œμ˜ 인곡지λŠ₯ κΈ°λŠ₯λ“€κ³Ό μ§„ν™”λœ λ””μŠ€ν”Œλ ˆμ΄μ™€ μ„±λŠ₯, λ‘κ»˜ 등을 κ°•μ‘°ν•˜κΈ° μœ„ν•œ μ·¨μ§€λ‘œ ν’€μ΄λœλ‹€. μ• ν”Œμ€ μ΄λ²ˆμ— κ³΅κ°œν•œ μ•„μ΄νŒ¨λ“œ ν”„λ‘œμ— μ‹ ν˜• β€˜M4’ 칩이 νƒ‘μž¬λ˜λ©° λ‘κ»˜λŠ” μ• ν”Œμ˜ μ—­λŒ€ μ œν’ˆ 쀑 κ°€μž₯ μ–‡λ‹€λŠ” μ„€λͺ…도 λ§λΆ™μ˜€λ‹€. κ΄‘κ³ λŠ” 곡개 직후 κ±°μ„Ό λΉ„νŒμ— μ§λ©΄ν–ˆλ‹€. μ°½μž‘μžλ₯Ό μƒμ§•ν•˜λŠ” 물건이 μ§“λˆŒλ €μ§€λŠ” 과정을 μ§€λ‚˜μΉ˜κ²Œ μ λ‚˜λΌν•˜κ²Œ λ¬˜μ‚¬ν•œ 점이 λ¬Έμ œκ°€ 됐닀. 기술이 인간 μ°½μž‘μžλ₯Ό μ§“λ°ŸλŠ” λͺ¨μŠ΅μ„ λ¬˜μ‚¬ν•œ κ²ƒμœΌλ‘œ 해석될 여지가 μžˆλ‹€λŠ” λ¬Έμ œμ˜μ‹μ΄λ‹€. 인곡지λŠ₯(AI)으둜 인해 μ°½μž‘μžκ°€ μ„€ μžλ¦¬κ°€ μ€„μ–΄λ“œλŠ” 상황을 μ—°μƒμ‹œν‚¨λ‹€λŠ” λͺ©μ†Œλ¦¬λ„ λ‚˜μ™”λ‹€."
"InstrucTrans":"이번 λ…Όλž€μ€ μ• ν”Œμ΄ μ§€λ‚œ 7일 μœ νŠœλΈŒμ— κ³΅κ°œν•œ μ΅œμ‹  μ•„μ΄νŒ¨λ“œ ν”„λ‘œ κ΄‘κ³ λ₯Ό μ€‘μ‹¬μœΌλ‘œ λΆˆκ±°μ‘Œλ‹€. 이 κ΄‘κ³ λŠ” μ•…κΈ°, 쑰각상, 카메라, 물감 등을 λˆ„λ₯΄κΈ° μ‹œμž‘ν•˜λŠ” μž₯λ©΄κ³Ό ν•¨κ»˜ κ·Έ μžλ¦¬μ— μ•„μ΄νŒ¨λ“œ ν”„λ‘œκ°€ λ“±μž₯ν•˜λŠ” μž₯면을 보여쀀닀. μ΄λŠ” μƒˆλ‘œμš΄ μ•„μ΄νŒ¨λ“œ ν”„λ‘œμ˜ 인곡지λŠ₯ κΈ°λŠ₯, κ³ κΈ‰ λ””μŠ€ν”Œλ ˆμ΄, μ„±λŠ₯, λ‘κ»˜λ₯Ό κ°•μ‘°ν•˜λŠ” κ²ƒμœΌλ‘œ 보인닀. μ• ν”Œμ€ μ΄λ²ˆμ— κ³΅κ°œν•œ μ•„μ΄νŒ¨λ“œ ν”„λ‘œμ— μ΅œμ‹  'M4' 칩이 νƒ‘μž¬λμœΌλ©°, μ• ν”Œ 역사상 κ°€μž₯ 얇은 기기라고 μ–ΈκΈ‰ν–ˆλ‹€. 이 κ΄‘κ³ λŠ” μΆœμ‹œν•˜μžλ§ˆμž 크리에이터λ₯Ό μƒμ§•ν•˜λŠ” 물건이 νŒŒμ‡„λ˜λŠ” μž₯면이 κ·ΈλŒ€λ‘œ κ·Έλ €μ Έ λ…Όλž€μ΄ 되고 μžˆλ‹€. 비평가듀은 이 이미지가 기술이 인간 크리에이터λ₯Ό μ§“λ°ŸλŠ”λ‹€λŠ” 의미둜 해석될 수 μžˆλ‹€κ³  μ£Όμž₯ν•œλ‹€. λ˜ν•œ AI둜 인해 크리에이터듀이 밀리고 μžˆλ‹€λŠ” 상황을 μ—°μƒμ‹œν‚¨λ‹€λŠ” 우렀의 λͺ©μ†Œλ¦¬λ„ λ‚˜μ˜¨λ‹€."
Owner

μ•ˆλ…•ν•˜μ„Έμš”. λ‹΅μž₯이 λŠ¦μ–΄ μ£„μ†‘ν•©λ‹ˆλ‹€. ν‰κ°€λŠ” ko_ref와 model prediction에 λŒ€ν•΄μ„œ ScareBLEUλ₯Ό μ‚¬μš©ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
μ‹€ν—˜μ— μ‚¬μš©ν•œ μΆ”λ‘ μ½”λ“œμ™€ 평가 μ½”λ“œ κ³΅μœ λ“œλ¦½λ‹ˆλ‹€. κ°μ‚¬ν•©λ‹ˆλ‹€.

Owner

python inference_translation_eeve.py -g 3 -d "eval_dataset/flores.csv" -m "yanolja/EEVE-Korean-Instruct-10.8B-v1.0"
python inference_translation_seagull.py -g 3 -d "eval_dataset/flores.csv" -m "kuotient/Seagull-13b-translation"
python inference_translation_kullm.py -g 3 -d "eval_dataset/flores.csv" -m "nlpai-lab/KULLM3"
python inference_translation_synatra.py -g 3 -d "eval_dataset/flores.csv" -m "maywell/Synatra-7B-v0.3-Translation"

# python inference_translation_base.py
import os
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model_name", type=str, default="meta-llama/Meta-Llama-3-8B-Instruct")
parser.add_argument("-d", "--dataset_path", type=str,  default="gemini/ko-eng-dataset.csv")
parser.add_argument("-g", "--gpu_id", type=int,  default=0)
args = parser.parse_args()
print(args)
os.environ["CUDA_VISIBLE_DEVICES"]=str(args.gpu_id)

import torch
import evaluate
import pandas as pd

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(args.model_name)
# tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    args.model_name,
    # device_map="auto",
    torch_dtype=torch.bfloat16,
).to('cuda')
model.eval()

def apply_template(example):
    SYSTEM_PROMPT=f"당신은 λ²ˆμ—­κΈ° μž…λ‹ˆλ‹€. μ˜μ–΄λ₯Ό ν•œκ΅­μ–΄λ‘œ λ²ˆμ—­ν•˜μ„Έμš”." # ours
    conversation = {"messages": [
                        {'role': 'system', 'content': SYSTEM_PROMPT},
                        {'role': 'user', 'content':example["en_ref"]}
                    ]}
    return conversation

# datasets
tc_dataset = load_dataset("csv", data_files=args.dataset_path, split="train")
dataset = tc_dataset.map(apply_template, remove_columns=tc_dataset.features, batched=False, num_proc=64)
print(dataset)

# inference
output_list = []
for idx, data in enumerate(dataset):
    inputs = tokenizer.apply_chat_template(data['messages'],tokenize=True, add_generation_prompt=True, return_tensors='pt').to("cuda")
    # print(tokenizer.batch_decode(inputs))
    outputs = model.generate(inputs,
                             pad_token_id=tokenizer.eos_token_id,
                             max_new_tokens=512)
    output_decode = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
    print(f'{idx}:',output_decode)
    output_list.append(output_decode)

df = pd.DataFrame(tc_dataset)
df['ko_pred']=output_list
df = df[['ko_pred', 'ko_ref', 'en_ref', 'source']]

model_name = args.model_name.split('/')[-1]
output_path = 'inference_' + args.dataset_path.split('.')[-2]
print(output_path)
os.makedirs(output_path, exist_ok=True)
df.to_json(f'{output_path}/{model_name}_eval_result.json', lines=True, orient='records', force_ascii=False)
Owner
β€’
edited Jun 21

python eval_translation.py -i inference_eval_dataset/ko_news_eval40/nllb-finetuned-en2ko_eval_result.json
python eval_translation.py -i inference_eval_dataset/ko_news_eval40/EEVE-Korean-Instruct-10.8B-v1.0_eval_result.json
python eval_translation.py -i inference_eval_dataset/ko_news_eval40/Synatra-7B-v0.3-Translation_eval_result.json
python eval_translation.py -i inference_eval_dataset/ko_news_eval40/KULLM3_eval_result.json

# python eval_translation.py
import os
import argparse
import evaluate
import pandas as pd

parser = argparse.ArgumentParser()
parser.add_argument("-i", "--inference_path", type=str,  default="result/nayohanllama3-8b-it-translation-271k_eval_result.json")
args = parser.parse_args()
print(args)

# evaluate sacrebleu
metric = evaluate.load("sacrebleu")
def compute_metrics(eval_preds):
    decoded_preds, decoded_labels = eval_preds
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}
    result = {k: round(v, 2) for k, v in result.items()}
    return result

# eval result to json
df = pd.read_json(args.inference_path, lines=True, orient='records')
result = []
for source in df['source'].unique():
    df_source = df[df['source']==source].reset_index(drop=True)
    eval_preds = [df_source['ko_pred'], df_source['ko_ref']]
    eval_result = compute_metrics(eval_preds)
    # print(eval_result)
    eval_result['source'] = source
    result.append(eval_result)

output_df = pd.DataFrame(result, columns=['source', 'bleu'])
output_df = output_df.sort_values(by=['source'])
print(output_df)
output_path = '/'.join(args.inference_path.split('/')[:-1]) + '/eval'
output_file = args.inference_path.split('/')[-1]
os.makedirs(output_path, exist_ok=True)
output_df.to_json(f'{output_path}/{output_file}', lines=True, orient='records', force_ascii=False)
Owner

make_eval_dataset.py

import pandas as pd
from datasets import load_dataset

# flores
eval_dataset = load_dataset('traintogpb/aihub-flores-koen-integrated-sparta-30k')
df = pd.DataFrame(eval_dataset['test'])
df = df.drop('ko_ref_xcomet', axis=1)
df.to_csv('eval_dataset/flores.csv', index=False)

# iwlst2023
iwlst_en_ko_ban = load_dataset('shreevigneshs/iwslt-2023-en-ko-train-val-split-0.1', split='f_test')
iwlst_en_ko_zon = load_dataset('shreevigneshs/iwslt-2023-en-ko-train-val-split-0.1', split='if_test')

df = iwlst_en_ko_ban.to_pandas()
df = df[["en", "ko"]]
df.columns=["en_ref", "ko_ref"]
df['source'] = 'iwlst_en_ko_ban'
df.to_csv('iwlst_en_ko_banmal.csv', index=False)#, encoding='utf-8-sig')
print(df)

df = iwlst_en_ko_zon.to_pandas()
df = df[["en", "ko"]]
df.columns=["en_ref", "ko_ref"]
df['source'] = 'iwlst_en_ko_zon'
df.to_csv('iwlst_en_ko_zondae.csv', index=False)#, encoding='utf-8-sig')
print(df)

Sign up or log in to comment