chenxwh
/

AVeriTeC

Model card Files Files and versions Community

Chenxi Whitehouse commited on Apr 9

Commit

eaaaf3d

•

1 Parent(s): 4400462

add src files

Browse files

Files changed (5) hide show

README.md +13 -6
src/prediction/veracity_prediction.py +121 -0
src/reranking/bm25_sentences.py +1 -1
src/reranking/rerank_questions.py +106 -0
src/retrieval/scraper_for_knowledge_store.py +1 -1

README.md CHANGED Viewed

@@ -47,7 +47,7 @@ The training and dev dataset can be found under [data](https://huggingface.co/ch
 ## Reproduce the baseline
-Below are the steps to reproduce the baseline results. The main difference from the reported results in the paper is that, instead of requiring direct access to the paid Google Search API, we provide such search results for up to 1000 URLs per claim using different queries, and the scraped text as a knowledge store for retrieval for each claim. This is aimed at reducing the overhead cost of participating in the Shared Task.
 ### 0. Set up environment
@@ -93,28 +93,35 @@ python -m src.reranking.bm25_sentences
 ```
 ### 3. Generate questions-answer pair for the top sentences
-We use [BLOOM](https://huggingface.co/bigscience/bloom-7b1) to generate QA paris for each of the top 100 sentence, providing 10 closest claim-QA-pairs from the training set as in-context examples. See [question_generation_top_sentences.py](https://huggingface.co/chenxwh/AVeriTeC/blob/main/src/reranking/question_generation_top_sentences.py) for more argument options. We provide the output file for this step on the dev set [here](https://huggingface.co/chenxwh/AVeriTeC/blob/main/data_store/ddev_top_k_qa.json).
 ```bash
 python -m src.reranking.question_generation_top_sentences
 ```
 ### 4. Rerank the QA pairs
-Using a pre-trained BERT model [bert_dual_encoder.ckpt](https://huggingface.co/chenxwh/AVeriTeC/blob/main/pretrained_models/bert_dual_encoder.ckpt), we rerank the QA paris and keep top 3 QA paris as evidence. We provide the output file for this step on the dev set [here]().
 ```bash
 ```
 ### 5. Veracity prediction
-Finally, given a claim and its 3 QA pairs as evidence, we use another pre-trained BERT model [bert_veracity.ckpt](https://huggingface.co/chenxwh/AVeriTeC/blob/main/pretrained_models/bert_veracity.ckpt) to predict the veracity label. The pre-trained model is provided . We provide the prediction file for this step on the dev set [here]().
 ```bash
 ```
 The results will be presented as follows:
-```bash
 ```
-We recommend using 0.25 as cut-off score for evaluating the relevance of the evidence. The result for dev and the test set below.
 ## Citation
 If you find AVeriTeC useful for your research and applications, please cite us using this BibTeX:

 ## Reproduce the baseline
+Below are the steps to reproduce the baseline results. The main difference from the reported results in the paper is that, instead of requiring direct access to the paid Google Search API, we provide such search results for up to 1000 URLs per claim using different queries, and the scraped text as a knowledge store for retrieval for each claim. This is aimed at reducing the overhead cost of participating in the Shared Task. Another difference is that we also added text scraped from pdf URLs to the knowledge store.
 ### 0. Set up environment
 ```
 ### 3. Generate questions-answer pair for the top sentences
+We use [BLOOM](https://huggingface.co/bigscience/bloom-7b1) to generate QA paris for each of the top 100 sentence, providing 10 closest claim-QA-pairs from the training set as in-context examples. See [question_generation_top_sentences.py](https://huggingface.co/chenxwh/AVeriTeC/blob/main/src/reranking/question_generation_top_sentences.py) for more argument options. We provide the output file for this step on the dev set [here](https://huggingface.co/chenxwh/AVeriTeC/blob/main/data_store/dev_top_k_qa.json).
 ```bash
 python -m src.reranking.question_generation_top_sentences
 ```
 ### 4. Rerank the QA pairs
+Using a pre-trained BERT model [bert_dual_encoder.ckpt](https://huggingface.co/chenxwh/AVeriTeC/blob/main/pretrained_models/bert_dual_encoder.ckpt), we rerank the QA paris and keep top 3 QA paris as evidence. See [rerank_questions.py](https://huggingface.co/chenxwh/AVeriTeC/blob/main/src/reranking/rerank_questions.py) for more argument options. We provide the output file for this step on the dev set [here](https://huggingface.co/chenxwh/AVeriTeC/blob/main/data_store/dev_top_3_rerank_qa.json).
 ```bash
+python -m reranking.rerank_questions
 ```
 ### 5. Veracity prediction
+Finally, given a claim and its 3 QA pairs as evidence, we use another pre-trained BERT model [bert_veracity.ckpt](https://huggingface.co/chenxwh/AVeriTeC/blob/main/pretrained_models/bert_veracity.ckpt) to predict the veracity label. See [veracity_prediction.py](https://huggingface.co/chenxwh/AVeriTeC/blob/main/src/prediction/veracity_prediction.py) for more argument options. We provide the prediction file for this step on the dev set [here](https://huggingface.co/chenxwh/AVeriTeC/blob/main/data_store/dev_vericity_prediction.json).
 ```bash
+python -m prediction.veracity_prediction
 ```
 The results will be presented as follows:
+```
 ```
+We recommend using 0.25 as cut-off score for evaluating the relevance of the evidence. The result for dev and the test set below.
+| Model             | Split	| Q only | Q + A | Veracity @ 0.2 | @ 0.25 | @ 0.3 |
+|-------------------|-------|--------|-------|----------------|--------|-------|
+| AVeriTeC-BLOOM-7b | dev	|     	|   	| 	|  	| 	|
+| AVeriTeC-BLOOM-7b | test	|    	|   	|  	|  	| 	|
 ## Citation
 If you find AVeriTeC useful for your research and applications, please cite us using this BibTeX:

src/prediction/veracity_prediction.py ADDED Viewed

	@@ -0,0 +1,121 @@

+import argparse
+import json
+import tqdm
+import torch
+from transformers import BertTokenizer, BertForSequenceClassification
+from data_loaders.SequenceClassificationDataLoader import (
+    SequenceClassificationDataLoader,
+)
+from models.SequenceClassificationModule import SequenceClassificationModule
+LABEL = [
+    "Supported",
+    "Refuted",
+    "Not Enough Evidence",
+    "Conflicting Evidence/Cherrypicking",
+]
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Given a claim and its 3 QA pairs as evidence, we use another pre-trained BERT model to predict the veracity label."
+    )
+    parser.add_argument(
+        "-i",
+        "--claim_with_evidence_file",
+        default="data/dev_top3_questions.json",
+        help="Json file with claim and top question-answer pairs as evidence.",
+    )
+    parser.add_argument(
+        "-o",
+        "--output_file",
+        default="data_store/dev_veracity.json",
+        help="Json file with the veracity predictions.",
+    )
+    parser.add_argument(
+        "-ckpt",
+        "--best_checkpoint",
+        type=str,
+        default="pretrained_models/bert_veracity.ckpt",
+    )
+    args = parser.parse_args()
+    with open(args.claim_with_evidence_file) as f:
+        examples = json.load(f)
+    bert_model_name = "bert-base-uncased"
+    tokenizer = BertTokenizer.from_pretrained(bert_model_name)
+    bert_model = BertForSequenceClassification.from_pretrained(
+        bert_model_name, num_labels=4, problem_type="single_label_classification"
+    )
+    device = "cuda:0" if torch.cuda.is_available() else "cpu"
+    trained_model = SequenceClassificationModule.load_from_checkpoint(
+        args.best_checkpoint, tokenizer=tokenizer, model=bert_model
+    ).to(device)
+    dataLoader = SequenceClassificationDataLoader(
+        tokenizer=tokenizer,
+        data_file="this_is_discontinued",
+        batch_size=32,
+        add_extra_nee=False,
+    )
+    predictions = []
+    for example in tqdm.tqdm(examples):
+        example_strings = []
+        for evidence in example["evidence"]:
+            example_strings.append(
+                dataLoader.quadruple_to_string(
+                    example["claim"], evidence["question"], evidence["answer"], ""
+                )
+            )
+        if (
+            len(example_strings) == 0
+        ):  # If we found no evidence e.g. because google returned 0 pages, just output NEI.
+            example["label"] = "Not Enough Evidence"
+            continue
+        tokenized_strings, attention_mask = dataLoader.tokenize_strings(example_strings)
+        example_support = torch.argmax(
+            trained_model(tokenized_strings, attention_mask=attention_mask).logits,
+            axis=1,
+        )
+        has_unanswerable = False
+        has_true = False
+        has_false = False
+        for v in example_support:
+            if v == 0:
+                has_true = True
+            if v == 1:
+                has_false = True
+            if v in (
+                2,
+                3,
+            ):  # TODO another hack -- we cant have different labels for train and test so we do this
+                has_unanswerable = True
+        if has_unanswerable:
+            answer = 2
+        elif has_true and not has_false:
+            answer = 0
+        elif not has_true and has_false:
+            answer = 1
+        else:
+            answer = 3
+        json_data = {
+            "claim_id": example["claim_id"],
+            "claim": example["claim"],
+            "evidence": example["evidence"],
+            "label": LABEL[answer],
+        }
+        predictions.append(json_data)
+    with open(args.output_file, "w", encoding="utf-8") as output_file:
+        json.dump(predictions, output_file, ensure_ascii=False, indent=4)

src/reranking/bm25_sentences.py CHANGED Viewed

@@ -30,7 +30,7 @@ def retrieve_top_k_sentences(query, document, urls, top_k):
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(
-        description="Get top 100 sentences for sentences in the knowledge store"
     )
     parser.add_argument(
         "-k",

 if __name__ == "__main__":
     parser = argparse.ArgumentParser(
+        description="Get top 100 sentences with BM25 in the knowledge store."
     )
     parser.add_argument(
         "-k",

src/reranking/rerank_questions.py ADDED Viewed

	@@ -0,0 +1,106 @@

+import argparse
+import json
+import torch
+import tqdm
+from transformers import BertTokenizer, BertForSequenceClassification
+from models.DualEncoderModule import DualEncoderModule
+def triple_to_string(x):
+    return " </s> ".join([item.strip() for item in x])
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Rerank the QA paris and keep top 3 QA paris as evidence using a pre-trained BERT model."
+    )
+    parser.add_argument(
+        "-i",
+        "--top_k_qa_file",
+        default="data/dev_top_k_qa.json",
+        help="Json file with claim and top k generated question-answer pairs.",
+    )
+    parser.add_argument(
+        "-o",
+        "--output_file",
+        default="data/dev_top_3_rerank_qa.json",
+        help="Json file with the top3 reranked questions.",
+    )
+    parser.add_argument(
+        "-ckpt",
+        "--best_checkpoint",
+        type=str,
+        default="pretrained_models/bert_dual_encoder.ckpt",
+    )
+    parser.add_argument(
+        "--top_n",
+        type=int,
+        default=3,
+        help="top_n question answer pairs as evidence to keep.",
+    )
+    args = parser.parse_args()
+    with open(args.top_k_qa_file) as f:
+        examples = json.load(f)
+    bert_model_name = "bert-base-uncased"
+    tokenizer = BertTokenizer.from_pretrained(bert_model_name)
+    bert_model = BertForSequenceClassification.from_pretrained(
+        bert_model_name, num_labels=2, problem_type="single_label_classification"
+    )
+    device = "cuda:0" if torch.cuda.is_available() else "cpu"
+    trained_model = DualEncoderModule.load_from_checkpoint(
+        args.best_checkpoint, tokenizer=tokenizer, model=bert_model
+    ).to(device)
+    with open(args.output_file, "w", encoding="utf-8") as output_file:
+        for example in tqdm.tqdm(examples):
+            strs_to_score = []
+            values = []
+            bm25_qau = example["bm25_qau"] if "bm25_qau" in example else []
+            claim = example["claim"]
+            for question, answer, url in bm25_qau:
+                str_to_score = triple_to_string([claim, question, answer])
+                strs_to_score.append(str_to_score)
+                values.append([question, answer, url])
+            if len(bm25_qau) > 0:
+                encoded_dict = tokenizer(
+                    strs_to_score,
+                    max_length=512,
+                    padding="longest",
+                    truncation=True,
+                    return_tensors="pt",
+                ).to(device)
+                input_ids = encoded_dict["input_ids"]
+                attention_masks = encoded_dict["attention_mask"]
+                scores = torch.softmax(
+                    trained_model(input_ids, attention_mask=attention_masks).logits,
+                    axis=-1,
+                )[:, 1]
+                top_n = torch.argsort(scores, descending=True)[: args.top_n]
+                evidence = [
+                    {
+                        "question": values[i][0],
+                        "answer": values[i][1],
+                        "url": values[i][2],
+                    }
+                    for i in top_n
+                ]
+            else:
+                evidence = []
+            json_data = {
+                "claim_id": example["claim_id"],
+                "claim": claim,
+                "evidence": evidence,
+            }
+            output_file.write(json.dumps(json_data, ensure_ascii=False) + "\n")
+            output_file.flush()

src/retrieval/scraper_for_knowledge_store.py CHANGED Viewed

@@ -46,7 +46,7 @@ def scrape_text_from_url(url, temp_name):
 if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="Scraping text from URL")
     parser.add_argument(
         "-i",
         "--tsv_input_file",

 if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Scraping text from URLs.")
     parser.add_argument(
         "-i",
         "--tsv_input_file",