Spaces:
Runtime error
Runtime error
| \section*{Supplementary code} | |
| The LLM benchmark codebase is a structured collection of files and directories for large language model evaluation. Below is an introduction to the function of each component within the \texttt{"LLM benchmark"} directory: | |
| \begin{itemize} | |
| \item \textbf{data}: This directory contains the datasets used for geological knowledge extraction and evaluation. | |
| \begin{itemize} | |
| \item \texttt{Task1}: Stores geological texts and their corresponding triplets for testing large language models' triplet extraction capabilities, providing prompts and evaluation data for assessing LLM performance in geological knowledge extraction. | |
| \item \texttt{Task2}: Includes Yes or No tests and Factoid tests with texts, questions, and answers, encompassing five categories of sub-questions with 200 data entries each for comprehensive geological question-answering evaluation. | |
| \end{itemize} | |
| \item \textbf{scripts}: The \texttt{scripts} directory contains preprocessing and utility scripts for both tasks: | |
| \begin{itemize} | |
| \item \texttt{Task1/KNN\_token.py}: Implements entity-level detection based on geological texts, finding the most similar texts from the original dataset (1-500) and other geological texts (500-1000) as prompts. | |
| \item \texttt{Task2/cot\_process.py}: Extracts core insights from Task2's chain-of-thought responses using GPT-4o-mini for evaluation purposes. | |
| \item \texttt{Task2/KNN.py}: Uses BERT-base-Chinese and cosine similarity to retrieve the most similar texts for prompting. | |
| \item \texttt{utils/LLM.py}: Provides unified API interfaces for multiple large language models including GPT, Gemini, Claude, and others. | |
| \end{itemize} | |
| \item \textbf{tasks}: This directory contains task-specific implementations: | |
| \begin{itemize} | |
| \item \textbf{task1}: Focuses on extracting 24 types of geological relationship triplets from given geological texts. | |
| \begin{itemize} | |
| \item \texttt{eval.py}: Evaluates triplet extraction performance on geological texts from dataset 1-500, measuring precision, recall, and F1 scores for both triplets and BERTScore metrics. | |
| \item \texttt{Task1\_test.ipynb}: Interactive notebook for testing different prompting approaches including zero-shot, few-shot, KNN-based, and knowledge-guided methods. | |
| \item \texttt{metrics/graph\_matching.py}: Implements graph-based evaluation metrics for measuring triple extraction performance. | |
| \item \texttt{utils/prompt\_generate.py}: Contains functions for generating prompts for various tasks. | |
| \item \texttt{utils/response\_to\_json.py}: Contains functions for converting model responses into JSON format. | |
| \end{itemize} | |
| \item \textbf{task2}: Tests large language models' capabilities in geological judgment (Yes or No) and factual extraction (Factoid). | |
| \begin{itemize} | |
| \item \texttt{eval.py}: Evaluates LLM performance in judgment and factual extraction using accuracy, precision, recall, BERTScore, and METEOR score metrics. | |
| \item \texttt{pretreatment\_split\_data\_Geo.py}: Splits 200 judgment and factual extraction entries into testing sets (for LLM evaluation) and prompt sets (for LLM prompting). | |
| \item \texttt{Task2\_test.ipynb}: Interactive testing notebook supporting various prompting strategies and chain-of-thought reasoning. | |
| \item \texttt{utils/knn\_prompt.py}: Generates prompts using K-nearest neighbor examples for improved context-aware questioning. | |
| \item \texttt{utils/prompt\_get.py}: Provides flexible prompt generation with support for different shot configurations and example selection. | |
| \item \texttt{utils/save\_response.py}: Handles structured saving of model responses in JSON format for analysis and evaluation. | |
| \end{itemize} | |
| \end{itemize} | |
| \end{itemize} | |
| \newpage | |
| The scripts/Task1/KNN_token.py file: | |
| \begin{minted}[bgcolor=LightGray,breaklines=true,fontsize=\footnotesize]{python} | |
| ''' | |
| Entity-level retrieval using KNN algorithm | |
| ''' | |
| import json | |
| import numpy as np | |
| import faiss | |
| from transformers import AutoTokenizer, AutoModel | |
| import torch | |
| from collections import defaultdict | |
| class EntityLevelRetriever: | |
| def __init__(self, model_name='bert-base-chinese'): | |
| """ | |
| Initialize entity-level retriever | |
| Args: | |
| model_name: Pre-trained model name for encoding | |
| """ | |
| self.tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| self.model = AutoModel.from_pretrained(model_name) | |
| self.model.eval() | |
| # Initialize FAISS index and storage | |
| self.entity_db = [] | |
| self.metadata = [] | |
| self.index = faiss.IndexFlatIP(768) # Inner product for cosine similarity | |
| def _get_entity_span(self, text, entity): | |
| """Get character span of entity in text""" | |
| start = text.find(entity) | |
| if start == -1: | |
| return None | |
| return (start, start + len(entity)) | |
| def _generate_entity_embedding(self, text, entity): | |
| """Generate entity-level contextual embeddings""" | |
| span = self._get_entity_span(text, entity) | |
| if not span: | |
| return None | |
| inputs = self.tokenizer(text, return_tensors='pt', truncation=True) | |
| with torch.no_grad(): | |
| outputs = self.model(**inputs) | |
| # Convert character positions to token positions | |
| char_to_token = lambda x: inputs.char_to_token(x) | |
| start_token = char_to_token(span[0]) | |
| end_token = char_to_token(span[1]-1) | |
| if not start_token or not end_token: | |
| return None | |
| # Extract and average token embeddings for the entity | |
| entity_embedding = outputs.last_hidden_state[0, start_token:end_token+1].mean(dim=0).numpy() | |
| return entity_embedding.astype('float32') | |
| def build_index(self, train_path): | |
| """Build entity index from training data""" | |
| with open(train_path, 'r', encoding='utf-8') as f: | |
| dataset = json.load(f) | |
| # Use data from index 500-1000 as specified in original implementation | |
| dataset = dataset[500:1000] | |
| for item in dataset: | |
| text = item['text'] | |
| for triple in item['triple_list']: | |
| # Process head and tail entities | |
| for entity in [triple[0], triple[2]]: | |
| embedding = self._generate_entity_embedding(text, entity) | |
| if embedding is not None: | |
| self.entity_db.append(embedding) | |
| self.metadata.append({ | |
| 'entity': entity, | |
| 'type': triple[1], # Store relation type | |
| 'context': text | |
| }) | |
| print(f"Entity count check - vectors: {len(self.entity_db)}, metadata: {len(self.metadata)}") | |
| self.index.add(np.array(self.entity_db)) | |
| def search_texts(self, test_path, top_k=3): | |
| """Search for similar texts based on entity matching""" | |
| with open(test_path, 'r', encoding='utf-8') as f: | |
| test_data = json.load(f) | |
| results = [] | |
| for item in test_data: | |
| query_text = item['text'] | |
| # Extract entities from query text and generate embeddings | |
| query_entities = [] | |
| for triple in item.get('triple_list', []): | |
| query_entities.extend([triple[0], triple[2]]) | |
| if not query_entities: | |
| continue | |
| # Generate embeddings for query entities | |
| query_embeddings = [] | |
| for entity in query_entities: | |
| embedding = self._generate_entity_embedding(query_text, entity) | |
| if embedding is not None: | |
| query_embeddings.append(embedding) | |
| if not query_embeddings: | |
| continue | |
| # Search for similar entities | |
| query_embeddings = np.array(query_embeddings) | |
| scores, indices = self.index.search(query_embeddings, top_k) | |
| # Collect matched texts | |
| matched_texts = [] | |
| for score_array, index_array in zip(scores, indices): | |
| for score, idx in zip(score_array, index_array): | |
| if idx != -1 and score > 0.5: # Similarity threshold | |
| metadata = self.metadata[idx] | |
| matched_texts.append({ | |
| 'entity': metadata['entity'], | |
| 'relation': metadata['type'], | |
| 'context': metadata['context'], | |
| 'score': float(score) | |
| }) | |
| # Sort by score and take top results | |
| matched_texts.sort(key=lambda x: x['score'], reverse=True) | |
| matched_texts = matched_texts[:top_k] | |
| results.append({ | |
| 'query_text': query_text, | |
| 'matched_texts': matched_texts | |
| }) | |
| return results | |
| # Usage example | |
| if __name__ == "__main__": | |
| # Initialize retrieval system | |
| retriever = EntityLevelRetriever() | |
| # Build training index | |
| print("Building training index...") | |
| retriever.build_index('./data/Task1/train_triples.json') | |
| # Execute test retrieval | |
| print("\nSearching similar entities...") | |
| text_results = retriever.search_texts('./data/Task1/GT_500.json', top_k=3) | |
| # Save results | |
| with open('./data/Task1/text_retrieval_results.json', 'w', encoding='utf-8') as f: | |
| json.dump(text_results, f, ensure_ascii=False, indent=2) | |
| \end{minted} | |
| \newpage | |
| The scripts/Task2/cot_process.py file: | |
| \begin{minted}[bgcolor=LightGray,breaklines=true,fontsize=\footnotesize]{python} | |
| import json | |
| from utils.LLM import LLM_request | |
| ''' | |
| Use GPT-4o-mini-2024-07-18 as reasoning model to extract core phrases/entities | |
| from CoT responses, replace original answers and save as new files | |
| ''' | |
| # Read JSON files - model result paths for processing | |
| model_results_paths = [ | |
| './output/Task2/cot/cot/deepseek-ai/DeepSeek-R1_f.json', | |
| './output/Task2/cot/cot/gpt-3.5-turbo_f.json', | |
| './output/Task2/cot/cot/gpt-4o_f.json', | |
| './output/Task2/cot/cot/gemini-1.5-pro-002_f.json', | |
| './output/Task2/cot/cot/claude-3-5-haiku-20241022_f.json', | |
| './output/Task2/cot/cot/deepseek-ai/DeepSeek-V3_f.json', | |
| './output/Task2/cot/cot/deepseek-ai/DeepSeek-R1_f.json', | |
| './output/Task2/cot/cot/meta-llama/Meta-Llama-3.1-405B-Instruct_f.json', | |
| './output/Task2/cot/cot/Qwen/Qwen2.5-72B-Instruct_f.json', | |
| ] | |
| # Configuration for processing model | |
| model_series = 'gpt' | |
| model_name = 'gpt-4o-mini-2024-07-18' | |
| prompt = ''' | |
| Extract the main factual information from the following sentence that answers the question. | |
| The answer should be entity phrases without additional explanations or prefix statements. | |
| Question: {question} | |
| Answer: {answer} | |
| Please extract only the core answer: | |
| ''' | |
| # Use GPT-4o-mini-2024-07-18 as reasoning model to extract core phrases/entities | |
| # from CoT responses, replace original answers and save as new files | |
| for i in range(len(model_results_paths)): | |
| with open(model_results_paths[i], 'r', encoding='utf-8') as f: | |
| data = json.load(f) | |
| # Extract JSON filename from model_results_paths and remove '_f.json' suffix | |
| file_name = model_results_paths[i].split('/')[-1].split('_')[0] | |
| print(file_name) | |
| # Create new list to store processed data | |
| new_data = [] | |
| for j in range(len(data)): | |
| question = data[j]['question'] | |
| answer = data[j]['answer'] | |
| # Create processing prompt | |
| prompt = f"Extract the main factual information from the following sentence that answers the question. The answer should be entity phrases without additional explanations or prefix statements.\nQuestion: {question}\nAnswer: {answer}\nPlease extract only the core answer:" | |
| # print(prompt) | |
| response = LLM_request(model_series, model_name, prompt + '\n' + 'Do not include any other irrelevant explanations or meaningless replies') | |
| # print(response) | |
| # Extract content from ChatCompletionMessage object | |
| core_answer = response.content if hasattr(response, 'content') else response | |
| # Add processed data to new list | |
| new_data.append({ | |
| "question": question, | |
| "answer": core_answer | |
| }) | |
| # Save new data to new JSON file | |
| new_file_path = f'./output/Task2/cot/cot_new/{file_name}_f_processed.json' | |
| with open(new_file_path, 'w', encoding='utf-8') as f: | |
| json.dump(new_data, f, ensure_ascii=False, indent=4) | |
| \end{minted} | |
| \newpage | |
| The scripts/Task2/KNN.py file: | |
| \begin{minted}[bgcolor=LightGray,breaklines=true,fontsize=\footnotesize]{python} | |
| import pandas as pd | |
| from sentence_transformers import SentenceTransformer | |
| from sklearn.metrics.pairwise import cosine_similarity | |
| import json | |
| class GeologicalKNNRetriever: | |
| """ | |
| K-nearest neighbor retriever for geological question answering | |
| Uses BERT-base-Chinese and cosine similarity for text matching | |
| """ | |
| def __init__(self, model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'): | |
| """Initialize KNN retriever with sentence transformer model""" | |
| self.model = SentenceTransformer(model_name) | |
| def process_knn_data(self, data_path, test_sheet, train_sheet, output_path, k=3): | |
| """ | |
| Process test data using KNN to find similar training examples | |
| Args: | |
| data_path: Path to Excel data file | |
| test_sheet: Name of test data sheet | |
| train_sheet: Name of training data sheet | |
| output_path: Path to save KNN results | |
| k: Number of nearest neighbors to find | |
| """ | |
| # Load training and test data | |
| train_data = pd.read_excel(data_path, sheet_name=train_sheet) | |
| test_data = pd.read_excel(data_path, sheet_name=test_sheet) | |
| # Encode training texts | |
| train_texts = train_data['Text'].tolist() | |
| train_embeddings = self.model.encode(train_texts) | |
| results = [] | |
| # Process each test query | |
| for idx, test_row in test_data.iterrows(): | |
| test_query = { | |
| 'text': test_row['Text'], | |
| 'question': test_row['Question'], | |
| 'answer': test_row['Answer'] | |
| } | |
| # Encode test text and find similar training examples | |
| test_embedding = self.model.encode([test_query['text']]) | |
| similarities = cosine_similarity(test_embedding, train_embeddings)[0] | |
| # Get top-k most similar examples | |
| top_k_indices = similarities.argsort()[-k:][::-1] | |
| matched_samples = [] | |
| for train_idx in top_k_indices: | |
| matched_samples.append({ | |
| 'context': train_data.iloc[train_idx]['Text'], | |
| 'question': train_data.iloc[train_idx]['Question'], | |
| 'answer': train_data.iloc[train_idx]['Answer'], | |
| 'similarity': float(similarities[train_idx]) | |
| }) | |
| results.append({ | |
| 'test_query': test_query, | |
| 'matched_samples': matched_samples | |
| }) | |
| # Save results to JSON | |
| with open(output_path, 'w', encoding='utf-8') as f: | |
| json.dump(results, f, ensure_ascii=False, indent=2) | |
| print(f"KNN results saved to {output_path}") | |
| def main(): | |
| """Main function to process both Yes/No and Factoid data""" | |
| retriever = GeologicalKNNRetriever() | |
| data_path = './data/Task2/data.xlsx' | |
| # Process Yes/No questions | |
| print("Processing Yes/No questions...") | |
| retriever.process_knn_data( | |
| data_path, 'Yes or No Test', 'Yes or No Train', | |
| './data/Task2/knn_yes_no.json', k=3 | |
| ) | |
| # Process Factoid questions | |
| print("Processing Factoid questions...") | |
| retriever.process_knn_data( | |
| data_path, 'Factoid Test', 'Factoid Train', | |
| './data/Task2/knn_factoid.json', k=3 | |
| ) | |
| if __name__ == "__main__": | |
| main() | |
| \end{minted} | |
| \newpage | |
| The tasks/task2/pretreatment_split_data_Geo.py file: | |
| \begin{minted}[bgcolor=LightGray,breaklines=true,fontsize=\footnotesize]{python} | |
| ''' | |
| Data reading and splitting for geological question answering | |
| Read ./data/Task2/Task2.xlsx file with "Yes or No" and "Factoid" sheets | |
| Both datasets have 200 entries with columns: "ID", "Question", "Answer", "Text" | |
| The 200 entries are divided into 5 subcategories, 40 entries each, arranged by ID order | |
| (e.g., geological disaster development characteristics, economic loss assessment, etc.) | |
| Split each subcategory: 20 for training (prompting), 20 for testing (evaluation) | |
| Note: When constructing the dataset, continuous adjacent questions may come from the same | |
| text segment, so when splitting test and training sets, we cannot directly divide by | |
| original order but need random sampling to build initial dataset. | |
| Random sampling: 20 test + 20 train per subcategory (total 100 test, 100 train per major category) | |
| ''' | |
| import pandas as pd | |
| import random | |
| def split_data(data_path, output_path): | |
| """ | |
| Split geological QA data into training and testing sets | |
| Args: | |
| data_path: Input Excel file path | |
| output_path: Output Excel file path for split data | |
| """ | |
| # Read data from both sheets | |
| data_yes_no = pd.read_excel(data_path, sheet_name='Yes or No') | |
| data_factoid = pd.read_excel(data_path, sheet_name='Factoid') | |
| # Split Yes or No data by categories (5 categories, 40 entries each) | |
| train_data_yes_no = [] | |
| test_data_yes_no = [] | |
| for category in range(5): | |
| category_data = data_yes_no[category * 40:(category + 1) * 40] | |
| random.seed(42) | |
| train = category_data.sample(n=20, random_state=42) | |
| test = category_data.drop(train.index) | |
| train_data_yes_no.append(train) | |
| test_data_yes_no.append(test) | |
| # Split Factoid data by categories (5 categories, 40 entries each) | |
| train_data_factoid = [] | |
| test_data_factoid = [] | |
| for category in range(5): | |
| category_data = data_factoid[category * 40:(category + 1) * 40] | |
| random.seed(42) | |
| train = category_data.sample(n=20, random_state=42) | |
| test = category_data.drop(train.index) | |
| train_data_factoid.append(train) | |
| test_data_factoid.append(test) | |
| # Combine data from both sheets | |
| train_data_yes_no = pd.concat(train_data_yes_no) | |
| test_data_yes_no = pd.concat(test_data_yes_no) | |
| train_data_factoid = pd.concat(train_data_factoid) | |
| test_data_factoid = pd.concat(test_data_factoid) | |
| # Create Excel writer object | |
| writer = pd.ExcelWriter(output_path, engine='xlsxwriter') | |
| # Write data to different sheets | |
| train_data_yes_no.to_excel(writer, sheet_name='Yes or No Train', index=False) | |
| test_data_yes_no.to_excel(writer, sheet_name='Yes or No Test', index=False) | |
| train_data_factoid.to_excel(writer, sheet_name='Factoid Train', index=False) | |
| test_data_factoid.to_excel(writer, sheet_name='Factoid Test', index=False) | |
| # Save Excel file | |
| writer.close() | |
| print(f"Data saved to {output_path}") | |
| if __name__ == "__main__": | |
| data_path = './data/Task2/Task2.xlsx' | |
| output_path = './data/Task2/train_test_data.xlsx' | |
| split_data(data_path, output_path) | |
| \end{minted} | |
| \newpage | |
| The tasks/task2/Task2_test.ipynb file: | |
| \begin{minted}[bgcolor=LightGray,breaklines=true,fontsize=\footnotesize]{python} | |
| # Zero-shot testing, random sample prompting (random_few_shot) | |
| from utils.LLM import LLM_request | |
| from utils.promp_get import get_prompt | |
| from utils.save_response import save_responses | |
| import json | |
| # Model configuration | |
| model_series = 'gpt' | |
| model_name = 'gpt-3.5-turbo' | |
| # model_name = 'gpt-4o' | |
| # model_name='deepseek-ai/DeepSeek-R1' | |
| # model_series = 'gemini' | |
| # model_name = 'gemini-1.5-pro-002' | |
| # model_series = 'claude' | |
| # model_name = 'claude-3-5-haiku-20241022' | |
| # model_series = 'ds_V3_qwen_llama' | |
| # model_name = 'deepseek-ai/DeepSeek-V3' | |
| # model_name = 'Qwen/Qwen2.5-72B-Instruct' | |
| # model_name = 'meta-llama/Meta-Llama-3.1-405B-Instruct' | |
| # Task configuration | |
| type = 'yes_no' | |
| # type = 'factoid' | |
| # Shot type configuration | |
| shot_type = 'one_shot' | |
| # shot_type = 'two_shot' | |
| # shot_type = 'three_shot' | |
| # Generate prompts and process each one | |
| prompt = get_prompt(type, shot_type) | |
| for i in range(0,len(prompt)): | |
| # Send request to LLM | |
| response = LLM_request(model_series, model_name, prompt[i]+'\n'+'Do not include any other irrelevant explanations or meaningless replies') | |
| print(prompt[i]) | |
| print(response) | |
| # Parse response and save | |
| save_responses(type, i, response, '../../output/Task2/nomal/'+shot_type+'/'+model_name+'.json', '../../output/Task2/nomal/'+shot_type+'_raw/'+model_name+'.json') | |
| # save_responses(type, i, response, '../../output/Task2/nomal/'+shot_type+'/'+model_name+'_f'+'.json', './output/Task2/nomal/'+shot_type+'_raw/'+model_name+'_f'+'.json') | |
| # KNN-shot testing | |
| from utils.knn_prompt import generate_prompt | |
| from utils.save_response import save_responses_knn | |
| # Task and shot configuration for KNN | |
| type = 'yes_no' | |
| # type = 'factoid' | |
| shot_type = 'one_shot' | |
| # shot_type = 'two_shot' | |
| # shot_type = 'three_shot' | |
| # Generate KNN-based prompts | |
| prompt = generate_prompt(type, '../../data/Task2/knn_'+type+'.json', example_num=1) | |
| for i in range(0,len(prompt)): | |
| # Send request to LLM | |
| response = LLM_request(model_series, model_name, prompt[i]+'\n'+'Do not include any other irrelevant explanations or meaningless replies') | |
| # print(prompt[i]) | |
| # print(response) | |
| # Parse response and save with KNN information | |
| save_responses_knn(prompt[i], type, i, response, '../../output/Task2/knn/'+shot_type+'/'+model_name+'.json', '../../output/Task2/knn/'+shot_type+'_raw/'+model_name+'.json') | |
| # save_responses_knn(prompt[i], type, i, response, '../../output/Task2/knn/'+shot_type+'/'+model_name+'_f'+'.json', '../../output/Task2/knn/'+shot_type+'_raw/'+model_name+'_f'+'.json') | |
| # Chain-of-Thought (CoT) testing | |
| from utils.knn_prompt import generate_prompt_cot | |
| # Task configuration for CoT | |
| type = 'yes_no' | |
| # type = 'factoid' | |
| # Generate CoT prompts | |
| prompt = generate_prompt_cot(type, '../../data/Task2/knn_'+type+'.json') | |
| for i in range(0,len(prompt)): | |
| # Send request to LLM with CoT prompting | |
| response = LLM_request(model_series, model_name, prompt[i]+'\n'+'Do not include any other irrelevant explanations or meaningless replies') | |
| print(prompt[i]) | |
| print(response) | |
| # Parse response and save | |
| save_responses_knn(prompt[i], type, i, response, '../../output/Task2/cot/cot/'+model_name+'.json', '../../output/Task2/cot/cot_raw/'+model_name+'.json') | |
| # save_responses_knn(prompt[i], type, i, response, '../../output/Task2/cot/cot/'+model_name+'_f'+'.json', '../../output/Task2/cot/cot_raw/'+model_name+'_f'+'.json') | |
| \end{minted} | |
| \newpage | |
| The tasks/task2/eval.py file: | |
| \begin{minted}[bgcolor=LightGray,breaklines=true,fontsize=\footnotesize]{python} | |
| import pandas as pd | |
| from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score | |
| import os | |
| from bert_score import score as score_bert | |
| from nltk.translate.meteor_score import single_meteor_score | |
| import jieba | |
| from collections import Counter | |
| def calculate_metrics(model_results_paths, data_path='./data/Task2/data.xlsx'): | |
| """ | |
| Evaluate Yes/No classification performance with multiple metrics | |
| Args: | |
| model_results_paths: List of paths to model result JSON files | |
| data_path: Path to ground truth data Excel file | |
| """ | |
| # Load ground truth labels | |
| true_labels = pd.read_excel(data_path, sheet_name='Yes or No Train') | |
| for model_results_path in model_results_paths: | |
| # Load model generated results | |
| model_results = pd.read_json(model_results_path) | |
| # Extract predictions and true labels | |
| predicted = model_results['answer'].apply(lambda x: 1 if x == 'Yes' else 0) | |
| true = true_labels['Answer'].apply(lambda x: 1 if x == 'Yes' else 0) | |
| # Calculate evaluation metrics | |
| accuracy = accuracy_score(true, predicted) | |
| recall = recall_score(true, predicted) | |
| precision = precision_score(true, predicted) | |
| f1 = f1_score(true, predicted) | |
| # Calculate AUROC | |
| predicted_prob = predicted # Can be adjusted based on actual use case | |
| auroc = roc_auc_score(true, predicted_prob) | |
| # Format and save results | |
| results = ( | |
| f'Model {model_results_path} evaluation results:\n' | |
| f'Accuracy: {accuracy:.4f}\n' | |
| f'Recall: {recall:.4f}\n' | |
| f'Precision: {precision:.4f}\n' | |
| f'F1 Score: {f1:.4f}\n' | |
| f'AUROC: {auroc:.4f}\n' | |
| '---\n' | |
| ) | |
| # Save results to file | |
| save_path = './output/Task2' | |
| results_file_path = os.path.join(save_path, 'results_yes_or_no.txt') | |
| with open(results_file_path, 'a', encoding='utf-8') as f: | |
| f.write(results) | |
| print(results) | |
| def calculate_metrics_Factoid(model_results_paths, data_path='./data/Task2/data.xlsx'): | |
| """ | |
| Comprehensive evaluation for factoid question answering using BERT Score and METEOR | |
| Args: | |
| model_results_paths: List of paths to model result JSON files | |
| data_path: Path to ground truth data Excel file | |
| """ | |
| # Load ground truth labels | |
| true_labels = pd.read_excel(data_path, sheet_name='Factoid Train') | |
| for model_results_path in model_results_paths: | |
| # Load model generated results | |
| model_results = pd.read_json(model_results_path) | |
| # Preprocess answers into lists | |
| predictions = [] | |
| references = [] | |
| for pred, ref in zip(model_results['answer'], true_labels['Answer']): | |
| # Handle null values, convert to string, remove spaces | |
| pred = str(pred).strip() if not pd.isna(pred) else "" | |
| ref = str(ref).strip() if not pd.isna(ref) else "" | |
| predictions.append(pred) | |
| references.append(ref) | |
| # 1. Calculate BERT Score | |
| P, R, F1 = score_bert(predictions, references, lang='zh', verbose=False) | |
| bert_precision = P.mean().item() | |
| bert_recall = R.mean().item() | |
| bert_f1 = F1.mean().item() | |
| # 2. Calculate METEOR Score and related metrics | |
| meteor_scores = [] | |
| meteor_precision_scores = [] | |
| meteor_recall_scores = [] | |
| meteor_penalty_scores = [] | |
| weighted_harmonic_means = [] | |
| # METEOR parameters | |
| ALPHA = 0.9 # Precision weight | |
| BETA = 3.0 # Chunk penalty weight | |
| GAMMA = 0.5 # Penalty factor | |
| empty_pred = [] | |
| for pred, ref in zip(predictions, references): | |
| # Check if original answer is empty | |
| if not pred: | |
| empty_pred.append(pred) | |
| print(f"Warning: Found empty prediction. Original prediction: {pred}, Reference: {ref}") | |
| # Tokenize using jieba | |
| pred_tokens = list(jieba.cut(pred)) | |
| ref_tokens = list(jieba.cut(ref)) | |
| # Clean tokenization results, remove spaces | |
| pred_tokens = [token for token in pred_tokens if token.strip()] | |
| ref_tokens = [token for token in ref_tokens if token.strip()] | |
| # Basic METEOR score | |
| meteor = single_meteor_score(ref_tokens, pred_tokens) | |
| # Use Counter to handle duplicate words | |
| pred_counter = Counter(pred_tokens) | |
| ref_counter = Counter(ref_tokens) | |
| matched_count = sum((pred_counter & ref_counter).values()) | |
| precision = matched_count / len(pred_tokens) if pred_tokens else 0 | |
| recall = matched_count / len(ref_tokens) if ref_tokens else 0 | |
| # Calculate weighted harmonic mean | |
| if precision > 0 and recall > 0: | |
| weighted_harmonic_mean = (precision * recall) / (ALPHA * precision + (1 - ALPHA) * recall) | |
| else: | |
| weighted_harmonic_mean = 0 | |
| # Calculate penalty score | |
| if weighted_harmonic_mean != 0: | |
| meteor_penalty_score = 1 - (meteor / weighted_harmonic_mean) | |
| else: | |
| meteor_penalty_score = 1 | |
| meteor_penalty_scores.append(meteor_penalty_score) | |
| weighted_harmonic_means.append(weighted_harmonic_mean) | |
| meteor_precision_scores.append(precision) | |
| meteor_recall_scores.append(recall) | |
| meteor_scores.append(meteor) | |
| # Calculate average scores | |
| avg_meteor_precision = sum(meteor_precision_scores) / len(meteor_precision_scores) if meteor_precision_scores else 0 | |
| avg_meteor_recall = sum(meteor_recall_scores) / len(meteor_recall_scores) if meteor_recall_scores else 0 | |
| ave_Fmean = sum(weighted_harmonic_means) / len(weighted_harmonic_means) if weighted_harmonic_means else 0 | |
| avg_meteor_penalty = sum(meteor_penalty_scores) / len(meteor_penalty_scores) if meteor_penalty_scores else 0 | |
| avg_meteor = sum(meteor_scores) / len(meteor_scores) if meteor_scores else 0 | |
| # Empty prediction rate | |
| empty_pred_rate = len(empty_pred) / len(predictions) | |
| # Format and save results | |
| results = ( | |
| f'Model {model_results_path} evaluation results:\n' | |
| f'\nBERT Score evaluation results:\n' | |
| f'BERT Precision: {bert_precision:.4f}\n' | |
| f'BERT Recall: {bert_recall:.4f}\n' | |
| f'BERT F1: {bert_f1:.4f}\n' | |
| f'\nMETEOR Score evaluation results:\n' | |
| f'METEOR Precision: {avg_meteor_precision:.4f}\n' | |
| f'METEOR Recall: {avg_meteor_recall:.4f}\n' | |
| f'METEOR Fmean: {ave_Fmean:.4f}\n' | |
| f'METEOR Penalty (Gamma={GAMMA:.1f},β={BETA:.1f}): {avg_meteor_penalty:.4f}\n' | |
| f'METEOR Score: {avg_meteor:.4f}\n' | |
| f'Empty prediction rate: {empty_pred_rate:.4f}\n' | |
| '---\n' | |
| ) | |
| save_path = './output/Task2' | |
| results_file_path = os.path.join(save_path, 'results_factoid.txt') | |
| with open(results_file_path, 'a', encoding='utf-8') as f: | |
| f.write(results) | |
| print(results) | |
| if __name__ == '__main__': | |
| # Configure model result paths for evaluation | |
| Factoid_results_paths = [ | |
| # GPT-3.5-turbo eight tests | |
| './output/Task2/nomal/zero_shot/gpt-3.5-turbo_f.json', | |
| './output/Task2/nomal/one_shot/gpt-3.5-turbo_f.json', | |
| './output/Task2/nomal/two_shot/gpt-3.5-turbo_f.json', | |
| './output/Task2/nomal/three_shot/gpt-3.5-turbo_f.json', | |
| './output/Task2/knn/one_shot/gpt-3.5-turbo_f.json', | |
| './output/Task2/knn/two_shot/gpt-3.5-turbo_f.json', | |
| './output/Task2/knn/three_shot/gpt-3.5-turbo_f.json', | |
| './output/Task2/cot/cot_new/gpt-3.5-turbo_f_processed.json', | |
| # GPT-4o eight tests | |
| './output/Task2/nomal/zero_shot/gpt-4o_f.json', | |
| './output/Task2/nomal/one_shot/gpt-4o_f.json', | |
| './output/Task2/nomal/two_shot/gpt-4o_f.json', | |
| './output/Task2/nomal/three_shot/gpt-4o_f.json', | |
| './output/Task2/knn/one_shot/gpt-4o_f.json', | |
| './output/Task2/knn/two_shot/gpt-4o_f.json', | |
| './output/Task2/knn/three_shot/gpt-4o_f.json', | |
| './output/Task2/cot/cot_new/gpt-4o_f_processed.json', | |
| # Additional models... | |
| './output/Task2/nomal/zero_shot/gemini-1.5-pro-002_f.json', | |
| './output/Task2/nomal/zero_shot/claude-3-5-haiku-20241022_f.json', | |
| './output/Task2/nomal/zero_shot/deepseek-ai/DeepSeek-V3_f.json', | |
| ] | |
| data_path = './data/Task2/data.xlsx' | |
| calculate_metrics_Factoid(Factoid_results_paths, data_path) | |
| \end{minted} | |
| \newpage | |
| The tasks/task2/utils/save_response.py file: | |
| \begin{minted}[bgcolor=LightGray,breaklines=true,fontsize=\footnotesize]{python} | |
| import json | |
| import pandas as pd | |
| import os | |
| def save_responses(type, index, response, output_file, raw_output_file): | |
| """ | |
| Save model responses to JSON files with structured format | |
| Args: | |
| type: Type of task ('yes_no' or 'factoid') | |
| index: Index of the current question | |
| response: Model response object | |
| output_file: Path to save processed responses | |
| raw_output_file: Path to save raw responses | |
| """ | |
| # Load training data based on task type | |
| data_path = '../../data/Task2/data.xlsx' | |
| if type == 'yes_no': | |
| sheet = 'Yes or No Train' | |
| elif type == 'factoid': | |
| sheet = 'Factoid Train' | |
| data_train = pd.read_excel(data_path, sheet_name=sheet) | |
| question_train = data_train['Question'] | |
| # Parse response type and extract content | |
| if hasattr(response, 'content'): # For ChatCompletionMessage objects | |
| content = response.content | |
| elif isinstance(response, dict) and 'content' in response: # For dictionary format | |
| content = response['content'] | |
| else: | |
| content = "Invalid response format" # Handle invalid responses | |
| # Create dictionary to save | |
| result = { | |
| "question": question_train[index], | |
| "answer": content | |
| } | |
| # Load existing data if file exists | |
| if os.path.exists(output_file): | |
| with open(output_file, 'r', encoding='utf-8') as f: | |
| existing_data = json.load(f) | |
| else: | |
| existing_data = [] | |
| # Append new result to existing data | |
| existing_data.append(result) | |
| # Save model responses as JSON file | |
| with open(output_file, 'w', encoding='utf-8') as f: | |
| json.dump(existing_data, f, ensure_ascii=False, indent=4) | |
| # Save raw response as JSON file | |
| raw_response = { | |
| "question": question_train[index], | |
| "raw_response": str(response) # Convert object to string | |
| } | |
| if os.path.exists(raw_output_file): | |
| with open(raw_output_file, 'r', encoding='utf-8') as f: | |
| existing_raw_data = json.load(f) | |
| else: | |
| existing_raw_data = [] | |
| # Append new raw response to existing data | |
| existing_raw_data.append(raw_response) | |
| with open(raw_output_file, 'w', encoding='utf-8') as f: | |
| json.dump(existing_raw_data, f, ensure_ascii=False, indent=4) | |
| def save_responses_knn(prompt, type, index, response, output_file, raw_output_file): | |
| """ | |
| Save KNN-enhanced model responses with prompt information | |
| Args: | |
| prompt: The input prompt used for generation | |
| type: Type of task ('yes_no' or 'factoid') | |
| index: Index of the current question | |
| response: Model response object | |
| output_file: Path to save processed responses | |
| raw_output_file: Path to save raw responses with prompts | |
| """ | |
| # Load training data based on task type | |
| data_path = '../../data/Task2/data.xlsx' | |
| if type == 'yes_no': | |
| sheet = 'Yes or No Train' | |
| elif type == 'factoid': | |
| sheet = 'Factoid Train' | |
| data_train = pd.read_excel(data_path, sheet_name=sheet) | |
| question_train = data_train['Question'] | |
| # Parse response type and extract content | |
| if hasattr(response, 'content'): # For ChatCompletionMessage objects | |
| content = response.content | |
| elif isinstance(response, dict) and 'content' in response: # For dictionary format | |
| content = response['content'] | |
| else: | |
| content = "Invalid response format" # Handle invalid responses | |
| # Create dictionary to save processed response | |
| result = { | |
| "question": question_train[index], | |
| "answer": content | |
| } | |
| # Load existing processed data if file exists | |
| if os.path.exists(output_file): | |
| with open(output_file, 'r', encoding='utf-8') as f: | |
| existing_data = json.load(f) | |
| else: | |
| existing_data = [] | |
| # Append new result to existing data | |
| existing_data.append(result) | |
| # Save processed model responses as JSON file | |
| with open(output_file, 'w', encoding='utf-8') as f: | |
| json.dump(existing_data, f, ensure_ascii=False, indent=4) | |
| # Save raw response with prompt information | |
| raw_response = { | |
| "prompt": prompt, | |
| "raw_response": str(response) # Convert object to string | |
| } | |
| if os.path.exists(raw_output_file): | |
| with open(raw_output_file, 'r', encoding='utf-8') as f: | |
| existing_raw_data = json.load(f) | |
| else: | |
| existing_raw_data = [] | |
| # Append new raw response to existing data | |
| existing_raw_data.append(raw_response) | |
| with open(raw_output_file, 'w', encoding='utf-8') as f: | |
| json.dump(existing_raw_data, f, ensure_ascii=False, indent=4) | |
| \end{minted} | |
| \newpage | |
| The tasks/task2/utils/promp_get.py file: | |
| \begin{minted}[bgcolor=LightGray,breaklines=true,fontsize=\footnotesize]{python} | |
| def get_prompt(ask_name, shot_type): | |
| """ | |
| Generate prompts for geological question answering with different shot configurations | |
| Args: | |
| ask_name: Type of question ('yes_no' or 'factoid') | |
| shot_type: Number of examples ('zero_shot', 'one_shot', 'two_shot', 'three_shot') | |
| Returns: | |
| List of prompts for the specified configuration | |
| """ | |
| import pandas as pd | |
| import random | |
| if ask_name == 'yes_no': | |
| # Set paths for Yes/No classification data | |
| data_path = '../../data/Task2/data.xlsx' | |
| sheet_train = 'Yes or No Train' | |
| sheet_test = 'Yes or No Test' | |
| # Load data | |
| data_train = pd.read_excel(data_path, sheet_name=sheet_train) | |
| data_test = pd.read_excel(data_path, sheet_name=sheet_test) | |
| # Extract questions, texts, and answers | |
| question_train = data_train['Question'] | |
| text_train = data_train['Text'] | |
| answer_train = data_train['Answer'] | |
| question_test = data_test['Question'] | |
| text_test = data_test['Text'] | |
| answer_test = data_test['Answer'] | |
| # Generate prompts based on shot type | |
| if shot_type == 'zero_shot': | |
| prompt = ''' | |
| Please answer the question based on the given text. | |
| Given text: "''' + text_train + '''"''' + '\n' + '''Question: "''' + question_train + '''" | |
| Please answer directly with Yes or No. | |
| ''' | |
| elif shot_type == 'one_shot': | |
| # Randomly select one example | |
| random_index = random.randint(0, len(data_test)-1) | |
| example = 'Text: ' + str(text_test[random_index]) + '\n' + 'Question: ' + str(question_test[random_index]) + '\n' + 'Answer: ' + str(answer_test[random_index]) | |
| prompt = ''' | |
| Please answer the question based on the given text. | |
| ''' + '\n' + '''Example:''' + '\n' + example + ''' | |
| ''' + '\n' + '''Given text: "''' + text_train + '''"''' + '\n' + '''Question: "''' + question_train + '"' + ''' | |
| Please answer directly with Yes or No. | |
| ''' | |
| elif shot_type == 'two_shot': | |
| # Randomly select two examples | |
| random_index1 = random.randint(0, len(data_test)-1) | |
| random_index2 = random.randint(0, len(data_test)-1) | |
| while random_index2 == random_index1: | |
| random_index2 = random.randint(0, len(data_test)-1) | |
| example1 = 'Text: ' + str(text_test[random_index1]) + '\n' + 'Question: ' + str(question_test[random_index1]) + '\n' + 'Answer: ' + str(answer_test[random_index1]) | |
| example2 = 'Text: ' + str(text_test[random_index2]) + '\n' + 'Question: ' + str(question_test[random_index2]) + '\n' + 'Answer: ' + str(answer_test[random_index2]) | |
| prompt = ''' | |
| Please answer the question based on the given text. | |
| ''' + '\n' + '''Example 1:''' + '\n' + example1 + ''' | |
| ''' + '\n' + '''Example 2:''' + '\n' + example2 + ''' | |
| ''' + '\n' + '''Given text: "''' + text_train + '''"''' + '\n' + '''Question: "''' + question_train + '"' + ''' | |
| Please answer directly with Yes or No. | |
| ''' | |
| elif shot_type == 'three_shot': | |
| # Randomly select three examples | |
| random_index1 = random.randint(0, len(data_test)-1) | |
| random_index2 = random.randint(0, len(data_test)-1) | |
| random_index3 = random.randint(0, len(data_test)-1) | |
| while random_index3 == random_index1 or random_index3 == random_index2: | |
| random_index3 = random.randint(0, len(data_test)-1) | |
| example1 = 'Text: ' + str(text_test[random_index1]) + '\n' + 'Question: ' + str(question_test[random_index1]) + '\n' + 'Answer: ' + str(answer_test[random_index1]) | |
| example2 = 'Text: ' + str(text_test[random_index2]) + '\n' + 'Question: ' + str(question_test[random_index2]) + '\n' + 'Answer: ' + str(answer_test[random_index2]) | |
| example3 = 'Text: ' + str(text_test[random_index3]) + '\n' + 'Question: ' + str(question_test[random_index3]) + '\n' + 'Answer: ' + str(answer_test[random_index3]) | |
| prompt = ''' | |
| Please answer the question based on the given text. | |
| ''' + '\n' + '''Example 1:''' + '\n' + example1 + ''' | |
| ''' + '\n' + '''Example 2:''' + '\n' + example2 + ''' | |
| ''' + '\n' + '''Example 3:''' + '\n' + example3 + ''' | |
| ''' + '\n' + '''Given text: "''' + text_train + '''"''' + '\n' + '''Question: "''' + question_train + '"' + ''' | |
| Please answer directly with Yes or No. | |
| ''' | |
| elif ask_name == 'factoid': | |
| # Set paths for factoid question data | |
| data_path = '../../data/Task2/data.xlsx' | |
| sheet_train = 'Factoid Train' | |
| sheet_test = 'Factoid Test' | |
| # Load data | |
| data_train = pd.read_excel(data_path, sheet_name=sheet_train) | |
| data_test = pd.read_excel(data_path, sheet_name=sheet_test) | |
| # Extract questions, texts, and answers | |
| question_train = data_train['Question'] | |
| text_train = data_train['Text'] | |
| answer_train = data_train['Answer'] | |
| question_test = data_test['Question'] | |
| text_test = data_test['Text'] | |
| answer_test = data_test['Answer'] | |
| # Generate prompts based on shot type (similar structure to yes_no) | |
| if shot_type == 'zero_shot': | |
| prompt = ''' | |
| Please answer the question based on the given text. | |
| Given text: "''' + text_train + '"' + '\n' + '''Question: "''' + question_train + '''" | |
| Please answer the question directly. | |
| ''' | |
| # ... (similar implementation for one_shot, two_shot, three_shot) | |
| return prompt | |
| if __name__ == '__main__': | |
| # Test prompt generation | |
| prompt = get_prompt('yes_no', 'zero_shot') | |
| print('--------------------------------') | |
| print(len(prompt)) | |
| print(prompt[0]) | |
| \end{minted} | |
| \newpage | |
| The tasks/task2/utils/knn_prompt.py file: | |
| \begin{minted}[bgcolor=LightGray,breaklines=true,fontsize=\footnotesize]{python} | |
| import json | |
| def generate_prompt(type, json_path, example_num=3): | |
| """ | |
| Generate prompts using K-nearest neighbor examples for improved context-aware questioning | |
| Args: | |
| type: Type of task ('yes_no' or 'factoid') | |
| json_path: Path to JSON file containing KNN search results | |
| example_num: Number of examples to include in prompt (default: 3) | |
| Returns: | |
| List of generated prompts with KNN examples | |
| """ | |
| with open(json_path, 'r', encoding='utf-8') as f: | |
| data = json.load(f) | |
| prompts = [] | |
| for item in data: | |
| test_query = item['test_query'] | |
| matched_samples = item['matched_samples'][:example_num] # Control number of examples | |
| # Build prompt with examples | |
| prompt = f'Please answer the question based on the given text.\nExamples:\n' | |
| for sample in matched_samples: | |
| prompt += f'Given text: "{sample["context"]}".\nQuestion: "{sample["question"]}".\nAnswer: "{sample["answer"]}"\n\n' | |
| # Add task-specific instruction | |
| if type == 'yes_no': | |
| prompt += f'Given text: "{test_query["text"]}".\nQuestion: "{test_query["question"]}"\nPlease answer directly with Yes or No.' | |
| elif type == 'factoid': | |
| prompt += f'Given text: "{test_query["text"]}".\nQuestion: "{test_query["question"]}"\nPlease answer the question directly.' | |
| prompts.append(prompt) | |
| return prompts | |
| def generate_prompt_cot(type, json_path): | |
| """ | |
| Generate chain-of-thought prompts for enhanced reasoning | |
| Args: | |
| type: Type of task ('yes_no' or 'factoid') | |
| json_path: Path to JSON file containing test queries | |
| Returns: | |
| List of CoT prompts that encourage step-by-step reasoning | |
| """ | |
| with open(json_path, 'r', encoding='utf-8') as f: | |
| data = json.load(f) | |
| prompts = [] | |
| for item in data: | |
| test_query = item['test_query'] | |
| # Build chain-of-thought prompt | |
| prompt = f'Please answer the question based on the given text.\n' | |
| if type == 'yes_no': | |
| prompt += f'Given text: "{test_query["text"]}".\nQuestion: "{test_query["question"]}"\nPlease first answer Yes or No, then provide your reasoning basis.' | |
| elif type == 'factoid': | |
| prompt += f'Given text: "{test_query["text"]}".\nQuestion: "{test_query["question"]}"\nPlease first answer the question, then provide your reasoning basis.' | |
| prompts.append(prompt) | |
| return prompts | |
| if __name__ == "__main__": | |
| # Usage examples | |
| # Test CoT prompt generation for Yes/No questions | |
| prompts = generate_prompt_cot('yes_no', './data/Task2/knn_yes_no.json') | |
| print("Chain-of-Thought Prompts for Yes/No:") | |
| for i, prompt in enumerate(prompts): | |
| print(f'Prompt {i+1}:\n{prompt}\n') | |
| if i >= 1: # Print first two examples | |
| break | |
| # Test KNN prompt generation for Factoid questions | |
| prompts = generate_prompt('factoid', './data/Task2/knn_factoid.json', example_num=1) | |
| print('\nKNN Prompts for Factoid:') | |
| for i, prompt in enumerate(prompts): | |
| print(f'Prompt {i+1}:\n{prompt}\n') | |
| if i >= 1: # Print first two examples | |
| break | |
| \end{minted} |