Spaces:
Runtime error
Runtime error
| # CONSTANTS-URL | |
| URL = "http://opencompass.openxlab.space/assets/OpenVLM.json" | |
| RESULTS = 'ShoppingMMLU_overall.json' | |
| SHOPPINGMMLU_README = 'https://raw.githubusercontent.com/KL4805/ShoppingMMLU/refs/heads/main/README.md' | |
| # CONSTANTS-CITATION | |
| CITATION_BUTTON_TEXT = r"""@article{jin2024shopping, | |
| title={Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models}, | |
| author={Jin, Yilun and Li, Zheng and Zhang, Chenwei and Cao, Tianyu and Gao, Yifan and Jayarao, Pratik and Li, Mao and Liu, Xin and Sarkhel, Ritesh and Tang, Xianfeng and others}, | |
| journal={arXiv preprint arXiv:2410.20745}, | |
| year={2024} | |
| }""" | |
| CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
| # CONSTANTS-TEXT | |
| LEADERBORAD_INTRODUCTION = """# Shopping MMLU Leaderboard | |
| ### Welcome to Shopping MMLU Leaderboard! On this leaderboard we share the evaluation results of LLMs obtained by the OpenSource Framework: | |
| ### [Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models](https://github.com/KL4805/ShoppingMMLU) 🏆 | |
| ### Currently, Shopping MMLU Leaderboard covers {} different LLMs and {} main online shopping skills. | |
| This leaderboard was last updated: {}. | |
| Shopping MMLU Leaderboard only includes open-source LLMs or API models that are publicly available. To add your own model to the leaderboard, please create a PR in [Shopping MMLU](https://github.com/KL4805/ShoppingMMLU) to support your LLM and then we will help with the evaluation and updating the leaderboard. For any questions or concerns, please feel free to contact us at [email protected] and [email protected]. | |
| """ | |
| # CONSTANTS-FIELDS | |
| META_FIELDS = ['Method', 'Param (B)', 'OpenSource', 'Verified'] | |
| # MAIN_FIELDS = [ | |
| # 'MMBench_V11', 'MMStar', 'MME', | |
| # 'MMMU_VAL', 'MathVista', 'OCRBench', 'AI2D', | |
| # 'HallusionBench', 'SEEDBench_IMG', 'MMVet', | |
| # 'LLaVABench', 'CCBench', 'RealWorldQA', 'POPE', 'ScienceQA_TEST', | |
| # 'SEEDBench2_Plus', 'MMT-Bench_VAL', 'BLINK' | |
| # ] | |
| MAIN_FIELDS = [ | |
| 'Shopping Concept Understanding', 'Shopping Knowledge Reasoning', 'User Behavior Alignment','Multi-lingual Abilities' | |
| ] | |
| # DEFAULT_BENCH = [ | |
| # 'MMBench_V11', 'MMStar', 'MMMU_VAL', 'MathVista', 'OCRBench', 'AI2D', | |
| # 'HallusionBench', 'MMVet' | |
| # ] | |
| DEFAULT_BENCH = ['Shopping Concept Understanding', 'Shopping Knowledge Reasoning', 'User Behavior Alignment','Multi-lingual Abilities'] | |
| MODEL_SIZE = ['<4B', '4B-10B', '10B-20B', '20B-40B', '>40B', 'Unknown'] | |
| MODEL_TYPE = ['API', 'OpenSource', 'Proprietary'] | |
| # The README file for each benchmark | |
| LEADERBOARD_MD = {} | |
| LEADERBOARD_MD['MAIN'] = f""" | |
| ## Included Shopping Skills: | |
| - Shopping Concept Understanding: Understanding domain-specific short texts in online shopping (e.g. brands, product models). | |
| - Shopping Knowledge Reasoning: Reasoning over commonsense, numeric, and implicit product-product multi-hop knowledge. | |
| - User Behavior Alignment: Modeling heterogeneous and implicit user behaviors (e.g. click, query, purchase). | |
| - Multi-lingual Abilities: Online shopping across marketplaces around the globe. | |
| ## Main Evaluation Results | |
| - Metrics: | |
| - Avg Score: The average score on all 4 online shopping skills (normalized to 0 - 100, the higher the better). | |
| - Detailed metrics and evaluation results for each skill are provided in the consequent tabs. | |
| """ | |
| LEADERBOARD_MD['Shopping Concept Understanding'] = """ | |
| ## Shopping Concept Understanding Evaluation Results | |
| Online shopping concepts such as brands and product models are domain-specific and not often seen in pre-training. Moreover, they often appear in short texts (e.g. queries, attribute-value pairs) and thus no sufficient contexts are given to help understand them. Hence, failing to understand these concepts compromises the performance of LLMs on downstream tasks. | |
| The included sub-skills and tasks include: | |
| - **Concept Normalization**: | |
| - Product Category Synonym | |
| - Attribute Value Synonym | |
| - **Elaboration**: | |
| - Attribute Explanation | |
| - Product Category Explanation | |
| - **Relational Inference**: | |
| - Applicable Attribute to Product Category | |
| - Applicable Product Category to Attribute | |
| - Inapplicable Attributes | |
| - Valid Attribute Value Given Attribute and Product Category | |
| - Valid Attribute Given Attribute Value and Product Category | |
| - Product Category Classification | |
| - Product Category Generation | |
| - **Sentiment Analysis**: | |
| - Aspect-based Sentiment Classification | |
| - Aspect-based Review Retrieval | |
| - Aspect-based Review Selection | |
| - Aspect-based Reviews Overall Sentiment Classification | |
| - **Information Extraction**: | |
| - Attribute Value Extraction | |
| - Query Named Entity Recognition | |
| - Aspect-based Review Keyphrase Selection | |
| - Aspect-based Review Keyphrase Extraction | |
| - **Summarization**: | |
| - Attribute Naming from Decription | |
| - Product Category Naming from Description | |
| - Review Aspect Retrieval | |
| - Single Conversation Topic Selection | |
| - Multi-Conversation Topic Retrieval | |
| - Product Keyphrase Selection | |
| - Product Keyphrase Retrieval | |
| - Product Title Generation | |
| """ | |
| LEADERBOARD_MD['Shopping Knowledge Reasoning'] = """ | |
| ## Shopping Knowledge Reasoning Evaluation Results | |
| This skill focuses on understanding and applying various implicit knowledge to perform reasoning over products and their attributes. For example, calculations such as the total volume of a product pack require numeric reasoning, and finding compatible products requires multi-hop reasoning among various products over a product knowledge graph. | |
| The included sub-skills and tasks include: | |
| - **Numeric Reasoning**: | |
| - Unit Conversation | |
| - Product Numeric Reasoning | |
| - **Commonsense Reasoning** | |
| - **Implicit Multi-Hop Reasoning**: | |
| - Product Compatibility | |
| - Complementary Product Categories | |
| - Implicit Attribute Reasoning | |
| - Related Brands Selection | |
| - Related Brands Retrieval | |
| """ | |
| LEADERBOARD_MD['User Behavior Alignment'] = """ | |
| ## User Behavior Alignment Evaluation Results | |
| Accurately modeling user behaviors is a crucial skill in online shopping. A large variety of user behaviors exist in online shopping, including queries, clicks, add-to-carts, purchases, etc. Moreover, these behaviors are generally implicit and not expressed in text. | |
| Consequently, LLMs trained with general texts encounter challenges in aligning with the heterogeneous and implicit user behaviors as they rarely observe such inputs during pre-training. | |
| The included sub-skills and tasks include: | |
| - **Query-Query Relations**: | |
| - Query Re-Writing | |
| - Query-Query Intention Selection | |
| - Intention-Based Related Query Retrieval | |
| - **Query-Product Relations**: | |
| - Product Category Selection for Query | |
| - Query-Product Relation Selection | |
| - Query-Product Ranking | |
| - **Sessions**: | |
| - Session-based Query Recommendation | |
| - Session-based Next Query Selection | |
| - Session-based Next Product Selection | |
| - **Purchases**: | |
| - Product Co-Purchase Selection | |
| - Product Co-Purchase Retrieval | |
| - **Reviews and QA**: | |
| - Review Rating Prediction | |
| - Aspect-Sentiment-Based Review Generation | |
| - Review Helpfulness Selection | |
| - Product-Based Question Answering | |
| """ | |
| LEADERBOARD_MD['Multi-lingual Abilities'] = """ | |
| ## Multi-lingual Abilities Evaluation Results | |
| Multi-lingual models are desired in online shopping as they can be deployed in multiple marketplaces without re-training. | |
| The included sub-skills and tasks include: | |
| - **Multi-lingual Shopping Concept Understanding**: | |
| - Multi-lingual Product Title Generation | |
| - Multi-lingual Product Keyphrase Selection | |
| - Cross-lingual Product Title Translation | |
| - Cross-lingual Product Entity Alignment | |
| - **Multi-lingual User Behavior Alignment**: | |
| - Multi-lingual Query-product Relation Selection | |
| - Multi-lingual Query-product Ranking | |
| - Multi-lingual Session-based Product Recommendation | |
| """ | |