AI model evaluation

Found 4 AI tools

tools

Primary Category: productive forces

Subcategory: AI model evaluation

Found 4 matching tools

Related AI Tools

Click any tool to view details

MLE-bench

MLE-bench is a benchmark launched by OpenAI to measure the performance of AI agents in machine learning engineering. The benchmark brings together 75 machine learning engineering-related competitions from Kaggle to form a diverse set of challenging tasks that test real-world machine learning engineering skills such as training models, preparing data sets, and running experiments. Human benchmarks were established for each competition through Kaggle's public leaderboard data. We evaluated the performance of multiple cutting-edge language models on this benchmark using an open-source agent framework, and found that the best-performing setup—OpenAI’s o1-preview paired with the AIDE framework—achieved at least Kaggle bronze level in 16.9% of the competition. Additionally, various forms of resource expansion of AI agents and the impact of pre-training contamination are studied. The benchmark code for MLE-bench has been open sourced to facilitate future understanding of the machine learning engineering capabilities of AI agents.

开源机器学习 AI代理 +2

生产力 Visit

SFR-Judge

SFR-Judge is a series of evaluation models launched by Salesforce AI Research, aiming to accelerate the evaluation and fine-tuning process of large language models (LLMs) through artificial intelligence technology. These models are capable of performing a variety of evaluation tasks, including pairwise comparisons, single-item scoring, and binary classification, while providing explanations and avoiding black-box problems. SFR-Judge performs well on multiple benchmarks, demonstrating its effectiveness in evaluating model output and guiding fine-tuning.

人工智能语言模型模型微调 +1

生产力 Visit

OpenCompass 2.0 Large Language Model Leaderboard

OpenCompass 2.0 is a platform focused on performance evaluation of large language models. It uses multiple closed-source datasets for multi-dimensional evaluation, providing the model with an overall average score and an expertise score. The platform helps developers and researchers understand the performance of different models in language, knowledge, reasoning, mathematics and programming by updating rankings in real time.

语言模型评估排行榜 +1

生产力 Visit

RULER

RULER is a new synthetic benchmark that provides a more comprehensive evaluation of long text language models. It extends the general retrieval test to cover different types and numbers of information points. Additionally, RULER introduces new task categories such as multi-hop tracking and aggregation to test behavior beyond retrieval from context. Ten long text language models are evaluated on RULER and performance is obtained on 13 representative tasks. Although these models achieved almost perfect accuracy on ordinary retrieval tests, they performed very poorly when context length increased. Only four models (GPT-4, Command-R, Yi-34B, and Mixtral) performed reasonably well at 32K length. We open source RULER to facilitate comprehensive evaluation of long text language models.

语言模型评估长文本

生产力 Visit

Related Subcategories

Explore other subcategories under productive forces Other Categories

Development and Tools

1361 tools

Productivity tools

904 tools

personal assistant

767 tools

AI model

619 tools

writing assistant

607 tools

knowledge management

431 tools

chatbot

406 tools

AI design tools

398 tools

💼

Explore More productive forces Tools

AI model evaluation Hot productive forces is a popular subcategory under 4 quality AI tools

Browse productive forces Category Categories