Found 4 AI tools
Click any tool to view details
MLE-bench is a benchmark launched by OpenAI to measure the performance of AI agents in machine learning engineering. The benchmark brings together 75 machine learning engineering-related competitions from Kaggle to form a diverse set of challenging tasks that test real-world machine learning engineering skills such as training models, preparing data sets, and running experiments. Human benchmarks were established for each competition through Kaggle's public leaderboard data. We evaluated the performance of multiple cutting-edge language models on this benchmark using an open-source agent framework, and found that the best-performing setup—OpenAI’s o1-preview paired with the AIDE framework—achieved at least Kaggle bronze level in 16.9% of the competition. Additionally, various forms of resource expansion of AI agents and the impact of pre-training contamination are studied. The benchmark code for MLE-bench has been open sourced to facilitate future understanding of the machine learning engineering capabilities of AI agents.
SFR-Judge is a series of evaluation models launched by Salesforce AI Research, aiming to accelerate the evaluation and fine-tuning process of large language models (LLMs) through artificial intelligence technology. These models are capable of performing a variety of evaluation tasks, including pairwise comparisons, single-item scoring, and binary classification, while providing explanations and avoiding black-box problems. SFR-Judge performs well on multiple benchmarks, demonstrating its effectiveness in evaluating model output and guiding fine-tuning.
OpenCompass 2.0 is a platform focused on performance evaluation of large language models. It uses multiple closed-source datasets for multi-dimensional evaluation, providing the model with an overall average score and an expertise score. The platform helps developers and researchers understand the performance of different models in language, knowledge, reasoning, mathematics and programming by updating rankings in real time.
RULER is a new synthetic benchmark that provides a more comprehensive evaluation of long text language models. It extends the general retrieval test to cover different types and numbers of information points. Additionally, RULER introduces new task categories such as multi-hop tracking and aggregation to test behavior beyond retrieval from context. Ten long text language models are evaluated on RULER and performance is obtained on 13 representative tasks. Although these models achieved almost perfect accuracy on ordinary retrieval tests, they performed very poorly when context length increased. Only four models (GPT-4, Command-R, Yi-34B, and Mixtral) performed reasonably well at 32K length. We open source RULER to facilitate comprehensive evaluation of long text language models.
Explore other subcategories under productive forces Other Categories
1361 tools
904 tools
767 tools
619 tools
607 tools
431 tools
406 tools
398 tools
AI model evaluation Hot productive forces is a popular subcategory under 4 quality AI tools