Found 65 related AI tools
ZeroSearch is a novel reinforcement learning framework designed to motivate the search capabilities of large language models (LLMs) without interacting with actual search engines. Through supervised fine-tuning, ZeroSearch transforms LLM into a retrieval module capable of generating relevant and irrelevant documents, and introduces a course rollout mechanism to gradually stimulate the model's reasoning capabilities. The main advantage of this technology is that it outperforms models based on real search engines while incurring zero API cost. It is suitable for LLMs of all sizes and supports different reinforcement learning algorithms, making it suitable for research and development teams that require efficient retrieval capabilities.
Search-R1 is a reinforcement learning framework designed to train language models (LLMs) capable of reasoning and invoking search engines. It is built on veRL and supports multiple reinforcement learning methods and different LLM architectures, making it efficient and scalable in tool-enhanced inference research and development.
This model improves the reasoning capabilities of diffusion large language models through reinforcement learning and masked self-supervised fine-tuning of high-quality reasoning trajectories. The importance of this technology lies in its ability to optimize the model's inference process and reduce computational costs while ensuring the stability of learning dynamics. Ideal for users who want to be more efficient in writing and reasoning tasks.
DeepCoder-14B-Preview is a reinforcement learning-based large-scale language model for code inference capable of handling long contexts with a 60.6% pass rate, suitable for programming tasks and automated code generation. The advantage of this model lies in the innovation of its training method, which provides better performance than other models. It is completely open source and supports a wide range of community applications and research.
Hunyuan T1 is a very large-scale inference model launched by Tencent. It is based on reinforcement learning technology and significantly improves inference capabilities through extensive post-training. It performs outstandingly in long text processing and context capture, while optimizing the consumption of computing resources and having efficient reasoning capabilities. It is suitable for all kinds of reasoning tasks, especially in mathematics, logical reasoning and other fields. This product is based on deep learning and continuously optimized based on actual feedback. It is suitable for applications in scientific research, education and other fields.
Light-R1-14B-DS is an open source mathematical model developed by Beijing Qihoo Technology Co., Ltd. The model was trained on reinforcement learning based on DeepSeek-R1-Distill-Qwen-14B and achieved high scores of 74.0 and 60.2 in the AIME24 and AIME25 mathematics competition benchmarks respectively, surpassing many 32B parameter models. It successfully implements reinforcement learning attempts on already long-chain reasoning fine-tuning models under a lightweight budget, providing the open source community with a powerful mathematical model tool. The open source of this model helps promote the application of natural language processing in the field of education, especially in mathematical problem solving, and provides researchers and developers with a valuable research foundation and practical tools.
Light-R1 is an open source project developed by Qihoo360 that aims to train long-chain inference models through curriculum-based supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning (RL). This project achieves long-chain reasoning capabilities from scratch through decontaminated data sets and efficient training methods. Its main advantages include open source training data, low-cost training methods, and excellent performance in the field of mathematical reasoning. The project background is based on the current training needs of long-chain inference models and aims to provide a transparent and reproducible training method. The project is currently free and open source, suitable for use by research institutions and developers.
R1-Omni is an innovative multi-modal emotion recognition model that improves the model's reasoning and generalization capabilities through reinforcement learning. This model is developed based on HumanOmni-0.5B, focuses on emotion recognition tasks, and can perform emotion analysis through visual and audio modal information. Its main advantages include powerful inference capabilities, significantly improved emotion recognition performance, and excellent performance on out-of-distribution data. This model is suitable for scenarios that require multi-modal understanding, such as sentiment analysis, intelligent customer service and other fields, and has important research and application value.
Steiner is a family of inference models developed by Yichao 'Peak' Ji that focus on training on synthetic data through reinforcement learning, with the ability to explore multiple paths and autonomously verify or backtrack during inference. The goal of this model is to reproduce the inference capabilities of OpenAI o1 and verify the expansion curve during inference. Steiner-preview is an ongoing project, its open source purpose is to share knowledge and get more feedback from real users. While the model performs well on some benchmarks, OpenAI o1's inference scaling capabilities have not yet been fully realized and so is still in the development stage.
NotaGen is an innovative symbolic music generation model that improves the quality of music generation through three stages of pre-training, fine-tuning and reinforcement learning. It uses large language model technology to generate high-quality classical scores, bringing new possibilities to music creation. The main advantages of this model include efficient generation, diverse styles, and high-quality output. It is suitable for fields such as music creation, education and research, and has broad application prospects.
SWE-RL is a large-scale language model inference technology based on reinforcement learning proposed by Facebook Research. It aims to use open source software evolution data to improve the model's performance in software engineering tasks. This technology optimizes the model's reasoning capabilities through a rule-driven reward mechanism, allowing it to better understand and generate high-quality code. The main advantages of SWE-RL are its innovative reinforcement learning methods and effective utilization of open source data, which bring new possibilities to the field of software engineering. This technology is currently in the research stage and commercial pricing has not yet been determined, but it has significant potential to improve development efficiency and code quality.
MLGym is an open source framework and benchmark developed by Meta's GenAI team and UCSB NLP team for training and evaluating AI research agents. It promotes the development of reinforcement learning algorithms by providing diverse AI research tasks and helping researchers train and evaluate models in real-world research scenarios. The framework supports a variety of tasks, including computer vision, natural language processing and reinforcement learning, and aims to provide a standardized testing platform for AI research.
VLM-R1 is a visual language model based on reinforcement learning, focusing on visual understanding tasks such as Referring Expression Comprehension (REC). The model demonstrates excellent performance on both in-domain and out-of-domain data by combining R1 (Reinforcement Learning) and SFT (Supervised Fine-Tuning) methods. The main advantages of VLM-R1 include its stability and generalization capabilities, allowing it to perform well on a variety of visual language tasks. The model is built on Qwen2.5-VL and utilizes advanced deep learning technologies such as Flash Attention 2 to improve computing efficiency. VLM-R1 is designed to provide an efficient and reliable solution for visual language tasks, suitable for applications requiring precise visual understanding.
NovaSky is an artificial intelligence technology platform focused on improving the performance of code generation and inference models. It significantly improves the performance of non-inference models through innovative test-time expansion techniques (such as S*), reinforcement learning distilled inference and other techniques, making it outstanding in the field of code generation. The platform is committed to providing developers with efficient, low-cost model training and optimization solutions to help them achieve higher efficiency and accuracy in programming tasks. NovaSky's technical background originates from Sky Computing Lab @ Berkeley, with strong academic support and cutting-edge technical research foundation. Currently, NovaSky provides a variety of model optimization methods, including but not limited to inference cost optimization and model distillation technology, to meet the needs of different developers.
AlphaMaze is a decoder language model designed specifically to solve visual reasoning tasks. It demonstrates the potential of language models for visual reasoning by training them on a maze-solving task. The model is built on the 1.5 billion parameter Qwen model and trained through supervised fine-tuning (SFT) and reinforcement learning (RL). Its main advantage is that it can convert visual tasks into text format for reasoning, thus making up for the shortcomings of traditional language models in spatial understanding. The model was developed to improve AI performance on vision tasks, especially in scenarios that require step-by-step reasoning. Currently, AlphaMaze is a research project and its commercial pricing and market positioning have not yet been clarified.
HOMIE is an innovative humanoid robot teleoperation solution designed to achieve precise walking and operating tasks through reinforcement learning and low-cost exoskeleton hardware systems. The importance of this technology is that it solves the inefficiency and instability problems of traditional teleoperation systems, and enables robots to perform complex tasks more naturally through human motion capture and reinforcement learning training frameworks. Its main advantages include efficient task completion capabilities, no need for complex motion capture equipment, and fast training times. This product is mainly aimed at robotics research institutions, manufacturing and logistics industries. The price has not been clearly disclosed, but its hardware system cost is low and it has a high cost performance.
DeepScaleR-1.5B-Preview is a large language model optimized by reinforcement learning, focusing on improving mathematical problem solving capabilities. This model significantly improves the accuracy in long text reasoning scenarios through distributed reinforcement learning algorithms. Its main advantages include efficient training strategies, significant performance improvements, and the flexibility of open source. The model was developed by UC Berkeley’s Sky Computing Lab and Berkeley AI Research teams to advance the use of artificial intelligence in education, particularly in mathematics education and competitive mathematics. The model is licensed under the MIT open source license and is completely free for researchers and developers to use.
R1-V is a project focused on enhancing the generalization capabilities of visual language models (VLM). It significantly improves the generalization ability of VLM in visual counting tasks through reinforcement learning with verifiable rewards (RLVR) technology, especially in the out-of-distribution (OOD) test. The importance of this technology lies in its ability to achieve efficient optimization of large-scale models at extremely low cost (only a training cost of $2.62), providing new ideas for the practical use of visual language models. The project background is based on the improvement of existing VLM training methods. The goal is to improve the model's performance in complex visual tasks through innovative training strategies. The open source nature of R1-V also makes it an important resource for researchers and developers to explore and apply advanced VLM technology.
Tülu 3 405B is an open source language model developed by the Allen Institute for AI with 405 billion parameters. The model improves performance through an innovative reinforcement learning framework (RLVR), especially in mathematics and instruction following tasks. It is optimized based on the Llama-405B model and uses techniques such as supervised fine-tuning and preference optimization. The open source nature of Tülu 3 405B makes it a powerful tool in research and development for a variety of applications requiring high-performance language models.
Computer-Using Agent (CUA) is an advanced artificial intelligence model developed by OpenAI that combines the vision capabilities of GPT-4o with advanced reasoning capabilities through reinforcement learning. It is able to interact with a graphical user interface (GUI) like a human, without relying on operating system-specific APIs or network interfaces. The flexibility of CUA enables it to perform tasks in a variety of digital environments, such as filling out forms, browsing the web, etc. The emergence of this technology marks the next step in the development of AI, opening up new possibilities for the application of AI in everyday tools. CUA is currently in research preview and available to Pro users in the United States through the Operator.
DeepSeek-R1-Distill-Qwen-1.5B is an open source language model developed by the DeepSeek team and is optimized for distillation based on the Qwen2.5 series. The model uses large-scale reinforcement learning and data distillation techniques to significantly improve reasoning capabilities and performance while maintaining a small model size. It performs well on multiple benchmarks, with significant advantages in math, code generation, and reasoning tasks. The model supports commercial use and allows users to modify and develop derivative works. It is suitable for research institutions and enterprises to develop high-performance natural language processing applications.
DeepSeek-R1-Distill-Qwen-7B is a reinforcement learning-optimized inference model based on distillation optimization of Qwen-7B. It performs well on math, coding, and reasoning tasks, producing high-quality reasoning chains and solutions. This model significantly improves reasoning capabilities and efficiency through large-scale reinforcement learning and data distillation technology, and is suitable for scenarios that require complex reasoning and logical analysis.
DeepSeek-R1-Distill-Llama-8B is a high-performance language model developed by the DeepSeek team, based on the Llama architecture and optimized for reinforcement learning and distillation. The model performs well in reasoning, code generation, and multilingual tasks, and is the first model in the open source community to improve reasoning capabilities through pure reinforcement learning. It supports commercial use, allows modifications and derivative works, and is suitable for academic research and corporate applications.
DeepSeek-R1-Distill-Qwen-14B is a distillation model based on Qwen-14B developed by the DeepSeek team, focusing on reasoning and text generation tasks. This model uses large-scale reinforcement learning and data distillation technology to significantly improve reasoning capabilities and generation quality, while reducing computing resource requirements. Its main advantages include high performance, low resource consumption, and broad applicability to scenarios that require efficient reasoning and text generation.
DeepSeek-R1-Distill-Qwen-32B is a high-performance language model developed by the DeepSeek team, based on the Qwen-2.5 series for distillation optimization. The model performs well on multiple benchmarks, especially on math, coding, and reasoning tasks. Its main advantages include efficient reasoning capabilities, powerful multi-language support, and open source features, which facilitate secondary development and application by researchers and developers. This model is suitable for scenarios that require high-performance text generation, such as intelligent customer service, content creation, and code assistance, and has broad application prospects.
DeepSeek-R1-Distill-Llama-70B is a large language model developed by the DeepSeek team, based on the Llama-70B architecture and optimized through reinforcement learning. The model performs well in reasoning, conversational and multilingual tasks and supports a variety of application scenarios, including code generation, mathematical reasoning and natural language processing. Its main advantages are efficient reasoning capabilities and the ability to solve complex problems, while supporting open source and commercial use. This model is suitable for enterprises and research institutions that require high-performance language generation and reasoning capabilities.
PaSa is an advanced academic paper search agent developed by ByteDance. Based on large language model (LLM) technology, it can autonomously call search tools, read papers and filter relevant references to obtain comprehensive and accurate results for complex academic queries. The technique is optimized through reinforcement learning, trained using the synthetic dataset AutoScholarQuery, and performs well on the real-world query dataset RealScholarQuery, significantly outperforming traditional search engines and GPT-based methods. The main advantage of PaSa is its high recall and precision rates, which provide researchers with a more efficient academic search experience.
Kimi k1.5 is a multi-modal language model developed by MoonshotAI. Through reinforcement learning and long context expansion technology, it significantly improves the model's performance in complex reasoning tasks. The model has reached industry-leading levels on multiple benchmarks, surpassing GPT-4o and Claude Sonnet 3.5 in mathematical reasoning tasks such as AIME and MATH-500. Its main advantages include an efficient training framework, powerful multi-modal reasoning capabilities, and support for long contexts. Kimi k1.5 is mainly targeted at application scenarios that require complex reasoning and logical analysis, such as programming assistance, mathematical problem solving, and code generation.
DeepSeek-R1-Zero is an inference model developed by the DeepSeek team, focusing on improving the model's inference capabilities through reinforcement learning. The model exhibits powerful reasoning behaviors such as self-verification, reflection, and generation of long-chain reasoning without the need for supervised fine-tuning. Its key benefits include efficient inference capabilities, availability without pre-training, and superior performance on math, coding, and reasoning tasks. The model is developed based on the DeepSeek-V3 architecture, supports large-scale inference tasks, and is suitable for research and commercial applications.
DeepSeek-R1 is the first-generation inference model launched by the DeepSeek team. It is trained through large-scale reinforcement learning and can demonstrate excellent inference capabilities without supervised fine-tuning. The model performs well on math, coding, and inference tasks and is comparable to the OpenAI-o1 model. DeepSeek-R1 also provides a variety of distillation models suitable for scenarios with different scales and performance requirements. Its open source nature provides powerful tools for the research community and supports commercial use and secondary development.
RLLoggingBoard is a tool focused on visualizing the training process of Reinforcement Learning with Human Feedback (RLHF). It helps researchers and developers intuitively understand the training process, quickly locate problems, and optimize training effects through fine-grained indicator monitoring. This tool supports a variety of visualization modules, including reward curves, response sorting, and token-level indicators, etc., and is designed to assist existing training frameworks and improve training efficiency and effectiveness. It works with any training framework that supports saving required metrics and is highly flexible and scalable.
SakanaAI/self-adaptive-llms is an adaptive framework called Transformer² that aims to solve the challenges of traditional fine-tuning methods being computationally intensive and static in their ability to handle diverse tasks. The framework is able to adapt large language models (LLMs) to unseen tasks in real time during inference via a two-step mechanism: first, the scheduling system identifies task attributes; then, task-specific 'expert' vectors trained using reinforcement learning are dynamically blended to obtain target behaviors for the input prompts. Key advantages include real-time task adaptability, computational efficiency, and flexibility. The project was developed by the SakanaAI team and is currently open source on GitHub, with 195 stars and 12 forks.
PRIME-RL/Eurus-2-7B-PRIME is a 7B parameter language model trained based on the PRIME method, aiming to improve the reasoning capabilities of the language model through online reinforcement learning. The model is trained from Eurus-2-7B-SFT, using the Eurus-2-RL-Data dataset for reinforcement learning. The PRIME method uses an implicit reward mechanism to make the model pay more attention to the reasoning process during the generation process, rather than just the results. The model performed well in multiple inference benchmarks, with an average improvement of 16.7% compared to its SFT version. Its main advantages include efficient inference improvements, lower data and model resource requirements, and excellent performance in mathematical and programming tasks. This model is suitable for scenarios that require complex reasoning capabilities, such as programming problem solving and mathematical problem solving.
EurusPRM-Stage2 is an advanced reinforcement learning model that optimizes the inference process of the generative model through implicit process rewards. This model uses the log-likelihood ratio of a causal language model to calculate process rewards, thereby improving the model's reasoning capabilities without increasing additional annotation costs. Its main advantage is the ability to learn process rewards implicitly using only response-level labels, thereby improving the accuracy and reliability of generative models. The model performs well in tasks such as mathematical problem solving and is suitable for scenarios requiring complex reasoning and decision-making.
EurusPRM-Stage1 is part of the PRIME-RL project, which aims to enhance the inference capabilities of generative models through implicit process rewards. This model utilizes an implicit process reward mechanism to obtain process rewards during the inference process without the need for additional process labels. Its main advantage is that it can effectively improve the performance of generative models in complex tasks while reducing labeling costs. This model is suitable for scenarios that require complex reasoning and generation capabilities, such as mathematical problem solving, natural language generation, etc.
PRIME is an open source online reinforcement learning solution that enhances the reasoning capabilities of language models through implicit process rewards. The main advantage of this technology is its ability to effectively provide dense reward signals without relying on explicit process labels, thereby accelerating model training and improving inference capabilities. PRIME performs well on mathematics competition benchmarks, outperforming existing large-scale language models. Its background information includes that it was jointly developed by multiple researchers and related code and data sets were released on GitHub. PRIME is positioned to provide powerful model support for users who require complex reasoning tasks.
GLM-Zero-Preview is Zhipu's first reasoning model trained based on extended reinforcement learning technology. It focuses on enhancing AI reasoning capabilities and is good at handling mathematical logic, code and complex problems that require deep reasoning. Compared with the base model, the expert task capabilities are greatly improved without significantly reducing the general task capabilities. In AIME 2024, MATH500 and LiveCodeBench evaluations, the effect is equivalent to OpenAI o1-preview. Product background information shows that Zhipu Huazhang Technology Co., Ltd. is committed to improving the deep reasoning capabilities of the model through reinforcement learning technology. In the future, it will launch the official version of GLM-Zero to expand the deep thinking capabilities to more technical fields.
HuatuoGPT-o1 is a large language model designed for complex medical reasoning, capable of identifying errors, exploring alternative strategies and refining answers. The model advances complex reasoning by leveraging verifiable medical questions and specialized medical validators. The main advantages of HuatuoGPT-o1 include: using validators to guide the search of complex reasoning trajectories to fine-tune large language models; applying reinforcement learning (PPO) based on validator rewards to further improve complex reasoning capabilities. The open source model, data and code of HuatuoGPT-o1 make it of great value in the fields of medical education and research.
MarS is a financial market simulation engine driven by a generative base model (LMM) that can dynamically generate order sequences based on historical financial market data in response to various conditions, including user-injected interactive orders, vague target scenario descriptions, and current/recent market data. MarS matches the generated order sequences and user interactive orders in real-time in the simulated clearing house to generate fine-grained simulated market trajectories. The flexibility of MarS enables it to support a variety of downstream applications such as predictions, detection systems, analytics platforms, and agent training environments.
Unitree RL GYM is a reinforcement learning platform based on Unitree robots, supporting Unitree Go2, H1, H1_2, G1 and other models. The platform provides an integrated environment that allows researchers and developers to train and test reinforcement learning algorithms on real or simulated robots. Its importance lies in promoting the development of robot autonomy and intelligence technology, especially in applications requiring complex decision-making and motion control. Unitree RL GYM is open source and free to use, mainly for scientific researchers and robotics enthusiasts.
Meta Motivo is the first behavior-based model released by Meta FAIR. It is pre-trained through a novel unsupervised reinforcement learning algorithm and is used to control complex virtual humanoid agents to complete full-body tasks. The model is able to solve unseen tasks such as motion tracking, pose attainment, and reward optimization with hints at test time, without the need for additional learning or fine-tuning. The importance of this technology lies in its zero-shot learning capability, which enables it to handle a variety of complex tasks while maintaining behavioral robustness. The development background of Meta Motivo is based on the pursuit of generalization capabilities for more complex tasks and different types of agents. Its open source pre-training model and training code encourage the community to further develop research on behavioral basic models.
The RLVR-GSM-MATH-IF-Mixed-Constraints data set is a data set focused on mathematical problems. It contains various types of mathematical problems and corresponding solutions, and is used to train and verify reinforcement learning models. The importance of this data set lies in its ability to help develop smarter educational aids and improve students' ability to solve mathematical problems. Product background information shows that the data set was released by allenai on the Hugging Face platform, including two subsets: GSM8k and MATH, as well as IF Prompts with verifiable constraints, and is suitable for MIT License and ODC-BY license.
ReFT is an open source research project that aims to fine-tune large language models through deep reinforcement learning techniques to improve their performance on specific tasks. The project provides detailed code and data so that researchers and developers can reproduce the results in the paper. The main advantages of ReFT include the ability to automatically adjust model parameters using reinforcement learning and improve model performance on specific tasks through fine-tuning. Product background information shows that ReFT is based on the Codellama and Galactica models and follows the Apache2.0 license.
O1-CODER is a project aiming to reproduce OpenAI's O1 model, focusing on programming tasks. The project combines reinforcement learning (RL) and Monte Carlo Tree Search (MCTS) techniques to enhance the model's system-two thinking capabilities, with the goal of generating more efficient and logical code. This project is of great significance for improving programming efficiency and code quality, especially in scenarios that require a large amount of automated testing and code optimization.
Tülu 3 is an open source family of advanced language models that are post-trained to adapt to a wider range of tasks and users. These models enable complex training processes by combining partial details of proprietary methodologies, novel techniques, and established academic research. Tülu 3's success is rooted in careful data management, rigorous experimentation, innovative methodologies and improved training infrastructure. By openly sharing data, recipes and findings, Tülu 3 aims to empower the community to explore new and innovative post-training methods.
Google DeepMind is a leading artificial intelligence company owned by Google, focused on developing advanced machine learning algorithms and systems. DeepMind is known for its pioneering work in deep learning and reinforcement learning, with research spanning fields from gaming to healthcare. DeepMind's goal is to advance science and medicine by building intelligent systems to solve complex problems.
Agent Q is a new generation AI agent model developed by MultiOn, which combines search, self-criticism and reinforcement learning to create advanced autonomous network agents capable of planning and self-healing. It solves the challenges of traditional large language models (LLMs) in multi-step reasoning tasks in dynamic environments by guiding Monte Carlo Tree Search (MCTS), AI self-criticism and Direct Preference Optimization (DPO) algorithms, improving the success rate in complex environments.
Meta Llama 3.1 is a series of pre-trained and instruction-tuned multilingual large language models (LLMs) supporting 8 languages, optimized for conversational use cases, and improved safety and usefulness through supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF).
DigiRL is an innovative online reinforcement learning algorithm for training intelligent agents capable of controlling devices in wild environments. It solves open-ended, real-world Android tasks through autonomous value assessment models (VLM). The main advantages of DigiRL include the ability to leverage existing non-optimal offline datasets and encourage agents to learn from their own trial and error through offline-to-online reinforcement learning. The model uses an instruction-level value function to implicitly build an automated course, prioritizing tasks that are most valuable to the agent, and picking out beneficial actions in the trajectory that contribute to the goal via a step-level value function.
Nemotron-4-340B-Reward is a multi-dimensional reward model developed by NVIDIA for use in synthetic data generation pipelines to help researchers and developers build their own large language models (LLMs). The model consists of a Nemotron-4-340B-Base model and a linear layer capable of converting the token at the end of the response into five scalar values, corresponding to the HelpSteer2 attribute. It supports context lengths of up to 4096 tokens and is able to score five attributes per assistant turn.
RL4VLM is an open source project that aims to fine-tune large visual-language models through reinforcement learning into intelligent agents capable of making decisions. The project was jointly developed by researchers including Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine and other researchers. It is based on the LLaVA model and uses the PPO algorithm for reinforcement learning fine-tuning. The RL4VLM project provides a detailed code base structure, getting started guide, license information, and instructions on how to cite the research.
DIAMOND (DIffusion As a Model Of eNvironment Dreams) is a reinforcement learning agent trained in a diffusion world model for world modeling in Atari games where visual detail is critical. It is trained on a subset of Atari games via autoregressive imagination, allowing you to quickly install and try out pre-trained world models.
LeRobot is an open source project that aims to lower the barrier to entry into the field of robotics, allowing everyone to contribute and benefit from shared datasets and pre-trained models. It contains state-of-the-art methods proven in the real world, with a special focus on imitation learning and reinforcement learning. LeRobot provides a set of pre-trained models, a dataset with human collection demonstrations, and a simulation environment so users can get started without assembling a robot. In the coming weeks, there are plans to add support for the most affordable and capable real-world robots.
MuKoe is a fully open source implementation of MuZero that runs on GKE using Ray as the distributed orchestrator. It provides examples of Atari games and provides an overview of the code base through a Google Next 2024 talk. MuKoe supports running on CPU and TPU, has specific hardware requirements, and is suitable for AI research and development that require large-scale distributed computing resources.
Universe is a software platform that measures and trains the general intelligence capabilities of artificial intelligence through a variety of games, websites, and other applications. It allows AI agents to use computers like humans, interacting with the system by observing screen pixels and operating virtual keyboards and mice. The platform integrates thousands of environments including Flash games, web tasks, video games, etc. It aims to achieve a major breakthrough in general artificial intelligence by building an AI agent that can flexibly apply past experience to quickly master unfamiliar environments.
SERL is a carefully implemented code base that contains an efficient off-policy deep reinforcement learning method, as well as methods for calculating rewards and resetting the environment, a high-quality and widely adopted robot controller, and some challenging example tasks. It provides a resource for the community, describes its design choices, and presents experimental results. Surprisingly, we find that our implementation enables very efficient learning, requiring only 25 to 50 minutes of training for strategies such as PCB assembly, cable routing, and object relocation, improving on state-of-the-art results reported in the literature for similar tasks. These strategies achieve perfect or near-perfect success rates, are extremely robust even under perturbations, and exhibit emergent recovery and correction behaviors. We hope that these promising results and our high-quality open source implementation will provide the robotics community with a tool to promote further development of reinforcement learning in robotics.
Text-to-image diffusion model is a type of deep generative model that exhibits excellent image generation capabilities. However, these models are susceptible to implicit biases from web-scale text-image training pairs and may not accurately model the image aspects we care about. This can lead to suboptimal samples, model bias, and images that are inconsistent with human ethics and preferences. This paper introduces an efficient and scalable algorithm that leverages reinforcement learning (RL) to improve diffusion models covering diverse reward functions such as human preference, compositionality, and fairness, covering millions of images. We illustrate how our approach substantially outperforms existing methods, aligning diffusion models with human preferences. We further illustrate how this significantly improves the pre-trained Stable Diffusion (SD) model, generating samples that are 80.3% preferred by humans, while improving the composition and diversity of the generated samples.
ReFT is a simple and effective way to enhance the inference capabilities of large language models (LLMs). It first warms up the model through supervised fine-tuning (SFT), and then uses online reinforcement learning, specifically the PPO algorithm in this article, to further fine-tune the model. ReFT significantly outperforms SFT by automatically sampling a large number of reasoning paths for a given problem and naturally deriving rewards from real answers. The performance of ReFT may be further improved by incorporating inference-time strategies such as majority voting and re-ranking. It is important to note that ReFT improves by learning the same training problem as SFT without relying on additional or enhanced training problems. This shows that ReFT has stronger generalization ability.
Parrot is a multi-objective reinforcement learning framework designed for text-to-image generation. It automatically identifies the best trade-offs between different rewards in the RL optimization process generated by T2I through batch Pareto optimal selection. In addition, Parrot adopts a joint optimization method of T2I model and prompt expansion network, which promotes the generation of quality-aware text prompts, thereby further improving the final image quality. To counteract the potentially catastrophic forgetting of the original user prompt due to prompt expansion, we introduce original prompt centered guidance at inference time to ensure that the generated images are faithful to the user input. Extensive experiments and user studies show that Parrot outperforms several baseline methods on various quality criteria, including aesthetics, human preference, image sentiment, and text-image alignment.
Starling-7B is an open large language model (LLM) trained by reinforcement learning from AI feedback (RLAIF). It comes to full use with our new GPT-4 labeled ranking dataset Nectar and new reward training and policy tuning pipelines. Starling-7B scored 8.09 on MT Bench using GPT-4 as a judge, surpassing all current models on MT-Bench except OpenAI’s GPT-4 and GPT-4 Turbo. We released the ranking dataset Nectar, reward model Starling-RM-7B-alpha, and language model Starling-LM-7B-alpha on HuggingFace, as well as online demos in LMSYS Chatbot Arena. Please look forward to our upcoming code and paper releases, which will provide more details on the entire process.
JaxMARL is a multi-agent reinforcement learning library that combines ease of use with GPU-accelerated performance. It supports commonly used multi-agent reinforcement learning environments as well as popular benchmark algorithms. The goal is to provide a library for comprehensive evaluation of multi-agent reinforcement learning methods and comparisons with relevant benchmarks. It also introduces SMAX, a simplified version of the popular StarCraft multi-agent challenge environment that does not require the StarCraft II game engine to run.
Motif is a PyTorch-based project that trains AI agents on NetHack by deriving reward functions from the preferences of LLMs (Large Language Models). It can generate behaviors that are intuitively consistent with human behavior and can be guided by cue modifications.
Eureka is a human-level reward design algorithm implemented by encoding large language models. It leverages the zero-shot generation, code writing, and context improvement capabilities of state-of-the-art language models such as GPT-4 to evolve the reward code. The generated rewards can be used to acquire complex skills through reinforcement learning. The reward functions generated by Eureka outperformed reward functions designed by human experts in 29 open source reinforcement learning environments, including 10 different robot forms. Eureka also has the flexibility to improve the reward function to improve the quality and security of generated rewards. By combining it with course learning and using the Eureka reward function, we demonstrated for the first time that a simulated Shadow Hand was able to perform the skill of rotating a pen, skillfully maneuvering the pen in a circle at a rapid speed.
flowRL is a tool that leverages real-time user experience personalization and reinforcement learning to increase product revenue. It uses AI algorithms to customize a unique application experience for each user, making real-time UI adjustments based on the user's behavior to best match their preferences. Our machine learning models use reinforcement learning techniques to continuously learn and optimize user data to achieve any target metric, from user retention to revenue and user lifetime value.
Octopus is a visual language programming tool based on environmental feedback that can efficiently parse the agent's visual and textual task goals, formulate complex action sequences, and generate executable code. Octopus' design allows agents to handle a wide range of tasks, from everyday chores in simulators to complex interactions in complex video games. Octopus is trained in our experimental environment OctoVerse by leveraging GPT-4 to control the exploration agent to generate training data, namely action blueprints and corresponding executable code. We also collect feedback to allow reinforcement training schemes for reinforcement learning with environmental feedback (RLEF). Through a series of experiments, we elucidate the functionality of Octopus and present convincing results, and the proposed RLEF demonstrates the effectiveness of improving agent decision-making. By open sourcing our model architecture, simulators, and datasets, we hope to inspire more innovation and promote collaborative applications within the broader Experiential AI community.