Found 60 related AI tools
Level-Navi Agent is an open source general network search agent framework that can decompose complex problems and gradually search information on the Internet until it answers user questions. It provides a benchmark for evaluating the performance of models on search tasks by providing the Web24 data set, covering five major fields: finance, games, sports, movies, and events. This framework supports zero-shot and few-shot learning, providing an important reference for the application of large language models in the field of Chinese web search agents.
Signs is an innovative platform powered by NVIDIA that aims to help users learn American Sign Language (ASL) through artificial intelligence technology and allows users to contribute data by recording sign language videos to build the world's largest open sign language dataset. The platform uses AI real-time feedback and 3D animation technology to provide a friendly learning experience for beginners, while providing data support for the sign language community to promote the popularity and diversity of sign language learning. The platform plans to make the data set public in the second half of 2025 to promote the development of more related technologies and services.
Dolphin R1 is a dataset created by the Cognitive Computations team to train inference models similar to the DeepSeek-R1 Distill model. This data set contains 300,000 inference samples from DeepSeek-R1, 300,000 inference samples from Gemini 2.0 flash thinking, and 200,000 Dolphin chat samples. The combination of these data sets provides researchers and developers with rich training resources that help improve the model's reasoning and conversational capabilities. The creation of this data set was sponsored by Dria, Chutes, Crusoe Cloud and other companies, which provided computing resources and financial support for the development of the data set. The release of the Dolphin R1 data set provides an important foundation for research and development in the field of natural language processing and promotes the development of related technologies.
Nemotron-CC is a 6.3 trillion token data set based on Common Crawl. It transforms the English Common Crawl into a 6.3 trillion token long-term pre-training data set through classifier integration, synthetic data rewriting, and reduced dependence on heuristic filters, containing 4.4 trillion global deduplicated original tokens and 1.9 trillion synthetically generated tokens. This dataset achieves a better balance between accuracy and data volume and is of great significance for training large language models.
mlabonne/llm-datasets is a collection of high-quality datasets and tools focused on fine-tuning large language models (LLMs). The product provides researchers and developers with a range of carefully selected and optimized datasets to help them better train and optimize their language models. Its main advantage lies in the diversity and high quality of the data set, which can cover a variety of usage scenarios, thus improving the generalization ability and accuracy of the model. In addition, the product provides tools and concepts to help users better understand and use these data sets. Background information includes being created and maintained by mlabonne to advance the field of LLM.
The RLVR-GSM-MATH-IF-Mixed-Constraints data set is a data set focused on mathematical problems. It contains various types of mathematical problems and corresponding solutions, and is used to train and verify reinforcement learning models. The importance of this data set lies in its ability to help develop smarter educational aids and improve students' ability to solve mathematical problems. Product background information shows that the data set was released by allenai on the Hugging Face platform, including two subsets: GSM8k and MATH, as well as IF Prompts with verifiable constraints, and is suitable for MIT License and ODC-BY license.
MAmmoTH-VL is a large-scale multi-modal reasoning platform that significantly improves the performance of multi-modal large language models (MLLMs) in multi-modal tasks through instruction tuning technology. The platform uses open models to create a dataset of 12 million command-response pairs, covering diverse, inference-intensive tasks with detailed and faithful justification. MAmmoTH-VL achieved state-of-the-art performance on benchmarks such as MathVerse, MMMU-Pro and MuirBench, demonstrating its importance in education and research.
FineWeb2 is a large-scale multi-language pre-training data set provided by Hugging Face, covering more than 1,000 languages. This dataset is carefully designed to support the pre-training and fine-tuning of natural language processing (NLP) models, especially on multiple languages. Known for its high quality, large scale, and diversity, it can help models learn common features across languages and improve performance on specific language tasks. FineWeb2 performs well on pre-trained datasets in multiple languages and even, in some cases, performs better than some databases designed specifically for a single language.
OLMo 2 1124 13B Preference Mixture is a large multilingual dataset provided by Hugging Face, containing 377.7k generated pairs, used for training and optimizing language models, especially in preference learning and instruction following. The importance of this dataset is that it provides a diverse and large-scale data environment that helps develop more precise and personalized language processing technologies.
DOLMino dataset mix for OLMo2 stage 2 annealing training is a dataset that mixes a variety of high-quality data and is used in the second stage of OLMo2 model training. This data set contains various types of data such as web pages, STEM papers, encyclopedias, etc., and is designed to improve the performance of the model in text generation tasks. Its importance lies in providing rich training resources for developing smarter and more accurate natural language processing models.
Tülu 3 is an open source family of advanced language models that are post-trained to adapt to a wider range of tasks and users. These models enable complex training processes by combining partial details of proprietary methodologies, novel techniques, and established academic research. Tülu 3's success is rooted in careful data management, rigorous experimentation, innovative methodologies and improved training infrastructure. By openly sharing data, recipes and findings, Tülu 3 aims to empower the community to explore new and innovative post-training methods.
WorkflowLLM is a data-centric framework designed to enhance the capabilities of large language models (LLMs) in workflow orchestration. The core is WorkflowBench, a large-scale supervised fine-tuning dataset containing 106,763 samples from 1,503 APIs in 83 applications and 28 categories. WorkflowLLM created a WorkflowLlama model specifically optimized for workflow orchestration tasks by fine-tuning the Llama-3.1-8B model. Experimental results show that WorkflowLlama performs well in orchestrating complex workflows and generalizes well to unseen APIs.
GenXD is a framework focused on 3D and 4D scene generation, which jointly studies general 3D and 4D generation using common camera and object motions in daily life. Due to the lack of large-scale 4D data in the community, GenXD first proposed a data curation process to obtain camera poses and object motion intensity from videos. Based on this process, GenXD introduces a large-scale real-world 4D scene dataset: CamVid-30K. By utilizing all 3D and 4D data, the GenXD framework is able to generate any 3D or 4D scene. It proposes multi-view-temporal modules that decouple camera and object motion to seamlessly learn from 3D and 4D data. In addition, GenXD also uses masked latent conditions to support multiple condition views. GenXD is able to generate videos that follow camera trajectories and consistent 3D views that can be lifted to a 3D representation. It is extensively evaluated on a variety of real-world and synthetic datasets, demonstrating the effectiveness and versatility of GenXD compared to previous methods for 3D and 4D generation.
Sparsh is a family of general-purpose tactile representations trained with self-supervised algorithms such as MAE, DINO and JEPA. It can generate useful representations for DIGIT, Gelsight'17 and Gelsight Mini, and significantly surpass the end-to-end model in the downstream tasks proposed by TacBench. It can also provide support for data-efficient training of new downstream tasks. The Sparsh project contains PyTorch implementations, pretrained models, and datasets released with Sparsh.
The 1X World Model is a machine learning program that simulates how the world responds to a robot's behavior. It is based on technological advances in video generation and self-driving car world models, providing a virtual simulator for robots capable of predicting future scenarios and evaluating robot strategies. This model is not only able to handle complex object interactions such as rigid bodies, the impact of dropped objects, partial observability, deformable objects, and articulated objects, but is also capable of evaluation in changing environments, which is crucial for the development of robotics.
GameGen-O is the first diffusion transformation model tailored for generating open-world video games. The model enables high-quality, open-domain generation by simulating multiple features of game engines, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, allowing gameplay simulation. The development of GameGen-O involved a comprehensive data collection and processing effort from scratch, including building the first open world video game dataset (OGameData), efficiently sorting, scoring, filtering and decoupling titles through a proprietary data pipeline. This powerful and extensive OGameData forms the basis of the model training process.
CSGO is a text-to-image generation model based on content style synthesis. It generates and automatically cleans stylized data triples through a data construction pipeline, and builds the first large-scale style transfer data set IMAGStyle, which contains 210k image triples. The CSGO model adopts end-to-end training, clearly decouples content and style features, and achieves it through independent feature injection. It implements image-driven style transfer, text-driven style synthesis, and text editing-driven style synthesis. It has the advantages of inference without fine-tuning, maintaining the generation ability of the original text-to-image model, and unifying style transfer and style synthesis.
MedTrinity-25M is a large-scale multi-modal dataset containing medical annotations at multiple granularities. It was developed by multiple authors to advance research in the field of medical image and text processing. The construction of the data set includes steps such as data extraction and multi-granularity text description generation, and supports a variety of medical image analysis tasks, such as visual question answering (VQA), pathology image analysis, etc.
MINT-1T is a multi-modal dataset open sourced by Salesforce AI, containing one trillion text tags and 3.4 billion images, which is 10 times larger than existing open source datasets. It contains not only HTML documents, but also PDF documents and ArXiv papers, enriching the diversity of the dataset. MINT-1T's data set construction involves data collection, processing and filtering steps from multiple sources, ensuring the high quality and diversity of data.
SA-V Dataset is an open-world video dataset designed for training general object segmentation models, containing 51K diverse videos and 643K spatio-temporal segmentation masks (masklets). This dataset is used for computer vision research and is allowed to be used under the CC BY 4.0 license. Video content is diverse and includes topics such as places, objects, and scenes, with masks ranging from large-scale objects such as buildings to details such as interior decorations.
Segment Anything Model 2 (SAM 2) is a visual segmentation model launched by FAIR, the AI research department of Meta Corporation. It implements real-time video processing through a simple transformer architecture and streaming memory design. The model builds a model loop data engine through user interaction, collecting SA-V, the largest video segmentation dataset to date. SAM 2 is trained on this dataset and provides strong performance across a wide range of tasks and vision domains.
DCLM-baseline is a pre-training data set for language model benchmark testing, containing 4T tokens and 3B documents. It is extracted from the Common Crawl dataset through carefully planned data cleaning, filtering and deduplication steps, aiming to demonstrate the importance of data curation in training efficient language models. This dataset is for research use only and is not suitable for production environments or domain-specific model training, such as coding and mathematics.
PixelProse is a large-scale dataset created by tomg-group-umd that leverages the advanced visual-language model Gemini 1.0 Pro Vision to generate more than 16 million detailed image descriptions. This dataset is of great significance for developing and improving image-to-text conversion technology, which can be used for image description generation, visual question answering and other tasks.
emo-visual-data is a public emoticon visual annotation data set. It collects 5329 emoticons through visual annotation completed using the glm-4v and step-free-api projects. This dataset can be used to train and test large multi-modal models and is of great significance for understanding the relationship between image content and text description.
The UltraMedical project aims to develop specialized general models in the biomedical field that are designed to answer questions relevant to examinations, clinical scenarios, and research questions, while maintaining a broad general knowledge base to effectively address cross-domain problems. By using advanced alignment techniques, including supervised fine-tuning (SFT), direct preference optimization (DPO), and odds ratio preference optimization (ORPO), large language models are trained on the UltraMedical dataset to create powerful and versatile models that effectively serve the needs of the biomedical community.
FlashRAG is a Python toolkit for the reproduction and development of Retrieval Augmented Generation (RAG) research. It includes 32 pre-processed benchmark RAG datasets and 12 state-of-the-art RAG algorithms. FlashRAG provides an extensive and customizable framework, including the basic components required for RAG scenarios such as retrievers, rearrangers, generators and compressors, allowing for flexible assembly of complex processes. In addition, FlashRAG also provides an efficient pre-processing stage and optimized execution, supporting tools such as vLLM and FastChat to accelerate LLM inference and vector index management.
ImageInWords (IIW) is a human-involved iterative annotation framework for curating hyper-detailed image descriptions and generating a new dataset. This dataset achieves state-of-the-art results by evaluating automated and human-parallelism (SxS) metrics. The IIW dataset significantly improves upon previous datasets and GPT-4V output in generating descriptions across multiple dimensions, including readability, comprehensiveness, specificity, hallucination, and human-likeness. Furthermore, models fine-tuned using IIW data performed well in text-to-image generation and visual language reasoning, able to generate descriptions that were closer to the original images.
The WildChat dataset is a corpus of 1 million real-world users interacting with ChatGPT, characterized by diverse languages and user prompts. This dataset was used to fine-tune Meta’s Llama-2, creating the WildLlama-7b-user-assistant chatbot capable of predicting user prompts and assistant responses.
HuggingFace Mirror Station is a non-profit project that aims to provide domestic AI developers with a fast and stable model and data set download platform. It greatly improves developer productivity by optimizing the download process and reducing interruptions caused by network issues. This mirror site supports a variety of download methods, including direct downloading from web pages, using the official command line tool huggingface-cli, the hfd download tool developed by this site, and setting environment variables to achieve non-intrusive downloading.
The FineWeb dataset contains more than 15 trillion English web page data that has been cleaned and deduplicated and comes from CommonCrawl. This dataset is designed for large language model pre-training and aims to promote the development of open source models. The dataset is carefully processed and filtered to ensure high quality and is suitable for a variety of natural language processing tasks.
The StableDesign project aims to provide data sets and training methods for generative interior design. Users upload pictures of empty rooms and text prompts to generate decoration renderings. Through Airbnb data download, feature extraction and ControlNet model training, combined with image processing and natural language processing technology, new ideas and methods are provided.
The LMSYS Org is an organization that aims to democratize technology for large-scale models and their system infrastructure. They developed the Vicuna chatbot, which can impress GPT-4 at 7B/13B/33B scale, achieving 90% ChatGPT quality. Chatbot Arena is also available for large-scale, gamified evaluation of LLMs with crowdsourcing and the Elo rating system. SGLang provides an efficient interface and runtime environment for complex LLM programs. LMSYS-Chat-1M is a large-scale real-world LLM conversation dataset. FastChat is an open platform for training, serving, and evaluating LLM-based chatbots. MT-Bench is a set of challenging, multi-turn, open-ended questions for evaluating chatbots.
The Apollo project is maintained by the Freedom Intelligence organization and aims to democratize medical AI by providing large language models (LLMs) in the field of multilingual medicine, covering 600 million people. The project includes models, datasets, benchmarks, and associated code.
MNBVC (Massive Never-ending BT Vast Chinese corpus) is a project aimed at providing rich Chinese corpus for AI. It includes not only mainstream cultural content, but also niche culture and Internet slang. The data set includes news, compositions, novels, books, magazines, papers, lines, posts, wikis, ancient poems, lyrics, product introductions, jokes, embarrassments, chat records and other forms of plain text Chinese data.
Refined-Anime-Text is a refined dataset for anime text, provided by CausalLM. This data set contains a large amount of animation-related texts and is suitable for training and optimizing text generation models, especially for applications in the field of animation.
The Aria Daily Activity Dataset is a re-release of the first pilot data set released by Project Aria, updated with new tools and location data to accelerate the development of machine perception and artificial intelligence technologies. The data set contains first-person video sequences in daily life scenes, and is equipped with rich sensor data, annotation data, and 3D point cloud data generated by the Aria machine perception service. Researchers can quickly get started with this dataset using specialized tools provided by Aria.
AutoMathText is an extensive and carefully curated dataset containing approximately 200GB of mathematical text. Each piece of content in the dataset is independently selected and scored by Qwen, the most advanced open source language model, ensuring high standards of relevance and quality. This dataset is particularly suitable for promoting advanced research at the intersection of mathematics and artificial intelligence, as an educational tool for learning and teaching complex mathematical concepts, and as a basis for developing and training AI models that specifically process and understand mathematical content.
LiveFood is a data set containing more than 5,100 food videos. The videos include four fields: ingredients, cooking, presentation, and eating. All videos are carefully annotated by professional workers, and a strict double-checking mechanism is used to further ensure the quality of the annotations. We also propose the Global Prototype Encoding (GPE) model to handle this incremental learning problem, achieving competitive performance compared with traditional techniques.
MAGNeT is a community platform that provides various artificial intelligence models and data sets. Users can find a variety of advanced natural language processing and machine learning models on the platform, as well as related data sets. The platform also offers a range of solutions including text-to-speech conversion, image processing, and more. MAGNeT is positioned to provide developers, researchers and enterprises with high-quality artificial intelligence models and data sets.
ANIM-400K is a comprehensive dataset of over 425,000 aligned Japanese and English animated video clips, supporting various video-related tasks such as automatic dubbing, simultaneous translation, video summarization, genre/topic/style classification, etc. This dataset is publicly available for research purposes.
DL3DV-10K is a large-scale real-scene data set containing more than 10,000 high-quality videos. Each video is manually annotated with scene key points and complexity, and provides camera poses, NeRF estimated depth, point clouds, and 3D meshes. This data set can be used for computer vision research such as general NeRF research, scene consistency tracking, and visual language models.
En3D is a platform that provides advanced natural language processing models. They provide a wide variety of models and datasets to help developers build and deploy natural language processing applications. The advantage of the En3D platform is that it provides a large number of pre-trained models and convenient deployment tools, allowing developers to quickly and efficiently build natural language processing applications.
ml-ferret is an end-to-end machine learning language model (MLLM) that can accept various forms of references and respond to precise positioning in multi-modal environments. It combines a hybrid region representation and a spatially aware visual sampler, supporting fine-grained and open-vocabulary referencing and localization. Additionally, ml-ferret includes the GRIT dataset (approximately 1.1 million samples) and the Ferret-Bench evaluation benchmark.
LLM Spark is a development platform that can be used to build LLM-based applications. It provides rapid testing of multiple LLMs, version control, observability, collaboration, multiple LLM support and other functions. LLM Spark makes it easy to build smart applications such as AI chatbots and virtual assistants, and achieves superior performance by integrating with provider keys. It also provides GPT-driven templates to accelerate the creation of various AI applications while supporting customized projects from scratch. LLM Spark also supports seamless uploading of datasets to enhance the functionality of AI applications. Compare GPT results, iterate and deploy smart AI applications with LLM Spark's comprehensive logging and analytics. It also supports simultaneous testing of multiple models, saving prompt versions and history, easy collaboration, and powerful search capabilities based on meaning rather than just keywords. In addition, LLM Spark also supports the integration of external data sets into LLM and complies with GDPR compliance requirements to ensure data security and privacy protection.
Distil-Whisper is a platform that provides models and data sets, where users can access various pre-trained models and data sets and conduct related applications and research. The platform provides a wealth of model and data set resources to help users quickly carry out work related to natural language processing and machine learning.
OpenXLab Puyuan provides a one-stop AI development platform for developers and users in the field of artificial intelligence. Including application development, free model hosting, data set download and other services. The application center provides an application construction platform, the model center provides a community model hosting platform, and the data set center provides massive high-quality artificial intelligence data sets.
V7 is an AI data engine that provides a complete infrastructure for enterprise-level training data, covering annotation, workflow, datasets and human-in-the-loop. It can help users label, process and manage training data quickly and efficiently, improving the accuracy and performance of AI models. V7 supports automated annotation, video annotation, document processing and other functions, and is suitable for various industries and application scenarios.
HyperHuman is a model that generates realistic human images. The model generates coherent and natural human images by capturing their structural features, from coarse body skeletons to fine-grained spatial geometries. HyperHuman consists of three parts: 1) Build a large-scale human dataset HumanVerse, which contains 340M images and comprehensive annotations such as human pose, depth and surface normal; 2) Propose a latent structure diffusion model that simultaneously denoises depth, surface normal and synthetic RGB images. Our model enforces learning of image appearance, spatial relationships, and geometry in a unified network, with each branch in the model being structure-aware and texture-rich; 3) Finally, to further improve visual quality, we propose a structure-guided refiner for more detailed high-resolution generation. Extensive experiments demonstrate that our model generates highly realistic and diverse human images in various scenarios, achieving state-of-the-art performance.
RoleLLM is a role-playing framework for building and evaluating the role-playing capabilities of large language models. It consists of four stages: role profile construction, context-based command generation, role prompting using GPT, and role-based command adjustment. Through Context-Instruct and RoleGPT, we create RoleBench, a systematic and fine-grained role-level benchmark dataset containing 168,093 samples. In addition, RoCIT produces RoleLLaMA (English) and RoleGLM (Chinese) on RoleBench, which significantly improves role-playing capabilities and even achieves comparable results with RoleGPT using GPT-4.
Awesome-Domain-LLM is a project that collects and sorts out open source models, data sets and evaluation benchmarks in vertical fields. The project includes open source models, data sets and evaluation benchmarks in many fields including medical, legal, financial, education and other fields, aiming to promote the empowerment of large models in all walks of life. Users can find models and data sets suitable for their fields in this project to improve work efficiency and quality.
Shizhi AI is a platform that provides AI models and data sets. It is committed to providing high-quality AI models and data sets to scientific research institutions, enterprises, institutions and individuals. The advantage of Shizhi AI is that it provides multiple types of AI models and data sets, including images, videos, natural language processing, etc. Users can choose appropriate models and data sets according to their own needs. The pricing of Shizhi AI is reasonable, and users can choose different packages according to their own needs to meet different needs. Shizhi AI is positioned to become a leading platform in the field of AI models and data sets.
I2VGen-XL is an AI model library and data set platform that provides rich AI models and data sets to help users quickly build AI applications. The platform supports a variety of AI tasks, including image recognition, natural language processing, speech recognition, etc. Users can upload, download and share models and data sets through the platform, or use the API interface provided by the platform to make calls. The platform provides both free and paid services, and users can choose the service that suits them according to their needs.
Process Street is an easy-to-use, no-code process platform that helps businesses create, track, automate, and complete tasks to optimize processes and increase efficiency. Its main functions include task allocation, approval, conditional logic, automation, scheduling and grouping, etc. Through AI technology, Process Street also provides AI-driven workflow design that adapts to the unique operational needs of the enterprise, driving productivity and growth. In addition, Process Street also provides functions such as forms, datasets, and pages, as well as integration with tools such as Salesforce, Slack, Microsoft Teams, and Google Sheets.
Inst-Inpaint is an image inpainting algorithm that estimates the object to be deleted based on natural language input and deletes it simultaneously. The product provides a dataset called GQA-Inpaint, and a novel inpainting framework called Inst-Inpaint that can remove objects from images based on textual prompts. The product provides various GAN and diffusion baselines and runs experiments on synthetic and real image datasets. The product provides different evaluation metrics to measure model quality and accuracy and shows significant quantitative and qualitative improvements.
CelebV-Text is a large-scale, high-quality, and diverse face text-video dataset designed to promote research on face text-video generation tasks. The dataset contains 70,000 video clips of faces in the wild, each with 20 texts, covering 40 general appearances, 5 detailed appearances, 6 lighting conditions, 37 actions, 8 emotions, and 6 light directions. CelebV-Text validates its superiority in video, text, and text-video correlation through comprehensive statistical analysis, and builds a benchmark to standardize the evaluation of face text-video generation tasks.
Botdocs are a series of high-quality data sets used to train artificial intelligence to handle common customer service interactions. It can be used to train large language models, intent classifiers, and natural language understanding engines to help enterprises automate common customer service interactions and provide an understanding of customer intent and deliver superior customer experiences. Botdocs are provided in CSV, JSONL and Dialogflow (ES) formats to meet the different needs of AI developers and systems for large language models, intent classifiers and natural language understanding engines.
Objaverse is a large-scale dataset containing 800K+ annotated 3D objects, each with name, description, label and other metadata. It contains various types of objects, including static objects, animated objects, characters with part annotations, decomposable models, indoor and outdoor environments, etc., and has a variety of visual styles. Objaverse can be used to generate 3D models, as an enhancement to 2D instance segmentation, to open vocabulary embodied AI, and to study the robustness of CLIP.
ClearCypherAI is a US-based AI startup building cutting-edge solutions. Our products include text-to-speech (T2A), speech-to-text (A2T) and speech-to-speech (A2A), supporting multi-language, multi-modal, real-time voice intelligence. We also provide natural language data sets, threat assessments, AI customization platforms and other services. Our products are highly customizable, advanced technology and excellent customer support.
Riku.AI is a no-code AI building tool that can be used to create AI models and datasets. Easily use AI through integrations with existing tools, APIs or public shared links. Make AI accessible to everyone.
LAION is a non-profit organization dedicated to making machine learning resources available to the public, including data sets, tools, and models. We encourage open public education and greener use of resources through the reuse of existing datasets and models. We provide multiple datasets, models, and projects to support a wide range of AI research.