Found 203 related AI tools
Upstage AI leverages powerful large-scale language models and document processing engines to transform workflows and improve efficiency for enterprises. Its main advantages include high precision, high performance, and customized solutions for various industries. Positioned to empower leading enterprises and improve work efficiency.
Seed-Coder is a series of open source code large-scale language models launched by the ByteDance Seed team. It includes basic, instruction and inference models. It aims to autonomously manage code training data with minimal human investment, thereby significantly improving programming capabilities. This model has superior performance among similar open source models and is suitable for various coding tasks. It is positioned to promote the development of the open source LLM ecosystem and is suitable for research and industry.
ZeroSearch is a novel reinforcement learning framework designed to motivate the search capabilities of large language models (LLMs) without interacting with actual search engines. Through supervised fine-tuning, ZeroSearch transforms LLM into a retrieval module capable of generating relevant and irrelevant documents, and introduces a course rollout mechanism to gradually stimulate the model's reasoning capabilities. The main advantage of this technology is that it outperforms models based on real search engines while incurring zero API cost. It is suitable for LLMs of all sizes and supports different reinforcement learning algorithms, making it suitable for research and development teams that require efficient retrieval capabilities.
NoteLLM is a searchable large-scale language model focused on user-generated content, designed to improve the performance of recommendation systems. By combining topic generation and embedding generation, NoteLLM improves the ability to understand and process note content. The model adopts an end-to-end fine-tuning strategy and is suitable for multi-modal inputs, enhancing its application potential in diverse content fields. Its importance lies in its ability to effectively improve the accuracy and user experience of note recommendations, which is especially suitable for UGC platforms such as Xiaohongshu.
SWE-RL is a large-scale language model inference technology based on reinforcement learning proposed by Facebook Research. It aims to use open source software evolution data to improve the model's performance in software engineering tasks. This technology optimizes the model's reasoning capabilities through a rule-driven reward mechanism, allowing it to better understand and generate high-quality code. The main advantages of SWE-RL are its innovative reinforcement learning methods and effective utilization of open source data, which bring new possibilities to the field of software engineering. This technology is currently in the research stage and commercial pricing has not yet been determined, but it has significant potential to improve development efficiency and code quality.
Coding-Tutor is a programming tutoring tool based on large language models (LLM), designed to help learners improve their programming abilities through conversational interaction. It combines knowledge tracing and round-by-round verification through a Trace-and-Verify (Traver) workflow to solve key challenges in programming tutoring. This tool is not only suitable for programming education, but can also be extended to other task coaching scenarios to help adjust teaching content according to the learner’s knowledge level. The project is open source and supports community contributions.
Goedel-Prover is an open source large-scale language model focused on automated theorem proving. It significantly improves the efficiency of automated proof of mathematical problems by translating natural language mathematical problems into formal languages (such as Lean 4) and generating formal proofs. The model achieved a success rate of 57.6% on the miniF2F benchmark, surpassing other open source models. Its main advantages include high performance, open source scalability, and deep understanding of mathematical problems. Goedel-Prover aims to promote the development of automated theorem proving technology and provide powerful tool support for mathematical research and education.
OmniParser is an advanced image parsing technology developed by Microsoft that is designed to convert irregular screenshots into a structured list of elements, including the location of interactable areas and functional descriptions of icons. It achieves efficient parsing of UI interfaces through deep learning models, such as YOLOv8 and Florence-2. The main advantages of this technology are its efficiency, accuracy and wide applicability. OmniParser can significantly improve the performance of large language model (LLM)-based UI agents, enabling them to better understand and operate various user interfaces. It performs well in a variety of application scenarios, such as automated testing, intelligent assistant development, etc. OmniParser's open source nature and flexible license make it a powerful tool for developers and researchers.
Mistral Small 24B is a large-scale language model developed by the Mistral AI team with 24 billion parameters that supports multi-language dialogue and command processing. This model can generate high-quality text content through fine-tuning of instructions, and is suitable for various scenarios such as chatting, writing, and programming assistance. Its main advantages include powerful language generation capabilities, multi-language support, and efficient reasoning capabilities. This model is suitable for individual and enterprise users who require high-performance language processing. It has an open source license, supports local deployment and quantitative optimization, and is suitable for scenarios that require data privacy.
DeepSeek-R1-Distill-Llama-70B is a large language model developed by the DeepSeek team, based on the Llama-70B architecture and optimized through reinforcement learning. The model performs well in reasoning, conversational and multilingual tasks and supports a variety of application scenarios, including code generation, mathematical reasoning and natural language processing. Its main advantages are efficient reasoning capabilities and the ability to solve complex problems, while supporting open source and commercial use. This model is suitable for enterprises and research institutions that require high-performance language generation and reasoning capabilities.
InternVL2.5-MPO is a series of multi-modal large-scale language models based on InternVL2.5 and Mixed Preference Optimization (MPO). It performs well on multi-modal tasks by integrating the newly incrementally pretrained InternViT with multiple pretrained large language models (LLMs), such as InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. This model series was trained on the multi-modal reasoning preference data set MMPR, which contains approximately 3 million samples. Through effective data construction processes and hybrid preference optimization technology, the model's reasoning capabilities and answer quality are improved.
SakanaAI/self-adaptive-llms is an adaptive framework called Transformer² that aims to solve the challenges of traditional fine-tuning methods being computationally intensive and static in their ability to handle diverse tasks. The framework is able to adapt large language models (LLMs) to unseen tasks in real time during inference via a two-step mechanism: first, the scheduling system identifies task attributes; then, task-specific 'expert' vectors trained using reinforcement learning are dynamically blended to obtain target behaviors for the input prompts. Key advantages include real-time task adaptability, computational efficiency, and flexibility. The project was developed by the SakanaAI team and is currently open source on GitHub, with 195 stars and 12 forks.
InternLM3-8B-Instruct is a large language model developed by the InternLM team with excellent reasoning capabilities and knowledge-intensive task processing capabilities. While using only 4 trillion high-quality words for training, this model achieves a training cost that is more than 75% lower than models of the same level. At the same time, it surpasses models such as Llama3.1-8B and Qwen2.5-7B in multiple benchmark tests. It supports deep thinking mode, can solve complex reasoning tasks through long thinking chains, and also has smooth user interaction capabilities. This model is open source based on the Apache-2.0 license and is suitable for various application scenarios that require efficient reasoning and knowledge processing.
Dria-Agent-a-3B is a large language model based on the Qwen2.5-Coder series, focusing on agent applications. It uses Pythonic function calling, with the advantages of single parallel multiple function calls, free-form reasoning and action, and on-the-fly complex solution generation. The model performs well on multiple benchmarks such as Berkeley Function Calling Leaderboard (BFCL), MMLU-Pro, and Dria-Pythonic-Agent-Benchmark (DPAB). The model size is 3.09B parameters and supports BF16 tensor type.
Dria-Agent-a-7B is a large language model based on Qwen2.5-Coder series training, focusing on agent applications. It adopts Pythonic function calling method, which has the advantages of single parallel multi-function calling, free-form reasoning and action, and instant complex solution generation compared with the traditional JSON function calling method. The model performs well on multiple benchmarks, including Berkeley Function Calling Leaderboard (BFCL), MMLU-Pro, and Dria-Pythonic-Agent-Benchmark (DPAB). The model size is 7.62 billion parameters, uses BF16 tensor type, and supports text generation tasks. Its main advantages include powerful programming assistance capabilities, efficient function calling methods, and high accuracy in specific fields. This model is suitable for application scenarios that require complex logic processing and multi-step task execution, such as automated programming, intelligent agents, etc. Currently, the model is available on the Hugging Face platform for free use by users.
Dria-Agent-α is a large language model (LLM) tool interaction framework launched by Hugging Face. It calls tools through Python code. Compared with the traditional JSON mode, it can more fully utilize the reasoning capabilities of LLM, allowing the model to solve complex problems in a way that is closer to human natural language. The framework leverages Python’s popularity and near-pseudocode syntax to make LLM perform better in agency scenarios. Dria-Agent-α was developed using the synthetic data generation tool Dria, which uses a multi-stage pipeline to generate realistic scenes and train models for complex problem solving. Currently, two models, Dria-Agent-α-3B and Dria-Agent-α-7B, have been released on Hugging Face.
This model is a quantified version of a large-scale language model. It uses 4-bit quantization technology to reduce storage and computing requirements. It is suitable for natural language processing. The parameter size is 8.03B. It is free and can be used for non-commercial purposes. It is suitable for those who require high-performance language applications in resource-constrained environments.
InternVL2.5-MPO is an advanced multi-modal large-scale language model series built on InternVL2.5 and Mixed Preference Optimization (MPO). This series of models performs well in multi-modal tasks, capable of processing image, text and video data and generating high-quality text responses. The model adopts the 'ViT-MLP-LLM' paradigm to optimize visual processing capabilities through pixel unshuffle operations and dynamic resolution strategies. In addition, the model also introduces support for multiple image and video data, further expanding its application scenarios. InternVL2.5-MPO surpassed multiple benchmark models in multi-modal capability evaluation, proving its leading position in the multi-modal field.
Agent Laboratory is a project developed by Samuel Schmidgall et al. It aims to help researchers complete the entire research process from literature review to experimental execution to report writing through specialized agents driven by large language models. It is not intended to replace human creativity, but to complement it, allowing researchers to focus on ideation and critical thinking while automating repetitive and time-consuming tasks such as coding and documentation. The source code of this tool is licensed under the MIT license, which allows use, modification and distribution of the code under the terms of the MIT license.
InternVL2_5-26B-MPO-AWQ is a multi-modal large-scale language model developed by OpenGVLab, aiming to improve the model's reasoning capabilities through mixed preference optimization. The model performs well in multi-modal tasks and is able to handle complex relationships between images and text. It adopts advanced model architecture and optimization technology, giving it significant advantages in multi-modal data processing. This model is suitable for scenarios that require efficient processing and understanding of multi-modal data, such as image description generation, multi-modal question answering, etc. Its main advantages include powerful inference capabilities and efficient model architecture.
AnyParser Pro is an innovative document parsing tool developed by CambioML. It uses large language model (LLM) technology to quickly and accurately extract complete text content from PDF, PPT and image files. The main advantages of this technology are its efficient processing speed and high-precision parsing capabilities, which can significantly improve the efficiency of document processing. Background information on AnyParser Pro shows that it was launched by CambioML, a startup incubated by Y Combinator, and aims to provide users with an easy-to-use and powerful document parsing solution. Currently, the product offers a free trial and users can access its features by obtaining an API key.
Sonus-1 is a series of large language models (LLMs) launched by Sonus AI to push the boundaries of artificial intelligence. Designed for their high performance and multi-application versatility, these models include Sonus-1 Mini, Sonus-1 Air, Sonus-1 Pro and Sonus-1 Pro (w/ Reasoning) in different versions to suit different needs. Sonus-1 Pro (w/ Reasoning) performed well on multiple benchmarks, particularly on reasoning and math problems, demonstrating its ability to outperform other proprietary models. Sonus AI is committed to developing high-performance, affordable, reliable, and privacy-focused large-scale language models.
InternVL2_5-26B-MPO is a multimodal large language model (MLLM). Based on InternVL2.5, it further improves the model performance through Mixed Preference Optimization (MPO). This model can process multi-modal data including images and text, and is widely used in scenarios such as image description and visual question answering. Its importance lies in its ability to understand and generate text that is closely related to the content of the image, pushing the boundaries of multi-modal artificial intelligence. Product background information includes its superior performance in multi-modal tasks and evaluation results in OpenCompass Leaderboard. This model provides researchers and developers with powerful tools to explore and realize the potential of multimodal artificial intelligence.
InternVL2_5-8B-MPO-AWQ is a multi-modal large-scale language model launched by OpenGVLab. It is based on the InternVL2.5 series and uses Mixed Preference Optimization (MPO) technology. The model demonstrates excellent performance in visual and language understanding and generation, especially in multi-modal tasks. It achieves in-depth understanding and interaction of images and text by combining the visual part InternViT and the language part InternLM or Qwen, using randomly initialized MLP projectors for incremental pre-training. The importance of this technology lies in its ability to process multiple data types including single images, multiple images, and video data, providing new solutions in the field of multi-modal artificial intelligence.
InternVL2.5-MPO is an advanced multi-modal large-scale language model series built on InternVL2.5 and hybrid preference optimization. The model integrates the newly incrementally pretrained InternViT with various pretrained large language models, including InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. InternVL2.5-MPO retains the same model architecture as InternVL 2.5 and its predecessor in the new version, following the "ViT-MLP-LLM" paradigm. The model supports multiple image and video data, and further improves model performance through Mixed Preference Optimization (MPO), making it perform better in multi-modal tasks.
HuatuoGPT-o1-8B is a large language model (LLM) in the medical field designed for advanced medical reasoning. It generates a complex thought process that reflects and refines its reasoning before providing a final response. The model is built based on LLaMA-3.1-8B, supports English, and adopts the 'thinks-before-it-answers' method. The output format includes the reasoning process and final response. This model is of great significance in the medical field because of its ability to handle complex medical problems and provide thoughtful answers, which is crucial to improving the quality and efficiency of medical decision-making.
HuatuoGPT-o1 is a large language model designed for complex medical reasoning, capable of identifying errors, exploring alternative strategies and refining answers. The model advances complex reasoning by leveraging verifiable medical questions and specialized medical validators. The main advantages of HuatuoGPT-o1 include: using validators to guide the search of complex reasoning trajectories to fine-tune large language models; applying reinforcement learning (PPO) based on validator rewards to further improve complex reasoning capabilities. The open source model, data and code of HuatuoGPT-o1 make it of great value in the fields of medical education and research.
InternVL2_5-4B-MPO-AWQ is a multimodal large language model (MLLM) focused on improving the model's performance in image and text interaction tasks. The model is based on the InternVL2.5 series and further improves performance through Mixed Preference Optimization (MPO). It can handle a variety of inputs including single and multi-image and video data, and is suitable for complex tasks that require interactive understanding of images and text. InternVL2_5-4B-MPO-AWQ provides a powerful solution for image-to-text tasks with its excellent multi-modal capabilities.
InternVL2.5-MPO is an advanced multi-modal large-scale language model series built based on InternVL2.5 and hybrid preference optimization. The model integrates the new incremental pre-trained InternViT and various pre-trained large language models, such as InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. It supports multiple image and video data and performs well in multi-modal tasks, capable of understanding and generating image-related text content.
Valley is a multi-modal large-scale model (MLLM) developed by ByteDance and is designed to handle a variety of tasks involving text, image and video data. The model achieved the best results in internal e-commerce and short video benchmarks, far outperforming other open source models, and demonstrated excellent performance on the OpenCompass multimodal model evaluation rankings, with an average score of 67.40, ranking among the top two among known open source MLLMs (<10B).
InternVL2_5-2B-MPO is a family of multi-modal large-scale language models that demonstrates excellent overall performance. The series is built on InternVL2.5 and hybrid preference optimization. It integrates the newly incrementally pretrained InternViT with various pretrained large language models, including InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. The model performs well in multi-modal tasks and is able to handle a variety of data types including images and text, making it suitable for scenarios that require understanding and generating multi-modal content.
InternVL2_5-1B-MPO is a multimodal large language model (MLLM) built on InternVL2.5 and Mixed Preference Optimization (MPO), demonstrating superior overall performance. The model integrates the newly incrementally pretrained InternViT with various pretrained large language models (LLMs), including InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. InternVL2.5-MPO retains the same "ViT-MLP-LLM" paradigm as InternVL 2.5 and its predecessors in model architecture, and introduces support for multiple image and video data. This model performs well in multi-modal tasks and can handle a variety of visual language tasks including image description, visual question answering, etc.
ExploreToM is a framework developed by Facebook Research that aims to generate diverse and challenging theory-of-mind data at scale for enhanced training and evaluation of large language models (LLMs). The framework utilizes the A* search algorithm to generate complex story structures and novel, diverse, and plausible scenarios on a custom domain-specific language to test the limits of LLMs.
EXAONE-3.5-32B-Instruct-GGUF is a series of instruction-tuned bilingual (English and Korean) generative models developed by LG AI Research, containing different versions of parameters from 2.4B to 32B. These models support long context processing up to 32K tokens, demonstrating state-of-the-art performance in real-world use cases and long context understanding, while remaining competitive in the general domain compared to recently released models of similar scale. The model family is detailed via technical reports, blogs, and GitHub, and contains instruction-tuned 32B language models in multiple precisions with the following characteristics: 30.95B parameter count (excluding embeddings), 64 layers, GQA attention heads, and Containing 40 Q headers and 8 KV headers, the vocabulary is 102,400, the context length is 32,768 tokens, and the quantization includes Q8_0, Q6_0, Q5_K_M, Q4_K_M, IQ4_XS and other GGUF formats (also including BF16 weights).
CosyVoice 2 is a speech synthesis model developed by Alibaba Group's SpeechLab@Tongyi team. It is based on supervised discrete speech labeling and combines two popular generative models: language models (LMs) and flow matching to achieve speech synthesis with high naturalness, content consistency, and speaker similarity. The model has important applications in multimodal large language models (LLMs), especially in interactive experiences where response latency and real-time factors are critical to speech synthesis. CosyVoice 2 improves the codebook utilization of speech tags through finite scalar quantization, simplifies the text-to-speech language model architecture, and designs a block-aware causal flow matching model to adapt to different synthesis scenarios. Trained on large-scale multilingual datasets, it achieves human-comparable synthesis quality with extremely low response latency and real-time performance.
Command R7B is a high-performance, scalable large language model (LLM) launched by Cohere, specially designed for enterprise-level applications. It provides first-class speed, efficiency and quality while maintaining a small model size. It can be deployed on ordinary GPUs, edge devices and even CPUs, significantly reducing the cost of production deployment of AI applications. Command R7B excels in multi-language support, reference-validated retrieval enhanced generation (RAG), inference, tool usage and agent behavior, making it ideal for enterprise use cases that require optimized speed, cost performance and computing resources.
InternVL 2.5 is a family of advanced multimodal large language models based on InternVL 2.0, which introduces significant enhancements in training and testing strategies and data quality while maintaining the core model architecture. This model provides an in-depth look at the relationship between model scaling and performance, systematically exploring performance trends for visual encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluation on a wide range of benchmarks including multi-disciplinary reasoning, document understanding, multi-image/video understanding, real-world understanding, multi-modal hallucination detection, visual localization, multi-language capabilities and pure language processing, InternVL 2.5 has demonstrated competitiveness on par with leading commercial models such as GPT-4o and Claude-3.5-Sonnet. In particular, the model is the first open source MLLM to exceed 70% on the MMMU benchmark, achieve a 3.7 percentage point improvement via chain-of-thinking (CoT) inference, and demonstrate strong potential for test-time scaling.
InternVL2_5-4B is an advanced multi-modal large language model (MLLM) that maintains the core model architecture based on InternVL 2.0 and has significant enhancements in training and testing strategies and data quality. The model performs well in processing image, text-to-text tasks, especially in multi-modal reasoning, mathematical problem solving, OCR, diagrams, and document understanding. As an open source model, it provides researchers and developers with powerful tools to explore and build vision- and language-based intelligent applications.
MLPerf Client is a new benchmark co-developed by MLCommons designed to evaluate the performance of large language models (LLMs) and other AI workloads on personal computers, ranging from laptops to desktops to workstations. This benchmark provides clear indicators of how a system handles generative AI workloads by simulating real-world AI tasks. The MLPerf Client Working Group hopes this benchmark will drive innovation and competition, ensuring PCs can meet the challenges of an AI-driven future.
InternVL 2.5 is an advanced multi-modal large language model series that builds on InternVL 2.0 by introducing significant training and testing strategy enhancements and data quality improvements while maintaining its core model architecture. The model integrates the newly incrementally pretrained InternViT with various pretrained large language models, such as InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. InternVL 2.5 supports multiple image and video data, with dynamic high-resolution training methods that provide better performance when processing multi-modal data.
InternVL 2.5 is a series of advanced multimodal large language models (MLLM) that builds on InternVL 2.0 by introducing significant training and testing strategy enhancements and data quality improvements, while maintaining its core model architecture. The model integrates the newly incrementally pretrained InternViT with various pretrained large language models (LLMs), such as InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. InternVL 2.5 supports multiple image and video data, and enhances the model's ability to handle multi-modal data through dynamic high-resolution training methods.
InternVL2_5-8B is a multi-modal large language model (MLLM) developed by OpenGVLab. It has significant training and testing strategy enhancements and data quality improvements based on InternVL 2.0. The model adopts the 'ViT-MLP-LLM' architecture, which integrates the new incremental pre-trained InternViT with multiple pre-trained language models, such as InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector. InternVL 2.5 series models demonstrate excellent performance on multi-modal tasks, including image and video understanding, multi-language understanding, etc.
InternVL2_5-26B is an advanced multimodal large language model (MLLM) that is further developed based on InternVL 2.0 by introducing significant training and testing strategy enhancements and data quality improvements. The model maintains the "ViT-MLP-LLM" core model architecture of its predecessor and integrates the newly incrementally pretrained InternViT with various pretrained large language models (LLMs), such as InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. InternVL 2.5 series models demonstrate excellent performance in multi-modal tasks, especially in visual perception and multi-modal capabilities.
InternVL 2.5 is a series of multi-modal large-scale language models launched by OpenGVLab. It has significant training and testing strategy enhancements and data quality improvements based on InternVL 2.0. This model series can process image, text and video data, and has the ability to understand and generate multi-modal data. It is a cutting-edge product in the current field of multi-modal artificial intelligence. The InternVL 2.5 series models provide powerful support for multi-modal tasks with their high performance and open source features.
InternVL 2.5 is a series of advanced multimodal large language models (MLLM) that builds on InternVL 2.0 by introducing significant training and testing strategy enhancements and data quality improvements. This model series is optimized in terms of visual perception and multi-modal capabilities, supporting a variety of functions including image and text-to-text conversion, and is suitable for complex tasks that require processing of visual and language information.
Llama-3.3-70B-Instruct is a large-scale language model with 7 billion parameters developed by Meta, which is specially optimized for multi-language dialogue scenarios. The model uses an optimized Transformer architecture and uses supervised fine-tuning (SFT) and human feedback-based reinforcement learning (RLHF) to improve its usefulness and safety. It supports multiple languages and can handle text generation tasks, and is an important technology in the field of natural language processing.
Sandbox Fusion is a versatile code sandbox designed for large language models (LLMs). It supports up to 20 programming languages and is able to comprehensively test multiple areas including programming, mathematics and hardware programming. Sandbox Fusion integrates over 10 coding-related assessment datasets with standardized data formats and accessible via a unified HTTP API. In addition, Sandbox Fusion is optimized for cloud infrastructure deployments and provides built-in security isolation when there are privileged containers. Product background information shows that Sandbox Fusion was developed by ByteDance Co., Ltd. and aims to provide developers with a safe and efficient code testing environment.
OLMo 2 7B is a 7B parameter large language model developed by the Allen Institute for AI (Ai2). It shows excellent performance on multiple natural language processing tasks. This model can understand and generate natural language through training on large-scale data sets, and supports scientific research and applications related to multiple language models. The main advantages of OLMo 2 7B include its large-scale parameter volume, which enables the model to capture more subtle language features, and its open source nature, which promotes further research and application in academia and industry.
Star-Attention is a new block sparse attention mechanism proposed by NVIDIA, aiming to improve the reasoning efficiency of Transformer-based large language models (LLM) on long sequences. This technology significantly improves inference speed through two stages of operation while maintaining 95-100% accuracy. It is compatible with most Transformer-based LLMs, can be used directly without additional training or fine-tuning, and can be combined with other optimization methods such as Flash Attention and KV cache compression technology to further improve performance.
Mistral-Large-Instruct-2411 is a large language model with 123B parameters provided by Mistral AI. It has the most advanced capabilities in reasoning, knowledge, encoding, etc. The model supports multiple languages and is trained on more than 80 programming languages, including but not limited to Python, Java, C, C++, etc. It is agent-centric and has native function calling and JSON output capabilities, making it an ideal choice for scientific research and development.
Qwen2.5-Coder is the latest series of Qwen large-scale language models, designed for code generation, code reasoning and code repair. Based on the powerful Qwen2.5, by increasing the training token to 5.5 trillion, including source code, text code base, synthetic data, etc., Qwen2.5-Coder-32B has become the most advanced open source code large-scale language model currently, and its coding capabilities match GPT-4o. This model is an instruction-tuned version of 1.5B parameters, adopts GGUF format, and has features such as causal language model, pre-training and post-training stages, and transformers architecture.
WorkflowLLM is a data-centric framework designed to enhance the capabilities of large language models (LLMs) in workflow orchestration. The core is WorkflowBench, a large-scale supervised fine-tuning dataset containing 106,763 samples from 1,503 APIs in 83 applications and 28 categories. WorkflowLLM created a WorkflowLlama model specifically optimized for workflow orchestration tasks by fine-tuning the Llama-3.1-8B model. Experimental results show that WorkflowLlama performs well in orchestrating complex workflows and generalizes well to unseen APIs.
Pixtral-Large-Instruct-2411 is a 124B parameter large multi-modal model developed by Mistral AI. It is built on Mistral Large 2 and demonstrates cutting-edge image understanding capabilities. The model is able to understand documents, diagrams, and natural images while maintaining Mistral Large 2’s leadership in text understanding. It has achieved state-of-the-art performance on MathVista, DocVQA, VQAv2 and other data sets, and is a powerful tool for scientific research and commercial applications.
fixie-ai/ultravox-v0_4_1-llama-3_1-70b is a large language model based on pre-trained Llama3.1-70B-Instruct and whisper-large-v3-turbo, which can process speech and text input and generate text output. The model converts the input audio into an embedding through the special pseudo tag <|audio|> and merges it with the text prompt to generate the output text. Ultravox was developed to expand application scenarios for speech recognition and text generation, such as voice agents, speech-to-speech translation, and spoken audio analysis. This model is licensed under the MIT license and developed by Fixie.ai.
fixie-ai/ultravox-v0_4_1-llama-3_1-8b is a large language model based on pre-trained Llama3.1-8B-Instruct and whisper-large-v3-turbo, which can process speech and text input and generate text output. The model converts input audio into embeddings via special <|audio|> pseudo-tags and generates output text. Future releases plan to extend the token vocabulary to support the generation of semantic and acoustic audio tokens, which can in turn be used by the vocoder to produce speech output. The model performs well in translation evaluation without preference adjustment and is suitable for scenarios such as voice agency, speech-to-speech translation, and speech analysis.
Hermes 3 is the latest version of the Hermes series of large language models (LLM) launched by Nous Research. Compared with Hermes 2, it has significant improvements in agent capabilities, role playing, reasoning, multi-turn dialogue, and long text coherence. The core concept of the Hermes series of models is to align LLM with users, giving end users powerful guidance capabilities and control. Based on Hermes 2, Hermes 3 further enhances function calling and structured output capabilities, and improves general assistant capabilities and code generation skills.
Nous Chat, the first user-facing chatbot from AI research organization Nous Research, provides access to the large language model Hermes 3-70B. Hermes 3-70B is a variant of Meta’s Llama 3.1, which has been fine-tuned to serve as popular AI chat tools like ChatGPT. The chatbot features its retro design language and fonts and characters from early PC terminals, with dark and light modes available for users to choose from. Although Nous Chat is designed to allow users to deploy and control their own AI models, it actually has some safeguards in place, including a ban on the manufacture of illegal drugs. Additionally, the model has a knowledge deadline of April 2023, so it may not be as useful as other competitors in capturing the latest events. Still, Nous Chat is an interesting experiment, and as new features are added, it could become an attractive alternative to enterprise chatbots and AI models.
Agora is a simple cross-platform protocol that allows heterogeneous large language models (LLMs) to efficiently communicate with each other through negotiation. The protocol enables rare communications via natural language and negotiates a communications protocol for frequent communications, often involving structured data (such as JSON). Once the protocols are determined, they implement routines using LLMs, which are simple scripts (such as Python) that are used to send or receive data. Future communications will be handled using these routines, meaning LLMs are no longer needed, enabling efficiency, versatility and portability.
PPLLaVA is an efficient large-scale language model for video that combines fine-grained visual cue alignment, user-instructed convolutional-style pooling of visual token compression, and CLIP context extension. The model establishes new state-of-the-art results on datasets such as VideoMME, MVBench, VideoChatGPT Bench, and VideoQA Bench, using only 1024 visual tokens and achieving an 8x increase in throughput.
5ire is an AI product built with simplicity and user-friendliness at its core, designed to make it easy for even beginners to leverage large language models. It supports parsing and vectorization of multiple document formats, and has functions such as local knowledge base, usage analysis, prompt library, bookmarks, and fast keyword search. As an open source project, 5ire is free to download and provides a pay-as-you-go large language model API service.
O1-Journey is a project initiated by the GAIR research group at Shanghai Jiao Tong University to replicate and reimagine the capabilities of OpenAI’s O1 model. This project proposes a new training paradigm of "journey learning" and builds the first model to successfully integrate search and learning in mathematical reasoning. This model becomes an effective way to handle complex reasoning tasks through processes of trial and error, correction, backtracking, and reflection.
Ferret-UI is the first user interface-centric multimodal large-scale language model (MLLM) designed for referent expression, localization and reasoning tasks. It is built on Gemma-2B and Llama-3-8B and is capable of performing complex user interface tasks. This version follows Apple's research paper and is a powerful tool for image-to-text tasks, with advantages in conversation and text generation.
URL Parser Online is an online tool that converts complex URLs into an input format suitable for use by large language models (LLMs). The importance of this technology lies in its ability to help developers and researchers process and parse URL data more efficiently, especially when performing web content analysis and data extraction. Product background information shows that with the explosive growth of Internet data volume, the demand for URL parsing and processing is increasing. URL Parser Online provides users with a convenient solution with its simple user interface and efficient parsing capabilities. The product currently provides free services and is targeted at developers and data analysts.
SELA is an innovative system that enhances automated machine learning (AutoML) by combining Monte Carlo Tree Search (MCTS) with large language model (LLM) based agents. Traditional AutoML methods often produce low diversity and suboptimal code, limiting their effectiveness in model selection and integration. By representing pipeline configurations as trees, SELA enables agents to intelligently explore the solution space and iteratively improve their strategies based on experimental feedback.
LongVU is an innovative long video language understanding model that reduces the number of video tags through a spatiotemporal adaptive compression mechanism while retaining visual details in long videos. The importance of this technology lies in its ability to process a large number of video frames with only a small loss of visual information within the limited context length, significantly improving the ability to understand and analyze long video content. LongVU outperforms existing methods on multiple video understanding benchmarks, especially on the task of understanding hour-long videos. Additionally, LongVU is able to efficiently scale to smaller model sizes while maintaining state-of-the-art video understanding performance.
BitNet is an official inference framework developed by Microsoft and designed for 1-bit large language models (LLMs). It provides a set of optimized cores that support fast and lossless 1.58-bit model inference on the CPU (NPU and GPU support coming soon). BitNet achieved a speed increase of 1.37 times to 5.07 times on ARM CPU, and the energy efficiency ratio increased by 55.4% to 70.0%. On x86 CPUs, the speed improvement ranges from 2.37 times to 6.17 times, and the energy efficiency ratio increases by 71.9% to 82.2%. In addition, BitNet is able to run the 100B parameter BitNet b1.58 model on a single CPU, achieving inference speeds close to human reading speed, broadening the possibility of running large language models on local devices.
Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA, focusing on improving the helpfulness of answers generated by large language models (LLM). The model performs well on multiple automatic alignment benchmarks, such as Arena Hard, AlpacaEval 2 LC, and GPT-4-Turbo MT-Bench. It is trained on the Llama-3.1-70B-Instruct model by using RLHF (specifically the REINFORCE algorithm), Llama-3.1-Nemotron-70B-Reward and HelpSteer2-Preference hints. This model not only demonstrates NVIDIA's technology in improving the helpfulness of common domain instruction compliance, but also provides a model transformation format that is compatible with the HuggingFace Transformers code library and can be used for free managed inference through NVIDIA's build platform.
ComfyGen is an adaptive workflow system focused on text-to-image generation that automates and customizes efficient workflows by learning user prompts. The advent of this technology marks a shift from the use of a single model to complex workflows that combine multiple specialized components to improve the quality of image generation. The main benefit behind ComfyGen is the ability to automatically adjust the workflow based on the user's text prompts to produce higher quality images, which is important for users who need to produce images of a specific style or theme.
Mistral-8B-Instruct-2410 is a large-scale language model developed by the Mistral AI team and is designed for local intelligence, device-side computing and edge usage scenarios. The model performs well among similar sized models, supports 128k context windows and interleaved sliding window attention mechanisms, can be trained on multi-language and code data, supports function calls, and has a vocabulary of 131k. The Ministral-8B-Instruct-2410 model performs well in various benchmarks, including knowledge and general knowledge, code and mathematics, and multi-language support. The model performs particularly well in chat/arena (judged by gpt-4o) and is able to handle complex conversations and tasks.
MM1.5 is a family of multimodal large language models (MLLMs) designed to enhance text-rich image understanding, visual referential representation and grounding, and multi-image reasoning. This model is based on the MM1 architecture and adopts a data-centric model training method to systematically explore the impact of different data mixtures throughout the model training life cycle. MM1.5 models range from 1B to 30B parameters, including intensive and mixed expert (MoE) variants, and provide detailed training process and decision-making insights through extensive empirical studies and ablation studies, providing valuable guidance for future MLLM development research.
Lumigator is a product developed by Mozilla.ai to help developers choose from a number of large language models (LLMs) that are best suited for their specific projects. It evaluates models by providing a task-specific metric framework to ensure that the selected model meets project needs. Lumigator's vision is to be an open source platform that promotes ethical and transparent AI development and fills gaps in the industry tool chain.
Tilores Identity RAG is a platform that provides customer data search, unification and retrieval services for Large Language Models (LLMs). It handles misspellings and inaccurate information through real-time fuzzy search technology to deliver accurate, relevant and unified responses to customer data. The platform solves the challenges faced by large language models in retrieving structured customer data, such as disparate data sources, difficulty finding customer data when search terms do not exactly match, and the complexity of unifying customer records. It allows for rapid retrieval of structured customer data, building dynamic customer profiles, and providing real-time unified and accurate customer data at query time.
NVLM 1.0 is a cutting-edge multi-modal large-scale language model series launched by NVIDIA ADLR. It has reached the industry-leading level in visual-language tasks and is comparable to top proprietary models and open access models. The model improved in accuracy even on plain text tasks after multi-modal training. NVLM 1.0’s open source model weights and Megatron-Core training code provide a valuable resource to the community.
NVLM-D-72B is a multi-modal large-scale language model launched by NVIDIA. It focuses on visual-language tasks and improves text performance through multi-modal training. The model achieves results comparable to industry-leading models on visual-language benchmarks.
Diabetica-7B is a large language model optimized for the diabetes care domain. It excels at a variety of diabetes-related tasks, including diagnosis, treatment recommendations, medication management, lifestyle recommendations, patient education, and more. The model is fine-tuned based on open source models, using disease-specific data sets and fine-tuning techniques, providing a reproducible framework that can accelerate the development of AI-assisted medicine. In addition, it has undergone comprehensive evaluation and clinical trials to verify its effectiveness in clinical applications.
Diabetica-1.5B is a large-scale language model specially customized for the field of diabetes care. It performs well in multiple diabetes-related tasks such as diagnosis, treatment recommendations, medication management, lifestyle recommendations, and patient education. The model is developed based on open source models and fine-tuned using disease-specific data sets, providing a reproducible framework that can accelerate the development of AI-assisted medicine.
Llama-3.2-11B-Vision is a multi-modal large language model (LLMs) released by Meta that combines the capabilities of image and text processing and aims to improve the performance of visual recognition, image reasoning, image description and answering general questions about images. The model outperforms numerous open source and closed multi-modal models on common industry benchmarks.
Llama 3.2 is a multi-lingual large-scale language model (LLMs) launched by Meta Company, including 1B and 3B scale pre-training and instruction tuning generation models. These models are optimized in multiple language conversation use cases, including agent retrieval and summarization tasks. Llama 3.2 outperforms many existing open source and closed chat models on many industry benchmarks.
Mishi AI Community is a knowledge community focusing on the fields of artificial intelligence and product management, providing relevant knowledge systems and R&D product use cases for AI product management. Community members have the opportunity to become 'super individuals and one-person companies'. You can contact the manager via email or social media to join the AI PM community.
NVLM 1.0 is a series of cutting-edge multi-modal large language models (LLMs) that achieve advanced results on visual-linguistic tasks that are comparable to leading proprietary models and open-access models. It is worth noting that NVLM 1.0’s text performance even surpasses its LLM backbone model after multi-modal training. We open sourced the model weights and code for the community.
OneGen is an efficient single-pass generation and retrieval framework designed for large language models (LLMs) to fine-tune generation, retrieval, or hybrid tasks. Its core idea is to integrate generation and retrieval tasks into the same context, enabling LLM to perform both tasks in a single forward pass by assigning retrieval tasks to retrieval tokens generated in an autoregressive manner. Not only does this approach reduce deployment costs, it also significantly reduces inference costs because it avoids the need for two forward-pass computations on the query.
Open Source LLM Tools is a platform focused on collecting and displaying open source large language model (LLM) tools. It provides a frequently updated resource library to help developers and researchers discover and utilize the latest open source AI tools. The main advantages of this platform are its high update frequency and focus on active open source AI developers, allowing users to obtain the latest industry trends and technological progress in a timely manner.
XVERSE-MoE-A36B is a multi-language large-scale language model independently developed by Shenzhen Yuanxiang Technology. It adopts a hybrid expert model (MoE) architecture and has a total parameter scale of 255.4 billion and an activation parameter amount of 36 billion. This model supports more than 40 languages including Chinese, English, Russian, Spanish, etc., and performs particularly well in Chinese and English bilinguals. The model uses 8K-length training samples, and through refined data sampling ratios and dynamic data switching strategies, the high quality and diversity of the model are ensured. In addition, the model has been customized and optimized for the MoE architecture, improving computing efficiency and overall throughput.
the Shire is an AI programming agent language designed to enable communication between large language models (LLM) and integrated development environments (IDEs) to support automated programming. It originated from the AutoDev project and aims to provide developers with an AI-driven IDE, including DevIns, the predecessor of Shire. Shire enables users to build an AI-driven development environment that meets their personal needs by providing customized AI agents.
PromptChainer is a tool designed to improve the output quality of large language models. By automating the generation of prompt chains, it helps users break down complex tasks into manageable small steps, thereby obtaining more accurate and high-quality results. It is particularly suitable for tasks that require multiple steps and/or a lot of context and knowledge.
LongCite is an open source model that trains large language models (LLMs) to generate accurate answers and precise sentence-level quotations in long text question and answer scenarios. The importance of this technology lies in its ability to improve the accuracy and credibility of question answering systems, allowing users to verify the source of the output information. LongCite supports context lengths up to 128K and provides two models: LongCite-glm4-9b and LongCite-llama3.1-8b, which are trained based on GLM-4-9B and Meta-Llama-3.1-8B respectively.
LongLLaVA is a multi-modal large-scale language model that efficiently scales to 1000 images through a hybrid architecture, aiming to improve image processing and understanding capabilities. Through innovative architectural design, this model achieves effective learning and reasoning on large-scale image data, which is of great significance to fields such as image recognition, classification and analysis.
iText2KG is a Python package designed to leverage large language models to extract entities and relationships from text documents and incrementally build consistent knowledge graphs. It has zero-shot capabilities, allowing knowledge extraction across different domains without specific training. The package includes document distillation, entity extraction, and relationship extraction modules, ensuring entities and relationships are resolved and unique. It provides visual representation of knowledge graphs through Neo4j, supporting interactive exploration and analysis of structured data.
Reflection Llama-3.1 70B is currently the world's top open source large language model (LLM). It is trained using a new technology called Reflection-Tuning, which enables the model to detect errors in its reasoning and make corrections. The model was trained on synthetic data, generated by Glaive. Glaive is an excellent tool for users who are training models. The model uses the standard Llama 3.1 chat format, with special tags to differentiate between the model's internal thinking and the final answer, improving the user experience.
OLMoE-1B-7B is an expert hybrid large language model (LLM) with 100 million active parameters and 700 million total parameters, released in September 2024. This model performs well among models of similar cost, competing with larger models such as the Llama2-13B. OLMoE is completely open source and supports a variety of functions, including text generation, model training and deployment, etc.
ChatMLX is a modern, open source, high-performance MacOS chat application built on large-scale language models. It leverages the powerful performance of MLX and Apple silicon to support multiple models and provide users with rich conversation options. ChatMLX runs large language models natively to ensure user privacy and security.
C4AI Command R 08-2024 is a 3.5 billion parameter large-scale language model developed by Cohere and Cohere For AI, optimized for a variety of use cases such as reasoning, summarization, and question answering. The model supports training in 23 languages and is evaluated in 10 languages, with high-performance RAG (Retrieval Augmentation Generation) capabilities. It is trained through supervised fine-tuning and preference to match human preferences for usefulness and safety. Additionally, the model has conversational tooling capabilities, enabling tool-based responses to be generated through specific prompt templates.
EAGLE is a vision-centered, high-resolution multimodal large language model (LLM) family that enhances the perceptual capabilities of multimodal LLMs by mixing visual encoders and different input resolutions. The model contains channel connection based 'CLIP+X' fusion, suitable for vision experts with different architectures (ViT/ConvNets) and knowledge (detection/segmentation/OCR/SSL). The EAGLE model family supports input resolutions over 1K and achieves excellent results on multi-modal LLM benchmarks, especially on resolution-sensitive tasks such as optical character recognition and document understanding.
SlowFast-LLaVA is a training-free multi-modal large-scale language model designed for video understanding and reasoning. It achieves performance comparable to or better than state-of-the-art video large language models on a variety of video question answering tasks and benchmarks without any fine-tuning on any data.
mPLUG-Owl3 is a multi-modal large-scale language model focused on the understanding of long image sequences. It can learn knowledge from the retrieval system, engage in alternating text and picture conversations with users, and watch long videos to remember their details. The source code and weights of the model have been released on HuggingFace, and are suitable for scenarios such as visual question answering, multi-modal benchmarking, and video benchmarking.
Seed-ASR is a speech recognition model based on Large Language Model (LLM) developed by ByteDance. It leverages the power of LLM by feeding continuous speech representations and contextual information into LLM, guided by large-scale training and context-aware capabilities, to significantly improve performance on a comprehensive evaluation set that includes multiple domains, accents/dialects, and languages. Compared with recently released large-scale ASR models, Seed-ASR achieves a 10%-40% word error rate reduction on Chinese and English public test sets, further demonstrating its powerful performance.
Parsera is a lightweight Python library specifically designed to be combined with large language models (LLMs) to simplify the process of website data scraping. It makes data scraping more efficient and cost-effective by using minimal tokens to increase speed and reduce costs. Parsera supports multiple chat models and can be customized to use different models, such as OpenAI or Azure.
ShieldGemma is a series of safe content moderation models developed by Google and built on Gemma 2, focusing on four harm categories (inappropriate content, dangerous content, hate and harassment). They are text-to-text decoder-only large language models, English only, with open weights, including models with 2B, 9B, and 27B parameter sizes. These models are designed to improve the safety of AI applications as part of a responsible generative AI toolkit.
nanoPerplexityAI is an open source implementation of a Large Language Model (LLM) service, citing information from Google. No complex GUI or LLM agent, just 100 lines of Python code.
CLASI is a high-quality, human-like simultaneous interpretation system developed by ByteDance’s research team. It balances translation quality and latency with a novel data-driven reading and writing strategy, employs multi-modal retrieval modules to enhance translation of domain-specific terms, and leverages large language models (LLMs) to generate fault-tolerant translations that take into account input audio, historical context, and retrieval information. In real-world scenarios, CLASI achieved a valid information ratio (VIP) of 81.3% and 78.0% in the Chinese-English and English-Chinese translation directions respectively, far exceeding other systems.