Found 64 related AI tools
GPT OSS is an open source language model launched by OpenAI, with powerful reasoning capabilities and Apache 2.0 license. This model has the characteristics of high efficiency, security, API compatibility, etc., and is a pioneer of future open source language models.
CameraBench is a model for analyzing camera motion in video, aiming to understand camera motion patterns through video. Its main advantage lies in utilizing generative visual language models for principle classification of camera motion and video text retrieval. By comparing with traditional structure-from-motion (SfM) and real-time localization and construction (SLAM) methods, the model shows significant advantages in capturing scene semantics. The model is open source and suitable for use by researchers and developers, and more improved versions will be released in the future.
HiDream-I1 is a new open source image generation base model with 17 billion parameters that can generate high-quality images in seconds. The model is suitable for research and development and has performed well in multiple evaluations. It is efficient and flexible and suitable for a variety of creative design and generation tasks.
Together Chat is a secure AI chat platform that offers 100 free messages per day for users who want private conversations and high-quality interactions. It uses North America as its server location to ensure the security of user information.
Wan 2.1 AI is an open source large-scale video generation AI model developed by Alibaba. It supports text-to-video (T2V) and image-to-video (I2V) generation, capable of transforming simple input into high-quality video content. This model is of great significance in the field of video generation. It can greatly simplify the video creation process, lower the creation threshold, improve creation efficiency, and provide users with rich and diverse video creation possibilities. Its main advantages include high-quality video generation effects, smooth display of complex movements, realistic physical simulation, and rich artistic styles. At present, the product is completely open source, and users can use its basic functions for free. It has high practical value for individuals and enterprises who have video creation needs but lack professional skills or equipment.
CSM 1B is a speech generation model based on the Llama architecture, capable of generating RVQ audio codes from text and audio input. This model is mainly used in the field of speech synthesis and has high-quality speech generation capabilities. Its advantage lies in its ability to handle multi-speaker dialogue scenarios and generate natural and smooth speech through contextual information. The model is open source and intended to support research and educational purposes, but use for impersonation, fraud, or illegal activities is expressly prohibited.
Gemma 3 is the latest open source model launched by Google, based on the research and technology development of Gemini 2.0. It is a lightweight, high-performance model that can run on a single GPU or TPU, providing developers with powerful AI capabilities. Gemma 3 is available in multiple sizes (1B, 4B, 12B and 27B), supports over 140 languages, and features advanced text and visual reasoning capabilities. Its key benefits include high performance, low computing requirements, and extensive multi-language support for rapid deployment of AI applications on a variety of devices. The launch of Gemma 3 aims to promote the popularization and innovation of AI technology and help developers achieve efficient development on different hardware platforms.
HunyuanVideo-I2V is Tencent's open source image-to-video generation model, developed based on the HunyuanVideo architecture. This model effectively integrates reference image information into the video generation process through image latent stitching technology, supports high-resolution video generation, and provides customizable LoRA effect training functions. This technology is of great significance in the field of video creation, as it can help creators quickly generate high-quality video content and improve creation efficiency.
Wan2.1-T2V-14B is an advanced text-to-video generation model based on a diffusion transformer architecture that combines an innovative spatiotemporal variational autoencoder (VAE) with large-scale data training. It is capable of generating high-quality video content at multiple resolutions, supports Chinese and English text input, and surpasses existing open source and commercial models in performance and efficiency. This model is suitable for scenarios that require efficient video generation, such as content creation, advertising production, and video editing. The model is currently available for free on the Hugging Face platform and is designed to promote the development and application of video generation technology.
PIKE-RAG is a domain knowledge and reasoning enhanced generative model developed by Microsoft, designed to enhance the capabilities of large language models (LLM) through knowledge extraction, storage and reasoning logic. Through multi-module design, this model can handle complex multi-hop question and answer tasks, and significantly improves the accuracy of question and answer in fields such as industrial manufacturing, mining, and pharmaceuticals. The main advantages of PIKE-RAG include efficient knowledge extraction capabilities, powerful multi-source information integration capabilities, and multi-step reasoning capabilities, making it perform well in scenarios that require deep domain knowledge and complex logical reasoning.
SkyReels V1 is a human-centered video generation model fine-tuned based on HunyuanVideo. It is trained through high-quality film and television clips to generate video content with movie-like quality. This model has reached the industry-leading level in the open source field, especially in facial expression capture and scene understanding. Its key benefits include open source leadership, advanced facial animation technology and cinematic light and shadow aesthetics. This model is suitable for scenarios that require high-quality video generation, such as film and television production, advertising creation, etc., and has broad application prospects.
SkyReels-V1 is an open source human-centered video basic model, fine-tuned based on high-quality film and television clips, focusing on generating high-quality video content. This model has reached the top level in the open source field and is comparable to commercial models. Its main advantages include: high-quality facial expression capture, cinematic light and shadow effects, and the efficient inference framework SkyReelsInfer, which supports multi-GPU parallel processing. This model is suitable for scenarios that require high-quality video generation, such as film and television production, advertising creation, etc.
DeepScaleR-1.5B-Preview is a large language model optimized by reinforcement learning, focusing on improving mathematical problem solving capabilities. This model significantly improves the accuracy in long text reasoning scenarios through distributed reinforcement learning algorithms. Its main advantages include efficient training strategies, significant performance improvements, and the flexibility of open source. The model was developed by UC Berkeley’s Sky Computing Lab and Berkeley AI Research teams to advance the use of artificial intelligence in education, particularly in mathematics education and competitive mathematics. The model is licensed under the MIT open source license and is completely free for researchers and developers to use.
Lumina-Video is a video generation model developed by the Alpha-VLLM team, mainly used to generate high-quality video content from text. This model is based on deep learning technology and can generate corresponding videos based on text prompts input by users, which is efficient and flexible. It is of great significance in the field of video generation, providing content creators with powerful tools to quickly generate video materials. The project is currently open source, supports video generation at multiple resolutions and frame rates, and provides detailed installation and usage guides.
Zonos-v0.1 is a real-time text-to-speech (TTS) model developed by the Zyphra team with high-fidelity voice cloning capabilities. The model consists of a 1.6B parameter Transformer model and a 1.6B parameter Hybrid model (Hybrid), both released under the Apache 2.0 open source license. It generates natural, expressive speech based on text prompts and supports multiple languages. In addition, Zonos-v0.1 enables high-quality voice cloning from speech clips of 5 to 30 seconds, and can be adjusted based on conditions such as speaking speed, pitch, voice quality, and emotion. Its main advantages are high generation quality, support for real-time interaction, and flexible voice control capabilities. The model is released to promote research and development of TTS technology.
Hibiki is an advanced model focused on streaming speech translation. It generates correct translation block by block by accumulating enough contextual information in real time, supports speech and text translation, and can perform sound conversion. The model is based on a multi-stream architecture and is able to process source and target speech simultaneously, generating a continuous audio stream and timestamped text translation. Its key benefits include high-fidelity speech conversion, low-latency real-time translation, and compatibility with complex reasoning strategies. Hibiki currently supports French to English translation, which is suitable for scenarios that require efficient real-time translation, such as international conferences, multi-language live broadcasts, etc. The model is open source and free, suitable for developers and researchers.
Qwen2.5-1M is an open source artificial intelligence language model designed for processing long sequence tasks and supports a context length of up to 1 million Tokens. This model significantly improves the performance and efficiency of long sequence processing through innovative training methods and technical optimization. It performs well on long context tasks while maintaining performance on short text tasks, making it an excellent open source alternative to existing long context models. This model is suitable for scenarios that require processing large amounts of text data, such as document analysis, information retrieval, etc., and can provide developers with powerful language processing capabilities.
BEN2 (Background Erase Network) is an innovative image segmentation model that uses the Confidence Guided Matting (CGM) process. It uses a thinning network to specifically process pixels where the model has lower confidence, resulting in more accurate matting effects. BEN2 performs well in hair matting, 4K image processing, object segmentation and edge refinement. Its base model is open source, and users can try the full model for free via API or web demo. The model training data includes DIS5k data set and 22K proprietary segmentation data set, which can meet a variety of image processing needs.
YuE is an open source music generation model developed by the Hong Kong University of Science and Technology and the Multimodal Art Projection team. It can generate a complete song of up to 5 minutes, including vocals and backing parts, based on given lyrics. This model solves the complex problem of lyrics-to-song generation through a variety of technological innovations, such as semantically enhanced audio taggers, dual tagging technology, and lyric chain thinking. The main advantage of YuE is that it can generate high-quality music works, support multiple languages and music styles, and is highly scalable and controllable. The model is currently free and open source and aims to advance the development of music generation technology.
Llasa-1B is a text-to-speech model developed by the Hong Kong University of Science and Technology Audio Laboratory. It is based on the LLaMA architecture and can convert text into natural and smooth speech by combining speech tags in the XCodec2 codebook. The model was trained on 250,000 hours of Chinese and English speech data and supports speech generation from plain text or synthesis using given speech cues. Its main advantage is that it can generate high-quality multi-language speech and is suitable for a variety of speech synthesis scenarios, such as audio books, voice assistants, etc. This model is licensed under CC BY-NC-ND 4.0 and commercial use is prohibited.
Llasa-3B is a powerful text-to-speech (TTS) model developed based on the LLaMA architecture and focuses on Chinese and English speech synthesis. By combining the speech coding technology of XCodec2, this model can efficiently convert text into natural and smooth speech. Its main advantages include high-quality speech output, support for multi-language synthesis, and flexible voice prompt functions. This model is suitable for a variety of scenarios that require speech synthesis, such as audiobook production, voice assistant development, etc. Its open source nature also allows developers to freely explore and extend its functionality.
MiniRAG is a retrieval enhancement generation system designed for small language models, aiming to simplify the RAG process and improve efficiency. It solves the problem of limited performance of small models in the traditional RAG framework through a semantic-aware heterogeneous graph indexing mechanism and a lightweight topology-enhanced retrieval method. This model has significant advantages in resource-constrained scenarios, such as in mobile devices or edge computing environments. The open source nature of MiniRAG also makes it easy to be accepted and improved by the developer community.
MatterGen is a generative AI tool launched by Microsoft Research for material design. It can directly generate new materials with specific chemical, mechanical, electronic or magnetic properties according to the design requirements of the application, providing a new paradigm for materials exploration. The emergence of this tool is expected to accelerate the research and development process of new materials, reduce research and development costs, and play an important role in batteries, solar cells, CO2 adsorbents and other fields. Currently, MatterGen’s source code is open source on GitHub for public use and further development.
Kokoro-82M is a text-to-speech (TTS) model created by hexgrad and hosted on Hugging Face. It has 82 million parameters and is open source using the Apache 2.0 license. The model released v0.19 on December 25, 2024, and provides 10 unique voice packs. Kokoro-82M ranked first in TTS Spaces Arena, showing its efficiency in parameter scale and data usage. It supports US English and British English and can be used to generate high-quality speech output.
Llama-3-Patronus-Lynx-8B-Instruct is a fine-tuned version based on the meta-llama/Meta-Llama-3-8B-Instruct model developed by Patronus AI, mainly used to detect hallucinations in RAG settings. The model is trained on multiple data sets including CovidQA, PubmedQA, DROP, RAGTruth, etc., including manual annotation and synthetic data. It evaluates whether a given document, question, and answer is faithful to the document content, does not provide new information outside the document, and does not contradict the document information.
Meta Video Seal is an advanced open source video watermarking model that can embed persistent, invisible watermarks after video editing. As AI-generated content increases, verifying video origins becomes critical. By embedding invisible watermarks, Video Seal can maintain the integrity of the watermark even after the video is edited, which is of great significance for copyright protection and content verification.
OLMo-2-1124-13B-Instruct is a large-scale language model developed by Allen AI Research Institute, focusing on text generation and dialogue tasks. The model performs well on multiple tasks, including mathematical problem solving, scientific problem solving, and more. It is a 13B parameter-based version trained with supervised fine-tuning and reinforcement learning on specific datasets to improve its performance and security. As an open source model, it allows researchers and developers to explore and improve the science of language models.
OLMo-2-1124-7B-Instruct is a large-scale language model developed by the Allen Institute for Artificial Intelligence, focusing on dialogue generation tasks. The model is optimized on a variety of tasks, including mathematical problem solving, GSM8K, IFEval, etc., and is supervised fine-tuned on the Tülu 3 dataset. It is built on top of the Transformers library and can be used for research and educational purposes. The main advantages of this model include high performance, multi-task adaptability and open source, making it an important tool in the field of natural language processing.
Allegro-TI2V is a text-to-video generation model capable of generating video content based on user-provided prompts and images. The model has attracted attention for its open source nature, diverse content creation capabilities, high-quality output, small and efficient model parameters, and support for multiple accuracies and GPU memory optimization. It represents the current cutting-edge progress of artificial intelligence technology in the field of video generation and has important technical value and commercial application potential. The Allegro-TI2V model is provided on the Hugging Face platform and follows the Apache 2.0 open source protocol. Users can download and use it for free.
Llama-3.1-Tulu-3-70B-DPO is part of the Tülu3 family of models designed to provide a comprehensive guide to modern post-training techniques. This family of models aims to achieve state-of-the-art performance on a variety of tasks beyond chatting, such as MATH, GSM8K and IFEval. It is based on models trained on publicly available, synthetic and human-created datasets, is primarily in English, and is licensed under the Llama 3.1 Community License.
Llama-3.1-Tulu-3-70B is a member of the Tülu3 family of models designed to provide a comprehensive guide to modern post-training techniques. The model not only performs well on chat tasks, but also shows excellent performance on multiple tasks such as MATH, GSM8K and IFEval. As an open source model, it allows researchers and developers to access and use its data and code to advance natural language processing technology.
Qwen2.5-Coder is the latest series of Qwen large-scale language models, focusing on code generation, code reasoning and code repair. Based on the powerful Qwen2.5, this model includes 5.5 trillion source codes, text code associations, synthetic data, etc. in training. It is currently the leader in open source code language models, and its coding capabilities are comparable to GPT-4. In addition, Qwen2.5-Coder also has a more comprehensive real-world application foundation, such as code agents, etc., which not only enhances coding capabilities, but also maintains its advantages in mathematics and general capabilities.
Qwen2.5-Coder is the latest series of Qwen large-scale language models, designed for code generation, reasoning and repair. Based on the powerful Qwen2.5, the model contains 5.5 trillion source codes, text code bases, synthetic data, etc. during training, making its code capabilities reach the latest level of open source code LLM. It not only enhances coding skills but also maintains advantages in math and general abilities.
Qwen2.5-Coder-3B-Instruct-GPTQ-Int8 is a large language model in the Qwen2.5-Coder series, specially optimized for code generation, code reasoning and code repair. The model is based on Qwen2.5, and the training data includes source code, text code association, synthetic data, etc., reaching 5.5 trillion training tokens. Qwen2.5-Coder-32B has become the current most advanced large-scale language model for open source code, and its coding capabilities match GPT-4o. The model also provides a more comprehensive foundation for real-world applications such as code agents, which not only enhance coding capabilities but also maintain advantages in mathematical and general abilities.
Qwen2.5-Coder is the latest series of Qwen large-scale language models, focusing on code generation, code reasoning and code repair. Based on the powerful Qwen2.5, training tokens scale to 5.5 trillion, including source code, text code grounding, synthetic data, and more. Qwen2.5-Coder-32B has become the current most advanced large-scale language model for open source code, and its coding capabilities match GPT-4o. This model provides a more comprehensive foundation in practical applications, such as code agents, which not only enhances coding capabilities, but also maintains advantages in mathematics and general abilities.
Qwen2.5-Coder-32B-Instruct-GPTQ-Int8 is a large language model optimized for code generation in the Qwen series. It has 3.2 billion parameters and supports long text processing. It is one of the most advanced models in the field of open source code generation. The model has been further trained and optimized based on Qwen2.5, which not only has significant improvements in code generation, reasoning and repair, but also maintains advantages in mathematics and general capabilities. The model uses GPTQ 8-bit quantization technology to reduce model size and improve operating efficiency.
Qwen2.5-Coder-1.5B is a large language model in the Qwen2.5-Coder series, focusing on code generation, code reasoning and code repair. Based on the powerful Qwen2.5, this model has become the leader in the current open source code LLM by expanding the training token to 5.5 trillion, including source code, text code base, synthetic data, etc., with coding capabilities comparable to GPT-4o. In addition, Qwen2.5-Coder-1.5B also strengthens mathematical and general capabilities, providing a more comprehensive foundation for practical applications such as code agents.
Qwen2.5-Coder is the latest series of Qwen large-scale language models, focusing on code generation, code reasoning and code repair. Based on the powerful capabilities of Qwen2.5, this model uses 5.5 trillion source codes, text code bases, synthetic data, etc. during training. It is currently the leader in open source code generation language models, and its coding capabilities are comparable to GPT-4o. It not only enhances coding capabilities, but also maintains its advantages in mathematics and general abilities, providing a more comprehensive foundation for practical applications such as code agency.
Qwen2.5-Coder is the latest series of Qwen large-scale language models, focusing on code generation, code reasoning and code repair. Based on the powerful Qwen2.5, this series of models significantly improves code generation, reasoning and repair capabilities by increasing training tokens to 5.5 trillion, including source code, text code grounding, synthetic data, etc. Qwen2.5-Coder-3B is a model in the series with 3.09B parameters, 36 layers, 16 attention heads (Q) and 2 attention heads (KV), with a full 32,768 token context length. This model is currently the leader in open source code LLM, and its coding capabilities match GPT-4o, providing developers with a powerful code assistance tool.
CogVideoX1.5-5B-SAT is an open source video generation model developed by the Knowledge Engineering and Data Mining Team of Tsinghua University. It is an upgraded version of the CogVideoX model. This model supports the generation of 10-second videos and supports the generation of higher-resolution videos. The model includes modules such as Transformer, VAE and Text Encoder, which can generate video content based on text descriptions. The CogVideoX1.5-5B-SAT model provides a powerful tool for video content creators with its powerful video generation capabilities and high-resolution support, especially in education, entertainment and business fields.
Mochi is Genmo's latest open source video generation model, which is optimized in ComfyUI and can be implemented even with consumer-grade GPUs. Known for its high-fidelity motion and excellent prompt following, Mochi brings state-of-the-art video generation capabilities to the ComfyUI community. Mochi models are released under the Apache 2.0 license, which means developers and creators are free to use, modify, and integrate Mochi without being hindered by a restrictive license. Mochi is able to run on consumer-grade GPUs such as the 4090, and supports multiple attention backends in ComfyUI, allowing it to fit in less than 24GB of VRAM.
Tencent Hunyuan 3D is an open source 3D generative model that aims to solve the shortcomings of existing 3D generative models in terms of generation speed and generalization capabilities. The model adopts a two-stage generation method. The first stage uses a multi-view diffusion model to quickly generate multi-view images, and the second stage uses a feed-forward reconstruction model to quickly reconstruct 3D assets. The Hunyuan 3D-1.0 model can help 3D creators and artists automatically produce 3D assets, support rapid single-image 3D generation, and complete end-to-end generation within 10 seconds, including mesh and texture extraction.
hertz-dev is Standard Intelligence's open source full-duplex, audio-only converter base model with 8.5 billion parameters. The model represents a scalable cross-modal learning technique capable of converting mono 16kHz speech into an 8Hz latent representation with a bitrate of 1kbps, outperforming other audio encoders. The main advantages of hertz-dev include low latency, high efficiency and ease of fine-tuning and building by researchers. Product background information shows that Standard Intelligence is committed to building general intelligence that is beneficial to all mankind, and hertz-dev is the first step in this journey.
Mochi 1 is a research preview version of an open source video generation model launched by Genmo. It is committed to solving basic problems in the current AI video field. The model is known for its unparalleled motion quality, superior cue following capabilities, and ability to cross the uncanny valley to generate coherent, fluid human movements and expressions. Mochi 1 was developed in response to the need for high-quality video content generation, particularly in the gaming, film and entertainment industries. The product currently offers a free trial, and specific pricing information is not provided on the page.
Allegro is an advanced text-to-video model developed by Rhymes AI that converts simple text prompts into high-quality short video clips. Allegro's open source nature makes it a powerful tool for creators, developers, and researchers in the field of AI video generation. The main advantages of Allegro include open source, diverse content creation, high-quality output, and small and efficient model size. It supports multiple precisions (FP32, BF16, FP16), and in BF16 mode, the GPU memory usage is 9.3 GB and the context length is 79.2k, which is equivalent to 88 frames. Allegro's technology core includes large-scale video data processing, video compression into visual tokens, and extended video diffusion transformers.
Janus is an innovative autoregressive framework that addresses the limitations of previous approaches by separating visual encoding into distinct paths while utilizing a single, unified transformer architecture for processing. This decoupling not only alleviates the conflicting roles of the visual encoder in understanding and generation, but also enhances the flexibility of the framework. Janus' performance surpasses previous unified models and meets or exceeds the performance of task-specific models. Janus' simplicity, high flexibility, and effectiveness make it a strong candidate for the next generation of unified multimodal models.
LightRAG is a retrieval-enhanced generation model that aims to improve the performance of text generation tasks by combining the advantages of retrieval and generation. This model can provide more accurate and relevant information while maintaining generation speed, which is particularly important for application scenarios that require fast and accurate information retrieval. The development background of LightRAG is based on the need to improve existing text generation models, especially when large amounts of data and complex queries need to be processed. The model is currently open source and freely available, providing researchers and developers with a powerful tool to explore and implement retrieval-based text generation tasks.
The text-to-image generation model developed by the Tsinghua University team is open source, has broad application prospects in the field of image generation, and has the advantages of high-resolution output.
Aria is a multi-modal native hybrid expert model with strong performance on multi-modal, language and encoding tasks. It excels in video and document understanding, supports multi-modal inputs up to 64K, and is able to describe a 256-frame video in 10 seconds. The Aria model has a parameter size of 25.3B and can be loaded with bfloat16 precision on a single A100 (80GB) GPU. Aria was developed to meet the need for multimodal data understanding, particularly in video and document processing. It is an open source model designed to advance the development of multi-modal artificial intelligence.
CursorCore is a series of open source models designed to assist programming through programming instruction alignment, supporting features such as automated editing and inline chat. These features mimic the core capabilities of closed-source AI-assisted programming tools like Cursor. This project promotes the application of AI in the field of programming through the power of the open source community, allowing developers to write and edit code more efficiently. The project is currently in its early stages, but has already demonstrated its potential to improve programming efficiency and assist with code generation.
The Qwen2.5 series language models are a series of open source decoder-only dense models, with parameter sizes ranging from 0.5B to 72B, designed to meet the needs of different products for model size. These models perform well in many fields such as natural language understanding, code generation, and mathematical reasoning, and are particularly suitable for application scenarios that require high-performance language processing capabilities. The release of Qwen2.5 series models marks an important progress in the field of large-scale language models, providing developers and researchers with powerful tools.
Qwen2.5-Coder is a member of the Qwen2.5 open source family, focusing on code generation, reasoning, repair and other tasks. It improves coding capabilities by amplifying large-scale coding training data while maintaining mathematical and general capabilities. The model supports 92 programming languages and achieves significant improvements in code-related tasks. Qwen2.5-Coder adopts the Apache 2.0 license and is designed to accelerate the application of code intelligence.
Qwen2.5 is a series of new language models built on the Qwen2 language model, including the general language model Qwen2.5, as well as Qwen2.5-Coder specifically for programming and Qwen2.5-Math for mathematics. These models are pre-trained on large-scale data sets, have strong knowledge understanding capabilities and multi-language support, and are suitable for various complex natural language processing tasks. Their main advantages include higher knowledge density, enhanced programming and mathematical capabilities, and better understanding of long text and structured data. The release of Qwen 2.5 is a major step forward for the open source community, providing developers and researchers with powerful tools to promote research and development in the field of artificial intelligence.
g1 is an experimental project that aims to create an inference chain similar to OpenAI's o1 model on Groq hardware by using the Llama-3.1 70b model. This project demonstrates that it is possible to significantly improve the performance of existing open source models on logic problem solving using hinting technology alone, without the need for complex training. g1 helps the model achieve more accurate reasoning on logical problems through visual reasoning steps, which is of great significance for improving the logical reasoning ability of artificial intelligence.
AuraFlow v0.3 is a completely open source flow-based text-to-image generation model. Compared with the previous version AuraFlow-v0.2, the model has been trained with more calculations and fine-tuned on the aesthetic dataset to support various aspect ratios with width and height up to 1536 pixels. This model achieved state-of-the-art results on GenEval and is currently in the beta testing stage. It is being continuously improved and community feedback is very important.
LongWriter is a long text generation model developed by a team at Tsinghua University. It is based on large-scale language models (LLMs) and is capable of generating text content of more than 10,000 words. This model is particularly suitable for scenarios where long coherent texts need to be generated, such as writing assistance, content creation, etc. Through fine tuning and optimization, LongWriter improves the quality and consistency of generated text while maintaining the efficiency and scalability of the model.
CogVideoX-2B is an open source video generation model developed by the Tsinghua University team. It supports video generation using the English prompt language, has 36GB of inference GPU memory requirements, and can generate 6 seconds long, 8 frames per second, and 720*480 resolution videos. This model uses sinusoidal position embedding and currently does not support quantitative reasoning and multi-card reasoning. It is deployed based on Hugging Face's diffusers library and can generate videos based on text prompts, which has a high degree of creativity and application potential.
CogVideoX is an open source video generation model that has the same origin as the commercial model and supports the generation of video content through text descriptions. It represents the latest progress in text-to-video generation technology, has the ability to generate high-quality videos, and can be widely used in entertainment, education, business promotion and other fields.
DeepSeek-Coder-V2 is an open source Mixture-of-Experts code language model with performance comparable to GPT4-Turbo and outstanding performance on code-specific tasks. It is further pre-trained with an additional 6 trillion tokens, enhancing coding and mathematical reasoning capabilities while maintaining similar performance on general language tasks. Compared with DeepSeek-Coder-33B, there are significant improvements in code-related tasks, reasoning and general capabilities. In addition, the programming languages it supports are expanded from 86 to 338, and the context length is expanded from 16K to 128K.
DeepSeek-Coder-V2 is an open source Mixture-of-Experts (MoE) code language model with performance comparable to GPT4-Turbo and excellent performance on code-specific tasks. Based on DeepSeek-Coder-V2-Base, it is further pre-trained through a high-quality multi-source corpus of 6 trillion tokens, significantly enhancing coding and mathematical reasoning capabilities while maintaining performance on general language tasks. The supported programming languages have been expanded from 86 to 338, and the context length has been expanded from 16K to 128K.
Stable Audio Open is an open source text-to-audio model optimized for generating short audio samples, sound effects, and production elements. It allows users to generate up to 47 seconds of high-quality audio data through simple text prompts, and is particularly suitable for music production and sound design such as creating drum beats, instrumental riffs, ambient sounds, foley recordings, etc. A key benefit of the open source release is that users can fine-tune the model based on their own custom audio data.
360Zhinao is a series of 7B-scale intelligent language models open sourced by Qihoo 360, including a basic model and three dialogue models of different length contexts. These models have been pre-trained on large-scale Chinese and English corpora, perform well on a variety of tasks such as natural language understanding, knowledge, mathematics, code generation, etc., and have powerful long-text dialogue capabilities. The model can be used for the development and deployment of various conversational applications.
HuggingChat Assistants is a chatbot customization platform released by HuggingFace. Users can choose from multiple open source models hosted by HuggingFace to create customized chatbots suitable for multiple fields.
PIXART LCM is a text-to-image synthesis framework that integrates Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-α model. PIXART LCM is known for its ability to generate high-quality images at 1024px resolution through an efficient training process. Integrating LCM in PIXART-δ significantly speeds up inference, allowing high-quality images to be generated in only 2-4 steps. Of particular note is that PIXART-δ achieves a breakthrough in generating a 1024x1024 pixel image in 0.5 seconds, a 7-fold improvement over PIXART-α. In addition, PIXART-δ is carefully designed to efficiently train on a 32GB V100 GPU in a single day. PIXART-δ with 8-bit inference capabilities can synthesize 1024px images within the 8GB GPU memory constraint, greatly enhancing its usability and accessibility. Furthermore, the introduction of ControlNet-like modules enables fine control over the text-to-image diffusion model. We introduce a novel ControlNet-Transformer architecture specifically tailored for Transformers, enabling explicit controllability and high-quality image generation. As a state-of-the-art open source image generation model, PIXART-δ provides a promising alternative to the family of stable diffusion models, making significant contributions to text-to-image synthesis.