Found 100 AI tools
Click any tool to view details
FLUX.1 Krea [dev] is a 12 billion parameter modified stream converter designed for generating high quality images from text descriptions. The model is trained with guided distillation to make it more efficient, and the open weights drive scientific research and artistic creation. The product emphasizes its aesthetic photography capabilities and strong prompt-following capabilities, making it a strong competitor to closed-source alternatives. Users of the model can use it for personal, scientific and commercial purposes, driving innovative workflows.
WAN 2.1 LoRA T2V is a tool that can generate videos based on text prompts. Through customized training of the LoRA module, users can customize the generated videos, which is suitable for brand narratives, fan content and stylized animations. The product background is rich and provides a highly customized video generation experience.
Fotol AI is a website that provides AGI technology and services, dedicated to providing users with powerful artificial intelligence solutions. Its main advantages include advanced technical support, rich functional modules and wide range of application fields. Fotol AI is positioned to become the first choice platform for users to explore AGI and provide users with flexible and diverse AI solutions.
OmniGen2 is an efficient multi-modal generation model that combines visual language models and diffusion models to achieve functions such as visual understanding, image generation and editing. Its open source nature provides researchers and developers with a strong foundation to explore personalized and controllable generative AI.
BAGEL is a scalable unified multimodal model that is revolutionizing the way AI interacts with complex systems. The model has functions such as conversational reasoning, image generation, editing, style transfer, navigation, composition, and thinking. It is pre-trained through deep learning video and network data, providing a foundation for generating high-fidelity, realistic images.
FastVLM is an efficient visual encoding model designed specifically for visual language models. It uses the innovative FastViTHD hybrid visual encoder to reduce the encoding time of high-resolution images and the number of output tokens, making the model perform outstandingly in speed and accuracy. The main positioning of FastVLM is to provide developers with powerful visual language processing capabilities, suitable for various application scenarios, especially on mobile devices that require fast response.
F Lite is a large-scale diffusion model developed by Freepik and Fal with 10 billion parameters, specially trained on copyright-safe and suitable for work (SFW) content. The model is based on Freepik’s internal dataset of approximately 80 million legal and compliant images, marking the first time a publicly available model has focused on legal and safe content at this scale. Its technical report provides detailed model information and is distributed using the CreativeML Open RAIL-M license. The model is designed to promote openness and usability of artificial intelligence.
Flex.2 is the most flexible text-to-image diffusion model available, with built-in redrawing and universal controls. It is an open source project supported by the community and aims to promote the democratization of artificial intelligence. Flex.2 has 800 million parameters, supports 512 token length inputs, and is compliant with the OSI's Apache 2.0 license. This model can provide powerful support in many creative projects. Users can continuously improve the model through feedback and promote technological progress.
InternVL3 is a multimodal large language model (MLLM) released by OpenGVLab as an open source, with excellent multimodal perception and reasoning capabilities. This model series includes a total of 7 sizes from 1B to 78B, which can process text, pictures, videos and other information at the same time, showing excellent overall performance. InternVL3 performs well in fields such as industrial image analysis and 3D visual perception, and its overall text performance is even better than the Qwen2.5 series. The open source of this model provides strong support for multi-modal application development and helps promote the application of multi-modal technology in more fields.
VisualCloze is a general image generation framework learned through visual context, aiming to solve the inefficiency of traditional task-specific models under diverse needs. The framework not only supports a variety of internal tasks, but can also generalize to unseen tasks, helping the model understand the task through visual examples. This approach leverages the strong generative priors of advanced image filling models, providing strong support for image generation.
Step-R1-V-Mini is a new multi-modal reasoning model launched by Step Star. It supports image and text input and text output, and has good command compliance and general capabilities. The model has been technically optimized for reasoning performance in multi-modal collaborative scenarios. It adopts multi-modal joint reinforcement learning and a training method that fully utilizes multi-modal synthetic data, effectively improving the model's complex link processing capabilities in image space. Step-R1-V-Mini has performed well in multiple public lists, especially ranking first in the country on the MathVision visual reasoning list, demonstrating its excellent performance in visual reasoning, mathematical logic and coding. The model has been officially launched on the Step AI web page, and an API interface is provided on the Step Star open platform for developers and researchers to experience and use.
HiDream-I1 is a new open source image generation base model with 17 billion parameters that can generate high-quality images in seconds. The model is suitable for research and development and has performed well in multiple evaluations. It is efficient and flexible and suitable for a variety of creative design and generation tasks.
EasyControl is a framework that provides efficient and flexible control for Diffusion Transformers, aiming to solve problems such as efficiency bottlenecks and insufficient model adaptability existing in the current DiT ecosystem. Its main advantages include: supporting multiple condition combinations, improving generation flexibility and reasoning efficiency. This product is developed based on the latest research results and is suitable for use in areas such as image generation and style transfer.
RF-DETR is a transformer-based real-time object detection model designed to provide high accuracy and real-time performance for edge devices. It exceeds 60 AP in the Microsoft COCO benchmark, with competitive performance and fast inference speed, suitable for various real-world application scenarios. RF-DETR is designed to solve object detection problems in the real world and is suitable for industries that require efficient and accurate detection, such as security, autonomous driving, and intelligent monitoring.
Stable Virtual Camera is a 1.3B parameter universal diffusion model developed by Stability AI, which is a Transformer image to video model. Its importance lies in providing technical support for New View Synthesis (NVS), which can generate 3D consistent new scene views based on the input view and target camera. The main advantages are the freedom to specify target camera trajectories, the ability to generate samples with large viewing angle changes and temporal smoothness, the ability to maintain high consistency without additional Neural Radiation Field (NeRF) distillation, and the ability to generate high-quality seamless loop videos of up to half a minute. This model is free for research and non-commercial use only, and is positioned to provide innovative image-to-video solutions for researchers and non-commercial creators.
Flat Color - Style is a LoRA model designed specifically for generating flat color style images and videos. It is trained based on the Wan Video model and has unique lineless, low-depth effects, making it suitable for animation, illustrations and video generation. The main advantages of this model are its ability to reduce color bleeding and enhance black expression while delivering high-quality visuals. It is suitable for scenarios that require concise and flat design, such as animation character design, illustration creation and video production. This model is free for users to use and is designed to help creators quickly achieve visual works with a modern and concise style.
Aya Vision 32B is an advanced visual language model developed by Cohere For AI with 32 billion parameters and supports 23 languages, including English, Chinese, Arabic, etc. This model combines the latest multilingual language model Aya Expanse 32B and the SigLIP2 visual encoder to achieve the combination of vision and language understanding through a multimodal adapter. It performs well in the field of visual language and can handle complex image and text tasks, such as OCR, image description, visual reasoning, etc. The model was released to promote the popularity of multimodal research, and its open source weights provide a powerful tool for researchers around the world. This model is licensed under a CC-BY-NC license and is subject to Cohere For AI’s fair use policy.
CohereForAI's Aya Vision 8B is an 800 million-parameter multi-language visual language model that is optimized for a variety of visual language tasks and supports OCR, image description, visual reasoning, summary, question and answer and other functions. The model is based on the C4AI Command R7B language model, combined with the SigLIP2 visual encoder, supports 23 languages, and has a 16K context length. Its main advantages include multi-language support, powerful visual understanding capabilities, and a wide range of applicable scenarios. The model is released as open source weights to advance the global research community. According to the CC-BY-NC license agreement, users are required to comply with C4AI's acceptable use policy.
Aya Vision is an advanced vision model developed by the Cohere For AI team, focusing on multi-language and multi-modal tasks, supporting 23 languages. The model significantly improves the performance of visual and text tasks through innovative algorithm breakthroughs such as synthetic annotation, multilingual data expansion, and multimodal model fusion. Its main advantages include efficiency (it performs well even with limited computing resources) and extensive multi-language support. Aya Vision is launched to advance the cutting edge of multilingual and multimodal research and provide technical support to the global research community.
CogView4 is an advanced text-to-image generation model developed by Tsinghua University. It is based on diffusion model technology and can generate high-quality images based on text descriptions. It supports Chinese and English input and can generate high-resolution images. The main advantages of CogView4 are its powerful multi-language support and high-quality image generation capabilities, which is suitable for users who need to generate images efficiently. This model was demonstrated at ECCV 2024 and has important research and application value.
UniTok is an innovative visual word segmentation technology designed to bridge the gap between visual generation and comprehension. It significantly improves the representation capabilities of discrete word segmenters through multi-codebook quantization technology, enabling it to capture richer visual details and semantic information. This technology breaks through the bottleneck of traditional word segmenters in the training process and provides an efficient and unified solution for visual generation and understanding tasks. UniTok performs well in image generation and understanding tasks, such as achieving significant zero-shot accuracy improvements on ImageNet. The main advantages of this technology include efficiency, flexibility, and strong support for multi-modal tasks, bringing new possibilities to the field of visual generation and understanding.
Migician is a multi-modal large language model developed by the Natural Language Processing Laboratory of Tsinghua University, focusing on multi-image localization tasks. By introducing an innovative training framework and the large-scale data set MGrounding-630k, this model significantly improves the precise positioning capabilities in multi-image scenarios. It not only surpasses existing multi-modal large language models, but even surpasses the larger 70B model in performance. The main advantage of Migician is its ability to handle complex multi-image tasks and provide free-form localization instructions, making it an important application prospect in the field of multi-image understanding. The model is currently open source on Hugging Face for use by researchers and developers.
Magma-8B is a multi-modal AI basic model developed by Microsoft and designed specifically for studying multi-modal AI agents. It combines text and image inputs, is able to generate text output, and has visual planning and agent capabilities. This model uses Meta LLaMA-3 as the backbone of the language model, combined with the CLIP-ConvNeXt-XXLarge visual encoder, to support learning spatiotemporal relationships from unlabeled video data, and has strong generalization capabilities and multi-task adaptability. Magma-8B performs well in multi-modal tasks, especially in spatial understanding and reasoning. It provides powerful tools for multimodal AI research and advances the study of complex interactions in virtual and real environments.
SigLIP2 is a multilingual visual language encoder developed by Google with improved semantic understanding, localization, and dense features. It supports zero-shot image classification and can classify images directly through text descriptions without additional training. The model performs well in multi-language scenarios and is suitable for a variety of visual language tasks. Its main advantages include efficient language image alignment capabilities, support for multiple resolutions and dynamic resolution adjustment, and strong cross-language generalization capabilities. The launch of SigLIP2 provides a new solution for multi-language vision tasks, especially suitable for scenarios that require rapid deployment and multi-language support.
VLM-R1 is a visual language model based on reinforcement learning, focusing on visual understanding tasks such as Referring Expression Comprehension (REC). The model demonstrates excellent performance on both in-domain and out-of-domain data by combining R1 (Reinforcement Learning) and SFT (Supervised Fine-Tuning) methods. The main advantages of VLM-R1 include its stability and generalization capabilities, allowing it to perform well on a variety of visual language tasks. The model is built on Qwen2.5-VL and utilizes advanced deep learning technologies such as Flash Attention 2 to improve computing efficiency. VLM-R1 is designed to provide an efficient and reliable solution for visual language tasks, suitable for applications requiring precise visual understanding.
ZeroBench is a benchmark designed for evaluating the visual understanding capabilities of large multimodal models (LMMs). It challenges the limits of current models with 100 carefully crafted and rigorously vetted complex questions, with 334 sub-questions. This benchmark aims to fill the gaps in existing visual benchmarks and provide a more challenging and high-quality evaluation tool. The main advantages of ZeroBench are its high difficulty, lightweight, diverse and high-quality characteristics, which allow it to effectively differentiate the performance of models. In addition, it provides detailed sub-problem evaluations to help researchers better understand the model's inference capabilities.
WHAM (World and Human Action Model) is a generative model developed by Microsoft Research, specifically used to generate game scenes and player behaviors. The model is trained on Ninja Theory’s “Bleeding Edge” game data and can generate coherent and diverse game visuals and controller actions. The main advantage of WHAM is its ability to capture the 3D structure of the game environment and the time sequence of player behaviors, providing a powerful tool for game design and creative exploration. This model is mainly aimed at academic research and game development fields, helping developers quickly iterate game design.
Pippo is a generative model developed by Meta Reality Labs in cooperation with multiple universities. It can generate high-resolution multi-view videos from a single ordinary photo. The core benefit of this technology is the ability to generate high-quality 1K resolution video without additional inputs such as parametric models or camera parameters. It is based on a multi-view diffusion converter architecture and has a wide range of application prospects, such as virtual reality, film and television production, etc. Pippo's code is open source, but it does not include pre-trained weights. Users need to train the model by themselves.
One Shot LoRA is an online platform focused on quickly training LoRA models from videos. It uses advanced machine learning technology to efficiently convert video content into LoRA models, providing users with fast and convenient model generation services. The main advantages of this product are its simplicity, no need to log in, and secure privacy. It does not require users to upload private data, nor does it store or collect any user information, ensuring the privacy and security of user data. This product is mainly aimed at users who need to quickly generate LoRA models, such as designers, developers, etc., to help them quickly obtain the required model resources and improve work efficiency.
Janus Pro is an advanced AI image generation and understanding platform powered by DeepSeek technology. It uses a revolutionary unified transformer architecture that can efficiently handle complex multi-modal operations and achieve superior performance in image generation and understanding. The platform is trained on more than 90 million samples, including 72 million synthetic aesthetic data points, ensuring that the resulting images are visually appealing and contextually accurate. Janus Pro provides developers and researchers with powerful visual AI capabilities to help them move from creative ideas to visual storytelling. The platform offers a free trial and is suitable for users who require high-quality image generation and analysis.
Agentic Object Detection is an advanced inference-driven object detection technology that can accurately identify target objects in images through textual prompts. It achieves detection with human-like accuracy without requiring large amounts of custom training data. The technology leverages design patterns to perform deep reasoning on a target’s unique attributes such as color, shape, and texture, enabling smarter and more accurate recognition in a variety of scenarios. Its main advantages include high accuracy, no need for large amounts of training data, and the ability to handle complex scenarios. This technology is suitable for industries that require high-precision image recognition, such as manufacturing, agriculture, medical and other fields, and can help companies improve production efficiency and quality control levels. The product is currently in the trial stage, and users can try it for free to experience its functions.
DiffSplat is an innovative 3D generation technology that enables rapid generation of 3D Gaussian point clouds from text cues and single-view images. This technology enables efficient 3D content generation by leveraging large-scale pre-trained text-to-image diffusion models. It solves the problems of limited data sets and inability to effectively utilize 2D pre-trained models in traditional 3D generation methods, while maintaining 3D consistency. The main advantages of DiffSplat include efficient generation speed (completed in 1~2 seconds), high-quality 3D output, and support for multiple input conditions. This model has broad prospects in academic research and industrial applications, especially in scenarios where rapid generation of high-quality 3D models is required.
Qwen2.5-VL is the latest flagship visual language model launched by the Qwen team and is an important advancement in the field of visual language models. It can not only identify common objects, but also analyze complex content such as text, charts, and icons in images, and support the understanding and event location of long videos. The model performs well in multiple benchmarks, especially in document understanding and visual agent tasks, demonstrating strong visual understanding and reasoning capabilities. Its main advantages include efficient multi-modal understanding, powerful long video processing capabilities and flexible tool calling capabilities, which are suitable for a variety of application scenarios.
Animagine XL 4.0 is an animation theme generation model based on Stable Diffusion XL 1.0 fine-tuning. It used 8.4 million diverse anime-style images for training, and the training time reached 2,650 hours. This model focuses on generating and modifying anime-themed images through text prompts, supporting a variety of special tags that control different aspects of image generation. Its main advantages include high-quality image generation, rich anime-style details, and accurate reproduction of specific characters and styles. The model was developed by Cagliostro Research Lab under the CreativeML Open RAIL++-M license, which allows commercial use and modification.
MILS is an open source project released by Facebook Research that aims to demonstrate the ability of large language models (LLMs) to handle visual and auditory tasks without any training. This technology enables automatic description generation of images, audio and video by utilizing pre-trained models and optimization algorithms. This technological breakthrough provides new ideas for the development of multi-modal artificial intelligence and demonstrates the potential of LLMs in cross-modal tasks. This model is primarily intended for researchers and developers, providing them with a powerful tool to explore multimodal applications. The project is currently free and open source and aims to promote academic research and technology development.
Janus-Pro-7B is a powerful multimodal model capable of processing both text and image data. It solves the conflict between traditional models in understanding and generation tasks by separating the visual encoding path, improving the flexibility and performance of the model. The model is based on the DeepSeek-LLM architecture, uses SigLIP-L as the visual encoder, supports 384x384 image input, and performs well in multi-modal tasks. Its main advantages include efficiency, flexibility and powerful multi-modal processing capabilities. This model is suitable for scenarios requiring multi-modal interaction, such as image generation and text understanding.
Janus-Pro-1B is an innovative multimodal model focused on unifying multimodal understanding and generation. It solves the conflicting problem of traditional methods in understanding and generation tasks by separating the visual encoding path, while maintaining a single unified Transformer architecture. This design not only improves the model's flexibility but also enables it to perform well in multi-modal tasks, even surpassing task-specific models. The model is built on DeepSeek-LLM-1.5b-base/DeepSeek-LLM-7b-base, uses SigLIP-L as the visual encoder, supports 384x384 image input, and uses a specific image generation tokenizer. Its open source nature and flexibility make it a strong candidate for the next generation of multimodal models.
SmolVLM-256M is a multi-modal model developed by Hugging Face, based on the Idefics3 architecture and designed for efficient processing of image and text input. It can answer questions about images, describe visual content, or transcribe text, and requires less than 1GB of GPU memory to run inference. The model performs well on multi-modal tasks while maintaining a lightweight architecture suitable for on-device applications. Its training data comes from The Cauldron and Docmatix data sets, covering document understanding, image description and other fields, giving it a wide range of application potential. The model is currently available for free on the Hugging Face platform and is designed to provide developers and researchers with powerful multi-modal processing capabilities.
SmolVLM-500M is a lightweight multi-modal model developed by Hugging Face and belongs to the SmolVLM series. The model is based on the Idefics3 architecture and focuses on efficient image and text processing tasks. It can accept image and text input in any order and generate text output, which is suitable for tasks such as image description and visual question answering. Its lightweight architecture enables it to run on resource-constrained devices while maintaining strong multi-modal task performance. The model is licensed under the Apache 2.0 license, enabling open source and flexible usage scenarios.
Flex.1-alpha is a powerful text-to-image generative model based on an 8 billion-parameter modified flow transformer architecture. It inherits the features of FLUX.1-schnell and guides the embedder through training so that it can generate images without CFG. The model supports fine-tuning and has an open source license (Apache 2.0) suitable for use in multiple inference engines such as Diffusers and ComfyUI. Its main advantages include efficient generation of high-quality images, flexible fine-tuning capabilities, and open source community support. The development background is to solve the compression and optimization problems of image generation models and improve model performance through continuous training.
Kimi vision model is an advanced image understanding technology provided by the Moonshot AI open platform. It can accurately identify and understand text, color, object shape and other content in pictures, providing users with powerful visual analysis capabilities. This model is efficient and accurate, and is suitable for a variety of scenarios, such as image content description, visual question answering, etc. Its pricing is consistent with the moonshot-v1 series model. It is billed based on the total Tokens inferred by the model. The Tokens consumed for each picture is a fixed value of 1024.
StructLDM is a structured latent diffusion model for learning 3D human body generation from 2D images. It can generate diverse human bodies with consistent perspectives and supports different levels of controllable generation and editing, such as combined generation and local clothing editing. This model enables clothing-independent generation and editing without the need for clothing type or mask conditions. The project was proposed by Tao Hu, Fangzhou Hong and Ziwei Liu of Nanyang Technological University's S-Lab, and the relevant paper was published in ECCV 2024.
ViTPose is a series of human pose estimation models based on Transformer architecture. It leverages the powerful feature extraction capabilities of Transformer to provide a simple and effective baseline for human pose estimation tasks. The ViTPose model performs well on multiple datasets with high accuracy and efficiency. The model is maintained and updated by the University of Sydney community and is available in a variety of different scales to meet the needs of different application scenarios. On the Hugging Face platform, ViTPose models are available to users in open source form. Users can easily download and deploy these models to conduct research and application development related to human posture estimation.
Hallo3 is a technology for portrait image animation that utilizes pre-trained transformer-based video generation models to generate highly dynamic and realistic videos, effectively solving challenges such as non-frontal perspectives, dynamic object rendering, and immersive background generation. This technology, jointly developed by researchers from Fudan University and Baidu, has strong generalization capabilities and brings new breakthroughs to the field of portrait animation.
Stable Point Aware 3D (SPAR3D) is an advanced 3D generative model launched by Stability AI. It enables real-time editing and complete structure generation of 3D objects from a single image in less than a second. SPAR3D uses a unique architecture that combines precise point cloud sampling with advanced mesh generation technology to provide unprecedented control over 3D asset creation. The model is free for commercial and non-commercial use, and the weights can be downloaded at Hugging Face, the code is available on GitHub, or accessed through the Stability AI Developer Platform API.
InternVL2_5-26B-MPO is a multimodal large language model (MLLM). Based on InternVL2.5, it further improves the model performance through Mixed Preference Optimization (MPO). This model can process multi-modal data including images and text, and is widely used in scenarios such as image description and visual question answering. Its importance lies in its ability to understand and generate text that is closely related to the content of the image, pushing the boundaries of multi-modal artificial intelligence. Product background information includes its superior performance in multi-modal tasks and evaluation results in OpenCompass Leaderboard. This model provides researchers and developers with powerful tools to explore and realize the potential of multimodal artificial intelligence.
InternVL2_5-8B-MPO-AWQ is a multi-modal large-scale language model launched by OpenGVLab. It is based on the InternVL2.5 series and uses Mixed Preference Optimization (MPO) technology. The model demonstrates excellent performance in visual and language understanding and generation, especially in multi-modal tasks. It achieves in-depth understanding and interaction of images and text by combining the visual part InternViT and the language part InternLM or Qwen, using randomly initialized MLP projectors for incremental pre-training. The importance of this technology lies in its ability to process multiple data types including single images, multiple images, and video data, providing new solutions in the field of multi-modal artificial intelligence.
1.58-bit FLUX is an advanced text-to-image generative model that quantizes the FLUX.1-dev model by using 1.58-bit weights (i.e., values in {-1, 0, +1}) while maintaining comparable performance in generating 1024x1024 images. This method does not require access to image data and relies entirely on the self-supervision of the FLUX.1-dev model. In addition, a custom kernel was developed that optimized 1.58-bit operations, achieving a 7.7x reduction in model storage, a 5.1x reduction in inference memory, and improved inference latency. Extensive evaluation in GenEval and T2I Compbench benchmarks shows that 1.58-bit FLUX significantly improves computational efficiency while maintaining generation quality.
InternVL2.5-MPO is an advanced multi-modal large-scale language model series built on InternVL2.5 and hybrid preference optimization. The model integrates the newly incrementally pretrained InternViT with various pretrained large language models, including InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. InternVL2.5-MPO retains the same model architecture as InternVL 2.5 and its predecessor in the new version, following the "ViT-MLP-LLM" paradigm. The model supports multiple image and video data, and further improves model performance through Mixed Preference Optimization (MPO), making it perform better in multi-modal tasks.
InternVL2_5-4B-MPO-AWQ is a multimodal large language model (MLLM) focused on improving the model's performance in image and text interaction tasks. The model is based on the InternVL2.5 series and further improves performance through Mixed Preference Optimization (MPO). It can handle a variety of inputs including single and multi-image and video data, and is suitable for complex tasks that require interactive understanding of images and text. InternVL2_5-4B-MPO-AWQ provides a powerful solution for image-to-text tasks with its excellent multi-modal capabilities.
DynamicControl is a framework for improving control over text-to-image diffusion models. It supports adaptive selection of different numbers and types of conditions by dynamically combining diverse control signals to synthesize images more reliably and in detail. The framework first uses a dual-loop controller to generate initial true score rankings for all input conditions using pre-trained conditional generative and discriminative models. Then, an efficient condition evaluator is built through multimodal large language model (MLLM) to optimize condition ranking. DynamicControl jointly optimizes MLLM and diffusion models, leveraging the inference capabilities of MLLM to facilitate multi-condition text-to-image tasks. The final sorted conditions are input to the parallel multi-control adapter, which learns feature maps of dynamic visual conditions and integrates them to adjust ControlNet and enhance control of the generated images.
InternVL2.5-MPO is an advanced multi-modal large-scale language model series built based on InternVL2.5 and hybrid preference optimization. The model integrates the new incremental pre-trained InternViT and various pre-trained large language models, such as InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. It supports multiple image and video data and performs well in multi-modal tasks, capable of understanding and generating image-related text content.
Valley is a cutting-edge multi-modal large-scale model developed by ByteDance that is capable of handling a variety of tasks involving text, image and video data. The model achieved the best results in internal e-commerce and short video benchmarks, outperforming other open source models. In the OpenCompass test, compared with models of the same scale, the average score was greater than or equal to 67.40, ranking second among models smaller than 10B. The Valley-Eagle version refers to Eagle and introduces a visual encoder that can flexibly adjust the number of tokens and parallel the original visual tokens, enhancing the performance of the model in extreme scenarios.
InternVL2_5-2B-MPO is a family of multi-modal large-scale language models that demonstrates excellent overall performance. The series is built on InternVL2.5 and hybrid preference optimization. It integrates the newly incrementally pretrained InternViT with various pretrained large language models, including InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. The model performs well in multi-modal tasks and is able to handle a variety of data types including images and text, making it suitable for scenarios that require understanding and generating multi-modal content.
InternVL2_5-1B-MPO is a multimodal large language model (MLLM) built on InternVL2.5 and Mixed Preference Optimization (MPO), demonstrating superior overall performance. The model integrates the newly incrementally pretrained InternViT with various pretrained large language models (LLMs), including InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. InternVL2.5-MPO retains the same "ViT-MLP-LLM" paradigm as InternVL 2.5 and its predecessors in model architecture, and introduces support for multiple image and video data. This model performs well in multi-modal tasks and can handle a variety of visual language tasks including image description, visual question answering, etc.
DisPose is a method for controlling human image animation that improves the quality of video generation through motion field guidance and keypoint correspondence. This technology is able to generate videos from reference images and driving videos while maintaining consistency of motion alignment and identity information. DisPose provides region-level dense guidance by generating dense motion fields from sparse motion fields and reference images while maintaining the generalization ability of sparse pose control. Furthermore, it extracts diffusion features corresponding to pose key points from the reference image and transfers these point features to the target pose to provide unique identity information. Key benefits of DisPose include the ability to extract more versatile and efficient control signals without the need for additional dense inputs, as well as improved quality and consistency of generated videos via plug-and-play hybrid ControlNet without freezing existing model parameters.
Ruyi-Models is an image-to-video model capable of generating cinematic videos up to 768 resolution and 24 frames per second, supporting lens control and motion range control. Using an RTX 3090 or RTX 4090 graphics card, you can generate 512-resolution, 120-frame video losslessly. This model has attracted attention for its high-quality video generation capabilities and precise control of details, especially in areas where high-quality video content needs to be generated, such as film production, game production, and virtual reality experiences.
CAP4D is a technology that uses Morphable Multi-View Diffusion Models to create 4D human avatars. It is able to generate images of different perspectives and expressions from any number of reference images and adapt them to a 4D avatar that can be controlled via 3DMM and rendered in real time. Key advantages of this technology include highly realistic image generation, adaptability to multiple perspectives, and the ability to render in real time. CAP4D's technical background is based on recent advances in deep learning and image generation, especially in diffusion models and 3D facial modeling. Due to its high-quality image generation and real-time rendering capabilities, CAP4D has broad application prospects in entertainment, game development, virtual reality and other fields. Currently, the technology is available as code for free, but specific commercial applications may require further licensing and pricing.
HDR is a new technology focused on repairing damaged historical documents, aiming to predict the original appearance of damaged historical documents. This technology is able to handle a variety of damage including missing characters, paper damage and ink erosion by creating large-scale data sets HDR28K and diffusion-based network DiffHDR. The main advantage of HDR is its ability to accurately capture character content and style, harmonizing with the background within the restored area. This technology is not only capable of repairing damaged documents, but can also be extended to document editing and text block generation, demonstrating high flexibility and generalization capabilities. HDR is of great significance to the inheritance of priceless culture and civilization.
ComfyUI-IF_MemoAvatar is a memory-guided diffusion based model for generating expressive videos. The technology allows users to create expressive talking avatar videos from a single image and audio input. The importance of this technology lies in its ability to convert static images into dynamic videos while retaining the facial features and emotional expressions of the characters in the images, providing new possibilities for video content creation. This model was developed by Longtao Zheng and others, and related papers were published on arXiv.
GenEx is an AI model capable of creating a fully explorable 360° 3D world from a single image. Users can interactively explore this generated world. GenEx advances embodied AI in imaginary spaces and has the potential to extend these capabilities to real-world exploration.
DeepSeek-VL2 is a series of advanced large-scale hybrid expert (MoE) visual language models that are significantly improved compared to the previous generation DeepSeek-VL. This model series has demonstrated excellent capabilities in a variety of tasks such as visual question answering, optical character recognition, document/table/chart understanding, and visual localization. DeepSeek-VL2 consists of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1 billion, 2.8 billion and 4.5 billion activation parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance compared to existing open source intensive and MoE-based models with similar or fewer activation parameters.
DeepSeek-VL2 is a series of advanced large-scale mixed expert (MoE) visual language models that are significantly improved compared to the previous generation DeepSeek-VL. This model series has demonstrated excellent capabilities in multiple tasks such as visual question answering, optical character recognition, document/table/chart understanding, and visual positioning. DeepSeek-VL2 consists of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activation parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance compared to existing open source intensive and MoE-based models with similar or fewer activation parameters.
DeepSeek-VL2 is a series of large-scale Mixture-of-Experts visual language models that are significantly improved compared to the previous generation DeepSeek-VL. This model series demonstrates excellent capabilities in tasks such as visual question answering, optical character recognition, document/table/diagram understanding, and visual localization. DeepSeek-VL2 contains three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activation parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance compared to existing open source dense and MoE base models with similar or fewer activation parameters.
InternVL2_5-4B is an advanced multi-modal large language model (MLLM) that maintains the core model architecture based on InternVL 2.0 and has significant enhancements in training and testing strategies and data quality. The model performs well in processing image, text-to-text tasks, especially in multi-modal reasoning, mathematical problem solving, OCR, diagrams, and document understanding. As an open source model, it provides researchers and developers with powerful tools to explore and build vision- and language-based intelligent applications.
InternVL 2.5 is an advanced multi-modal large language model series that builds on InternVL 2.0 by introducing significant training and testing strategy enhancements and data quality improvements while maintaining its core model architecture. The model integrates the newly incrementally pretrained InternViT with various pretrained large language models, such as InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. InternVL 2.5 supports multiple image and video data, with dynamic high-resolution training methods that provide better performance when processing multi-modal data.
InternVL 2.5 is a series of advanced multimodal large language models (MLLM) that builds on InternVL 2.0 by introducing significant training and testing strategy enhancements and data quality improvements, while maintaining its core model architecture. The model integrates the newly incrementally pretrained InternViT with various pretrained large language models (LLMs), such as InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. InternVL 2.5 supports multiple image and video data, and enhances the model's ability to handle multi-modal data through dynamic high-resolution training methods.
Sana is a text-to-image generation framework developed by NVIDIA that can efficiently generate images with resolutions up to 4096×4096. With its fast speed and powerful text-image alignment capabilities, Sana can be deployed on laptop GPUs and represents an important advancement in image generation technology. The model is based on a linear diffusion transformer and uses a pre-trained text encoder and a spatially compressed latent feature encoder to generate and modify images based on text cues. Sana's open source code can be found on GitHub, and its research and application prospects are broad, especially in artistic creation, educational tools, and model research.
InternViT-300M-448px-V2_5 is an enhanced version based on InternViT-300M-448px. By using ViT incremental learning and NTP loss (Stage 1.5), it improves the visual encoder's ability to extract visual features, especially in areas that are underrepresented in large-scale network data sets, such as multi-language OCR data and mathematical charts. This model is part of the InternViT 2.5 series, retaining the same "ViT-MLP-LLM" model architecture as the previous generation, and integrating the new incremental pre-trained InternViT with various pre-trained LLMs, such as InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors.
InternViT-6B-448px-V2_5 is a visual model based on InternViT-6B-448px-V1-5. By using ViT incremental learning with NTP loss (stage 1.5), it improves the visual encoder's ability to extract visual features, especially in areas that are underrepresented in large-scale network datasets, such as multi-language OCR data and mathematical charts. This model is part of the InternVL 2.5 series, retaining the same "ViT-MLP-LLM" model architecture as the previous generation, and integrating the new incremental pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors.
InternVL2_5-8B is a multi-modal large language model (MLLM) developed by OpenGVLab. It has significant training and testing strategy enhancements and data quality improvements based on InternVL 2.0. The model adopts the 'ViT-MLP-LLM' architecture, which integrates the new incremental pre-trained InternViT with multiple pre-trained language models, such as InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector. InternVL 2.5 series models demonstrate excellent performance on multi-modal tasks, including image and video understanding, multi-language understanding, etc.
InternVL2_5-26B is an advanced multimodal large language model (MLLM) that is further developed based on InternVL 2.0 by introducing significant training and testing strategy enhancements and data quality improvements. The model maintains the "ViT-MLP-LLM" core model architecture of its predecessor and integrates the newly incrementally pretrained InternViT with various pretrained large language models (LLMs), such as InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors. InternVL 2.5 series models demonstrate excellent performance in multi-modal tasks, especially in visual perception and multi-modal capabilities.
Sana is a text-to-image generation framework developed by NVIDIA that can efficiently generate images with resolutions up to 4096×4096. Sana's fast speed and powerful text-image alignment capabilities allow it to be deployed on laptop GPUs. It is a model based on the linear diffusion transformer (text-to-image generative model), with 1648M parameters, specifically used to generate 1024px-based multi-scale height and width images. The main advantages of the Sana model include high-resolution image generation, fast synthesis speed, and powerful text-image alignment capabilities. Background information on the Sana model shows that it is developed based on open source code, the source code can be found on GitHub, and it also follows a specific license (CC BY-NC-SA 4.0 License).
InternVL 2.5 is a series of multi-modal large-scale language models launched by OpenGVLab. It has significant training and testing strategy enhancements and data quality improvements based on InternVL 2.0. This model series can process image, text and video data, and has the ability to understand and generate multi-modal data. It is a cutting-edge product in the current field of multi-modal artificial intelligence. The InternVL 2.5 series models provide powerful support for multi-modal tasks with their high performance and open source features.
InternVL 2.5 is a series of advanced multimodal large language models (MLLM) that builds on InternVL 2.0 by introducing significant training and testing strategy enhancements and data quality improvements. This model series is optimized in terms of visual perception and multi-modal capabilities, supporting a variety of functions including image and text-to-text conversion, and is suitable for complex tasks that require processing of visual and language information.
Sana is a text-to-image framework developed by NVIDIA that can efficiently generate images with resolutions up to 4096×4096. The model synthesizes high-resolution, high-quality images at blazing speed, maintains strong text-image alignment, and can be deployed on laptop GPUs. The Sana model is based on a linear diffusion transformer, uses a pre-trained text encoder and a spatially compressed latent feature encoder, and supports Emoji, Chinese and English, and mixed prompts.
TRELLIS is a native 3D generative model based on a unified structured latent representation and modified flow transformer, enabling diverse and high-quality 3D asset creation. This model comprehensively captures structural (geometry) and textural (appearance) information while maintaining flexibility during decoding by integrating sparse 3D meshes and dense multi-view visual features extracted from powerful vision base models. TRELLIS models are capable of processing up to 2 billion parameters and are trained on a large 3D asset dataset containing 500,000 diverse objects. The model produces high-quality results under text or image conditions, significantly outperforming existing methods, including recent methods of similar scale. TRELLIS also demonstrates flexible output format selection and local 3D editing capabilities not offered by previous models. Code, models and data will be released.
The Qwen2-VL-72B is the latest iteration of the Qwen-VL model and represents nearly a year of innovation. The model achieves state-of-the-art performance in visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, and more. It can understand videos of more than 20 minutes and can be integrated into mobile phones, robots and other devices to perform automatic operations based on the visual environment and text instructions. In addition to English and Chinese, Qwen2-VL now supports the understanding of text in images in different languages, including most European languages, Japanese, Korean, Arabic, Vietnamese, and more. Model architecture updates include Naive Dynamic Resolution and Multimodal Rotary Position Embedding (M-ROPE), which enhance its multi-modal processing capabilities.
The Qwen2-VL-7B is the latest iteration of the Qwen-VL model and represents nearly a year of innovation. The model achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, and others. It can understand videos longer than 20 minutes and provide high-quality support for video-based question answering, dialogue, content creation, etc. In addition, Qwen2-VL also supports multi-language, in addition to English and Chinese, it also includes most European languages, Japanese, Korean, Arabic, Vietnamese, etc. Model architecture updates include Naive Dynamic Resolution and Multimodal Rotary Position Embedding (M-ROPE), which enhance its multi-modal processing capabilities.
AWPortraitCN is a text-to-image generation model developed based on FLUX.1-dev, specially trained for the appearance and aesthetics of Chinese people. It contains multiple types of portraits, such as indoor and outdoor portraits, fashion and studio photos, with strong generalization capabilities. Compared with the original version, AWPortraitCN is more delicate and realistic in skin texture. In order to pursue a more realistic original image effect, it can be used together with the AWPortraitSR workflow.
Sana is a text-to-image generation framework developed by NVIDIA that can efficiently generate high-definition, high text-image consistency images up to 4096×4096 resolution, and is extremely fast and can be deployed on laptop GPUs. The Sana model is based on a linear diffusion transformer, using a pretrained text encoder and a spatially compressed latent feature encoder. The importance of this technology lies in its ability to quickly generate high-quality images, which has a revolutionary impact on art creation, design and other creative fields. The Sana model is licensed under the CC BY-NC-SA 4.0 license and the source code is available on GitHub.
Sana is a text-to-image generation framework developed by NVIDIA that can efficiently generate images with resolutions up to 4096×4096. Sana is known for its fast speed, powerful text-image alignment capabilities, and the fact that it can be deployed on laptop GPUs. This model is based on a linear diffusion transformer using a pretrained text encoder and a spatially compressed latent feature encoder, and represents the latest advancement in text-to-image generation technology. Sana's key advantages include high-resolution image generation, fast synthesis, deployability on laptop GPUs, and open source code, making it valuable in research and practical applications.
FLOAT is an audio-driven portrait video generation method based on a flow matching generative model that transfers generative modeling from a pixel-based latent space to a learned motion latent space, achieving temporally consistent motion design. This technology introduces a transformer-based vector field predictor with a simple yet effective frame-by-frame conditional mechanism. In addition, FLOAT supports voice-driven emotional enhancement and can naturally incorporate expressive movements. Extensive experiments show that FLOAT outperforms existing audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.
Luma Photon is an innovative image generation model known for its high degree of creativity, intelligence and personalization. It is built on a new breakthrough architecture that delivers ultra-high-definition images with 10x greater cost efficiency. Luma Photon surpassed all models on the market in large-scale double-blind evaluations, excelling in quality, creativity and understanding, while also delivering revolutionary improvements in efficiency.
MV-Adapter is an adapter-based multi-view image generation solution that enhances pre-trained text-to-image (T2I) models and their derived models without changing the original network structure or feature space. By updating fewer parameters, MV-Adapter achieves efficient training and retains the prior knowledge embedded in the pre-trained model, reducing the risk of overfitting. This technology enables the adapter to inherit the strong priors of the pre-trained model to model new 3D knowledge through innovative designs such as replicated self-attention layers and parallel attention architectures. In addition, MV-Adapter also provides a unified conditional encoder, seamlessly integrates camera parameters and geometric information, and supports applications such as text- and image-based 3D generation and texture mapping. MV-Adapter implements 768-resolution multi-view generation on Stable Diffusion XL (SDXL) and demonstrates its adaptability and versatility, which can be extended to arbitrary view generation, opening up wider application possibilities.
PSHuman is an innovative framework that leverages multi-view diffusion models and explicit reconstruction techniques to reconstruct realistic 3D human models from a single image. The importance of this technique lies in its ability to handle complex self-occlusion problems and avoid geometric distortion in the generated facial details. PSHuman jointly models global body shape and local facial features through a cross-scale diffusion model, achieving new perspective generation that is rich in detail and maintains identity features. In addition, PSHuman also enhances cross-view body shape consistency under different human postures through body priors provided by parametric models such as SMPL-X. The main advantages of PSHuman include rich geometric details, high texture fidelity, and strong generalization capabilities.
MyTimeMachine is a facial age transformation model based on artificial intelligence technology. It can perform age regression (age reduction) and age advancement (age increase) in a personalized manner through approximately 50 personal selfie photos. It can generate facial images similar to the target age while maintaining high fidelity and identity characteristics. This technology is of great value in virtual age applications such as film and television special effects, and can provide high-quality, identity-consistent, and temporally coherent age effects.
Sana-1.6B is an efficient high-resolution image synthesis model based on linear diffusion transformer technology capable of generating high-quality images. Developed by NVIDIA Labs, this model uses DC-AE technology, has 32 times the potential space, and can run on multiple GPUs, providing powerful image generation capabilities. Sana-1.6B is known for its efficient image synthesis capabilities and high-quality output results, and is an important technology in the field of image synthesis.
Diffusion Self-Distillation is a diffusion model-based self-distillation technology for zero-shot customized image generation. This technology allows artists and users to generate their own datasets through pre-trained text-to-image models without large amounts of paired data, and then fine-tune the models to achieve text- and image-conditioned image-to-image tasks. This approach outperforms existing zero-shot methods in maintaining performance on the identity generation task and is comparable to per-instance tuning techniques without test-time optimization.
SmolVLM is a small but powerful visual language model (VLM) with 2B parameters, leading among similar models with its small memory footprint and efficient performance. SmolVLM is completely open source, including all model checkpoints, VLM datasets, training recipes and tools released under the Apache 2.0 license. The model is suitable for local deployment on browsers or edge devices, reducing inference costs and allowing user customization.
OneDiffusion is a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding, covering a variety of tasks. The model is expected to have code and checkpoints released in early December. The importance of OneDiffusion lies in its ability to handle image synthesis and understanding tasks, which is an important advancement in the field of artificial intelligence, especially in image generation and recognition. Product background information shows that this is a project jointly developed by multiple researchers, and its research results have been published on arXiv.
FLUX1.1 [pro] is a high-resolution image generation model that supports image resolutions up to 4MP while maintaining a generation time of only 10 seconds per sample. FLUX1.1 [pro] – ultra mode is capable of generating four times the standard resolution without sacrificing speed, with performance benchmarks showing it is over 2.5 times faster than comparable high-resolution models. In addition, FLUX1.1 [pro] – raw mode provides more natural and less synthetic image generation effects for creators pursuing realism, significantly improving the diversity of characters and the authenticity of natural photography. This model is available at a competitive price of $0.06 per image.
The Aquila-VL-2B model is a visual language model (VLM) trained based on the LLava-one-vision framework. The Qwen2.5-1.5B-instruct model is selected as the language model (LLM), and siglip-so400m-patch14-384 is used as the visual tower. The model is trained on the self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset combines open source data collected from the Internet and synthetic instruction data generated using an open source VLM model. The open source of the Aquila-VL-2B model aims to promote the development of multi-modal performance, especially in the combined processing of images and text.
Claude Vision Object Detection is a Python-based tool that utilizes the Claude 3.5 Sonnet Vision API to detect and visualize objects in images. The tool automatically draws bounding boxes around detected objects, labels them, and displays confidence scores. It supports processing a single image or an entire catalog of images, and features highly accurate confidence scores using bright and different colors for each detected object. Additionally, it can save annotated images with detection results.
D-FINE is a powerful real-time object detection model that achieves excellent performance without adding additional inference and training costs by redefining the bounding box regression task in DETRs as fine-grained distribution refinement (FDR) and introducing global optimal localization self-distillation (GO-LSD). The model was developed by researchers at the Chinese Academy of Sciences to improve the accuracy and efficiency of target detection.
InstantIR is a blind image restoration method based on the diffusion model, which can handle unknown degradation problems during testing and improve the generalization ability of the model. This technology provides robust generation conditions by dynamically adjusting the generation conditions to generate reference images during inference. The main advantages of InstantIR include the ability to restore extremely degraded image details, provide realistic textures, and generate references through text description adjustment to achieve creative image restoration. The technology was jointly developed by researchers from Peking University, the InstantX team and the Chinese University of Hong Kong, with sponsorship support from HuggingFace and fal.ai.
Stable Diffusion 3.5 Medium is an artificial intelligence-based image generation model provided by Stability AI that can generate high-quality images based on text descriptions. The importance of this technology lies in its ability to greatly promote the development of creative industries, such as game design, advertising, art creation and other fields. Stable Diffusion 3.5 Medium is favored by users for its efficient image generation capabilities, ease of use and low resource consumption. The model is currently available as a free trial to users on the Hugging Face platform.
Stable Diffusion 3.5 Medium is a text-to-image generative model developed by Stability AI with improved image quality, typography, complex prompt understanding, and resource efficiency. The model uses three fixed pre-trained text encoders, improves training stability through QK-normalization, and introduces dual attention blocks in the first 12 transformation layers. It excels in multi-resolution image generation, consistency, and adaptability to various text-to-image tasks.
Flux.1 Lite is an 8B parameter text-to-image generation model released by Freepik, which is extracted from the FLUX.1-dev model. This version uses 7GB less RAM than the original model and runs 23% faster, while maintaining the same accuracy (bfloat16) as the original model. The release of this model aims to make high-quality AI models more accessible, especially for consumer GPU users.
Ultralight-Digital-Human is an ultra-lightweight digital human model that can run in real time on the mobile terminal. This model is open source and, to the best of the developer's knowledge, is the first such lightweight open source digital human model. The main advantages of this model include lightweight design, suitability for mobile deployment, and the ability to run in real time. Behind it is deep learning technology, especially the application in face synthesis and voice simulation, which enables digital human models to achieve high-quality performance with lower resource consumption. The product is currently free and is mainly targeted at technology enthusiasts and developers.
Explore other subcategories under image Other Categories
832 tools
771 tools
543 tools
522 tools
196 tools
95 tools
68 tools
63 tools
AI model Hot image is a popular subcategory under 352 quality AI tools