Found 111 related AI tools
Dream 7B is the latest diffusion large language model jointly launched by the NLP Group of the University of Hong Kong and Huawei's Noah's Ark Laboratory. It has demonstrated excellent performance in the field of text generation, especially in areas such as complex reasoning, long-term planning, and contextual coherence. This model adopts advanced training methods, has strong planning capabilities and flexible reasoning capabilities, and provides more powerful support for various AI applications.
AccVideo is a novel and efficient distillation method that accelerates the inference of video diffusion models using synthetic datasets. The model is able to achieve an 8.5x speedup in generating videos while maintaining similar performance. It uses a pre-trained video diffusion model to generate multiple effective denoised trajectories, thus optimizing the data usage and generation process. AccVideo is particularly suitable for scenarios that require efficient video generation, such as film production, game development, etc., and is suitable for researchers and developers.
InfiniteYou (InfU) is a powerful diffusion transformer-based framework designed to enable flexible image reconstruction while preserving user identity. By introducing identity features and employing a multi-stage training strategy, it significantly improves the quality and aesthetics of image generation while improving text-image alignment. This technology is of great significance for improving the similarity and aesthetics of image generation and is suitable for various image generation tasks.
TrajectoryCrafter is an advanced camera trajectory redirection tool that uses diffusion model technology to redesign the camera movement in monocular videos to improve the expressiveness and visual appeal of the video. This technology can be widely used in fields such as film and television production and virtual reality. It is efficient, convenient and innovative and aims to provide users with more creative freedom and control.
Inception Labs is a company focused on developing diffusion large language models (dLLMs). Its technology is inspired by advanced image and video generation systems such as Midjourney and Sora. With diffusion models, Inception Labs offers 5-10 times faster speeds, greater efficiency, and greater control over generation than traditional autoregressive models. Its model supports parallel text generation, is able to correct errors and illusions, is suitable for multi-modal tasks, and performs well in inference and structured data generation. The company, comprised of researchers and engineers from Stanford, UCLA and Cornell University, is a pioneer in the field of diffusion modeling.
Project Starlight is an AI video enhancement model from Topaz Labs designed to improve the quality of low-resolution and corrupted videos. It uses diffusion model technology to achieve video super-resolution, noise reduction, deblurring, and sharpening functions while maintaining temporal consistency and ensuring smooth transitions between video frames. This technology is a major breakthrough in the field of video enhancement, bringing unprecedented high-quality effects to video repair and enhancement. Currently, Project Starlight offers a free trial, with plans to support 4K export in the future, primarily for users and businesses in need of high-quality video restoration and enhancement.
Mercury Coder is the first commercial-grade diffusion large language model (dLLM) launched by Inception Labs, which is specially optimized for code generation. This model uses diffusion model technology to significantly improve the generation speed and quality through a 'coarse-to-fine' generation method. It is 5-10 times faster than traditional autoregressive language models and can achieve a generation speed of more than 1,000 tokens per second on NVIDIA H100 hardware while maintaining high-quality code generation capabilities. The background of this technology is the bottleneck of current autoregressive language models in terms of generation speed and reasoning cost. Mercury Coder breaks through this limitation through algorithm optimization and provides a more efficient and low-cost solution for enterprise-level applications.
VideoGrain is a video editing technology based on the diffusion model, which realizes multi-granularity video editing by adjusting the spatiotemporal attention mechanism. This technology solves the problems of semantic alignment and feature coupling in traditional methods and enables fine control of video content. Its main advantages include zero-shot editing capabilities, efficient text-to-region control, and feature separation capabilities. This technology is suitable for scenarios that require complex editing of videos, such as film and television post-production, advertising production, etc., and can significantly improve editing efficiency and quality.
MakeAnything is a diffusion transformer-based model focused on multi-domain procedural sequence generation. By combining advanced diffusion models and transformer architecture, the technology is able to generate high-quality, step-by-step creative sequences such as paintings, sculptures, icon designs, and more. Its main advantage is that it can handle generation tasks in a variety of domains and can quickly adapt to new domains with a small number of samples. Developed by the National University of Singapore Show Lab team and currently available as open source, the model aims to advance the development of generative technologies across multiple domains.
Pippo is a generative model developed by Meta Reality Labs in cooperation with multiple universities. It can generate high-resolution multi-view videos from a single ordinary photo. The core benefit of this technology is the ability to generate high-quality 1K resolution video without additional inputs such as parametric models or camera parameters. It is based on a multi-view diffusion converter architecture and has a wide range of application prospects, such as virtual reality, film and television production, etc. Pippo's code is open source, but it does not include pre-trained weights. Users need to train the model by themselves.
On-device Sora is an open source project that aims to achieve efficient video generation on mobile devices such as iPhone 15 Pro through technologies such as Linear Scale Leaping (LPL), Temporal Dimension Tag Merging (TDTM), and Dynamic Loading Concurrent Reasoning (CI-DL). The project is developed based on the Open-Sora model and is capable of generating high-quality videos based on text input. Its main advantages include high efficiency, low power consumption and optimization for mobile devices. This technology is suitable for scenarios where video content needs to be quickly generated on mobile devices, such as short video creation, advertising production, etc. The project is currently open source and users can use it for free.
DiffSplat is an innovative 3D generation technology that enables rapid generation of 3D Gaussian point clouds from text cues and single-view images. This technology enables efficient 3D content generation by leveraging large-scale pre-trained text-to-image diffusion models. It solves the problems of limited data sets and inability to effectively utilize 2D pre-trained models in traditional 3D generation methods, while maintaining 3D consistency. The main advantages of DiffSplat include efficient generation speed (completed in 1~2 seconds), high-quality 3D output, and support for multiple input conditions. This model has broad prospects in academic research and industrial applications, especially in scenarios where rapid generation of high-quality 3D models is required.
Go with the Flow is an innovative video generation technology that achieves efficient control of motion patterns in video diffusion models by using distortion noise instead of traditional Gaussian noise. This technology can achieve precise control of object and camera motion in videos without modifying the original model architecture without increasing computational costs. Its main advantages include efficiency, flexibility and scalability, and it can be widely used in various scenarios such as image to video generation and text to video generation. This technology was developed by researchers from institutions such as Netflix Eyeline Studios. It has high academic value and commercial application potential. It is currently open source and freely available to the public.
TokenVerse is an innovative multi-concept personalization method that leverages a pre-trained text-to-image diffusion model to decouple complex visual elements and attributes from a single image and enable seamless concept combination generation. This method breaks through the limitations of existing technologies in concept type or breadth, supporting a variety of concepts, including objects, accessories, materials, poses, and lighting. The importance of TokenVerse lies in its ability to bring more flexible and personalized solutions to the field of image generation to meet the diverse needs of users in different scenarios. Currently, TokenVerse’s code has not been made public, but its potential for personalized image generation has attracted widespread attention.
X-Dyna is an innovative zero-sample human image animation generation technology that generates realistic and expressive dynamic effects by transferring facial expressions and body movements in driving videos to a single human image. This technology is based on the diffusion model. Through the Dynamics-Adapter module, the reference appearance context is effectively integrated into the spatial attention of the diffusion model, while retaining the ability of the motion module to synthesize smooth and complex dynamic details. It can not only realize body posture control, but also capture identity-independent facial expressions through the local control module to achieve precise expression transmission. X-Dyna is trained on a mixture of human and scene videos and is able to learn physical human motion and natural scene dynamics to generate highly realistic and expressive animations.
Hunyuan3D 2.0 is an advanced large-scale 3D synthesis system launched by Tencent, focusing on generating high-resolution textured 3D assets. The system includes two basic components: the large-scale shape generation model Hunyuan3D-DiT and the large-scale texture synthesis model Hunyuan3D-Paint. It provides users with a flexible 3D asset creation platform by decoupling the challenges of shape and texture generation. This system surpasses existing open source and closed source models in terms of geometric details, conditional alignment, texture quality, etc., and is extremely practical and innovative. At present, the inference code and pre-training model of this model have been open sourced, and users can quickly experience it through the official website or Hugging Face space.
Diffusion as Shader (DaS) is an innovative video generation control model designed to achieve diversified control of video generation through the diffusion process of 3D perception. This model utilizes 3D tracking video as control input and can support multiple video control tasks under a unified architecture, such as mesh-to-video generation, camera control, motion transfer, and object manipulation. The main advantage of DaS is its 3D perception capability, which can effectively improve the temporal consistency of generated videos and demonstrate powerful control capabilities through fine-tuning with a small amount of data in a short time. This model was jointly developed by research teams from many universities including the Hong Kong University of Science and Technology. It aims to promote the development of video generation technology and provide more flexible and efficient solutions for film and television production, virtual reality and other fields.
SeedVR is an innovative diffusion transformer model specifically designed to handle real-world video inpainting tasks. The model is able to efficiently process video sequences of arbitrary length and resolution through its unique shifted window attention mechanism. SeedVR is designed to achieve significant improvements in both generative power and sampling efficiency, performing well on both synthetic and real-world benchmarks compared to traditional diffusion models. In addition, SeedVR incorporates modern practices such as causal video autoencoders, hybrid image and video training, and progressive training, further improving its competitiveness in the field of video restoration. As a cutting-edge video restoration technology, SeedVR provides video content creators and post-production staff with a powerful tool that can significantly improve video quality, especially when working with low-quality or damaged video footage.
CreatiLayout is an innovative layout-to-image generation technology that utilizes the Siamese Multimodal Diffusion Transformer to achieve high-quality and fine-grained controllable image generation. This technology can accurately render complex attributes such as color, texture, shape, quantity and text, making it suitable for application scenarios that require precise layout and image generation. Its main advantages include efficient layout guidance integration, powerful image generation capabilities and support for large-scale data sets. CreatiLayout was jointly developed by Fudan University and ByteDance to promote the application of image generation technology in the field of creative design.
DiffSensei is a customized comic generation model that combines multimodal large language models (LLMs) and diffusion models. It can generate controllable black and white comic panels based on user-provided text prompts and character images, with flexible character adaptability. The importance of this technology is that it combines natural language processing with image generation, providing new possibilities for comic creation and personalized content generation. The DiffSensei model has attracted attention for its high-quality image generation, diverse application scenarios, and efficient use of resources. Currently, the model is public on GitHub and can be downloaded and used for free, but specific use may require certain computing resources.
DynamicControl is a framework for improving control over text-to-image diffusion models. It supports adaptive selection of different numbers and types of conditions by dynamically combining diverse control signals to synthesize images more reliably and in detail. The framework first uses a dual-loop controller to generate initial true score rankings for all input conditions using pre-trained conditional generative and discriminative models. Then, an efficient condition evaluator is built through multimodal large language model (MLLM) to optimize condition ranking. DynamicControl jointly optimizes MLLM and diffusion models, leveraging the inference capabilities of MLLM to facilitate multi-condition text-to-image tasks. The final sorted conditions are input to the parallel multi-control adapter, which learns feature maps of dynamic visual conditions and integrates them to adjust ControlNet and enhance control of the generated images.
InvSR is a diffusion inversion-based image super-resolution technique that leverages rich image priors in a large pre-trained diffusion model to improve super-resolution performance. This technology builds an intermediate state of the diffusion model through a partial noise prediction strategy as the starting sampling point, and uses a depth noise predictor to estimate the optimal noise map to initialize sampling in the forward diffusion process and generate high-resolution results. InvSR supports any number of sampling steps, ranging from one to five, and exhibits performance that is better than or comparable to existing state-of-the-art methods even using only single-step sampling.
ColorFlow is a model designed for colorizing image sequences, with special emphasis on preserving identity information of characters and objects during the colorization process. The model leverages contextual information to accurately generate colors for different elements in a sequence of black and white images, such as a character's hair and clothing, based on a pool of reference images and ensures color consistency with the reference images. Through a three-stage diffusion model framework, ColorFlow proposes a novel retrieval-enhanced colorization process that enables image colorization with correlated color references without the need for per-identity fine-tuning or explicit identity embedding extraction. The main advantages of ColorFlow include its ability to provide high-quality coloring results while preserving identity information, which has important market value for coloring cartoon or comic series.
Leffa is a unified framework for controllable human image generation that enables precise control of a character's appearance (e.g., virtual try-on) and pose (e.g., pose transfer). The model reduces detail distortion while maintaining high image quality by guiding target queries to focus on corresponding regions in reference images during training. The main advantages of Leffa include model independence and can be used to improve the performance of other diffusion models.
HelloMeme is a diffusion model integrated with Spatial Knitting Attentions for embedding high-level and detail-rich conditions. This model supports the generation of images and videos, and has the advantages of improving expression consistency between generated videos and driven videos, reducing VRAM usage, and optimizing algorithms. HelloMeme, developed by the HelloVision team and owned by HelloGroup Inc., is a cutting-edge image and video generation technology with important commercial and educational value.
Color-diffusion is an image coloring project based on the diffusion model, which uses LAB color space to color black and white images. The main advantage of this project is the ability to use existing grayscale information (L channel) to predict color information (A and B channels) by training the model. This technology is of great significance in the field of image processing, especially in old photo restoration and artistic creation. Color-diffusion is an open source project, and its background information shows that it was quickly built by the author to satisfy his curiosity and experience training diffusion models from scratch. The project is currently free and has plenty of room for improvement.
AnchorCrafter is an innovative diffusion model system designed to generate 2D videos containing target people and customized objects, achieving high visual fidelity and controllable interaction through the integration of human-object interaction (HOI). The system enhances the ability to recognize the appearance of objects from any multiple perspectives through HOI-appearance perception and separates the appearance of people and objects; HOI-motion injection achieves complex human-object interaction by overcoming the challenges of object trajectory conditions and mutual occlusion management. In addition, the HOI region reweighted loss is used as a training target, which enhances the learning of object details. This technology maintains object appearance and shape awareness while also maintaining consistency in character appearance and movement, which is of great significance to areas such as online commerce, advertising, and consumer engagement.
text-to-pose is a research project that aims to generate human poses from text descriptions and use these poses to generate images. The technology combines natural language processing and computer vision to enable text-to-image generation by improving the control and quality of diffusion models. The project background is based on papers published at the NeurIPS 2024 Workshop, which is innovative and cutting-edge. Key advantages of the technology include improved accuracy and controllability of image generation, as well as potential applications in areas such as artistic creation and virtual reality.
DiffusionDrive is a truncated diffusion model for real-time end-to-end autonomous driving that speeds up computation by reducing diffusion denoising steps while maintaining high accuracy and diversity. The model learns directly from human demonstrations, enabling real-time autonomous driving decisions without complex pre- or post-processing steps. DiffusionDrive achieved a breakthrough score of 88.1 PDMS on the NAVSIM benchmark and was able to run at 45 FPS.
TryOffDiff is a diffusion model-based high-fidelity clothing reconstruction technique used to generate standardized clothing images from a single photo of a wearing individual. This technology differs from traditional virtual try-on in that it is designed to extract canonical images of garments, which presents unique challenges in capturing garment shapes, textures and complex patterns. TryOffDiff ensures high fidelity and detail preservation by using Stable Diffusion and SigLIP-based visual conditions. Experiments on this technique on the VITON-HD dataset show that its method outperforms baseline methods based on pose transfer and virtual try-on, and requires fewer pre- and post-processing steps. TryOffDiff not only improves the quality of e-commerce product images, but also advances the evaluation of generative models and inspires future work on high-fidelity reconstruction.
Diffusion Self-Distillation is a diffusion model-based self-distillation technology for zero-shot customized image generation. This technology allows artists and users to generate their own datasets through pre-trained text-to-image models without large amounts of paired data, and then fine-tune the models to achieve text- and image-conditioned image-to-image tasks. This approach outperforms existing zero-shot methods in maintaining performance on the identity generation task and is comparable to per-instance tuning techniques without test-time optimization.
CAT4D is a technology that uses multi-view video diffusion models to generate 4D scenes from monocular videos. It can convert input monocular video into multi-view video and reconstruct dynamic 3D scenes. The importance of this technology lies in its ability to extract and reconstruct complete information of three-dimensional space and time from single-view video data, providing powerful technical support for fields such as virtual reality, augmented reality, and three-dimensional modeling. Product background information shows that CAT4D was jointly developed by researchers from Google DeepMind, Columbia University and UC San Diego. It is a case in which cutting-edge scientific research results are transformed into practical applications.
OneDiffusion is a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding, covering a variety of tasks. The model is expected to have code and checkpoints released in early December. The importance of OneDiffusion lies in its ability to handle image synthesis and understanding tasks, which is an important advancement in the field of artificial intelligence, especially in image generation and recognition. Product background information shows that this is a project jointly developed by multiple researchers, and its research results have been published on arXiv.
JoyVASA is an audio-driven portrait animation technology based on a diffusion model that generates facial dynamics and head movements by separating dynamic facial expressions and static 3D facial representations. This technology not only improves video quality and lip sync accuracy, but also extends to animal facial animation, supports multiple languages, and improves training and inference efficiency. Key advantages of JoyVASA include longer video generation capabilities, character identity-independent motion sequence generation, and high-quality animation rendering.
Fashion-VDM is a video diffusion model (VDM) for generating virtual try-on videos. The model accepts an image of a piece of clothing and a video of a person as input, and aims to generate a high-quality try-on video of a person wearing a given piece of clothing while preserving the person's identity and movements. Compared with traditional image-based virtual try-on, Fashion-VDM performs well in terms of clothing details and time consistency. The main advantages of this technology include: diffusion architecture, classifier free guidance enhanced control, progressive temporal training strategy for single 64-frame 512px video generation, and effectiveness of joint image-video training. Fashion-VDM sets a new industry standard in video virtual try-on.
InstantIR is a blind image restoration method based on the diffusion model, which can handle unknown degradation problems during testing and improve the generalization ability of the model. This technology provides robust generation conditions by dynamically adjusting the generation conditions to generate reference images during inference. The main advantages of InstantIR include the ability to restore extremely degraded image details, provide realistic textures, and generate references through text description adjustment to achieve creative image restoration. The technology was jointly developed by researchers from Peking University, the InstantX team and the Chinese University of Hong Kong, with sponsorship support from HuggingFace and fal.ai.
PromptFix is a comprehensive framework that enables diffusion models to follow human instructions to perform various image processing tasks. This framework builds a large-scale instruction following data set, proposes a high-frequency guided sampling method to control the denoising process, and designs an auxiliary prompt adapter to use a visual language model to enhance text prompts and improve the model's task generalization ability. PromptFix outperforms previous methods in a variety of image processing tasks and exhibits superior zero-shot capabilities in blind recovery and combination tasks.
MarDini is a video diffusion model launched by Meta AI Research that integrates the advantages of masked autoregression (MAR) into a unified diffusion model (DM) framework. The model can generate video at any frame position based on any number of mask frames, and supports a variety of video generation tasks such as video interpolation, image-to-video generation, and video expansion. MarDini is designed to be efficient, allocating most of the computing resources to low-resolution planning models, making spatial-temporal attention possible at large scales. MarDini sets a new benchmark in video interpolation and efficiently generates videos comparable to more expensive advanced image-to-video models within a few inference steps.
FasterCache is an innovative no-training strategy designed to accelerate the inference process of video diffusion models and generate high-quality video content. The importance of this technology is that it can significantly improve the efficiency of video generation while maintaining or improving the quality of the content, which is very valuable for industries that need to quickly generate video content. FasterCache was developed by researchers from the University of Hong Kong, Nanyang Technological University, and Shanghai Artificial Intelligence Laboratory, and the project page provides more visual results and detailed information. The product is currently available for free and is mainly targeted at video content generation, AI research and development and other fields.
genmoai/models is an open source video generation model that represents the latest progress in video generation technology. The model, named Mochi 1, is a 1 billion parameter diffusion model based on the Asymmetric Diffusion Transformer (AsymmDiT) architecture. It is trained from scratch and is the largest video generation model publicly released to date. It features high-fidelity motion and strong prompt following, significantly bridging the gap between closed and open video generation systems. The model is released under the Apache 2.0 license and users can try it out for free on Genmo’s playground.
Stable Diffusion 3.5 Large Turbo is a Multimodal Diffusion Transformer (MMDiT) model for text-based image generation that uses adversarial diffusion distillation (ADD) technology to improve image quality, typography, complex prompt understanding, and resource efficiency, with a special focus on reducing inference steps. The model performs well in generating images, is able to understand and generate complex text prompts, and is suitable for a variety of image generation scenarios. It is released on the Hugging Face platform and follows the Stability Community License, which is suitable for research, non-commercial use, and free use by organizations or individuals with annual income of less than $1 million.
Stable Diffusion 3.5 Large is a Multimodal Diffusion Transformer (MMDiT) model that generates images from text, developed by Stability AI. The model offers significant improvements in image quality, typography, complex prompt understanding, and resource efficiency. It uses three fixed pre-trained text encoders and improves training stability through QK normalization technique. In addition, the model uses both synthetic and filtered publicly available data on training data and strategies. The Stable Diffusion 3.5 Large model is free for research, non-commercial use, and commercial use by organizations or individuals with annual revenue of less than $1 million, subject to the Community License.
ACE is a versatile creator and editor based on diffusion transformation, which can achieve joint training of multiple visual generation tasks through the unified conditional format Long-context Condition Unit (LCU) input. ACE solves the problem of lack of training data through efficient data collection methods and generates accurate text instructions through multi-modal large-scale language models. ACE has significant performance advantages in the field of vision generation, making it easy to build chat systems that respond to any image creation request, avoiding the cumbersome processes typically employed by vision agents.
Inverse Painting is a diffusion model-based method that generates time-lapse videos of the painting process from a target painting. The technology learns the painting process of real artists through training, can handle multiple art styles, and generates videos similar to the painting process of human artists. It combines text and region understanding, defines a set of painting instructions, and updates the canvas using a novel diffusion-based renderer. This technique is not only capable of handling the limited acrylic painting styles in which it was trained, but also provides reasonable results for a wide range of art styles and genres.
HelloMeme is a diffusion model integrating spatial weaving attention, aiming to embed high-fidelity and rich conditions into the image generation process. This technology generates videos by extracting the features of each frame in the driving video and using them as input to HMControlModule. By further optimizing the Animatediff module, the continuity and fidelity of the generated videos are improved. In addition, HelloMeme also supports facial expressions generated through ARKit facial blend shape control, as well as Lora or Checkpoint based on SD1.5, implementing a hot-swappable adapter for the framework that will not affect the generalization ability of the T2I model.
Diffusers Image Outpaint is an image epitaxy technology based on the diffusion model, which can generate additional parts of the image based on the existing image content. This technology has broad application prospects in image editing, game development, virtual reality and other fields. It uses advanced machine learning algorithms to make image generation more natural and realistic, providing users with an innovative image processing method.
InstantDrag is an optimized free process that enhances interactivity and speed by using only images and drag commands as input. The technology consists of two carefully designed networks: the drag-conditional optical flow generator (FlowGen) and the optical flow-conditional diffusion model (FlowDiffusion). InstantDrag learns the motion dynamics of drag-and-drop image editing based on real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation. Its ability to quickly perform photorealistic editing without the need for masks or text prompts makes it a promising solution for interactive, real-time applications.
OmniGen is an innovative diffusion framework that unifies multiple image generation tasks into a single model without the need for task-specific networks or fine-tuning. This technology simplifies the image generation process, improves efficiency, and reduces development and maintenance costs.
Concept Sliders is a technique for precise control of concepts in diffusion models. It is applied on top of pre-trained models through low-rank adapters (LoRA), allowing artists and users to be trained to control the direction of specific attributes through simple text descriptions or image pairs. The main advantage of this technique is the ability to make subtle adjustments to the resulting image, such as eye size, lighting, etc., without changing the overall structure of the image, allowing for finer control. It provides artists with a new way of creative expression while solving the problem of generating blurry or distorted images.
This product is an image-to-video diffusion model that can generate continuous video sequences with coherent motion from a pair of key frames through lightweight fine-tuning technology. This method is particularly suitable for scenarios where a smooth transition animation needs to be generated between two static images, such as animation production, video editing, etc. It leverages the power of large-scale image-to-video diffusion models by fine-tuning them to predict videos between two keyframes, achieving forward and backward consistency.
Follow-Your-Canvas is a diffusion model-based video epitaxy technology capable of generating high-resolution video content. This technology solves the problem of GPU memory limitations through distributed processing and spatial window merging, while maintaining the spatial and temporal consistency of the video. It excels at large-scale video extension, capable of significantly increasing video resolution, such as from 512 X 512 to 1152 X 2048, while producing high-quality and visually pleasing results.
DiPIR is a physics-based method jointly developed by the Toronto AI Lab and NVIDIA Research that enables virtual objects to be realistically inserted into indoor and outdoor scenes by recovering scene lighting from a single image. The technology not only optimizes materials and tone mapping, but also automatically adjusts to different environments to improve the realism of images.
GameNGen is a game engine driven entirely by neural models, enabling real-time interaction with complex environments and maintaining high quality over long-term trajectories. It can interactively simulate the classic game "DOOM" at more than 20 frames per second, and its next frame prediction PSNR reaches 29.4, which is comparable to lossy JPEG compression. Human evaluators were only slightly better than chance at distinguishing between gameplay clips and simulation clips. GameNGen is trained through two stages: (1) an RL-agent learns to play the game and records the actions and observations of the training session, which becomes the training data for the generative model; (2) a diffusion model is trained to predict the next frame, conditioned on the past sequence of actions and observations. Conditional boosting allows stable autoregressive generation over long time trajectories.
ml-mdm is a Python package for efficient training of high-quality text-to-image diffusion models. This model uses Matryoshka diffusion model technology to train a single pixel space model at a resolution of 1024x1024 pixels, demonstrating strong zero-sample generalization capabilities.
TexGen is an innovative multi-view sampling and resampling framework for synthesizing 3D textures from arbitrary textual descriptions. It utilizes pre-trained text-to-image diffusion models, multi-view sampling strategies through consistent view sampling and attention guidance, and noise resampling techniques to significantly improve the texture quality of 3D objects with a high degree of view consistency and rich appearance details.
CatVTON is a virtual try-on technology based on the diffusion model, with lightweight network (899.06M parameters in total), efficient parameter training (49.57M trainable parameters) and simplified inference (<8G VRAM at 1024X768 resolution). It achieves fast and efficient virtual try-on effects through simplified network structure and reasoning process, which is especially suitable for the fashion industry and personalized recommendation scenarios.
DiT-MoE is a diffusion transformer model implemented using PyTorch, capable of scaling to 16 billion parameters, demonstrating highly optimized inference capabilities while competing with dense networks. It represents the cutting-edge technology in the field of deep learning when processing large-scale data sets and has important research and application value.
TCAN is a novel portrait animation framework based on the diffusion model that maintains temporal consistency and generalizes well to unseen domains. The framework uses unique modules such as appearance-pose adaptation layer (APPA layer), temporal control network and attitude-driven temperature map to ensure that the generated video maintains the appearance of the source image and follows the pose of the driving video, while maintaining background consistency.
RodinHD is a high-fidelity 3D avatar generation technology based on the diffusion model. It was developed by researchers such as Bowen Zhang and Yiji Cheng. It aims to generate detailed 3D avatars from a single portrait image. This technology solves the shortcomings of existing methods in capturing complex details such as hairstyles. It integrates regularization terms through novel data scheduling strategies and weights to improve the decoder's ability to render sharp details. In addition, through multi-scale feature representation and cross-attention mechanism, the guidance effect of portrait images is optimized. The generated 3D avatar is significantly better in details than previous methods and can be generalized to wild portrait input.
AsyncDiff is an asynchronous denoising acceleration scheme for parallelizing diffusion models. It enables parallel processing of the model by splitting the noise prediction model into multiple components and distributing them to different devices. This approach significantly reduces inference latency with minimal impact on generation quality. AsyncDiff supports multiple diffusion models, including Stable Diffusion 2.1, Stable Diffusion 1.5, Stable Diffusion x4 Upscaler, Stable Diffusion XL 1.0, ControlNet, Stable Video Diffusion, and AnimateDiff.
Flash Diffusion is an efficient image generation model that generates high-quality images in fewer steps and is suitable for a variety of image processing tasks, such as text-to-image, inpainting, super-resolution, etc. The model achieves state-of-the-art performance on the COCO2014 and COCO2017 datasets with low training time and low number of parameters.
UniAnimate is a unified video diffusion model framework for human image animation. It reduces optimization difficulty and ensures temporal coherence by mapping reference images, pose guidance, and noisy videos into a common feature space. UniAnimate can handle long sequences and supports random noise input and first frame conditional input, significantly improving the ability to generate long-term videos. Furthermore, it explores alternative temporal modeling architectures based on state-space models as a replacement for the original computationally intensive temporal Transformer. UniAnimate achieves synthetic results that outperform existing state-of-the-art techniques in both quantitative and qualitative evaluations, and is able to generate highly consistent one-minute videos by iteratively using a first-frame conditional strategy.
HOI-Swap is a video editing framework based on the diffusion model, focusing on handling the complexity of hand-object interaction in video editing. Through self-supervised training, the model can achieve object exchange in a single frame and learn to adjust the hand interaction mode, such as the hand grip, according to changes in object attributes. The second stage extends single-frame editing to the entire video sequence, enabling high-quality video editing through motion alignment and video generation.
Hallo is a portrait image animation technology developed by Fudan University that uses diffusion models to generate realistic and dynamic portrait animations. Unlike traditional intermediate facial representations that rely on parametric models, Hallo adopts an end-to-end diffusion paradigm and introduces a layered audio-driven visual synthesis module to enhance the alignment accuracy between audio input and visual output, including lips, expressions, and gesture movements. This technology provides adaptive control of the diversity of expressions and postures, can more effectively achieve personalized customization, and is suitable for people with different identities.
Bootstrap3D is a framework for improving 3D content creation, solving the problem of scarcity of high-quality 3D assets through synthetic data generation technology. It utilizes 2D and video diffusion models to generate multi-view images based on text prompts, and uses the 3D-aware MV-LLaVA model to screen high-quality data and rewrite inaccurate titles. This framework has generated 1 million high-quality synthetic multi-view images with dense descriptive captions to address the shortage of high-quality 3D data. Furthermore, it proposes a training time step rearrangement (TTR) strategy to learn multi-view consistency using a denoising process while maintaining the original 2D diffusion prior.
Era3D is an open source high-resolution multi-view diffusion model that generates high-quality images through an efficient row attention mechanism. The model is capable of generating multi-view color and normal images, supporting custom parameters for optimal results. Era3D is of importance in the field of image generation because it provides a new way to generate realistic three-dimensional images.
ViViD is a new framework for video virtual try-on utilizing diffusion models. It extracts fine semantic features of clothing by designing a clothing encoder, and introduces a lightweight pose encoder to ensure spatiotemporal consistency and generate realistic video try-on effects. ViViD has collected the largest video virtual try-on data set with the most diverse clothing types and the highest resolution to date.
I2VEdit is an innovative video editing technology that extends the editing of single frames to the entire video through pre-trained image-to-video models. This technology is able to adaptively maintain the visual and motion integrity of the source video and effectively handle global edits, local edits, and moderate shape changes, which are not possible with existing methods. The core of I2VEdit consists of two main processes: coarse motion extraction and appearance refinement, with precise adjustment via coarse-grained attention matching. Furthermore, a skip interval strategy is introduced to mitigate the quality degradation during automatic regression generation of multiple video clips. Experimental results demonstrate I2VEdit's superior performance in fine-grained video editing, demonstrating its ability to produce high-quality, time-consistent output.
StreamV2V is a diffusion model that enables real-time video-to-video (V2V) translation via user prompts. Different from traditional batch processing methods, StreamV2V adopts streaming processing and can process infinite frames of video. Its core is to maintain a feature library that stores information from past frames. For newly incoming frames, StreamV2V directly fuses similar past features into the output by extending self-attention and direct feature fusion technology. The feature library is continuously updated by merging stored and new features, keeping it compact and information-rich. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without the need for fine-tuning.
Make-An-Audio 2 is a text-to-audio generation technology based on the diffusion model, jointly developed by researchers from Zhejiang University, ByteDance, and the Chinese University of Hong Kong. The technology improves the quality of generated audio by using pre-trained large language models (LLMs) to parse text, optimizing semantic alignment and temporal consistency. It also designs a feedforward Transformer-based diffusion denoiser to improve the performance of variable-length audio generation and enhance the extraction of temporal information. Furthermore, the problem of temporal data scarcity is solved by using LLMs to convert large amounts of audio label data into audio text datasets.
TryOnDiffusion is an innovative image synthesis technology that simultaneously maintains clothing details and adapts to significant body posture and shape changes in a single network through the combination of two UNets (Parallel-UNet). This technology can adapt to different body postures and shapes while maintaining clothing details, solving the shortcomings of previous methods in detail maintenance and posture adaptation, and achieving industry-leading performance.
DIAMOND (DIffusion As a Model Of eNvironment Dreams) is a reinforcement learning agent trained in a diffusion world model for world modeling in Atari games where visual detail is critical. It is trained on a subset of Atari games via autoregressive imagination, allowing you to quickly install and try out pre-trained world models.
Slicedit is a zero-shot video editing technology that utilizes a text-to-image diffusion model and combines spatiotemporal slicing to enhance temporal consistency in video editing. This technology is able to preserve the structure and motion of the original video while complying with the target text description. Through extensive experiments, Slicedit has been proven to have clear advantages in editing real-world videos.
CAT3D is a website that utilizes multi-view diffusion models to generate 3D scenes from new perspectives from any number of input images. It converts the generated views into interactively renderable 3D representations through a powerful 3D reconstruction pipeline. The entire processing time, including view generation and 3D reconstruction, takes just one minute.
MuLan is an open source multilingual diffusion model designed to provide diffusion model support for over 110 languages that can be used without additional training. Through adaptation technology, this model enables the diffusion model that originally required a large amount of training data and computing resources to quickly adapt to the new language environment, greatly expanding the application scope and language diversity of the diffusion model. The main advantages of MuLan include support for multiple languages, optimized memory usage, and providing rich resources to researchers and developers through the release of technical reports and code models.
Lumina-T2X is an advanced text-to-arbitrary modality generation framework that converts text descriptions into vivid images, dynamic videos, detailed multi-view 3D images, and synthesized speech. The framework uses a flow-based large-scale diffusion transformer (Flag-DiT) that supports up to 700 million parameters and can extend sequence lengths to 128,000 markers. Lumina-T2X integrates images, videos, multi-views of 3D objects, and speech spectrograms into a spatiotemporal latent label space that can generate output at any resolution, aspect ratio, and duration.
IDM-VTON is a novel diffusion model for image-based virtual try-on tasks, which generates virtual try-on images with high realism and detail by combining high-level semantics and low-level features of visual encoders and UNet networks. The technology enhances the realism of generated images by providing detailed textual prompts, and further improves fidelity and realism in real-world scenarios through customization methods.
Imagine Flash is a novel diffusion model that enables high-fidelity, diverse sample generation using only one to three steps through a backward distillation framework. The model contains three key components: backward distillation, dynamically adaptive knowledge transfer, and noise correction techniques that significantly improve image quality and sample diversity at very low steps.
Diffusion-RWKV is a diffusion model based on the RWKV architecture, aiming to improve the scalability of the diffusion model. It has been optimized and improved accordingly for image generation tasks and can generate high-quality images. The model supports unconditional and quasi-conditional training and has good performance and scalability.
DreamWalk is a text-aware image generation method based on diffusion guidance, which allows fine-grained control over the style and content of images without the need to fine-tune the diffusion model or modify internal layers. Supports multiple styles of interpolation and spatially varying bootstrap functions, and can be widely used in various diffusion models.
VAR is a new visual autoregressive modeling method that can surpass the diffusion model and achieve more efficient image generation. It establishes power-law scaling laws for visual generation and has the ability to generalize to zero shots. VAR provides a series of pre-trained models of different sizes for users to explore and use.
MuseV is a virtual human video generation framework based on the diffusion model, supports unlimited length video generation, and adopts a novel visual condition parallel denoising scheme. It provides a pre-trained virtual human video generation model, supports functions such as Image2Video, Text2Image2Video, Video2Video, etc., and is compatible with the Stable Diffusion ecosystem, including basic models, LoRA, ControlNet, etc. It supports multiple reference image technologies, such as IPAdapter, ReferenceOnly, ReferenceNet, IPAdapterFaceID, etc. The advantage of MuseV is that it can generate high-fidelity videos of unlimited length and is positioned in the field of video generation.
Make-Your-Anchor is a 2D avatar generation framework based on the diffusion model. It only takes about 1 minute of video material to automatically generate an anchor-style video with precise upper body and hand movements. The system uses a structure-guided diffusion model to render 3D mesh states into character appearances. Effectively bind movements to specific looks through a two-stage training strategy. In order to generate time series videos of arbitrary length, the 2D U-Net of the frame-wise diffusion model is extended to 3D form, and a simple and effective batch overlap time series denoising module is proposed, thereby breaking through the video length limit during inference. Finally, an identity-specific face enhancement module is introduced to improve the visual quality of the facial region in the output video. Experiments show that the system outperforms existing technologies in terms of visual quality, timing consistency, and identity fidelity.
ObjectDrop is a supervised method designed to achieve photorealistic object removal and insertion. It leverages a counted fact data set and bootstrap supervision techniques. The main function is the ability to remove objects from the image and their impact on the scene (such as occlusions, shadows and reflections), as well as the ability to insert objects into the image in an extremely realistic way. It achieves object deletion by fine-tuning the diffusion model on a small specially captured data set, and for object insertion, it uses a bootstrapped supervision method to synthesize a large-scale count fact data set using the deletion model, and then fine-tune it to the real data set after training on this data set to obtain a high-quality insertion model. Compared with the previous method, ObjectDrop has significantly improved the authenticity of object deletion and insertion.
MOTIA is a diffusion method based on test-time adaptation that utilizes the intrinsic content and motion patterns within the source video to effectively perform video extension painting. This method consists of two main stages: intrinsic adaptation and extrinsic rendering, aiming to improve the quality and flexibility of video epipainting.
ELLA (Efficient Large Language Model Adapter) is a lightweight method to equip existing CLIP-based diffusion models with powerful LLM. ELLA improves the model's prompt following capabilities, enabling text-to-image models to understand long texts. We design a time-aware semantic connector to extract time-step related conditions for various denoising stages from pre-trained LLM. Our TSC dynamically adapts to the semantic features at different sampling time steps, helping to freeze U-Net at different semantic levels. ELLA performs well in benchmarks such as DPG-Bench, especially in dense prompts involving multiple object combinations, different attributes and relationships.
SLD is a self-correcting LLM-controlled diffusion model framework that enhances generative models by integrating detectors to achieve accurate text-to-image alignment. The SLD framework supports image generation and fine editing, and is compatible with any image generator, such as DALL-E 3, without requiring additional training or data.
ResAdapter is a resolution adapter designed for diffusion models (such as Stable Diffusion), which can generate images of arbitrary resolutions and aspect ratios while maintaining style domain consistency. Different from multi-resolution generation methods that process static resolution images, ResAdapter directly generates dynamic resolution images, improving inference efficiency and reducing additional inference time.
DistriFusion is a training-free algorithm that leverages multiple GPUs to accelerate diffusion model inference without sacrificing image quality. DistriFusion can reduce latency based on the number of devices used while maintaining visual fidelity.
Neural Network Diffusion is a neural network diffusion model developed by the High Performance Computing and Artificial Intelligence Laboratory of the National University of Singapore. This model utilizes the diffusion process to generate high-quality images and is suitable for tasks such as image generation and inpainting.
Sora is a text-controlled video generation diffusion model based on large-scale training. It is capable of generating high-definition video up to 1 minute long, covering a wide range of visual data types and resolutions. Sora enables scalable video generation by training in the compressed latent space of videos and images, decomposing them into spatiotemporal location patches. Sora also demonstrated some capabilities in simulating the physical and digital worlds, such as three-dimensional consistency and interaction, revealing the prospects of continuing to expand the scale of video generation models to develop high-capability simulators.
Diffuse to Choose is a diffusion-based image repair model mainly used in virtual try-on scenarios. It is able to preserve the details of reference items when repairing images and is capable of accurate semantic operations. By directly incorporating the detailed features of the reference image into the latent feature map of the main diffusion model, and combining perceptual losses to further preserve the details of the reference items, this model achieves a good balance between fast inference and high-fidelity details.
AnyDoor AI is a breakthrough image generation tool whose design concept is based on the diffusion model. It seamlessly embeds target objects into new user-specified scene locations. AnyDoor first uses a segmenter to remove the background of the target object, and then uses an ID extractor to capture identity information (ID token). This information, along with target object details, is fed into a pre-trained text-to-image diffusion model. Guided by the extracted information and details, the model generates the desired image. What is unique about this model is that it does not require parameters to be adjusted for each object. In addition, its powerful customization capabilities allow users to easily position and adjust objects in scene images, achieving high-fidelity and diverse zero-shot object-scene synthesis. In addition to photo editing, this tool also has broad application prospects in the field of e-commerce. With AnyDoor, concepts such as "one-click clothing change" can be realized, using real-life models to swap clothes, providing users with a more personalized shopping experience. In a broader sense, AnyDoor can also be understood as a "one-click Photoshop composition" or a "context-aware move tool" in Photoshop. It features seamless image integration and the ability to swap scene objects and place image objects into target locations. By harnessing the power of advanced technology, AnyDoor essentially redefines image manipulation, promising a variety of more humane applications in everyday interactions.
Make-A-Shape is a new 3D generative model designed to train on large-scale data in an efficient manner, capable of leveraging 10 million publicly available shapes. We innovatively introduce a wavelet tree representation to compactly encode the shape by formulating a subband coefficient filtering scheme, and then arrange the representation in a low-resolution grid by designing a subband coefficient packing scheme, making it generative of diffusion models. Furthermore, we propose a subband adaptive training strategy that enables our model to effectively learn to generate coarse and fine wavelet coefficients. Finally, we extend our framework to be controlled by additional input conditions to enable it to generate shapes from various modalities, such as single/multi-view images, point clouds, and low-resolution voxels. In extensive experiments, we demonstrate various applications such as unconditional generation, shape completion, and conditional generation. Our method not only surpasses the state of the art in providing high-quality results, but also efficiently generates shapes in seconds, typically only 2 seconds under most conditions.
This paper proposes a simple and effective personalized image restoration method called dual-hub tuning. The method consists of two steps: 1) leveraging the conditional information in the encoder for personalization by fine-tuning the conditional generative model; 2) fixing the generative model and adjusting the parameters of the encoder to adapt to the enhanced personalization prior. This produces natural images that preserve personalized facial features as well as image degradation properties. Experiments demonstrate that this method can generate higher fidelity facial images compared to non-personalized methods.
This paper introduces a perceptual loss-based diffusion model that improves sample quality by directly incorporating perceptual loss into diffusion training. For conditional generation, this method only improves sample quality without affecting the conditional input, thus not sacrificing sample diversity. For unconditional generation, this approach also improves sample quality. The paper introduces the principle and experimental results of the method in detail.
InstructVideo is a method for guiding text-to-video diffusion models with reward fine-tuning through human feedback. It rewards fine-tuning through editing, reducing fine-tuning costs and improving fine-tuning efficiency. It uses an established image reward model to provide reward signals through piecewise sparse sampling and temporally decaying rewards, significantly improving the visual quality of the generated videos. InstructVideo can not only improve the visual quality of generated videos, but also maintain strong generalization capabilities. For more information, please visit the official website.
X-Adapter is a universal upgrade tool that enables pre-trained plug-in modules (e.g. ControlNet, LoRA) to be used directly with upgraded text-to-image diffusion models (e.g. SD-XL) without further retraining. By training additional networks to control the frozen upgraded model, X-Adapter retains the connectors of the old model and adds a trainable mapping layer to connect the decoders of different versions of the model for feature remapping. The remapped features will serve as guides for upgrading the model. In order to enhance the guidance ability of X-Adapter, we adopt an empty text training strategy. After training, we also introduce a two-stage denoising strategy to adjust the initial latent variables of the X-Adapter and upgraded models. X-Adapter demonstrates universal compatibility with various plug-ins and enables different versions of plug-ins to work together, thus extending the capabilities of the diffusion community. We conducted extensive experiments to demonstrate that X-Adapter may have wider applications in upgraded basic diffusion models.
Upscale-A-Video is a diffusion-based model that improves the resolution of videos by taking low-resolution videos and text cues as input. This model ensures temporal consistency through two key mechanisms: locally, it integrates the temporal layer into U-Net and VAE-Decoder to maintain the consistency of short sequences; globally, it introduces a flow-guided recurrent latent propagation module to enhance the stability of the overall video by propagating and fusing latent information throughout the sequence. Due to the diffusion paradigm, our model also achieves a trade-off between fidelity and quality by allowing text cues to guide texture creation and adjustable noise levels to balance recovery and generation. Extensive experiments demonstrate that Upscale-A-Video surpasses existing methods on both synthetic and real-world benchmarks, as well as AI-generated videos, demonstrating impressive visual fidelity and temporal consistency.
MagicAnimate is an advanced diffusion model-based framework for human body image animation. It can generate animated videos from single images and dynamic videos with temporal consistency, maintain the characteristics of reference images, and significantly improve the fidelity of animations. MagicAnimate supports image animation using action sequences from a variety of sources, including animation across identities and unseen areas such as paintings and movie characters. It also integrates seamlessly with T2I diffusion models such as DALLE3, which can give dynamic actions to images generated based on text. MagicAnimate is jointly developed by the National University of Singapore Show Lab and Bytedance.