Found 45 AI tools
Click any tool to view details
Hallo2 is a portrait image animation technology based on a latent diffusion generation model that generates high-resolution, long-term videos driven by audio. It expands Hallo's capabilities by introducing several design improvements, including generating long-duration videos, 4K resolution videos, and adding the ability to enhance expression control through text prompts. Hallo2's key advantages include high-resolution output, long-term stability, and enhanced control through text prompts, which make it a significant advantage in generating rich and diverse portrait animation content.
DreamMesh4D is a new framework that combines mesh representation and sparsely controlled deformation technology to generate high-quality 4D objects from monocular videos. This technique solves the challenges of traditional methods in terms of spatial-temporal consistency and surface texture quality by incorporating implicit Neural Radiation Fields (NeRF) or explicit Gaussian rendering as the underlying representation. DreamMesh4D uses inspiration from modern 3D animation pipelines to bind Gaussian drawing to triangular mesh surfaces, enabling differentiable optimization of textures and mesh vertices. The framework starts from a coarse mesh provided by a single-image 3D generation method and constructs a deformation map by uniformly sampling sparse points to improve computational efficiency and provide additional constraints. Through two-stage learning, combined with reference view photometric loss, score distillation loss, and other regularization losses, the learning of static surface Gaussians and mesh vertices and dynamic deformation networks is achieved. DreamMesh4D outperforms previous video-to-4D generation methods in terms of rendering quality and spatial-temporal consistency, and its mesh-based representation is compatible with modern geometry pipelines, demonstrating its potential in the 3D gaming and film industries.
Inverse Painting is a diffusion model-based method that generates time-lapse videos of the painting process from a target painting. The technology learns the painting process of real artists through training, can handle multiple art styles, and generates videos similar to the painting process of human artists. It combines text and region understanding, defines a set of painting instructions, and updates the canvas using a novel diffusion-based renderer. This technique is not only capable of handling the limited acrylic painting styles in which it was trained, but also provides reasonable results for a wide range of art styles and genres.
DepthFlow is a highly customizable parallax shader for animating your images. It is a free and open source ImmersityAI alternative capable of converting images into videos with 2.5D parallax effect. This tool has fast rendering capabilities and supports a variety of post-processing effects, such as vignette, depth of field, lens distortion, etc. It supports a variety of parameter adjustments, can create flexible motion effects, and has a variety of built-in preset animations. In addition, it also supports video encoding and export, including H264, HEVC, AV1 and other formats, and provides a user experience without watermarks.
Stable Video Portraits is an innovative hybrid 2D/3D generation method that utilizes pre-trained text-to-image models (2D) and 3D morphological models (3D) to generate realistic dynamic face videos. This technology upgrades the general 2D stable diffusion model to a video model through person-specific fine-tuning. By providing a time-series 3D morphological model as a condition and introducing a temporal denoising process, it generates a face image with temporal smoothness that can be edited and transformed into a text-defined celebrity image without additional test-time fine-tuning. This method outperforms existing monocular head avatar methods in both quantitative and qualitative analyses.
PhysGen is an innovative image-to-video generation method that converts single images and input conditions (e.g., forces and torques exerted on objects in the image) into realistic, physically plausible, and temporally coherent videos. The technology enables dynamic simulation in image space by combining model-based physical simulation with a data-driven video generation process. Key benefits of PhysGen include that the generated videos appear physically and visually realistic and can be precisely controlled, demonstrating its superiority over existing data-driven image-to-video generation efforts through quantitative comparisons and comprehensive user studies.
HelloMeme is a diffusion model integrating spatial weaving attention, aiming to embed high-fidelity and rich conditions into the image generation process. This technology generates videos by extracting the features of each frame in the driving video and using them as input to HMControlModule. By further optimizing the Animatediff module, the continuity and fidelity of the generated videos are improved. In addition, HelloMeme also supports facial expressions generated through ARKit facial blend shape control, as well as Lora or Checkpoint based on SD1.5, implementing a hot-swappable adapter for the framework that will not affect the generalization ability of the T2I model.
Robust Dual Gaussian Splatting (DualGS) is a novel Gaussian-based volumetric video representation method that captures complex human performances by optimizing joint Gaussians and skin Gaussians and achieves robust tracking and high-fidelity rendering. Demonstrated at SIGGRAPH Asia 2024, the technology enables real-time rendering on low-end mobile devices and VR headsets, providing a user-friendly and interactive experience. DualGS achieves a compression ratio of up to 120 times through a hybrid compression strategy, making the storage and transmission of volumetric videos more efficient.
This product is an image-to-video diffusion model that can generate continuous video sequences with coherent motion from a pair of key frames through lightweight fine-tuning technology. This method is particularly suitable for scenarios where a smooth transition animation needs to be generated between two static images, such as animation production, video editing, etc. It leverages the power of large-scale image-to-video diffusion models by fine-tuning them to predict videos between two keyframes, achieving forward and backward consistency.
Animate3D is an innovative framework for generating animations for any static 3D model. Its core idea consists of two main parts: 1) Propose a new multi-view video diffusion model (MV-VDM), which is based on multi-view rendering of static 3D objects and trained on the large-scale multi-view video dataset (MV-Video) we provide. 2) Based on MV-VDM, a framework combining reconstruction and 4D Scored Distillation Sampling (4D-SDS) is introduced to generate animations for 3D objects using multi-view video diffusion priors. Animate3D enhances spatial and temporal consistency by designing a new spatiotemporal attention module and maintains the identity of static 3D models through multi-view rendering. In addition, Animate3D also proposes an efficient two-stage process to animate 3D models: first directly reconstructing motion from the generated multi-view video, and then refining the appearance and motion through the introduced 4D-SDS.
EchoMimic is an advanced portrait image animation model capable of generating realistic portrait videos driven by audio and selected facial feature points individually or in combination. Through a novel training strategy, it solves the possible instability of traditional methods when driven by audio and the unnatural results that may be caused by facial key point driving. EchoMimic is comprehensively compared on multiple public and self-collected datasets and demonstrates superior performance in both quantitative and qualitative evaluations.
ControlNeXt is an open source image and video generation model that achieves faster convergence and superior efficiency by reducing trainable parameters by up to 90%. The project supports multiple forms of control information and can be combined with LoRA technology to change styles and ensure more stable generation effects.
VividDream is an innovative technology that generates explorable 4D scenes with environmental dynamics from a single input image or text prompt. It first expands the input image into a static 3D point cloud, then uses a video diffusion model to generate an animated video collection, and achieves consistent motion and immersive scene exploration by optimizing the 4D scene representation. This technology makes it possible to generate engaging 4D experiences based on a variety of real-life images and text cues.
Follow-Your-Emoji is a portrait animation framework based on the diffusion model, which can animate the target expression sequence onto the reference portrait while maintaining the consistency of portrait identity, expression delivery, temporal coherence and fidelity. By employing expression-aware landmarks and facial fine-grained loss techniques, it significantly improves the model's performance in controlling free-style human expressions, including real people, cartoons, sculptures, and even animals. In addition, it extends to stable long-term animation through a simple and effective stepwise generation strategy, increasing its potential application value.
ToonCrafter is an open source research project focused on interpolating two cartoon images using pretrained image-to-video diffusion priors. The project aims to positively impact the field of AI-driven video generation, offering users the freedom to create videos but requiring users to comply with local laws and use responsibly.
Lumina-T2X is an advanced text-to-arbitrary modality generation framework that converts text descriptions into vivid images, dynamic videos, detailed multi-view 3D images, and synthesized speech. The framework uses a flow-based large-scale diffusion transformer (Flag-DiT) that supports up to 700 million parameters and can extend sequence lengths to 128,000 markers. Lumina-T2X integrates images, videos, multi-views of 3D objects, and speech spectrograms into a spatiotemporal latent label space that can generate output at any resolution, aspect ratio, and duration.
StoryDiffusion is an open source image and video generation model that can generate coherent long sequences of images and videos through a consistent self-attention mechanism and motion predictors. The main advantage of this model is that it is able to generate character-consistent images and can be extended to video generation, providing users with a new way to create long videos. The model has a positive impact on the field of AI-driven image and video generation, and users are encouraged to use the tool responsibly.
PhysDreamer is a physics-based method that imparts interactive dynamics to static 3D objects by leveraging object dynamics priors learned by video generative models. This approach allows simulating realistic responses to novel interactions, such as external forces or agent manipulations, in the absence of data on the physical properties of real objects. PhysDreamer drives the development of more engaging and realistic virtual experiences through user studies that evaluate the authenticity of synthetic interactions.
SC-GS is a new representation technology that represents the motion and appearance of dynamic scenes using sparse control points and dense Gaussian functions respectively. It uses a small number of control points to learn compact 6-DOF transformation bases, which can be interpolated locally through interpolation weights to obtain the motion field of 3D Gaussian functions. It uses deformed MLP to predict the time-varying 6-degree-of-freedom transformation of each control point, reducing learning complexity, enhancing learning capabilities, and achieving spatiotemporally coherent motion patterns. At the same time, the 3D Gaussian function, the canonical spatial position of the control point and the deformation MLP are jointly learned to reconstruct the appearance, geometry and dynamics of the 3D scene. During the training process, the position and number of control points are adaptively adjusted to adapt to the motion complexity of different regions, and a loss function as rigid as possible is used to enforce spatial continuity and local rigidity of motion. Due to the explicit sparsity and appearance separation of motion representations, this approach enables user-controlled motion editing while preserving high-fidelity appearance. Extensive experiments show that this method outperforms existing methods in new view synthesis and high-speed rendering, and supports new appearance-preserving motion editing applications.
PhysAvatar is an innovative framework that combines inverse rendering and inverse physics to automatically estimate the physical parameters of human body shape, appearance, and clothing from multi-view video data. It uses grid-aligned 4D Gaussian spatiotemporal grid tracking technology and a physically based inverse renderer to estimate intrinsic material properties. PhysAvatar integrates a physics simulator to estimate the physical parameters of clothing in a principled manner using gradient-based optimization methods. These innovative capabilities enable PhysAvatar to render high-quality, new-view avatars wearing loose clothing under motion and lighting conditions outside of the training data.
SurMo is a new dynamic human body rendering paradigm that achieves high-fidelity human body rendering in a unified framework by jointly modeling temporal motion dynamics and human body appearance. This method uses a surface-based three-plane representation to efficiently encode human motion, and designs a physical motion decoding module and a 4D appearance decoding module, which can synthesize time-varying human appearance effects, such as clothing wrinkles, motion shadows, etc. Compared with existing methods, SurMo has significantly improved both quantitative and qualitative rendering indicators.
CameraCtrl is dedicated to providing precise camera pose control for text generation video models, and achieves camera control in the video generation process by training camera encoders to achieve parameterized camera trajectories. By comprehensively studying the effects of various data sets, the product proves that videos with diverse camera distributions and similar appearances can enhance controllability and generalization capabilities. Experiments have proven that CameraCtrl is very effective in achieving precise, domain-adaptive camera control, and is an important advance in enabling dynamic, customized video storytelling from text and camera gesture input.
NUWA is a series of research projects developed by Microsoft, including NUWA, NUWA-Infinity, NUWA-LIP, Learning 3D Photography Videos, and NUWA-XL. These projects involve pre-trained models for visual synthesis, capable of generating or manipulating visual data, such as images and videos, to perform a variety of visual synthesis tasks.
Motion-I2V is a new framework for consistent and controllable image-to-video generation (I2V). Unlike previous methods that directly learn complex image-to-video mapping, Motion-I2V decomposes I2V into two stages and adopts explicit motion modeling. In the first stage, we propose a diffusion-based motion field predictor that focuses on inferring the trajectories of reference image pixels. In the second stage, we propose enhanced motion-enhanced temporal attention to enhance the limited one-dimensional temporal attention in video latent diffusion models. This module can effectively propagate reference image features to synthetic frames guided by the trajectories predicted in the first stage. Compared with existing methods, Motion-I2V generates more consistent videos even in the presence of large motion and viewpoint changes. By training a sparse trajectory control network for the first stage, Motion-I2V can support users to accurately control motion trajectories and motion regions, with control capabilities for sparse trajectories and region annotations. This is more controllable than relying solely on text instructions. Additionally, the second stage of Motion-I2V natively supports zero-sample video-to-video conversion. Qualitative and quantitative comparisons demonstrate that Motion-I2V outperforms previous methods in terms of consistent and controllable image-to-video generation.
UniVG is a unified multi-modal video generation system capable of handling multiple video generation tasks, including text and image modalities. By introducing multi-condition cross-attention and biased Gaussian noise, high- and low-DOF video generation is achieved. Achieving the lowest Fr'echet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpassing the performance of current open source methods on human evaluation, and being on par with the current closed source method Gen2.
This paper proposes a dynamic perspective synthesis method based on diffusion prior to generate new perspectives of dynamic scenes from monocular videos. This method achieves geometric consistency and scene consistency through fine-tuning and knowledge distillation of video frames. The paper evaluates the effectiveness and robustness of the method through qualitative and quantitative experiments, proving the advantages of the method in complex scenarios.
DragNUWA is a video generation tool that can convert actions into camera movements or target object movements by directly manipulating backgrounds or images to generate corresponding videos. DragNUWA 1.5 is based on stabilized video diffusion technology that animates images according to specific paths. DragNUWA 1.0 utilizes text, images, and trajectories as three important control factors to facilitate highly controllable video generation semantically, spatially, and temporally. Users can clone the warehouse through git, download the pre-trained model, and drag and drop images on the desktop to generate animations.
Audio to Photoreal Embodiment is a framework for generating full-body photoreal humanoid avatars. It dynamically generates multiple gesture movements of the face, body, and hands based on conversational dynamics. The key to its approach is to generate more dynamic and expressive movements by combining the sample diversity of vector quantization with the high-frequency details obtained by diffusion. The resulting movements are visualized through highly realistic humanoid avatars, capable of expressing important nuances in gestures (such as mockery and arrogance). To facilitate this research direction, we introduce a debut multi-view conversation dataset that enables photorealistic reconstruction. Experiments demonstrate that our model generates suitable and diverse actions, outperforming diffusion and vector quantization-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational poses. The code and dataset are available online.
Human101 is a framework for quickly reconstructing the human body from a single view. It is able to train a 3D Gaussian model in 100 seconds and render 1024-resolution images at over 60FPS without pre-stored Gaussian attributes for each frame. The Human101 pipeline is as follows: First, 2D human poses are extracted from single-view videos. Then, the pose is used to drive the 3D simulator to generate matching 3D skeleton animation. Finally, a time-related 3D Gaussian model is constructed based on animation and rendered in real time.
VOODOO 3D is a high-fidelity 3D sensing one-time head reproduction technology. Our approach transfers the driver's expression to the source and produces view-consistent renderings for holographic displays. The method is based on a fully volumetric neural unwrapping framework for a 3D-aware one-shot head reconstruction method of source appearance and driven expression. Our method is highly real-time, produces output with high fidelity and consistent views, and is suitable for 3D remote conferencing systems based on holographic display. We demonstrate state-of-the-art performance on a variety of datasets and demonstrate high-quality 3D-aware head reproduction on highly challenging and diverse subjects, including non-frontal head poses and complex expressions of source and driver parties.
DreaMoving is a diffusion model-based controllable video generation framework for generating high-quality customized human dance videos. Given a target identity and a pose sequence, DreaMoving can generate a video of the target identity, driving the pose sequence to dance anywhere. To this end, we propose a video control network for motion control and a content director to preserve identity information. The model is easy to use and can be adapted to most stylized diffusion models to produce diverse results.
VividTalk is a one-shot audio-driven avatar generation technology based on 3D hybrid priors. It is capable of generating realistic rap videos with rich expressions, natural head poses and lip sync. This technology adopts a two-stage general framework that supports the generation of high visual quality rap videos with all the above characteristics. Specifically, in the first stage, the audio is mapped to the grid by learning two types of movements (non-rigid expression movements and rigid head movements). For expression motion, hybrid shapes and vertices are used as intermediate representations to maximize the model's representation ability. For natural head movements, a novel learnable head pose codebook is proposed and a two-stage training mechanism is adopted. In the second stage, a two-branch motion VAE and a generator are proposed to convert meshes into dense motion and synthesize high-quality videos frame by frame. Extensive experiments demonstrate that VividTalk is capable of generating high-visual-quality rap videos with lip-sync and lifelike enhancement, and outperforms previous state-of-the-art works in both objective and subjective comparisons. Code for the technology will be publicly released upon publication.
GAIA aims to synthesize natural conversational videos from speech and single portrait images. We introduce GAIA (Generative AI for Avatar), which removes domain priors from conversational avatar generation. GAIA is divided into two stages: 1) decompose each frame into motion and appearance representation; 2) generate motion sequences conditioned on speech and reference portrait images. We collected a large-scale, high-quality conversational avatar dataset and trained the model at different scales. Experimental results verify the superiority, scalability and flexibility of GAIA. Methods include variational autoencoders (VAEs) and diffusion models. The diffusion model is optimized to generate motion sequences conditional on speech sequences and random frames in video clips. GAIA can be used in different applications such as controlled conversational avatar generation and text-guided avatar generation.
Sketch Video Synthesis is an optimization-based video sketch generation framework that represents video through inter-frame Bezier curves. It uses semantic loss and a newly designed consistency loss to optimize curve positions, generate impressionistic-style video sketches and maintain temporal coherence. Can be used for video editing and video graffiti, supporting flexible rendering of SVG lines, including resizing, filling color, and overlaying graffiti on the original background image.
Animate Anyone is designed to generate character videos from static images by driving signals. We leverage the power of diffusion models to propose a new framework tailored for character animation. To maintain the consistency of complex appearance features in reference images, we design ReferenceNet to incorporate detailed features via spatial attention. To ensure controllability and continuity, we introduced an efficient pose guide to guide the character's movements and adopted an effective temporal modeling method to ensure smooth cross-frame transitions between video frames. By extending the training data, our method can animate arbitrary characters, achieving excellent results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on the benchmarks of fashion video and human dance synthesis, achieving state-of-the-art results.
SparseCtrl was developed to enhance control over text-to-video generation, with the ability to flexibly combine sparse signals for structural control with only one or a small number of inputs. It includes an additional conditional encoder to handle these sparse signals without affecting the pre-trained text-to-video model. The method is compatible with various modalities including sketch, depth and RGB images, providing more practical control for video generation and driving applications such as storyboarding, depth rendering, keyframe animation and interpolation. Extensive experiments demonstrate the generalization ability of SparseCtrl on both original and personalized text-to-video generators.
MagicDance is a novel and effective method for generating realistic human videos, enabling vivid motion and facial expression transfer, and consistent zero-tuning generation of 2D cartoon-style animations. With MagicDance, we can accurately generate results with consistent appearance, while original T2I models such as Stable Diffusion and ControlNet struggle to accurately maintain subject identity information. Furthermore, our proposed module can be considered as an extension/plug-in of the original T2I model without modifying its pre-trained weights.
FrameAI - AI Video Generator instantly turns your photos into AI videos. With advanced Deforum - Stable Diffusion technology, FrameAI - AI Video Generator intelligently analyzes and enhances your content to produce visually appealing and engaging videos. Choose from a variety of carefully designed templates and styles like anime, clown, cyberpunk, and more to quickly generate stunning AI videos. Seamlessly integrate with popular social media platforms to easily share your masterpieces and stand out from your audience with beautiful AI-generated videos.
Kandinsky Deforum is a text-to-image generation model based on Kandinsky extensions and Deforum features. This model can convert text into video, which is efficient, fast and accurate. Its core method includes generating a reference frame, performing a small transformation on the previous frame, and diffusing the resulting image through an image-to-image method. The advantage of Kandinsky Deforum is that it can generate high-quality videos while having good scalability and flexibility. The product is positioned to provide users with an efficient, fast, and accurate text-to-image generation model.
Endless AI Video Loops is an AI artistic video loop generator that can convert pictures into infinite loop videos. Users can generate video loops using their own images or using prompts provided by the app. This app is the only AI artistic video loop generator that allows users to easily create captivating video loops in just seconds. The app is completely free, but there is an in-app purchase option where users can buy unlimited points for life.
MagicAvatar is a multimodal framework capable of generating/animating avatars by converting various input modes (text, video, and audio) into motion signals. It can create avatars from simple text prompts or create avatars that follow a given movement based on a given source video. Additionally, it can animate theme-specific avatars. The advantage of MagicAvatar is its ability to combine multiple input modes to generate high-quality avatars and animations.
Flythroughs is an application based on AI and 3D generation technology that helps users easily create professional 3D Flythroughs. It uses the world's most advanced 3D-generating NeRF technology to generate realistic 3D experiences from video without any training or special equipment. Flythroughs also integrates a new 3D camera path AI that can generate realistic 3D experiences with one click. Flythroughs is suitable for real estate, construction, tourism, entertainment and other fields, and can help users show the fluidity and uniqueness of space.
Aiweiwen is a tool that uses AI drawing to help users convert novel tweets into comic video explanations with one click for quick monetization. Through AI recognition and generation, the original one-day work is shortened to 10 minutes, improving video output efficiency. All images, subtitles, and dubbing are original, ensuring pure original video content.
Pix2Pix Video is a small program that converts images into photorealistic videos. It uses the Pix2Pix model and is able to generate high-quality videos that bring static images to life. Pix2Pix Video has a simple and easy-to-use interface. Users only need to upload an image and set relevant parameters to generate stunning videos. It can be used in various scenarios, such as animation production, virtual reality, special effects addition, etc. Pix2Pix Video is a powerful image processing tool that offers unlimited creative possibilities.
D-ID is a creative AI platform that uses AI technology to convert photos into videos. It easily generates videos from text, providing AI-driven, affordable video solutions for training materials, internal communications, marketing, and more. D-ID can also enable face-to-face conversations with chatbots, giving users a more immersive and human experience. D-ID also provides APIs and self-service studios for developers to use.
Explore other subcategories under image Other Categories
832 tools
771 tools
543 tools
522 tools
352 tools
196 tools
95 tools
68 tools
AI video generation Hot image is a popular subcategory under 45 quality AI tools