Found 64 AI tools
Click any tool to view details
AI Hug Video Generator is an online platform that uses advanced machine learning technology to transform static photos into dynamic, lifelike hug videos. Users can create personalized, emotion-filled videos based on their precious photos. The technology creates photorealistic digital hugs by analyzing real human interactions, including subtle gestures and emotions. The platform provides a user-friendly interface, making it easy for both technology enthusiasts and video production novices to create AI hug videos. Additionally, the resulting video is high-definition and suitable for sharing on any platform, ensuring great results on every screen.
MIMO is a universal video synthesis model capable of simulating anyone interacting with objects in complex movements. It is capable of synthesizing character videos with controllable attributes (such as characters, actions, and scenes) based on simple user-provided inputs (such as reference images, pose sequences, scene videos, or images). MIMO achieves this by encoding 2D video into a compact spatial code and decomposing it into three spatial components (main character, underlying scene, and floating occlusion). This approach allows flexible user control, spatial motion expression, and 3D perception synthesis, suitable for interactive real-world scenarios.
LVCD is a reference-based line drawing video coloring technology that uses a large-scale pre-trained video diffusion model to generate colorized animated videos. This technology uses Sketch-guided ControlNet and Reference Attention to achieve color processing of animation videos with fast and large movements while ensuring temporal coherence. The main advantages of LVCD include temporal coherence in generating colorized animated videos, the ability to handle large motions, and high-quality output results.
ComfyUI-LumaAI-API is a plug-in designed for ComfyUI, which allows users to use the Luma AI API directly in ComfyUI. The Luma AI API is based on the Dream Machine video generation model, developed by Luma. This plug-in greatly enriches the possibilities of video generation by providing a variety of nodes, such as text to video, image to video, video preview, etc., and provides convenient tools for video creators and developers.
Tongyi Wanxiang AI Creative Painting is a product that uses artificial intelligence technology to convert users' text descriptions or images into video content. Through advanced AI algorithms, it can understand the user's creative intentions and automatically generate artistic videos. This product can not only improve the efficiency of content creation, but also stimulate users' creativity, and is suitable for many fields such as advertising, education, and entertainment.
Loopy is an end-to-end audio-driven video diffusion model specifically designed with a temporal module across clips and within clips and an audio-to-latent representation module, enabling the model to leverage long-term motion information in the data to learn natural motion patterns and improve the correlation of audio with portrait motion. This approach eliminates the need for manually specified spatial motion templates in existing methods, enabling more realistic, high-quality results in a variety of scenarios.
CyberHost is an end-to-end audio-driven human animation framework that achieves hand integrity, identity consistency, and natural motion generation through a regional codebook attention mechanism. This model utilizes the dual U-Net architecture as the basic structure and uses a motion frame strategy for temporal continuation to establish a baseline for audio-driven human animation. CyberHost improves the quality of synthesis results through a series of human-led training strategies, including body motion maps, hand articulation scores, pose-aligned reference features, and local augmentation supervision. CyberHost is the first audio-driven human body diffusion model capable of zero-shot video generation at the human body scale.
EmoTalk3D is a research project focused on 3D virtual head synthesis. It solves the problems of perspective consistency and insufficient emotional expression in traditional 3D head synthesis by collecting multi-view videos, emotional annotations and 3D geometric data per frame. This project proposes a novel approach to achieve emotion-controlled 3D human head synthesis with enhanced lip synchronization and rendering quality by training on the EmoTalk3D dataset. The EmoTalk 3D model is capable of generating 3D animations with a wide viewing angle and high rendering quality, while capturing dynamic facial details such as wrinkles and subtle expressions.
Clapper.app is an open source AI story visualization tool that can interpret and render scripts into storyboards, videos, sounds and music. Currently, the tool is still in the early stages of development and is not suitable for ordinary users as some features are not yet complete and there are no tutorials.
Stable Video 4D (SV4D) is a generative model based on Stable Video Diffusion (SVD) and Stable Video 3D (SV3D) that accepts a single view video and generates multiple new view videos (4D image matrices) of the object. The model is trained to generate 40 frames (5 video frames x 8 camera views) at 576x576 resolution, given 5 reference frames of the same size. Generate an orbital video by running SV3D, then use the orbital video as a reference view for SV4D and the input video as a reference frame for 4D sampling. The model also generates longer new perspective videos by using the first generated frame as an anchor and then densely sampling (interpolating) the remaining frames.
FasterLivePortrait is a real-time portrait animation project based on deep learning. It achieves real-time running speeds of 30+ FPS on the RTX 3090 GPU by using TensorRT, including pre- and post-processing, not just model inference speed. The project also implemented the conversion of LivePortrait model to Onnx model, and used onnxruntime-gpu on RTX 3090 to achieve an inference speed of about 70ms/frame, supporting cross-platform deployment. In addition, the project also supports the native Gradio app, which increases the speed several times and supports simultaneous inference of multiple faces. The code structure has been restructured and no longer relies on PyTorch. All models use onnx or tensorrt for inference.
RunwayML is a leading next-generation creative suite that provides a rich set of tools that enable users to turn any idea into reality. The app, through its unique text-to-video generation technology, allows users to generate videos on their phone using only text descriptions. Its main advantages include: 1. Text-to-video generation: Users only need to enter a text description to generate a video. 2. Real-time updates: New features and updates are launched regularly to ensure that users can always use the latest AI video and picture tools. 3. Seamless asset transfer: Users can seamlessly transfer assets between mobile phones and computers. 4. Multiple subscription options: Standard, Professional and monthly 1000 credits generated subscription options are available.
TCAN is a novel portrait animation framework based on the diffusion model that maintains temporal consistency and generalizes well to unseen domains. The framework uses unique modules such as appearance-pose adaptation layer (APPA layer), temporal control network and attitude-driven temperature map to ensure that the generated video maintains the appearance of the source image and follows the pose of the driving video, while maintaining background consistency.
LivePortrait is a generative portrait animation model based on an implicit keypoint framework that synthesizes photorealistic videos by using a single source image as a reference for appearance and deriving actions (such as facial expressions and head poses) from driving video, audio, text, or generation. The model not only achieves an effective balance between computational efficiency and controllability, but also significantly improves the generation quality and generalization ability by expanding the training data, adopting a hybrid image-video training strategy, upgrading the network architecture, and designing better motion conversion and optimization goals.
MimicMotion is a high-quality human action video generation model jointly developed by Tencent and Shanghai Jiao Tong University. This model achieves controllability of the video generation process through confidence-aware posture guidance, improves the temporal smoothness of the video, and reduces image distortion. It adopts an advanced image-to-video diffusion model and combines spatiotemporal U-Net and PoseNet to generate high-quality videos of arbitrary length based on pose sequence conditions. MimicMotion significantly outperforms previous methods in several aspects, including hand generation quality, accurate adherence to reference poses, etc.
Gen-3 Alpha is the first in a series of models trained by Runway on new infrastructure built for multi-modal training at scale. It offers significant improvements over Gen-2 in fidelity, consistency, and motion, and is a step toward building a universal world model. The model's ability to generate expressive characters with rich movements, gestures and emotions offers new opportunities for storytelling.
UniAnimate is a unified video diffusion model framework for human image animation. It reduces optimization difficulty and ensures temporal coherence by mapping reference images, pose guidance, and noisy videos into a common feature space. UniAnimate can handle long sequences and supports random noise input and first frame conditional input, significantly improving the ability to generate long-term videos. Furthermore, it explores alternative temporal modeling architectures based on state-space models as a replacement for the original computationally intensive temporal Transformer. UniAnimate achieves synthetic results that outperform existing state-of-the-art techniques in both quantitative and qualitative evaluations, and is able to generate highly consistent one-minute videos by iteratively using a first-frame conditional strategy.
VideoTetris is a novel framework that implements text-to-video generation and is particularly suitable for processing complex video generation scenarios that contain multiple objects or dynamic changes in the number of objects. The framework accurately follows complex textual semantics through spatial-temporal combination diffusion techniques, and does so by manipulating and combining spatial and temporal attention maps of denoising networks. Additionally, it introduces a new reference frame attention mechanism to improve the consistency of autoregressive video generation. VideoTetris achieves impressive qualitative and quantitative results in combining text-to-video generation.
Depth Anything V2 is an improved monocular depth estimation model that provides finer and more robust depth predictions than the previous version by training using synthetic images and a large number of unlabeled real images. The model has significant improvements in efficiency and accuracy, and is more than 10 times faster than the latest models based on Stable Diffusion.
MotionClone is a training-agnostic framework that allows motion cloning from reference videos to control text-to-video generation. It utilizes a temporal attention mechanism to represent motion in the reference video in video inversion, and introduces primary temporal attention guidance to mitigate the impact of noise or very subtle motion in the attention weights. Furthermore, in order to assist the generative model in synthesizing reasonable spatial relationships and enhancing its cue following ability, a position-aware semantic guidance mechanism utilizing the rough position of the foreground in the reference video and the original classifier's free guidance features is proposed.
WorldDreamer is an innovative video generation model that understands and simulates world dynamics by predicting occluded visual tokens. It excels in image-to-video synthesis, text-to-video generation, video restoration, video stylization, and motion-to-video generation. This model draws on the success of large language models and treats world modeling as an unsupervised visual sequence modeling challenge by mapping visual inputs to discrete tokens and predicting occluded tokens.
Follow-Your-Pose is a text-to-video generation model that leverages pose information and text descriptions to generate editable, pose-controllable character videos. This technology has important application value in the field of digital character creation, solving the limitations of the lack of comprehensive data sets and prior models for video generation. Through a two-stage training scheme, combined with a pre-trained text-to-image model, pose-controllable video generation is achieved.
MotionFollower is a lightweight score-guided diffusion model for video motion editing. It uses two lightweight signal controllers to control posture and appearance respectively, without involving heavy attention calculations. The model is designed with a score guidance principle based on a dual-branch architecture, including reconstruction and editing branches, which significantly enhances the modeling capabilities of texture details and complex backgrounds. Experiments show that MotionFollower reduces GPU memory usage by about 80% compared to the most advanced motion editing model, MotionEditor, while providing superior motion editing performance and exclusively supporting a wide range of camera movements and actions.
Keling Large Model is a self-developed large model with powerful video generation capabilities. It uses advanced technology to achieve up to 2 minutes of video generation, simulates physical world characteristics, concept combination capabilities, etc., and can generate movie-level images.
CamCo is an innovative image-to-video generation framework capable of generating high-quality videos with 3D consistency. This framework introduces camera information through Plücker coordinates and proposes a geometrically consistent dual-line constrained attention module. In addition, CamCo is fine-tuned on real-world videos in which camera poses are estimated through structure-from-motion algorithms to better synthesize object motion.
EasyAnimate is a pipeline based on the transformer architecture that can be used to generate AI photos and videos, train baseline models and Lora models for use in Diffusion Transformer. Supports prediction directly from pre-trained EasyAnimate models, generating videos of ~6 seconds (24fps) at different resolutions. Users can also train their own baseline models and Lora models to perform specific style transfers.
AnimateAnyone is a deep learning-based video generation model that can convert static pictures or videos into animations. This model is unofficially implemented by Novita AI, inspired by the implementation of MooreThreads/Moore-AnimateAnyone, and adjusted on the training process and data set.
MusePose is an image-to-video generation framework developed by Lyra Lab of Tencent Music Entertainment. It is designed to generate videos of virtual characters through posture control signals. It is the final building block in the Muse open source series, which, along with MuseV and MuseTalk, aims to move the community toward the vision of generating virtual characters with full-body movement and interaction capabilities. Based on diffusion models and pose guidance, MusePose is able to generate dancing videos of people in reference images, and the result quality surpasses almost all current open source models on the same topic.
FIFO-Diffusion is a novel inference technique based on pre-trained diffusion models for text-conditioned video generation. It is able to generate infinitely long videos without training, by iteratively performing diagonal denoising while handling gradually increasing noise levels over a sequence of consecutive frames in the queue; the method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. Furthermore, latent segmentation is introduced to reduce the training inference gap and exploit the benefits of forward references through lookahead denoising.
Veo is Google's latest video generation model, capable of generating high-quality 1080p resolution videos and supporting a variety of movies and visual styles. Through advanced natural language and visual semantic understanding, it can accurately capture the user's creative vision and generate video content that is consistent with the prompt's tone and rich in detail. Veo models offer an unprecedented level of creative control, understanding cinematic terms like "time-lapse" or "aerial landscape" to create coherent footage that allows people, animals and objects to move realistically within the shot.
ID-Animator is a zero-shot human video generation method capable of personalized video generation from a single reference facial image without the need for further training. This technique inherits existing diffusion-based video generation frameworks and incorporates face adapters to encode identity-related embeddings. Through this method, ID-Animator is able to maintain the details of character identity during video generation while improving training efficiency.
VASA-1 is a model developed by Microsoft Research that focuses on generating realistic facial animations that match audio in real time. This technology uses deep learning algorithms to automatically generate corresponding mouth shapes and facial expressions based on the input voice content, providing users with a new interactive experience. The main advantage of VASA-1 is its highly realistic generation effects and real-time responsiveness, allowing virtual characters to interact with users more naturally. Currently, VASA-1 is mainly used in virtual assistants, online education, entertainment and other fields. Its pricing strategy has not yet been announced, but it is expected to provide a free trial version for user experience.
Imagen provides advanced generative media capabilities. Gemini models are ideal for advanced inference and general-purpose use cases, while task-specific generative AI models can help enterprises deliver specialized capabilities. The text-to-motion picture feature previewed today makes Imagen even more powerful for enterprise workloads. This allows marketing and creative teams to generate dynamic images, such as GIFs, based on text prompts. Initially, dynamic graphics will be delivered at 24 frames per second (fps) at a resolution of 360x640 pixels and have a duration of 4 seconds, with ongoing enhancements planned. Considering that this mockup is designed for enterprise applications, it excels in themes such as nature, food images, and animals. It can generate a range of camera angles and movements while supporting consistency throughout the sequence. Imagen's dynamic image generation feature comes with security filters and digital watermarks to maintain the promise of trust between creators and users. Additionally, we’ve updated Imagen 2.0’s image generation capabilities with advanced photo editing capabilities, including patching and extensions. These features, now available on Vertex AI, allow users to easily remove unwanted elements from images, add new elements, and expand image boundaries to create a wider field of view. Additionally, our digital watermarking functionality based on Google DeepMind’s SynthID technology is now universal, enabling customers to generate invisible watermarks and verify images and dynamic images generated by Imagen series models.
MagicTime is a model for generating high-quality change videos based on text descriptions. It achieves highly realistic simulations of change processes by learning the physics from time-lapse videos. The model includes three main components: MagicAdapter, Dynamic Frames Extraction and Magic Text-Encoder, which can effectively understand the change process from text and generate corresponding videos. At the same time, the project team also developed a special time-lapse video data set, ChronoMagic, to provide support for change video generation. Comprehensive experimental results show that MagicTime performs well in generating dynamic and realistic change videos, providing a new idea for creating a change simulator of the physical world.
MuseV is a virtual human video generation framework based on the diffusion model, supports unlimited length video generation, and adopts a novel visual condition parallel denoising scheme. It provides a pre-trained virtual human video generation model, supports functions such as Image2Video, Text2Image2Video, Video2Video, etc., and is compatible with the Stable Diffusion ecosystem, including basic models, LoRA, ControlNet, etc. It supports multiple reference image technologies, such as IPAdapter, ReferenceOnly, ReferenceNet, IPAdapterFaceID, etc. The advantage of MuseV is that it can generate high-fidelity videos of unlimited length and is positioned in the field of video generation.
Make-Your-Anchor is a 2D avatar generation framework based on the diffusion model. It only takes about 1 minute of video material to automatically generate an anchor-style video with precise upper body and hand movements. The system uses a structure-guided diffusion model to render 3D mesh states into character appearances. Effectively bind movements to specific looks through a two-stage training strategy. In order to generate time series videos of arbitrary length, the 2D U-Net of the frame-wise diffusion model is extended to 3D form, and a simple and effective batch overlap time series denoising module is proposed, thereby breaking through the video length limit during inference. Finally, an identity-specific face enhancement module is introduced to improve the visual quality of the facial region in the output video. Experiments show that the system outperforms existing technologies in terms of visual quality, timing consistency, and identity fidelity.
AniPortrait is a project that generates animated talking and singing videos based on audio and image input. It can generate realistic facial animation based on audio and static face pictures, with consistent mouth shapes. Supports multiple languages, facial redrawing, and head pose control. Features include audio-driven animation synthesis, facial rendition, head pose control, support for self-driven and audio-driven video generation, high-quality animation generation, and flexible model and weight configuration.
SceneScript is a new 3D scene reconstruction technology developed by the Reality Labs research team. The technology uses AI to understand and reconstruct complex 3D scenes, enabling the creation of detailed 3D models from a single image. SceneScript significantly improves the accuracy and efficiency of 3D reconstruction by combining multiple advanced deep learning techniques, such as semi-supervised learning, self-supervised learning and multi-modal learning.
StreamingT2V is an advanced autoregressive technology that can create long videos with rich dynamic motion without any stagnation. It ensures temporal consistency within the video, tight alignment with descriptive text, and maintains high frame-level image quality.
MOTIA is a diffusion method based on test-time adaptation that utilizes the intrinsic content and motion patterns within the source video to effectively perform video extension painting. This method consists of two main stages: intrinsic adaptation and extrinsic rendering, aiming to improve the quality and flexibility of video epipainting.
DynamiCrafter is a Vincent video model that can generate dynamic videos about 2 seconds long based on input images and text. This model is trained to generate high-resolution videos with a resolution of 576x1024. The main advantage is the ability to capture the dynamic effects of input images and text descriptions and generate realistic short video content. It is suitable for video production, animation creation and other scenarios, providing content creators with efficient productivity tools. This model is currently in the research phase and is intended for personal and research use only.
VLOGGER is a method for generating text- and audio-driven videos of speaking humans from a single human input image, building on the recent success of generative diffusion models. Our approach consists of 1) a stochastic human-to-3D motion diffusion model, and 2) a novel diffusion-based architecture that enhances text-to-image models with temporal and spatial control. This approach is capable of generating high-quality videos of variable length and is easily controllable with advanced expressions of human faces and bodies. Unlike previous work, our approach does not require training for each individual, does not rely on face detection and cropping, generates complete images (not just faces or lips), and takes into account the wide range of scenarios required to correctly synthesize communicative humans (e.g. visible torsos or diverse subject identities).
AtomoVideo is a novel high-fidelity image-to-video (I2V) generation framework that generates high-fidelity videos from input images, achieves better motion intensity and consistency compared to existing work, and is compatible with various personalized T2I models without specific adjustments.
Alibaba's EMO: is a tool for generating facial expression videos with rich expressions. It can generate voice avatar videos with various head postures and expressions based on input character images and voice audio. It supports multi-language songs and various portrait styles, and can generate dynamic and expressive animated characters based on audio rhythms.
AnimateLCM-SVD-xt is a new image-to-video generation model that can generate high-quality, coherent videos in a few steps. This model uses consistency knowledge distillation and stereo matching learning technology to make the generated video more stable and coherent, while greatly reducing the amount of calculation. Key features include: 1) Generate 25 frames of 576x1024 resolution video in 4-8 steps; 2) Reduce the amount of calculation 12.5 times compared to the ordinary video diffusion model; 3) The generated video is of good quality and does not require additional classifier guidance.
Sora is a text-controlled video generation diffusion model based on large-scale training. It is capable of generating high-definition video up to 1 minute long, covering a wide range of visual data types and resolutions. Sora enables scalable video generation by training in the compressed latent space of videos and images, decomposing them into spatiotemporal location patches. Sora also demonstrated some capabilities in simulating the physical and digital worlds, such as three-dimensional consistency and interaction, revealing the prospects of continuing to expand the scale of video generation models to develop high-capability simulators.
Meshy-2 is the latest addition to our 3D generative AI product family, coming three months after the release of Meshy-1. This version is a huge leap forward in the field of Text to 3D, providing better structured meshes and rich geometric details for 3D objects. In Meshy-2, Text to 3D offers four style options: Realistic, Cartoon, Low Polygon and Voxel to satisfy a variety of artistic preferences and inspire new creative directions. We've increased the speed of generation without compromising quality, with preview time around 25 seconds and fine results within 5 minutes. Additionally, Meshy-2 introduces a user-friendly mesh editor with polygon count control and a quad mesh conversion system to provide more control and flexibility in 3D projects. The Text to Texture feature has been optimized to render textures more clearly and twice as fast. Enhanced features of Image to 3D produce higher quality results in 2 minutes. We are shifting our focus from Discord to web applications, encouraging users to share AI-generated 3D art in the web application community.
DynamiCrafter is an image animation tool developed by Jinbo Xing, Menghan Xia and others. By leveraging pre-trained video diffusion priors, DynamiCrafter can animate open-domain still images based on textual cues. The tool supports high-resolution models, providing better dynamics, higher resolution and greater consistency. DynamiCrafter is mainly used in scenarios such as story video generation, loop video generation, and frame interpolation generation.
Stable Video Diffusion (SVD) 1.1 Image-to-Video is a diffusion model that generates corresponding videos by using still images as condition frames. The model is a latent diffusion model trained to generate short video clips from images. The model is trained to generate 25 frames of video at a resolution of 1024x576, and its training is based on context frames of the same size and fine-tuned from SVD Image-to-Video [25 frames]. When fine-tuning, the conditions of 6FPS and Motion Bucket Id 127 were fixed to improve the consistency of the output without adjusting the hyperparameters.
AnimateLCM is a model that uses deep learning to generate animated videos. It can generate high-fidelity animated videos using only a few sampling steps. Different from directly performing consistency learning on the original video data set, AnimateLCM adopts a decoupled consistency learning strategy to decouple the extraction of image generation prior knowledge and motion generation prior knowledge, thereby improving training efficiency and enhancing the generated visual quality. In addition, AnimateLCM can also be used with the plug-in module of the Stable Diffusion community to achieve various controllable generation functions. AnimateLCM has proven its performance in image-based video generation and layout-based video generation.
Lumiere is a text-to-video diffusion model designed to synthesize videos that exhibit realistic, diverse, and coherent motion, solving key challenges in video synthesis. We introduce a space-time U-Net architecture that can generate the entire video's temporal duration at once, in a single pass of the model. This is in contrast to existing video models, which synthesize distant keyframes and then perform temporal super-resolution, an approach that inherently makes global temporal consistency difficult to achieve. By deploying spatial and (importantly) temporal downsampling and upsampling, and leveraging a pretrained text-to-image diffusion model, our model learns to directly generate full frame rate, low-resolution video at multiple spatiotemporal scales. We present state-of-the-art text-to-video generation results and show that our design easily facilitates a variety of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.
ActAnywhere is a generative model for automatically generating video backgrounds that match the motion and appearance of foreground subjects. The task involves compositing a background that is consistent with the movement and appearance of the foreground subject, while also fitting the artist's creative intent. ActAnywhere leverages the power of large-scale video diffusion models and is specifically tailored for this task. ActAnywhere takes a series of foreground subject segmentations as input and an image describing the desired scene as a condition, generating a coherent video consistent with the condition frame, while achieving realistic foreground and background interaction. The model is trained on a large-scale human-computer interaction video dataset. Extensive evaluations show that the model performs significantly better than baselines and can generalize to samples from a variety of distributions, including non-human subjects.
ComfyUI-Moore-AnimateAnyone is a ComfyUI custom node implemented based on the Moore-AnimateAnyone model, which can generate corresponding human animations through simple text descriptions. This node is easy to install and use, supports the generation of a variety of human postures and movements, and can be used to improve the quality of design works. Its output animation is delicate and natural, providing powerful tools for creators.
The I2V-Adapter is designed to convert static images into dynamic, lifelike video sequences while maintaining the fidelity of the original image. It uses lightweight adapter modules to process noisy video frames and input images in parallel. This module acts as a bridge, effectively connecting the input to the model’s self-attention mechanism, maintaining spatial details without changing the structure of the T2I model. The I2V-Adapter has fewer parameters than traditional models and ensures compatibility with existing T2I models and control tools. Experimental results show that the I2V-Adapter is able to generate high-quality video output, which is of great significance for AI-driven video generation, especially in the field of creative applications.
MagicVideo-V2 is an end-to-end video generation pipeline that integrates text-to-image models, video motion generators, reference image embedding modules, and frame interpolation modules. Its architectural design enables MagicVideo-V2 to produce beautiful-looking, high-resolution video with excellent fidelity and smoothness. Through large-scale user evaluation, it has demonstrated superior performance over leading text-to-video systems such as Runway, Pika 1.0, Morph, Moon Valley and Stable Video Diffusion.
Fairy is a minimalistic yet powerful adaptation of the diffusion model for image editing targeted at video editing applications. Its core is an anchor-based cross-frame attention mechanism that implicitly propagates diffusion features between frames, ensuring better temporal coherence and high-fidelity synthesis. Fairy not only solves the memory and processing speed limitations of previous models, but also improves temporal consistency through a unique data augmentation strategy.
InstructVideo is a method for guiding text-to-video diffusion models with reward fine-tuning through human feedback. It rewards fine-tuning through editing, reducing fine-tuning costs and improving fine-tuning efficiency. It uses an established image reward model to provide reward signals through piecewise sparse sampling and temporally decaying rewards, significantly improving the visual quality of the generated videos. InstructVideo can not only improve the visual quality of generated videos, but also maintain strong generalization capabilities. For more information, please visit the official website.
VideoPoet is a large-scale language model that turns any autoregressive language model into a high-quality video generator. It generates videos based on input text descriptions without any visual or audio guidance. VideoPoet is capable of generating various types of videos including text to video, image to video, video editing, stylization and restoration, etc. It can be used in film production, animation, advertising production, virtual reality and other fields. VideoPoet has high-quality video generation capabilities and can be flexibly applied to different scenarios.
W.A.L.T is a transformer-based real-life video generation method that achieves cross-modal training and generation by jointly compressing images and videos into a unified latent space. It uses a window attention mechanism to improve memory and training efficiency. The method achieves state-of-the-art performance on multiple video and image generation benchmarks.
MagicAnimate is a time-domain consistent human image animation tool using a diffusion model. It can achieve high-quality, natural and smooth human animation effects by performing diffusion model operations on human body images. MagicAnimate is highly controllable and flexible and can achieve different animation effects by fine-tuning parameters. It is suitable for human body animation creation, virtual character design and other fields.
Make Pixels Dance is a highly dynamic video generation tool that generates rich and diverse dynamic video effects by inputting images or text commands. This tool has basic mode and magic mode. Users can choose different modes to generate videos according to their needs. The product has powerful functions, is simple and easy to use, and is suitable for various creative video production scenarios.
DynVideo-E is a human video editing tool that utilizes dynamic NeRF technology for large-scale movement and perspective changes. The tool represents the video as a 3D foreground normalized body space, combined with a deformation field and a 3D background static space. By utilizing techniques such as reconstruction loss, 2D personalized diffusion prior, 3D diffusion prior, and local part super-resolution, the movable normalized human body space is edited in multi-view and multi-pose configurations. At the same time, the reference style is transferred into the 3D background model through the style transfer loss in the feature space. Users can perform corresponding rendering based on the source video camera pose in the edited video-NeRF model. DynVideo-E can not only process short videos, but also human body videos with large-scale movements and perspective changes, providing users with more directly controllable editing methods. Experiments on two challenging data sets demonstrate that DynVideo-E achieves a significant advantage of 50% to 95% in human preference compared to existing methods. DynVideo-E code and data will be released to the community.
MagicEdit is a high-fidelity, time-coherent video editing model that supports a variety of editing applications such as video stylization, local editing, video mixing, and video extension through learning that explicitly separates appearance and motion. MagicEdit also supports video expansion tasks, which can be achieved without retraining.
Gen-2 is a multi-modal artificial intelligence system that can generate novel videos based on text, pictures or video clips. It can be achieved by applying the composition and style of image or text cues to the structure of the source video (Video to Video), or by using text alone (Text to Video). It's like shooting something completely new without actually shooting anything. Gen-2 offers a variety of modes to turn any image, video clip or text prompt into a compelling film.
Explore other subcategories under video Other Categories
399 tools
346 tools
323 tools
181 tools
130 tools
124 tools
49 tools
39 tools
AI image generation Hot video is a popular subcategory under 64 quality AI tools