Found 100 AI tools
Click any tool to view details
Jingyi Intelligent AI Video Generation Artifact is a product that uses artificial intelligence technology to convert static old photos into dynamic videos. It combines deep learning and image processing technology to allow users to easily resurrect precious old photos and create memorable video content. The main advantages of this product include easy operation, realistic effects, and personalized customization. It can not only meet the needs of individual users for the organization and innovation of home imaging materials, but also provide a novel marketing and publicity method for business users. Currently, this product provides a free trial, and further information on specific pricing and positioning is required.
TANGO is a co-lingual gesture video reproduction technology based on hierarchical audio-motion embedding and diffusion interpolation. It uses advanced artificial intelligence algorithms to convert voice signals into corresponding gesture movements to achieve natural reproduction of the gestures of the characters in the video. This technology has broad application prospects in video production, virtual reality, augmented reality and other fields, and can improve the interactivity and realism of video content. TANGO was jointly developed by the University of Tokyo and CyberAgent AI Lab and represents the current cutting-edge level of artificial intelligence in the fields of gesture recognition and action generation.
Coverr AI Workflows is a platform focused on AI video generation, providing a variety of AI tools and workflows to help users generate high-quality video content in simple steps. The platform brings together the wisdom of AI video experts. Through workflows shared by the community, users can learn how to use different AI tools to create videos. The background of Coverr AI Workflows is based on the increasingly widespread application of artificial intelligence technology in the field of video production. It lowers the technical threshold of video creation by providing an easy-to-understand and operate workflow, allowing non-professionals to create professional-level video content. Coverr AI Workflows currently provides free video and music resources, targeting the video production needs of creative workers and small businesses.
AI video generation artifact is an online tool that uses artificial intelligence technology to convert pictures or text into video content. Through deep learning algorithms, it can understand the meaning of pictures and text and automatically generate attractive video content. The application of this technology has greatly reduced the cost and threshold of video production, allowing ordinary users to easily produce professional-level videos. Product background information shows that with the rise of social media and video platforms, users' demand for video content is growing day by day. However, traditional video production methods are costly and time-consuming, making it difficult to meet the rapidly changing market demand. The emergence of AI video generation artifacts has just filled this market gap, providing users with a fast and low-cost video production solution. Currently, this product provides a free trial, and the specific price needs to be checked on the website.
Eddie AI is an innovative video editing platform that uses artificial intelligence technology to help users edit videos quickly and easily. The main advantage of this platform is its user-friendliness and efficiency, which allows users to talk to the AI as if they were talking to another editor, proposing the type of video clip they want. Background information on Eddie AI reveals that it aims to scale video editing through the use of custom AI editing/storytelling models, suggesting its potential revolutionary impact in the world of video production.
Pyramid Flow is an efficient video generation modeling technology based on flow matching methods and implemented through autoregressive video generation models. The main advantage of this technology is that it has high training efficiency and can be trained on open source data sets with low GPU hours to generate high-quality video content. The background information of Pyramid Flow includes that it was jointly developed by Peking University, Kuaishou Technology and Beijing University of Posts and Telecommunications, and related papers, codes and models have been published on multiple platforms.
AI Hug Video Generator is an online platform that uses advanced machine learning technology to transform static photos into dynamic, lifelike hug videos. Users can create personalized, emotion-filled videos based on their precious photos. The technology creates photorealistic digital hugs by analyzing real human interactions, including subtle gestures and emotions. The platform provides a user-friendly interface, making it easy for both technology enthusiasts and video production novices to create AI hug videos. Additionally, the resulting video is high-definition and suitable for sharing on any platform, ensuring great results on every screen.
LLaVA-Video is a large-scale multi-modal model (LMMs) focused on video instruction tuning. It solves the problem of obtaining large amounts of high-quality raw data from the network by creating a high-quality synthetic dataset LLaVA-Video-178K. This dataset includes tasks such as detailed video description, open-ended question and answer, and multiple-choice question and answer, and is designed to improve the understanding and reasoning capabilities of video language models. The LLaVA-Video model performs well on multiple video benchmarks, proving its effectiveness on the dataset.
JoggAI is a platform that uses artificial intelligence technology to help users quickly convert product links or visual materials into attractive video ads. It provides rich templates, diverse AI avatars, and fast-response services to create engaging content and drive website traffic and sales. The main advantages of JoggAI include rapid video content creation, AI script writing, batch mode production, video clip understanding, text-to-speech conversion, etc. These features make JoggAI ideal for e-commerce, marketing, sales and business owners as well as agencies and freelancers who need to produce video content efficiently.
Hailuo AI Video Generator is a tool that uses artificial intelligence technology to automatically generate video content based on text prompts. It uses deep learning algorithms to convert users' text descriptions into visual images, which greatly simplifies the video production process and improves creation efficiency. This product is suitable for individuals and businesses who need to quickly generate video content, especially in areas such as advertising, social media content production and movie previews.
Guangying AI is a platform that uses artificial intelligence technology to help users quickly create popular videos. It simplifies the video editing process through AI technology, allowing users to produce high-quality video content without video editing skills. This platform is particularly suitable for individuals and businesses that need to quickly produce video content, such as social media operators, video bloggers, etc.
Meta Movie Gen is an advanced media-based AI model that allows users to generate customized video and sound, edit existing videos or convert personal images into unique videos with simple text input. This technology represents the latest breakthrough of AI in content creation, providing content creators with unprecedented creative freedom and efficiency.
JoyHallo is a digital human model designed for Mandarin video generation. It created the jdh-Hallo dataset by collecting 29 hours of Mandarin videos from employees of JD Health International Co., Ltd. The dataset covers different ages and speaking styles, including conversational and professional medical topics. The JoyHallo model uses the Chinese wav2vec2 model for audio feature embedding, and proposes a semi-decoupled structure to capture the interrelationships between lips, expressions and gesture features, improving information utilization efficiency and speeding up inference by 14.3%. In addition, JoyHallo also performs well in generating English videos, demonstrating excellent cross-language generation capabilities.
MIMO is a universal video synthesis model capable of simulating anyone interacting with objects in complex movements. It is capable of synthesizing character videos with controllable attributes (such as characters, actions, and scenes) based on simple user-provided inputs (such as reference images, pose sequences, scene videos, or images). MIMO achieves this by encoding 2D video into a compact spatial code and decomposing it into three spatial components (main character, underlying scene, and floating occlusion). This approach allows flexible user control, spatial motion expression, and 3D perception synthesis, suitable for interactive real-world scenarios.
LVCD is a reference-based line drawing video coloring technology that uses a large-scale pre-trained video diffusion model to generate colorized animated videos. This technology uses Sketch-guided ControlNet and Reference Attention to achieve color processing of animation videos with fast and large movements while ensuring temporal coherence. The main advantages of LVCD include temporal coherence in generating colorized animated videos, the ability to handle large motions, and high-quality output results.
ComfyUI-LumaAI-API is a plug-in designed for ComfyUI, which allows users to use the Luma AI API directly in ComfyUI. The Luma AI API is based on the Dream Machine video generation model, developed by Luma. This plug-in greatly enriches the possibilities of video generation by providing a variety of nodes, such as text to video, image to video, video preview, etc., and provides convenient tools for video creators and developers.
Runway API is a powerful video modeling platform that provides advanced generative video models, allowing users to embed Gen-3 Alpha Turbo into their products in a safe and reliable environment. It supports a wide range of application scenarios, including creative advertising, music videos, film production, etc., and is the first choice of the world's top creative professionals.
Dream Machine API is a creative intelligence platform that provides a series of advanced video generation models. Through intuitive APIs and open source SDKs, users can build and extend creative AI products. With features like text to video, image to video, keyframe control, expansion, looping and camera control, the platform is designed to work with humans through creative intelligence to help them create better content. The Dream Machine API is designed to drive richness in visual exploration and creation, allowing more ideas to be tried, better narratives built, and diverse stories told by those who were previously unable to do so.
AI-Faceless-Video-Generator is a project that uses artificial intelligence technology to generate video scripts, voices and talking avatars based on topics. It combines sadtalker for facial animation, gTTS to generate AI speech, and OpenAI language model to generate scripts, providing an end-to-end solution for generating personalized videos. Key benefits of the project include script generation, AI voice generation, facial animation creation, and an easy-to-use interface.
Tongyi Wanxiang AI Creative Painting is a product that uses artificial intelligence technology to convert users' text descriptions or images into video content. Through advanced AI algorithms, it can understand the user's creative intentions and automatically generate artistic videos. This product can not only improve the efficiency of content creation, but also stimulate users' creativity, and is suitable for many fields such as advertising, education, and entertainment.
Follow-Your-Canvas is a diffusion model-based video epitaxy technology capable of generating high-resolution video content. This technology solves the problem of GPU memory limitations through distributed processing and spatial window merging, while maintaining the spatial and temporal consistency of the video. It excels at large-scale video extension, capable of significantly increasing video resolution, such as from 512 X 512 to 1152 X 2048, while producing high-quality and visually pleasing results.
Runway Staff Picks is a platform showcasing a selection of short films and experimental works created using Runway Gen-3 Alpha technology. The works range from art to technology, showcasing Runway's cutting-edge skills in video creation and experimental art. Runway partners with Tribeca Festival 2024 to further push the boundaries of creativity through a partnership with Media.Monks.
Loopy is an end-to-end audio-driven video diffusion model specifically designed with a temporal module across clips and within clips and an audio-to-latent representation module, enabling the model to leverage long-term motion information in the data to learn natural motion patterns and improve the correlation of audio with portrait motion. This approach eliminates the need for manually specified spatial motion templates in existing methods, enabling more realistic, high-quality results in a variety of scenarios.
CyberHost is an end-to-end audio-driven human animation framework that achieves hand integrity, identity consistency, and natural motion generation through a regional codebook attention mechanism. This model utilizes the dual U-Net architecture as the basic structure and uses a motion frame strategy for temporal continuation to establish a baseline for audio-driven human animation. CyberHost improves the quality of synthesis results through a series of human-led training strategies, including body motion maps, hand articulation scores, pose-aligned reference features, and local augmentation supervision. CyberHost is the first audio-driven human body diffusion model capable of zero-shot video generation at the human body scale.
CogVideo is a text-to-video generation model developed by the Tsinghua University team, which converts text descriptions into video content through deep learning technology. This technology has broad application prospects in video content creation, education, entertainment and other fields. Through large-scale pre-training, the CogVideo model can generate videos that match text descriptions, providing a new automated method for video production.
MoneyPrinterPlus is an open source AI short video generation tool. It uses AI large model technology to generate various short videos in batches with one click, supports one-click mixing and cutting of short videos, and can automatically publish videos to Douyin, Kuaishou, Xiaohongshu, video accounts and other platforms. This tool is designed to help users easily grasp short video traffic and achieve rapid dissemination and monetization of content.
Creatify 2.0 is an AI video ad maker with text-to-video conversion function and AI editing function, designed to create viral ad creatives and improve marketing efficiency. It supports more than 9 styles, including cartoon, reality, 3D, etc., helping users quickly generate engaging video ads, and can conduct customized promotions for specific audiences.
CogVideoX is an open source video generation model developed by the Tsinghua University team that supports video generation from text descriptions. It provides a variety of video generation models, including entry-level and large-scale models, to meet different quality and cost needs. The model supports multiple precisions, including FP16 and BF16. It is recommended to use the same precision as the model training for inference. The CogVideoX-5B model is particularly suitable for scenarios where high-quality video content needs to be generated, such as film production, game development and advertising creative.
Video-CCAM is a series of flexible video multi-language models (Video-MLLM) developed by Tencent QQ Multimedia Research Team. It is dedicated to improving video-language understanding capabilities and is especially suitable for the analysis of short and long videos. It achieves this goal through Causal Cross-Attention Masks. Video-CCAM performs well on multiple benchmarks, especially on MVBench, VideoVista, and MLVU. The model's source code has been rewritten to simplify the deployment process.
auto-video-generateorThe automatic video generator is an innovative AI model that can automatically generate explanation videos based on the topic text input by the user. It uses a large language model to generate stories or explanation texts, then uses speech synthesis technology to generate explanation voices, and combines Vincentian graph technology to generate pictures that match the text content, and finally integrates these elements to generate explanation videos. Product Background Based on Baidu Intelligent Cloud's Qianfan Large Model Platform, it uses the ERNIE series of models, combined with open source speech synthesis and Vincentian graph technology, to realize the automated video generation process.
Video-Foley is an innovative video-to-sound generation system that achieves highly controllable and synchronized video-sound synthesis by using root mean square (RMS) as the temporal event condition, combined with semantic timbre cues (audio or text). The system uses a label-free self-supervised learning framework, including two stages, Video2RMS and RMS2Sound, and combines novel concepts such as RMS discretization and RMS-ControlNet with a pre-trained text-to-audio model. Video-Foley delivers state-of-the-art performance in audio and video alignment and control of sound timing, intensity, timbre and detail.
NarratoAI is a tool that uses large AI models to explain and edit videos with one click. It provides a one-stop solution for script writing, automatic video editing, dubbing and subtitle generation, powered by LLM to increase the efficiency of content creation.
PixVerse is an innovative AI video creation platform designed to help users easily create high-quality video content. Through advanced generative AI technology, PixVerse can transform text, images and characters into vivid videos, greatly improving the efficiency and flexibility of creation. Whether you're a professional content creator or a casual user, PixVerse provides powerful tools to realize your creative ideas. The platform’s ease of use and powerful features make it unique in the market and suitable for all types of video production needs.
Tavus Conversational Video Interface (CVI) is an innovative video conversation platform that provides face-to-face interactive experiences through digital twin technology. The platform features low-latency (less than one second) instant response capabilities and combines advanced speech recognition, visual processing and conversational awareness to provide users with a rich, natural conversational experience. The platform is easy to deploy and expand, supports customized LLM or TTS, and is suitable for a variety of industries and scenarios.
ReSyncer is an innovative framework dedicated to injecting Transformer technology with advanced styles to achieve efficient synchronization of audio and video. Not only is it capable of generating high-fidelity lip sync videos, it also supports features such as rapid personalization fine-tuning, video-driven lip sync, speaking style conversion, and even face swapping. These features are essential for creating virtual hosts and performers, significantly increasing the naturalness and realism of video content.
EmoTalk3D is a research project focused on 3D virtual head synthesis. It solves the problems of perspective consistency and insufficient emotional expression in traditional 3D head synthesis by collecting multi-view videos, emotional annotations and 3D geometric data per frame. This project proposes a novel approach to achieve emotion-controlled 3D human head synthesis with enhanced lip synchronization and rendering quality by training on the EmoTalk3D dataset. The EmoTalk 3D model is capable of generating 3D animations with a wide viewing angle and high rendering quality, while capturing dynamic facial details such as wrinkles and subtle expressions.
ComfyUI-CogVideoXWrapper is a Python-based video processing model that generates and converts video content by using the T5 model. The model supports image-to-video conversion workflows and demonstrated interesting results during the experimental phase. It is mainly targeted at professional users who need to create and edit video content, especially those who have special needs in video generation and conversion.
CogVideoX-2B is an open source video generation model developed by the Tsinghua University team. It supports video generation using the English prompt language, has 36GB of inference GPU memory requirements, and can generate 6 seconds long, 8 frames per second, and 720*480 resolution videos. This model uses sinusoidal position embedding and currently does not support quantitative reasoning and multi-card reasoning. It is deployed based on Hugging Face's diffusers library and can generate videos based on text prompts, which has a high degree of creativity and application potential.
CogVideoX is an open source video generation model that has the same origin as the commercial model and supports the generation of video content through text descriptions. It represents the latest progress in text-to-video generation technology, has the ability to generate high-quality videos, and can be widely used in entertainment, education, business promotion and other fields.
Tora is a diffusion transformer (DiT)-based video generation model that enables precise control of video content dynamics by integrating textual, visual and trajectory conditions. Tora's design takes full advantage of DiT's scalability, allowing the generation of high-quality video content at different durations, aspect ratios and resolutions. The model excels in motion fidelity and simulation of physical world movement, opening up new possibilities for video content creation.
Clapper.app is an open source AI story visualization tool that can interpret and render scripts into storyboards, videos, sounds and music. Currently, the tool is still in the early stages of development and is not suitable for ordinary users as some features are not yet complete and there are no tutorials.
Qingying AI video generation service is an innovative artificial intelligence platform designed to generate high-quality video content through intelligent algorithms. The service is suitable for users in a variety of industries and enables the generation of creative visual content quickly and easily. Whether it is commercial advertising, educational courses or entertainment videos, Qingying AI can provide high-quality solutions. This product relies on the advanced GLM large model to ensure the accuracy and richness of the generated content while meeting the personalized needs of users. Free trials are available to encourage users to explore the endless possibilities of AI video creation.
Open-Sora Plan v1.2 is an open source video generation model focused on text-to-video conversion tasks. It adopts a 3D full attention architecture to optimize the visual representation of videos and improve reasoning efficiency. This model is innovative in the field of video generation and can better capture joint spatial-temporal features, providing a new technical path for automatic generation of video content.
MaskVAT is a video-to-audio (V2A) generative model that exploits the visual features of videos to generate realistic sounds that match the scene. This model places special emphasis on the synchronization of the starting point of the sound with the visual action to avoid unnatural synchronization problems. MaskVAT combines a full-band high-quality universal audio codec and a sequence-to-sequence mask generation model to achieve competitiveness comparable to non-codec generation audio models while ensuring high audio quality, semantic matching, and time synchronization.
Stable Video 4D (SV4D) is a generative model based on Stable Video Diffusion (SVD) and Stable Video 3D (SV3D) that accepts a single view video and generates multiple new view videos (4D image matrices) of the object. The model is trained to generate 40 frames (5 video frames x 8 camera views) at 576x576 resolution, given 5 reference frames of the same size. Generate an orbital video by running SV3D, then use the orbital video as a reference view for SV4D and the input video as a reference frame for 4D sampling. The model also generates longer new perspective videos by using the first generated frame as an anchor and then densely sampling (interpolating) the remaining frames.
Stable Video 4D is the latest AI model launched by Stability AI, which is able to convert a single object video into multiple novel view videos from eight different angles/views. This technology represents a leap in capabilities from image-based video generation to full 3D dynamic video synthesis. It has potential applications in areas such as game development, video editing, and virtual reality, and is being continuously optimized.
PixVerse V2 is a revolutionary update that empowers every user to create stunning video content with ease. With V2, you can easily create visually stunning movies, even incorporating elements that don't exist in the real world. The main advantages include model upgrades, improved image quality, and consistency between edits.
HeyGen Interactive Avatar is an online AI video generator that focuses on creating and optimizing avatar videos, supporting real-time interaction. It allows users to create an avatar optimized for continuous streaming, while reminding users to maintain minimal head and hand movements. HeyGen's background includes collaborations with big names like Baron David and Ryan Hoover, and the product is currently in beta testing with a free trial available.
Flow Studio is a video generation platform based on artificial intelligence technology that focuses on providing users with high-quality, personalized video content. The platform uses advanced AI algorithms to generate 3-minute videos in a short time, which is better than similar products such as Luma, Pika and Sora. Users can quickly create attractive video content by selecting different templates, characters, and scenes. The main advantages of Flow Studio include fast generation speed, realistic effects, and easy operation.
RunwayML is a leading next-generation creative suite that provides a rich set of tools that enable users to turn any idea into reality. The app, through its unique text-to-video generation technology, allows users to generate videos on their phone using only text descriptions. Its main advantages include: 1. Text-to-video generation: Users only need to enter a text description to generate a video. 2. Real-time updates: New features and updates are launched regularly to ensure that users can always use the latest AI video and picture tools. 3. Seamless asset transfer: Users can seamlessly transfer assets between mobile phones and computers. 4. Multiple subscription options: Standard, Professional and monthly 1000 credits generated subscription options are available.
TCAN is a novel portrait animation framework based on the diffusion model that maintains temporal consistency and generalizes well to unseen domains. The framework uses unique modules such as appearance-pose adaptation layer (APPA layer), temporal control network and attitude-driven temperature map to ensure that the generated video maintains the appearance of the source image and follows the pose of the driving video, while maintaining background consistency.
MiraData is a large-scale video data set that focuses on long video clips with an average length of 72 seconds. It provides structured subtitles with an average subtitle length of 318 words, enriching the description of video content. By using technologies such as GPT-4V, MiraData demonstrates high accuracy and semantic coherence in video understanding and subtitle generation.
vta-ldm is a deep learning model focused on video-to-audio generation, capable of generating audio content based on video content that is semantically and temporally aligned with the video input. It represents a new breakthrough in the field of video generation, especially after the significant progress in text-to-video generation technology. This model was developed by Manjie Xu and others from Tencent AI Lab. It has the ability to generate audio that is highly consistent with video content and has important application value in fields such as video production and audio post-processing.
HeyGen Labs provides Expressive Photo Avatar, an online AI video generator that allows users to create avatar videos with expressions and mouth shapes by uploading photos and audio files. This technology uses AI algorithms to simulate the facial expressions and mouth shapes of real people, making video content more vivid and attractive. The product background is to provide users with a simple and fast way to create personalized video content, suitable for a variety of scenarios, such as social media, advertising, education, etc.
Jockey is a conversational video agent built on Twelve Labs API and LangGraph. It combines the capabilities of existing Large Language Models (LLMs) with Twelve Labs' API for task distribution through LangGraph, allocating the load of complex video workflows to the appropriate underlying model. LLMs are used to logically plan execution steps and interact with users, while video-related tasks are passed to the Twelve Labs API powered by Video Foundation Models (VFMs) to process videos natively without the need for intermediary representations like pre-generated subtitles.
LivePortrait is a generative portrait animation model based on an implicit keypoint framework that synthesizes photorealistic videos by using a single source image as a reference for appearance and deriving actions (such as facial expressions and head poses) from driving video, audio, text, or generation. The model not only achieves an effective balance between computational efficiency and controllability, but also significantly improves the generation quality and generalization ability by expanding the training data, adopting a hybrid image-video training strategy, upgrading the network architecture, and designing better motion conversion and optimization goals.
MimicMotion is a high-quality human action video generation model jointly developed by Tencent and Shanghai Jiao Tong University. This model achieves controllability of the video generation process through confidence-aware posture guidance, improves the temporal smoothness of the video, and reduces image distortion. It adopts an advanced image-to-video diffusion model and combines spatiotemporal U-Net and PoseNet to generate high-quality videos of arbitrary length based on pose sequence conditions. MimicMotion significantly outperforms previous methods in several aspects, including hand generation quality, accurate adherence to reference poses, etc.
PAB is a technology for real-time video generation that accelerates the video generation process through Pyramid Attention Broadcast, providing an efficient video generation solution. The main advantages of this technology include real-time performance, efficiency and quality assurance. PAB is suitable for application scenarios that require real-time video generation capabilities, bringing a major breakthrough in the field of video generation.
MOFA-Video is a method that can animate a single picture through various control signals. It adopts sparse-to-dense (S2D) motion generation and flow-based motion adaptation technology, which can effectively animate a single picture using different types of control signals such as trajectories, keypoint sequences, and their combinations. In the training phase, sparse control signals are generated through sparse motion sampling, and then different MOFA-Adapters are trained to generate videos through pre-trained SVD. During the inference phase, different MOFA-Adapters can be combined to jointly control the frozen SVD.
Kling AI is a text-to-video generation model developed by Kuaishou Technology that can generate highly realistic videos based on text prompts. It has efficient video generation capabilities, up to 2 minutes of 30 frames per second video, and advanced technologies such as 3D spatio-temporal joint attention mechanism and physical world simulation, giving it a significant competitive advantage in the field of AI video generation.
The ShareGPT4Video family aims to facilitate video understanding with Large Video-Language Models (LVLMs) and video generation with Text-to-Video Models (T2VMs) with dense and accurate captions. The series includes: 1) ShareGPT4Video, 40K GPT4V annotated dense video subtitles developed through carefully designed data filtering and annotation strategies. 2) ShareCaptioner-Video, an efficient and powerful arbitrary video captioning model, annotated by its 4.8M high-quality aesthetic videos. 3) ShareGPT4Video-8B, a simple but superior LVLM that achieves the best performance on three advanced video benchmarks.
DeepFuze is an advanced deep learning tool integrated seamlessly with ComfyUI to revolutionize facial transformation, lipsyncing, video generation, voice cloning and lipsync translation. Utilizing advanced algorithms, DeepFuze enables users to combine audio and video with unparalleled realism, ensuring perfect synchronization of facial movements. This innovative solution is ideal for content creators, animators, developers, and anyone looking to enhance their video editing projects with advanced AI-driven capabilities.
TikTok Symphony is a new suite of creative solutions powered by generative AI designed to simplify the content creation journey for marketers and creators on TikTok. By combining human imagination with AI-driven efficiency, TikTok Symphony enables businesses, creators, and agencies of all sizes to elevate content creation, increase productivity, and uncover valuable insights.
VideoLLaMA2-7B is a multi-modal large-scale language model developed by the DAMO-NLP-SG team, focusing on the understanding and generation of video content. The model achieves remarkable performance in visual question answering and video subtitle generation, capable of processing complex video content and generating accurate, natural language descriptions. It is optimized in spatial-temporal modeling and audio understanding, providing powerful support for intelligent analysis and processing of video content.
VideoLLaMA2-7B-Base is a large-scale video language model developed by DAMO-NLP-SG, focusing on the understanding and generation of video content. The model demonstrates excellent performance in visual question answering and video subtitle generation, providing users with a new video content analysis tool through advanced spatial-temporal modeling and audio understanding capabilities. It is based on the Transformer architecture and is capable of processing multi-modal data, combining textual and visual information to produce accurate and insightful output.
VideoLLaMA2-7B-16F-Base is a large-scale video language model developed by the DAMO-NLP-SG team, focusing on video question answering (Visual Question Answering) and video subtitle generation. The model combines advanced spatial-temporal modeling and audio understanding capabilities to provide powerful support for multi-modal video content analysis. It demonstrates excellent performance on visual question answering and video subtitle generation tasks, capable of handling complex video content and generating accurate descriptions and answers.
Video-to-audio (V2A) technology is a DeepMind innovation that combines video pixels with natural language text cues to generate rich soundscapes synchronized with on-screen action. This technology can be combined with video generation models such as Veo to generate dramatic soundtracks for videos, realistic sound effects, or dialogue that matches the characters and tone of the video. It can also generate soundtracks for traditional material, including archival material, silent films, and more, opening up a wider range of creative opportunities.
UniAnimate is a unified video diffusion model framework for human image animation. It reduces optimization difficulty and ensures temporal coherence by mapping reference images, pose guidance, and noisy videos into a common feature space. UniAnimate can handle long sequences and supports random noise input and first frame conditional input, significantly improving the ability to generate long-term videos. Furthermore, it explores alternative temporal modeling architectures based on state-space models as a replacement for the original computationally intensive temporal Transformer. UniAnimate achieves synthetic results that outperform existing state-of-the-art techniques in both quantitative and qualitative evaluations, and is able to generate highly consistent one-minute videos by iteratively using a first-frame conditional strategy.
LVBench is a benchmark specifically designed for long video understanding, aiming to push the capabilities of multi-modal large language models in understanding hours-long videos, which is critical for practical applications such as long-term decision making, in-depth movie reviews and discussions, live sports commentary, and more.
VideoTetris is a novel framework that implements text-to-video generation and is particularly suitable for processing complex video generation scenarios that contain multiple objects or dynamic changes in the number of objects. The framework accurately follows complex textual semantics through spatial-temporal combination diffusion techniques, and does so by manipulating and combining spatial and temporal attention maps of denoising networks. Additionally, it introduces a new reference frame attention mechanism to improve the consistency of autoregressive video generation. VideoTetris achieves impressive qualitative and quantitative results in combining text-to-video generation.
TC-Bench is a tool specifically designed to evaluate the temporal compositionality of video generation models. It measures the ability of video generation models to emerge new concepts and their relationship transformations at different time points through carefully designed text prompts, corresponding real videos, and powerful evaluation metrics. TC-Bench is not only suitable for text conditional models, but also for image conditional models, and is capable of generative frame interpolation. The tool was developed to advance video generation technology and improve the quality and consistency of generated videos.
VideoLLaMA 2 is a large-scale language model optimized for video understanding tasks that improves the parsing and understanding of video content through advanced spatial-temporal modeling and audio understanding capabilities. The model demonstrates excellent performance on tasks such as multi-select video question answering and video subtitle generation.
MotionClone is a training-agnostic framework that allows motion cloning from reference videos to control text-to-video generation. It utilizes a temporal attention mechanism to represent motion in the reference video in video inversion, and introduces primary temporal attention guidance to mitigate the impact of noise or very subtle motion in the attention weights. Furthermore, in order to assist the generative model in synthesizing reasonable spatial relationships and enhancing its cue following ability, a position-aware semantic guidance mechanism utilizing the rough position of the foreground in the reference video and the original classifier's free guidance features is proposed.
Dream Machine API is a Python script that uses Dream Machine API to generate videos, asynchronously checks the video generation status, and outputs the latest generated video link. It requires Python 3.7+ environment and requests and aiohttp library support. Users need to log in to LumaAI's Dream Machine website to obtain an access_token to use the script.
WorldDreamer is an innovative video generation model that understands and simulates world dynamics by predicting occluded visual tokens. It excels in image-to-video synthesis, text-to-video generation, video restoration, video stylization, and motion-to-video generation. This model draws on the success of large language models and treats world modeling as an unsupervised visual sequence modeling challenge by mapping visual inputs to discrete tokens and predicting occluded tokens.
Follow-Your-Pose is a text-to-video generation model that leverages pose information and text descriptions to generate editable, pose-controllable character videos. This technology has important application value in the field of digital character creation, solving the limitations of the lack of comprehensive data sets and prior models for video generation. Through a two-stage training scheme, combined with a pre-trained text-to-image model, pose-controllable video generation is achieved.
SF-V is a diffusion-based video generation model that optimizes the pre-trained model through adversarial training to achieve the ability to generate high-quality videos in a single step. This model significantly reduces the computational cost of the denoising process while maintaining the temporal and spatial dependence of video data, paving the way for real-time video synthesis and editing.
Detail is an app designed specifically for iPad for TikTok enthusiasts, podcast creators, and Instagram influencers. It integrates a powerful video editor, a convenient teleprompter, smart subtitles, and cutting-edge camera technology to make creating stunning videos fast and easy with AI-powered editing features and instant video presets.
Kuaiying is a video editing application officially launched by Kuaishou. It provides comprehensive video editing functions, including editing, audio, subtitles, special effects, etc., aiming to help users easily create interesting and professional video content. It has AI animation video function, which can convert videos into animation style, providing a variety of style choices, such as animation style, Chinese style, Japanese style, etc. In addition, Kuaiying also has AI creation tools, such as AI painting, AI drawings, and AI copywriting library, to assist users in creation. Kuaiying also provides a creation center to help users view data and find inspiration, and provides a powerful material library, including stickers, hot memes, etc., to enhance users' online experience.
Keling Large Model is a self-developed large model with powerful video generation capabilities. It uses advanced technology to achieve up to 2 minutes of video generation, simulates physical world characteristics, concept combination capabilities, etc., and can generate movie-level images.
CamCo is an innovative image-to-video generation framework capable of generating high-quality videos with 3D consistency. This framework introduces camera information through Plücker coordinates and proposes a geometrically consistent dual-line constrained attention module. In addition, CamCo is fine-tuned on real-world videos in which camera poses are estimated through structure-from-motion algorithms to better synthesize object motion.
EasyAnimate is a pipeline based on the transformer architecture that can be used to generate AI photos and videos, train baseline models and Lora models for use in Diffusion Transformer. Supports prediction directly from pre-trained EasyAnimate models, generating videos of ~6 seconds (24fps) at different resolutions. Users can also train their own baseline models and Lora models to perform specific style transfers.
AnimateAnyone is a deep learning-based video generation model that can convert static pictures or videos into animations. This model is unofficially implemented by Novita AI, inspired by the implementation of MooreThreads/Moore-AnimateAnyone, and adjusted on the training process and data set.
MusePose is an image-to-video generation framework developed by Lyra Lab of Tencent Music Entertainment. It is designed to generate videos of virtual characters through posture control signals. It is the final building block in the Muse open source series, which, along with MuseV and MuseTalk, aims to move the community toward the vision of generating virtual characters with full-body movement and interaction capabilities. Based on diffusion models and pose guidance, MusePose is able to generate dancing videos of people in reference images, and the result quality surpasses almost all current open source models on the same topic.
StreamV2V is a diffusion model that enables real-time video-to-video (V2V) translation via user prompts. Different from traditional batch processing methods, StreamV2V adopts streaming processing and can process infinite frames of video. Its core is to maintain a feature library that stores information from past frames. For newly incoming frames, StreamV2V directly fuses similar past features into the output by extending self-attention and direct feature fusion technology. The feature library is continuously updated by merging stored and new features, keeping it compact and information-rich. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without the need for fine-tuning.
V-Express is an avatar video generation model developed by Tencent AI Lab. It balances different control signals through a series of progressive discarding operations, so that the generated video can consider posture, input image and audio at the same time. This model is specifically optimized for situations where the audio signal is weak, solving the challenge of generating avatar videos with varying signal strengths.
Open-Sora-Plan is a text-to-video generation model developed by the Peking University Yuanzu team. It first launched version v1.0.0 in April 2024 and has gained wide recognition in the field of text to video generation for its simple and efficient design and remarkable performance. Version v1.1.0 features significant improvements in video generation quality and duration, including better compressed visual representation, higher generation quality, and the ability to generate longer videos. This model adopts the optimized CausalVideoVAE architecture, which has stronger performance and higher reasoning efficiency. In addition, it maintains the minimalist design and data efficiency of the v1.0.0 version, and has similar performance to the Sora base model, indicating that its version evolution is consistent with the expansion law demonstrated by Sora.
KREA Video is an online video generation and enhancement tool that leverages advanced artificial intelligence technology to provide users with real-time video generation and editing capabilities. It allows users to upload images or text prompts, generate videos with animation effects, and adjust the duration and keyframes of the video. The main advantages of KREA Video are its ease of operation, user-friendly interface, and ability to quickly generate high-quality video content, making it suitable for content creators, advertising producers, and video editing professionals.
FIFO-Diffusion is a novel inference technique based on pre-trained diffusion models for text-conditioned video generation. It is able to generate infinitely long videos without training, by iteratively performing diagonal denoising while handling gradually increasing noise levels over a sequence of consecutive frames in the queue; the method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. Furthermore, latent segmentation is introduced to reduce the training inference gap and exploit the benefits of forward references through lookahead denoising.
Veo is Google's latest video generation model, capable of generating high-quality 1080p resolution videos and supporting a variety of movies and visual styles. Through advanced natural language and visual semantic understanding, it can accurately capture the user's creative vision and generate video content that is consistent with the prompt's tone and rich in detail. Veo models offer an unprecedented level of creative control, understanding cinematic terms like "time-lapse" or "aerial landscape" to create coherent footage that allows people, animals and objects to move realistically within the shot.
AniTalker is an innovative framework capable of generating realistic conversational facial animations from a single portrait. It enhances action expressiveness through two self-supervised learning strategies, while developing an identity encoder through metric learning, effectively reducing the need for labeled data. Not only is AniTalker capable of creating detailed and realistic facial movements, it also highlights its potential for producing dynamic avatars in real-world applications.
AI Video Generator lets everyone create stunning videos from text. Functions include: creative to video, blog to video, PPT to video, tweets to video, avatar video, product to video, etc. Suitable for content creation, business marketing, education and training, e-commerce and other fields.
Video Mamba Suite is a new state-space model suite for video understanding, designed to explore and evaluate the potential of Mamba in video modeling. The suite contains 14 models/modules covering 12 video understanding tasks, demonstrating efficient performance and superiority in video and video-language tasks.
Mira (Mini-Sora) is an experimental project aimed at exploring the field of high-quality, long-term video generation, especially in imitating Sora-style video generation. It builds on existing text-to-video (T2V) generation frameworks and achieves breakthroughs in several key aspects: extending sequence length, enhancing dynamic characteristics, and maintaining 3D consistency. Currently, the Mira project is in the experimental stage and there is still room for improvement compared with more advanced video generation technologies such as Sora.
ID-Animator is a zero-shot human video generation method capable of personalized video generation from a single reference facial image without the need for further training. This technique inherits existing diffusion-based video generation frameworks and incorporates face adapters to encode identity-related embeddings. Through this method, ID-Animator is able to maintain the details of character identity during video generation while improving training efficiency.
VideoGigaGAN is a video super-resolution (VSR) model based on the large-scale image upsampler GigaGAN. It is capable of generating videos with high-frequency detail and temporal consistency. This model significantly improves the temporal consistency of the video by adding a temporal attention layer and feature propagation module, and uses an anti-aliasing block to reduce the aliasing effect. VideoGigaGAN is compared with state-of-the-art VSR models on public datasets and demonstrates 8x super-resolution video results.
Ctrl-Adapter is a Controlnet specially designed for video generation. It provides fine control functions for images and videos, optimizes video time alignment, adapts to a variety of basic models, has video editing capabilities, and significantly improves video generation efficiency and quality.
MA-LMM is a large-scale multi-modal model based on a large language model, mainly designed for long-term video understanding. It processes videos online and uses a memory bank to store past video information, so that it can refer to historical video content for long-term analysis without exceeding the context length limit of the language model or the GPU memory limit. MA-LMM can be seamlessly integrated into current multi-modal language models and has achieved leading performance in tasks such as long video understanding, video question answering and video subtitles.
text2video is a tool that can automatically convert text to video. It uses technologies such as stable-diffusion and edge-tts, which can segment text to generate images and voices, and synthesize them into videos with subtitles and dubbing through ffmpeg. The original intention of this tool is to realize the visual reading function of novels, which can help users read text content more vividly. The tool is free to use and can be customized according to personal needs.
Imagen provides advanced generative media capabilities. Gemini models are ideal for advanced inference and general-purpose use cases, while task-specific generative AI models can help enterprises deliver specialized capabilities. The text-to-motion picture feature previewed today makes Imagen even more powerful for enterprise workloads. This allows marketing and creative teams to generate dynamic images, such as GIFs, based on text prompts. Initially, dynamic graphics will be delivered at 24 frames per second (fps) at a resolution of 360x640 pixels and have a duration of 4 seconds, with ongoing enhancements planned. Considering that this mockup is designed for enterprise applications, it excels in themes such as nature, food images, and animals. It can generate a range of camera angles and movements while supporting consistency throughout the sequence. Imagen's dynamic image generation feature comes with security filters and digital watermarks to maintain the promise of trust between creators and users. Additionally, we’ve updated Imagen 2.0’s image generation capabilities with advanced photo editing capabilities, including patching and extensions. These features, now available on Vertex AI, allow users to easily remove unwanted elements from images, add new elements, and expand image boundaries to create a wider field of view. Additionally, our digital watermarking functionality based on Google DeepMind’s SynthID technology is now universal, enabling customers to generate invisible watermarks and verify images and dynamic images generated by Imagen series models.
Explore other subcategories under video Other Categories
399 tools
346 tools
323 tools
130 tools
124 tools
64 tools
49 tools
39 tools
AI video generation Hot video is a popular subcategory under 181 quality AI tools