Found 120 related AI tools
Retro Image Prompt is a retro image prompt generator powered by Google Nano Banana. It supports text-to-image (T2I) and image-to-image (I2I) workflows, helping users quickly create high-quality retro image cues and retro AI art. The main advantage of the product is that it provides a wealth of retro styles for users to choose from, and the generated images are of high quality and stable style. In terms of price, use requires points. Users can obtain points and use it. It is positioned to meet users' needs for retro image creation and can be used by individual artists, designers or ordinary enthusiasts.
Seedream4 is an image generator with revolutionary multi-modal AI technology that combines text-to-image generation, precise image editing, and batch creation. Key benefits include fast 1.8 second generation speed, natural language control, complete creative control and enterprise integration availability. Please visit the official website for pricing information.
FLUX.1 Krea [dev] is a 12 billion parameter modified stream converter designed for generating high quality images from text descriptions. The model is trained with guided distillation to make it more efficient, and the open weights drive scientific research and artistic creation. The product emphasizes its aesthetic photography capabilities and strong prompt-following capabilities, making it a strong competitor to closed-source alternatives. Users of the model can use it for personal, scientific and commercial purposes, driving innovative workflows.
OpenDream AI is an online AI art generation platform that utilizes advanced AI models to convert text prompts into images. Launching in 2023, it aims to democratize graphic design and make visual content creation more accessible to everyone. No artistic skills required, just describe what you want to see and let OpenDream's AI create it for you.
Blip 3o is an application based on the Hugging Face platform that leverages advanced generative models to generate images from text, or analyze and answer existing images. The product provides users with powerful image generation and understanding capabilities, making it ideal for designers, artists, and developers. The main advantages of this technology are its efficient image generation speed and high-quality generation effects. It also supports multiple input forms and enhances the user experience. The product is free and is open to a wide range of users.
CogView4-6B is a text-to-image generation model developed by the Knowledge Engineering Group of Tsinghua University. It is based on deep learning technology and is able to generate high-quality images based on user-entered text descriptions. The model performs well in multiple benchmarks, especially in generating images from Chinese text. Its main advantages include high-resolution image generation, support for multiple language inputs, and efficient inference speed. This model is suitable for creative design, image generation and other fields, and can help users quickly convert text descriptions into visual content.
CogView4 is an advanced text-to-image generation model developed by Tsinghua University. It is based on diffusion model technology and can generate high-quality images based on text descriptions. It supports Chinese and English input and can generate high-resolution images. The main advantages of CogView4 are its powerful multi-language support and high-quality image generation capabilities, which is suitable for users who need to generate images efficiently. This model was demonstrated at ECCV 2024 and has important research and application value.
DiffSplat is an innovative 3D generation technology that enables rapid generation of 3D Gaussian point clouds from text cues and single-view images. This technology enables efficient 3D content generation by leveraging large-scale pre-trained text-to-image diffusion models. It solves the problems of limited data sets and inability to effectively utilize 2D pre-trained models in traditional 3D generation methods, while maintaining 3D consistency. The main advantages of DiffSplat include efficient generation speed (completed in 1~2 seconds), high-quality 3D output, and support for multiple input conditions. This model has broad prospects in academic research and industrial applications, especially in scenarios where rapid generation of high-quality 3D models is required.
Fashion-Hut-Modeling-LoRA is a text-to-image generation model based on Diffusion technology, mainly used to generate high-quality images of fashion models. With specific training parameters and data sets, the model is able to generate fashion photography images with specific styles and details based on text prompts. It has important application value in fields such as fashion design and advertising production, and can help designers and advertisers quickly generate creative concept drawings. The model is still in the training stage, and there may be some poor generation results, but it has already demonstrated strong potential. The model's training data set contains 14 high-resolution images, using parameters such as the AdamW optimizer and constant learning rate scheduler. The training process focuses on the details and quality of the images.
Flux-Midjourney-Mix2-LoRA is a deep learning-based text-to-image generation model designed to generate high-quality images through natural language descriptions. This model is based on the Diffusion architecture and combined with LoRA technology to achieve efficient fine-tuning and stylized image generation. Its main advantages include high-resolution output, diverse style support, and excellent performance capabilities for complex scenes. This model is suitable for users who require high-quality image generation, such as designers, artists, and content creators, and can help them quickly realize creative ideas.
NeuralSVG is an implicit neural representation method for generating vector graphics from textual cues. It is inspired by Neural Radiation Fields (NeRFs), encodes the entire scene into the weights of a small multi-layer perceptron (MLP) network, and uses Fractional Distillation Sampling (SDS) for optimization. This method encourages the generated SVG to have a hierarchical structure by introducing dropout-based regularization technology, so that each shape has independent meaning in the overall scene. In addition, its neural representation provides the advantage of inference-time control, allowing users to dynamically adjust the generated SVG such as color, aspect ratio, etc. based on the provided input, with only one learned representation. Through extensive qualitative and quantitative evaluation, NeuralSVG outperforms existing methods in generating structured and flexible SVG. The model was developed by researchers at Tel Aviv University and MIT CSAIL, and the code has not yet been made public.
Story-Adapter is a no-training iterative framework designed for long-form story visualization. It optimizes the image generation process through iterative paradigms and global reference cross-attention modules, maintaining semantic coherence in the story while reducing computational costs. The importance of this technology lies in its ability to generate high-quality, detailed images in long stories, solving the challenges of traditional text-to-image models in long story visualization, such as semantic consistency and computational feasibility.
DynamicControl is a framework for improving control over text-to-image diffusion models. It supports adaptive selection of different numbers and types of conditions by dynamically combining diverse control signals to synthesize images more reliably and in detail. The framework first uses a dual-loop controller to generate initial true score rankings for all input conditions using pre-trained conditional generative and discriminative models. Then, an efficient condition evaluator is built through multimodal large language model (MLLM) to optimize condition ranking. DynamicControl jointly optimizes MLLM and diffusion models, leveraging the inference capabilities of MLLM to facilitate multi-condition text-to-image tasks. The final sorted conditions are input to the parallel multi-control adapter, which learns feature maps of dynamic visual conditions and integrates them to adjust ControlNet and enhance control of the generated images.
LuminaBrush is an interactive tool designed to paint lighting effects on images. The tool uses a two-stage approach: one stage converts the image into a "uniformly lit" appearance, and another stage generates lighting effects based on user doodles. This decomposition method simplifies the learning process and avoids external constraints (such as optical transmission consistency, etc.) that may need to be considered in a single stage. LuminaBrush utilizes the "uniformly illuminated" appearance extracted from high-quality field images to construct paired data that trains the final interactive illumination mapping model. Additionally, the tool can independently use the Uniform Lighting Stage to "delight" the image.
fofr/flux-condensation is an AI model that generates images based on text. It uses the Diffusers library and LoRAs technology to generate corresponding images based on text prompts provided by the user. The model was trained on Replicate, with a non-commercial flux-1-dev license. It represents the latest advancement in text-to-image generation technology, providing designers, artists, and content creators with powerful visual expression tools.
Sana is a text-to-image generation framework developed by NVIDIA that can efficiently generate images with resolutions up to 4096×4096. With its fast speed and powerful text-image alignment capabilities, Sana can be deployed on laptop GPUs and represents an important advancement in image generation technology. The model is based on a linear diffusion transformer and uses a pre-trained text encoder and a spatially compressed latent feature encoder to generate and modify images based on text cues. Sana's open source code can be found on GitHub, and its research and application prospects are broad, especially in artistic creation, educational tools, and model research.
Sana is a text-to-image generation framework developed by NVIDIA that can efficiently generate images with resolutions up to 4096×4096. Sana's fast speed and powerful text-image alignment capabilities allow it to be deployed on laptop GPUs. It is a model based on the linear diffusion transformer (text-to-image generative model), with 1648M parameters, specifically used to generate 1024px-based multi-scale height and width images. The main advantages of the Sana model include high-resolution image generation, fast synthesis speed, and powerful text-image alignment capabilities. Background information on the Sana model shows that it is developed based on open source code, the source code can be found on GitHub, and it also follows a specific license (CC BY-NC-SA 4.0 License).
shou_xin is a text-to-image generative model that can generate hand-style pencil sketch images based on text prompts provided by users. This model uses the diffusers library and lora technology to achieve high-quality image generation. The shou_xin model occupies a place in the field of image generation with its unique artistic style and efficient image generation capabilities. It is especially suitable for users who need to quickly generate images with a specific artistic style.
Sana is a text-to-image framework developed by NVIDIA that can efficiently generate images with resolutions up to 4096×4096. The model synthesizes high-resolution, high-quality images at blazing speed, maintains strong text-image alignment, and can be deployed on laptop GPUs. The Sana model is based on a linear diffusion transformer, uses a pre-trained text encoder and a spatially compressed latent feature encoder, and supports Emoji, Chinese and English, and mixed prompts.
Bylo.ai is an advanced AI image generator that quickly converts text descriptions into high-quality images. It supports negative cues and a variety of models, including the popular Flux AI image generator, allowing users to customize their creations. Bylo.ai is ideal for personal and commercial use due to its free online access, fast and efficient generation, advanced customization options, flexible image settings and high-quality image output.
AWPortraitCN is a text-to-image generation model developed based on FLUX.1-dev, specially trained for the appearance and aesthetics of Chinese people. It contains multiple types of portraits, such as indoor and outdoor portraits, fashion and studio photos, with strong generalization capabilities. Compared with the original version, AWPortraitCN is more delicate and realistic in skin texture. In order to pursue a more realistic original image effect, it can be used together with the AWPortraitSR workflow.
Sana is a text-to-image framework developed by NVIDIA that can efficiently generate images with resolutions up to 4096×4096. Sana is able to synthesize high-resolution, high-quality images extremely quickly with strong text-to-image alignment capabilities and can be deployed on laptop GPUs. The model is based on a linear diffusion transformer, uses a fixed pre-trained text encoder and a spatially compressed latent feature encoder, and supports mixed prompts in English, Chinese and emoji. Sana's main advantages include high efficiency, high-resolution image generation capabilities, and multi-language support.
Sana is a text-to-image generation framework developed by NVIDIA that can efficiently generate high-definition, high text-image consistency images up to 4096×4096 resolution, and is extremely fast and can be deployed on laptop GPUs. The Sana model is based on a linear diffusion transformer, using a pretrained text encoder and a spatially compressed latent feature encoder. The importance of this technology lies in its ability to quickly generate high-quality images, which has a revolutionary impact on art creation, design and other creative fields. The Sana model is licensed under the CC BY-NC-SA 4.0 license and the source code is available on GitHub.
Sana is a text-to-image generation framework developed by NVIDIA that can efficiently generate images with resolutions up to 4096×4096. Sana is known for its fast speed, powerful text-image alignment capabilities, and the fact that it can be deployed on laptop GPUs. This model is based on a linear diffusion transformer using a pretrained text encoder and a spatially compressed latent feature encoder, and represents the latest advancement in text-to-image generation technology. Sana's key advantages include high-resolution image generation, fast synthesis, deployability on laptop GPUs, and open source code, making it valuable in research and practical applications.
MV-Adapter is an adapter-based multi-view image generation solution that enhances pre-trained text-to-image (T2I) models and their derived models without changing the original network structure or feature space. By updating fewer parameters, MV-Adapter achieves efficient training and retains the prior knowledge embedded in the pre-trained model, reducing the risk of overfitting. This technology enables the adapter to inherit the strong priors of the pre-trained model to model new 3D knowledge through innovative designs such as replicated self-attention layers and parallel attention architectures. In addition, MV-Adapter also provides a unified conditional encoder, seamlessly integrates camera parameters and geometric information, and supports applications such as text- and image-based 3D generation and texture mapping. MV-Adapter implements 768-resolution multi-view generation on Stable Diffusion XL (SDXL) and demonstrates its adaptability and versatility, which can be extended to arbitrary view generation, opening up wider application possibilities.
text-to-pose is a research project that aims to generate human poses from text descriptions and use these poses to generate images. The technology combines natural language processing and computer vision to enable text-to-image generation by improving the control and quality of diffusion models. The project background is based on papers published at the NeurIPS 2024 Workshop, which is innovative and cutting-edge. Key advantages of the technology include improved accuracy and controllability of image generation, as well as potential applications in areas such as artistic creation and virtual reality.
Sana is a text-to-image framework capable of efficiently generating images with resolutions up to 4096×4096. It synthesizes high-resolution, high-quality images extremely quickly, maintains strong text-image alignment, and can be deployed on laptop GPUs. Sana's core design includes deep compressive autoencoders, linear diffusion transformers (DiTs), decoder-only small language models as text encoders, and efficient training and sampling strategies. Sana-0.6B is 20 times smaller and has measurement throughput over 100 times faster than modern large-scale diffusion models. Additionally, Sana-0.6B can be deployed on a 16GB laptop GPU and generates 1024×1024 resolution images in less than 1 second. Sana makes low-cost content creation possible.
Stable Diffusion 3.5 ControlNets is a text-to-image AI model provided by Stability AI that supports a variety of control networks (ControlNets), such as Canny edge detection, depth maps, and high-fidelity upsampling. The model is capable of generating high-quality images based on text prompts, and is particularly suitable for scenarios such as illustration, architectural rendering, and 3D asset textures. Its importance lies in its ability to provide finer image control and improve the quality and detail of the generated images. Product background information includes its citation in academia (arxiv:2302.05543), and the Stability Community License it follows. In terms of price, it is free for non-commercial use and commercial use with annual revenue not exceeding US$1 million. If it exceeds, you need to contact the enterprise for permission.
FLUX.1-dev-IP-Adapter is an IP-Adapter based on the FLUX.1-dev model, developed by InstantX Team. This model can handle image work as flexibly as text, making image generation and editing more efficient and intuitive. It supports image reference, but is not suitable for fine-grained style transfer or character consistency. The model was trained on a 10M open source dataset using a batch size of 128 and 80K training steps. This model is innovative in the field of image generation and can provide diverse image generation solutions, but may suffer from insufficient coverage of styles or concepts.
FLUX.1 Tools is a set of model tools launched by Black Forest Labs, aiming to add control and operability to the text-based image generation model FLUX.1, making it possible to modify and recreate real and generated images. The tool suite consists of four distinct features, available as open access models in the FLUX.1 [dev] model family and as a complement to the BFL API supporting FLUX.1 [pro]. The main advantages of FLUX.1 Tools include advanced image repair and expansion capabilities, structured guidance, image changes and reconstruction, etc. These functions are of great significance to the field of image editing and creation.
Edify Image is an image generation model launched by NVIDIA that can generate realistic image content with pixel-level accuracy. The model adopts a cascaded pixel spatial diffusion model and is trained through a novel Laplacian diffusion process, which is able to attenuate image signals at different rates in different frequency bands. Edify Image supports a variety of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360° HDR panorama generation and image customization fine-tuning. It represents the latest progress in image generation technology and has broad application prospects and important commercial value.
FLUX.1-dev LoRA Outfit Generator is a text-to-image AI model capable of generating clothing based on color, pattern, fit, style, material and type detailed by the user. The model was trained using the H&M Fashion Captions Dataset and developed based on Ostris' AI Toolkit. Its importance lies in its ability to assist designers in quickly realizing design ideas and accelerating innovation and production processes in the clothing industry.
Regional-Prompting-FLUX is a training-independent regional prompt diffusion transformer model that provides fine-grained combined text-to-image generation capabilities for diffusion transformers such as FLUX without training. This model not only has remarkable effects, but is also highly compatible with LoRA and ControlNet, and can reduce GPU memory usage while maintaining high speed.
Stable Diffusion 3.5 Medium is an artificial intelligence-based image generation model provided by Stability AI that can generate high-quality images based on text descriptions. The importance of this technology lies in its ability to greatly promote the development of creative industries, such as game design, advertising, art creation and other fields. Stable Diffusion 3.5 Medium is favored by users for its efficient image generation capabilities, ease of use and low resource consumption. The model is currently available as a free trial to users on the Hugging Face platform.
Stable Diffusion 3.5 Medium is a text-to-image generative model developed by Stability AI with improved image quality, typography, complex prompt understanding, and resource efficiency. The model uses three fixed pre-trained text encoders, improves training stability through QK-normalization, and introduces dual attention blocks in the first 12 transformation layers. It excels in multi-resolution image generation, consistency, and adaptability to various text-to-image tasks.
Flux.1 Lite is an 8B parameter text-to-image generation model released by Freepik, which is extracted from the FLUX.1-dev model. This version uses 7GB less RAM than the original model and runs 23% faster, while maintaining the same accuracy (bfloat16) as the original model. The release of this model aims to make high-quality AI models more accessible, especially for consumer GPU users.
Stable Diffusion 3.5 Large Turbo is a Multimodal Diffusion Transformer (MMDiT) model for text-based image generation that uses adversarial diffusion distillation (ADD) technology to improve image quality, typography, complex prompt understanding, and resource efficiency, with a special focus on reducing inference steps. The model performs well in generating images, is able to understand and generate complex text prompts, and is suitable for a variety of image generation scenarios. It is released on the Hugging Face platform and follows the Stability Community License, which is suitable for research, non-commercial use, and free use by organizations or individuals with annual income of less than $1 million.
Stable Diffusion 3.5 Large is a Multimodal Diffusion Transformer (MMDiT) model that generates images from text, developed by Stability AI. The model offers significant improvements in image quality, typography, complex prompt understanding, and resource efficiency. It uses three fixed pre-trained text encoders and improves training stability through QK normalization technique. In addition, the model uses both synthetic and filtered publicly available data on training data and strategies. The Stable Diffusion 3.5 Large model is free for research, non-commercial use, and commercial use by organizations or individuals with annual revenue of less than $1 million, subject to the Community License.
Stable Diffusion 3.5 is a lightweight model for simple inference, which contains text encoder, VAE decoder and core MM-DiT technology. This model is designed to help partner organizations implement SD3.5 and can be used to generate high-quality images. Its importance lies in its efficient reasoning capabilities and low resource requirements, allowing a wide range of user groups to use and enjoy generating images. The model is subject to the Stability AI Community License Agreement and is free to use.
SD3.5-LoRA-Linear-Red-Light is an AI model based on text-to-image generation. By using LoRA (Low-Rank Adaptation) technology, the model is able to generate high-quality images based on text prompts provided by users. The importance of this technique lies in its ability to fine-tune the model at a low computational cost while maintaining the diversity and quality of the generated images. The model is based on the Stable Diffusion 3.5 Large model, which has been optimized and adjusted to suit specific image generation needs.
FLUX.1-dev-LoRA-Text-Poster is a text-to-image generation model developed by Shakker-Labs, specifically used for the generation of artistic text posters. This model uses LoRA technology to generate images through text prompts, providing users with an innovative way to create works of art. The training of the model was completed by copyright user cooooool and shared on the Hugging Face platform to promote community communication and development. Models are licensed under the flux-1-dev license for non-commercial use.
ComfyGen is an adaptive workflow system focused on text-to-image generation that automates and customizes efficient workflows by learning user prompts. The advent of this technology marks a shift from the use of a single model to complex workflows that combine multiple specialized components to improve the quality of image generation. The main benefit behind ComfyGen is the ability to automatically adjust the workflow based on the user's text prompts to produce higher quality images, which is important for users who need to produce images of a specific style or theme.
The text-to-image generation model developed by the Tsinghua University team is open source, has broad application prospects in the field of image generation, and has the advantages of high-resolution output.
Flux Ghibsky Illustration is a text-based image generation model that combines the fantastical details of Hayao Miyazaki's animation studio with the serene skies of Makoto Shinkai's work to create enchanting scenes. This model is particularly suitable for creating fantastic visual effects, and users can generate images with a unique aesthetic through specific trigger words. It is an open source project based on the Hugging Face platform, allowing users to download models and run them on Replicate.
Easy Anime Maker is an artificial intelligence-based animation generator that uses deep learning techniques such as generative adversarial networks to convert user-entered text descriptions or uploaded photos into anime-style artwork. The importance of this technology is that it lowers the threshold for creating animation art, allowing users without professional painting skills to create personalized animation images. Product background information shows that it is an online platform where users can generate anime art through simple text prompts or uploading photos, making it ideal for anime enthusiasts and professionals who need to quickly generate anime-style images. The product provides a free trial, and users can get 5 free points after registration. If you need more generation needs, you can choose to purchase points without subscribing.
FLUX.1-Turbo-Alpha is an 8-step distillation Lora based on the FLUX.1-dev model, released by AlimamaCreative Team. This model uses a multi-head discriminator to improve the quality of distillation and can be used in FLUX-related models such as text-to-image (T2I) and inpainting control networks. It is recommended to use a guide ratio of 3.5 and a Lora ratio of 1. The model is trained on 1M open source and in-house source images, uses adversarial training to improve quality, fixes the original FLUX.1-dev transformer as the discriminator backbone, and adds multi-heads on each layer of transformers.
FLUX.1-dev-LoRA-One-Click-Creative-Template is an image generation model trained on LoRA, provided by Shakker-Labs. This model focuses on creative photo generation and can transform users' text prompts into creative images. The model uses advanced text-to-image generation technology and is particularly suitable for users who need to quickly generate high-quality images. It is based on the Hugging Face platform and can be easily deployed and used. Non-commercial use of the model is free, but commercial use requires compliance with the corresponding license agreement.
Flux 1.1 Pro AI is an advanced artificial intelligence-based image generation platform that leverages cutting-edge AI technology to transform users' text prompts into high-quality visuals. The platform delivers 6x faster image generation, significantly improved image quality, and enhanced compliance with prompts. Flux 1.1 Pro AI is not only suitable for artists and designers, but also for content creators, marketers and other professionals, helping them realize visual ideas in their respective fields and improve creative efficiency and quality.
FLUX1.1 [pro] is the latest image generation model released by Black Forest Labs, which has significant improvements in speed and image quality. This model delivers six times the speed of its predecessor while improving image quality, prompt compliance, and diversity. FLUX1.1 [pro] also provides more advanced customization options and better cost performance, suitable for developers and enterprises that require efficient, high-quality image generation.
OpenFLUX.1 is a fine-tuned version based on the FLUX.1-Schnell model, removing the distillation process, making it fine-tunable, and has the open source, permissive license Apache 2.0. This model is capable of generating stunning images and does it in just 1-4 steps. It's an attempt to remove the distillation process and create an open source licensing model that can be fine-tuned.
Stable Video Portraits is an innovative hybrid 2D/3D generation method that utilizes pre-trained text-to-image models (2D) and 3D morphological models (3D) to generate realistic dynamic face videos. This technology upgrades the general 2D stable diffusion model to a video model through person-specific fine-tuning. By providing a time-series 3D morphological model as a condition and introducing a temporal denoising process, it generates a face image with temporal smoothness that can be edited and transformed into a text-defined celebrity image without additional test-time fine-tuning. This method outperforms existing monocular head avatar methods in both quantitative and qualitative analyses.
CogView3 is a cascade diffusion-based text-to-image generation system using the relay diffusion framework. The system begins the diffusion process from these noisy images by breaking the high-resolution image generation process into multiple stages and adding Gaussian noise on the low-resolution generation results by relaying the super-resolution process. CogView3 surpasses SDXL in generating images, with faster generation speed and higher image quality.
Prompt Llama is an AI model testing platform focused on text-to-image generation. It allows users to collect high-quality text prompts and test the performance of different models under the same prompt. The platform supports a variety of AI models, including but not limited to midjourney, DALL·E 3, Firefly, etc., and is a valuable resource for researchers and enthusiasts in the field of AI image generation.
FLUX AI Image Generator is an innovative image generation model capable of generating high-quality images based on textual prompts. The importance of FLUX.1 lies in its ability to democratize high-quality content creation tools, providing a streamlined solution for professionals and amateurs alike, allowing users to produce professional-grade visuals without requiring extensive technical knowledge or resources.
StoryMaker is an AI model focused on text-to-image generation that can generate coherent images of characters and scenes based on text descriptions. By combining advanced image generation technology with face encoding technology, it provides users with a powerful tool for creating storytelling visual content. The main advantages of this model include efficient image generation capabilities, precise control of details, and high responsiveness to user input. It has broad application prospects in the creative industries, advertising and entertainment fields.
Concept Sliders is a technique for precise control of concepts in diffusion models. It is applied on top of pre-trained models through low-rank adapters (LoRA), allowing artists and users to be trained to control the direction of specific attributes through simple text descriptions or image pairs. The main advantage of this technique is the ability to make subtle adjustments to the resulting image, such as eye size, lighting, etc., without changing the overall structure of the image, allowing for finer control. It provides artists with a new way of creative expression while solving the problem of generating blurry or distorted images.
Pony Diffusion V6 XL is a text-to-image diffusion model specifically designed to generate high-quality pony-themed artwork. It was fine-tuned on a dataset of approximately 80,000 pony images, ensuring that the resulting images are both relevant and beautiful. The model features a user-friendly interface, is easy to use, and is aesthetically ranked with CLIP to improve image quality. Pony Diffusion is provided under the CreativeML OpenRAIL license, which allows users to freely use, redistribute and modify the models.
Flux Image Generator is a tool that uses advanced AI model technology to quickly transform users' ideas into high-quality images. It offers three different model variants, including a rapid local development and personal use model FLUX.1 [schnell], a guided distillation model FLUX.1 [dev] for non-commercial applications, and FLUX.1 [pro] which provides state-of-the-art performance image generation. This tool is suitable not only for personal projects, but also for commercial use and can meet the needs of different users.
RECE is a concept erasure technique for text-to-image diffusion models, which achieves reliable and efficient erasure of specific concepts by introducing regularization terms during the model training process. This technology is important for improving the security and control of image generation models, especially in scenarios where the generation of inappropriate content needs to be avoided. The main advantages of RECE technology include high efficiency, high reliability and easy integration into existing models.
CSGO is a text-to-image generation model based on content style synthesis. It generates and automatically cleans stylized data triples through a data construction pipeline, and builds the first large-scale style transfer data set IMAGStyle, which contains 210k image triples. The CSGO model adopts end-to-end training, clearly decouples content and style features, and achieves it through independent feature injection. It implements image-driven style transfer, text-driven style synthesis, and text editing-driven style synthesis. It has the advantages of inference without fine-tuning, maintaining the generation ability of the original text-to-image model, and unifying style transfer and style synthesis.
AuraFlow v0.3 is a completely open source flow-based text-to-image generation model. Compared with the previous version AuraFlow-v0.2, the model has been trained with more calculations and fine-tuned on the aesthetic dataset to support various aspect ratios with width and height up to 1536 pixels. This model achieved state-of-the-art results on GenEval and is currently in the beta testing stage. It is being continuously improved and community feedback is very important.
half_illustration is a text-to-image generation model based on the Flux Dev 1 model that combines photography and illustration elements to create artistic images. This model uses LoRA technology, which can maintain style consistency through specific trigger words, and is suitable for use in the fields of art creation and design.
FLUX.1-dev-Controlnet-Union-alpha is a text-to-image generative model that belongs to the Diffusers series and uses ControlNet technology for control. The currently released alpha version has not yet been fully trained, but it has demonstrated the effectiveness of its code. This model aims to promote the development of the Flux ecosystem through the rapid growth of the open source community. Although a fully trained Union model may not be as good as a professional model in specific areas such as posture control, its performance will continue to improve as training progresses.
flux-RealismLora is a LoRA technology based on the FLUX.1-dev model released by the XLabs AI team for generating realistic images. The technology generates images from text prompts and supports a variety of styles, such as animation, fantasy, and naturalistic cinema. XLabs AI provides training scripts and configuration files to facilitate user model training and use.
flux-controlnet-canny is a ControlNet Canny model based on the FLUX.1-dev model developed by the XLabs AI team for text-to-image generation. After training, the model can generate high-quality images based on text prompts and is widely used in the fields of creative design and visual arts.
TexGen is an innovative multi-view sampling and resampling framework for synthesizing 3D textures from arbitrary textual descriptions. It utilizes pre-trained text-to-image diffusion models, multi-view sampling strategies through consistent view sampling and attention guidance, and noise resampling techniques to significantly improve the texture quality of 3D objects with a high degree of view consistency and rich appearance details.
Flux AI is an advanced text-to-image AI model developed by Black Forest Labs that utilizes a transformer-based flow model to generate high-quality images. Key advantages of this technology include superior visual quality, strict adherence to cues, size/proportion diversity, typography and output diversity. Flux AI is available in three variants: FLUX.1 [pro], FLUX.1 [dev] and FLUX.1 [schnell], targeting different usage scenarios and performance levels. Flux AI is committed to making cutting-edge AI technology accessible to everyone, ensuring that individuals, researchers, and small developers can benefit from advanced AI technology without financial barriers by providing FLUX.1 [schnell] as a free open source model.
Phantasma Anime model is an animation-style illustration generation tool focusing on fantasy themes. It uses text-to-image conversion technology to provide users with animation illustrations with specific effect details. This model has advantages in flexibility and performance of fantasy elements, and is suitable for users who need to quickly generate anime-style images.
FLUX.1 [schnell] is a 1.2 billion parameter rectified flow transformer capable of generating images from textual descriptions. It is known for its cutting-edge output quality and competitive cue-following capabilities, matching the performance of closed-source alternatives. The model is trained using latent adversarial diffusion distillation and is able to generate high-quality images in 1 to 4 steps. FLUX.1 [schnell] is released under the apache-2.0 license and may be used for personal, scientific, and commercial purposes.
FLUX.1-dev is a modified flow transformer with 1.2 billion parameters that can generate images from text descriptions. It represents the latest development in text-to-image generation technology, with advanced output quality second only to its professional version model FLUX.1 [pro]. The model improves efficiency by guiding distillation training and opens up weights to drive new scientific research and empower artists to develop innovative workflows. The generated output may be used for personal, scientific, and commercial purposes as described in flux-1-dev-non-commercial-license.
Adobe Firefly Vector AI is a series of creative generative AI models launched by Adobe, designed to enhance creative work through generative AI capabilities. Firefly models and services are used in Adobe creative applications such as Photoshop, Illustrator, and Lightroom. It helps users generate rich, realistic images and artwork with unprecedented control and creativity through text-to-image, generate fills, generate extensions, and more. Firefly's training data includes Adobe Stock authorized content, publicly licensed content and public domain content to ensure its safety for commercial use. Adobe is committed to developing generative AI responsibly and working closely with the creative community to continuously improve the technology to support and enhance the creative process.
AuraFlow v0.1 is a fully open source, flow-based text-to-image generation model that achieves state-of-the-art results on GenEval. The model is currently in the beta stage and is being continuously improved, and community feedback is crucial. Thanks to the two engineers @cloneofsimo and @isidentical for making this project a reality, and to the researchers who laid the foundation for it.
Kolors is a large-scale text-to-image generation model developed by the Kuaishou Kolors team, based on a latent diffusion model and trained on billions of text-image pairs. It outperforms both open-source and closed-source models in terms of visual quality, accuracy of complex semantics, and rendering of Chinese and English text. Kolors supports Chinese and English input, and is especially good at understanding and generating specific Chinese content.
This is a LoRA adaptive weight model based on stabilityai/stable-diffusion-xl-base-1.0, specially designed for generating watercolor illustration style images. It enhances the specific style generation capabilities of the original model through LoRA technology, allowing users to more accurately control the style of generated images.
Midsommar Cartoon is an image-generating model that combines retro style with anime elements. It is based on stable-diffusion technology and can generate illustrations with Nordic cartoon characteristics through text-to-image conversion. The model supports loading on the Inference API, allowing users to easily transform text descriptions into visual images.
AsyncDiff is an asynchronous denoising acceleration scheme for parallelizing diffusion models. It enables parallel processing of the model by splitting the noise prediction model into multiple components and distributing them to different devices. This approach significantly reduces inference latency with minimal impact on generation quality. AsyncDiff supports multiple diffusion models, including Stable Diffusion 2.1, Stable Diffusion 1.5, Stable Diffusion x4 Upscaler, Stable Diffusion XL 1.0, ControlNet, Stable Video Diffusion, and AnimateDiff.
Stable Diffusion 3 is the latest text-generated image model developed by Stability AI, with significantly improved image fidelity, multi-subject processing and text matching capabilities. Utilizing the multi-modal diffusion transformer (MMDiT) architecture, it provides separate image and language representation, supports API, download and online platform access, and is suitable for various application scenarios.
InstantX is an independent research organization focused on AI content generation, working on text-to-image generation technology. Its research projects include style-preserving text-to-image generation (InstantStyle) and zero-shot identity-preserving generation (InstantID). The organization updates and communicates projects through the GitHub community to promote the application and development of AI in the field of image generation.
Stable Diffusion 3 Medium is the most advanced text-to-image generation model Stability AI has released to date. It has 200 million parameters, provides excellent details, colors and lighting effects, and supports a variety of styles. The model has strong ability to understand long texts and complex cues, and is able to generate images with spatial reasoning, compositional elements, motion and style. Additionally, it achieves unprecedented text quality, reducing errors in spelling, kerning, letter formation, and spacing. The model is highly resource-efficient and suitable for running on standard consumer-grade GPUs. It also has fine-tuning capabilities and can absorb fine details in small data sets, making it ideal for customization.
HyperDreamBooth is a hypernetwork developed by Google Research for rapid personalization of text-to-image models. By generating a small set of personalized weights from a single face image, combined with rapid fine-tuning, it is able to generate face images with high thematic detail in multiple contexts and styles, while maintaining the model's critical knowledge of diverse styles and semantic modifications.
SDXL Flash is a text-to-image generation model launched by the SD community in collaboration with Project Fluently. It provides faster processing than LCM, Turbo, Lightning and Hyper while maintaining the quality of the generated images. This model is based on Stable Diffusion XL technology and achieves high efficiency and high quality of image generation by optimizing steps and CFG (Guidance) parameters.
Slicedit is a zero-shot video editing technology that utilizes a text-to-image diffusion model and combines spatiotemporal slicing to enhance temporal consistency in video editing. This technology is able to preserve the structure and motion of the original video while complying with the target text description. Through extensive experiments, Slicedit has been proven to have clear advantages in editing real-world videos.
Imagen 3 is Google's advanced text-to-image generative model that generates images with extremely high levels of detail and photorealism, with significantly fewer visually distracting elements than previous models. The model has a deeper understanding of natural language, can better grasp the intent behind prompts, and extract details from longer prompts. Additionally, Imagen 3 excels at rendering text, opening up new possibilities for personalized birthday messages, presentation title slides, and more.
Lumina-T2X is an advanced text-to-arbitrary modality generation framework that converts text descriptions into vivid images, dynamic videos, detailed multi-view 3D images, and synthesized speech. The framework uses a flow-based large-scale diffusion transformer (Flag-DiT) that supports up to 700 million parameters and can extend sequence lengths to 128,000 markers. Lumina-T2X integrates images, videos, multi-views of 3D objects, and speech spectrograms into a spatiotemporal latent label space that can generate output at any resolution, aspect ratio, and duration.
ID-Aligner is a feedback learning framework for enhanced identity-preserving text-to-image generation, which uses reward feedback learning to solve the problems of identity feature preservation, aesthetic appeal of generated images, and compatibility with LoRA and Adapter methods. The approach leverages feedback from face detection and recognition models to improve generated identity preservation, and provides aesthetic adjustment signals via human annotated preference data and automatically constructed feedback. ID-Aligner is suitable for LoRA and Adapter models, and its effectiveness has been verified through extensive experiments.
PixArt-Sigma is a PyTorch-based collection of model definitions, pre-trained weights, and inference/sampling code for exploring weak-to-strong training of diffusion transformers for 4K text-to-image generation. It supports image generation from low to high resolution and provides a variety of features and benefits such as a fast experience, user-friendly code base and multiple model choices.
Stable Diffusion 3 is an advanced text-to-image generation system that rivals or betters top systems like DALL-E 3 and Midjourney v6 in terms of layout and prompt following. The system uses a new multi-modal diffusion transformer (MMDiT) architecture that uses different sets of weights to improve the representation of images and language, thereby improving text understanding and spelling capabilities. The Stable Diffusion 3 API is now live on the Stability AI developer platform, partnering with Fireworks AI to provide fast and reliable API service, with a commitment to open model weights for self-hosting through Stability AI membership in the near future.
ViewDiff is a method that uses pre-trained text-to-image models as prior knowledge to learn to generate multi-view consistent images from real-world data. It adds 3D volume rendering and cross-frame attention layers to the U-Net network, which can generate 3D consistent images in a single denoising process. Compared with existing methods, ViewDiff generates results with better visual quality and 3D consistency.
InstantStyle is a general framework that leverages two simple but powerful techniques to achieve effective separation of style and content in reference images. Principles include separating content from images, injecting only style blocks, and providing features such as style composition and image generation. InstantStyle can help users maintain style during the text-to-image generation process, providing users with a better generation experience.
SPRIGHT is a large-scale visual language dataset and model focusing on spatial relationships. It builds the SPRIGHT dataset by re-describing 6 million images, significantly increasing the number of spatial phrases in the descriptions. The model was fine-tuned and trained on 444 images containing a large number of objects to optimize the generation of images with spatial relationships. SPRIGHT achieves state-of-the-art spatial consistency across multiple benchmarks while improving image quality scores.
Canva’s AI image generator app lets you have the perfect image anytime you want—even if it doesn’t exist yet. Using the Text to Image feature, you can simply enter text and generate an image for use in creative projects, such as presentations or social media posts. Choose from different image styles such as watercolor, film, neon and more. You can also use Canva’s other AI generator apps like DALL·E and Imagen. Whether you're a content creator, entrepreneur, or artist, you can use these tools to efficiently create unique images and branding assets. Canva offers free and paid subscriptions, with the paid version generating more images each month.
Animagine XL 3.1 is a text-to-image generation model capable of generating high-quality anime-style images based on text prompts. It is built on the foundation of Stable Diffusion XL and is specifically optimized for anime styles. The model features broader knowledge of anime characters, an optimized dataset, and new aesthetic labels, resulting in improved quality and accuracy of generated images. It aims to provide valuable resources for anime fans, artists, and content creators.
FineControlNet is an official Pytorch-based implementation for generating images that can control the shape and texture of image instances through spatially aligned text control inputs (such as 2D human poses) and instance-specific text descriptions. It can use anything from simple line drawings as spatial input, to complex human poses. FineControlNet ensures natural interaction and visual coordination between instances and environments, while gaining the quality and generalization capabilities of Stable Diffusion, but with more control.
ELLA (Efficient Large Language Model Adapter) is a lightweight method to equip existing CLIP-based diffusion models with powerful LLM. ELLA improves the model's prompt following capabilities, enabling text-to-image models to understand long texts. We design a time-aware semantic connector to extract time-step related conditions for various denoising stages from pre-trained LLM. Our TSC dynamically adapts to the semantic features at different sampling time steps, helping to freeze U-Net at different semantic levels. ELLA performs well in benchmarks such as DPG-Bench, especially in dense prompts involving multiple object combinations, different attributes and relationships.
Muse Pro delivers unparalleled speed and quality with GPT-4 Vision technology and supports real-time AI guidance, allowing artists to unleash their creativity using familiar tools and innovative AI. It features text-to-image functionality, randomized creation, detail enhancement, visual description, intuitive AI control sliders, pause functionality, and a library of layers and brushes, among other diverse tools.
SLD is a self-correcting LLM-controlled diffusion model framework that enhances generative models by integrating detectors to achieve accurate text-to-image alignment. The SLD framework supports image generation and fine editing, and is compatible with any image generator, such as DALL-E 3, without requiring additional training or data.
PIXART-Σ is a diffusion transformer model that directly generates 4K resolution images, providing higher image fidelity and better alignment with text cues than its predecessor PixArt-α. Key features of PIXART-Σ include an efficient training process that evolves from a "weaker" baseline model to a "stronger" model by incorporating higher quality data, a process known as "weak-to-strong training." Improvements to PIXART-Σ include the use of higher quality training data and efficient marker compression.
TCD is a consistent distillation technique for text-to-image synthesis that reduces errors in the synthesis process through Trajectory Consistency Function (TCF) and Strategic Random Sampling (SSS). TCD significantly improves image quality at low NFE (noise free energy) and maintains more detailed results than the teacher model at high NFE. TCD requires no additional discriminator or LPIPS supervision and maintains superior generation quality at both low and high NFE.
DiffuseKronA is a parameter-efficient fine-tuning method for personalized diffusion models. It significantly reduces the number of parameters and improves the quality of image synthesis by introducing an adaptation module based on Kronecker product. This method reduces the sensitivity to hyperparameters, generates high-quality images under different hyperparameters, and brings significant progress to the field of text-to-image generation models.
OpenDiT is an open source project that provides a high-performance implementation of Colossal-AI-based Diffusion Transformer (DiT), specifically designed to enhance training and inference efficiency for DiT applications, including text-to-video generation and text-to-image generation. OpenDiT improves performance through the following technologies: up to 80% acceleration and 50% memory reduction on the GPU; including FlashAttention, Fused AdaLN and Fused layernorm core optimization; including hybrid parallel methods of ZeRO, Gemini and DDP, as well as sharding the ema model to further reduce memory costs; FastSeq: a novel sequence parallel method, especially suitable for workloads such as DiT, where the activation size is large but the parameter size is small; single-node sequence parallelism can save up to 48% of communication costs; breaking through a single GPU memory constraints, reducing overall training and inference time; Huge performance improvements with small code modifications; Users do not need to know the implementation details of distributed training; Complete text-to-image and text-to-video generation process; Researchers and engineers can easily use and adapt our process to practical applications without modifying the parallel part; Perform text-to-image training on ImageNet and publish checkpoints.