Innovative AI application and game development
Polyverse is a leading mobile application company headquartered in New York that focuses on developing applications and games using AI technology. By combining creativity and AI technology, we develop innovative apps and games that change the way people interact with their devices. Our products include AI Mirror, Bricks Royale and Spellai, etc. AI Mirror is an AI-powered image generation app that transforms your photos into works of art in a variety of styles. Bricks Royale is a classic brick breaking game where you need to break various levels of bricks to save the king and princess. Spellai is an AI-powered app that turns your text into photos, letting you unleash your creativity in different styles using simple prompts. Our products can be widely used in entertainment, creative design, personalization and other scenarios.
Widely used in entertainment, creative design, personalization and other scenarios
Use AI Mirror to transform photos into cartoon images
Play Bricks Royale game to save the king and princess
Turn text into photo art with Spellai
Discover more similar quality AI tools
ControlMM is a full-body motion generation framework with plug-and-play multi-modal control capabilities, capable of generating robust motion in multiple domains including Text-to-Motion, Speech-to-Gesture, and Music-to-Dance. This model has obvious advantages in controllability, sequence and motion rationality, providing a new motion generation solution for the field of artificial intelligence.
SignLLM is the first multilingual sign language generation model built on public sign language data, including American Sign Language (ASL) and seven other sign languages. The model is able to generate sign language gestures from text or prompts, and accelerates the training process through reinforcement learning to improve data sampling quality. SignLLM achieves state-of-the-art performance on production tasks in eight sign languages.
MagicTime integrates MagicTime time Lora into the Animatediff v3 dynamic model, and converts its spatial Lora into .safetensors format. Used together in ComfyUI and AnimatediffEvolved. It has powerful functions, provides high-quality spatio-temporal fusion effects, and is suitable for dynamic model applications. Pricing depends on the specific use and is positioned to provide advanced space-time fusion technology.
IMUSIC is a novel system for facial expression capture based on IMU sensor signals. It uses an optimized IMU sensor configuration scheme and decoupling algorithm to accurately predict facial blendshape parameters only through IMU signals. Unlike traditional vision-based face capture solutions, IMUSIC can capture faces in visually obstructed scenarios while protecting user privacy.
Score Distillation Sampling (SDS) is a recent but already widely popular method that relies on image diffusion models to control optimization problems using textual cues. The paper provides an in-depth analysis of the SDS loss function, identifies problems inherent in its formulation, and proposes an unexpected but effective fix. Specifically, we decompose the loss into different factors and isolate the components that generate noisy gradients. In the original formulation, high-text guidance was used to account for noise, resulting in undesirable side effects. Instead, we trained a shallow network that simulates the time-step dependent denoising deficiencies of image diffusion models to efficiently factor them out. We demonstrate the versatility and effectiveness of our novel loss formulation through multiple qualitative and quantitative experiments, including optimization-based image synthesis and editing, zero-shot image conversion network training, and text-to-3D synthesis.
Instruct-Imagen is a multi-modal image generation model. By introducing multi-modal instructions, it can process heterogeneous image generation tasks and achieve generalization in unknown tasks. The model uses natural language to integrate different modalities (such as text, edges, styles, topics, etc.) and standardize rich generated intents. By fine-tuning a two-stage framework on a pre-trained text-to-image diffusion model, employing retrieval-augmented training and fine-tuning on diverse image generation tasks, human evaluation results of the model on various image generation datasets show that it matches or surpasses previous task-specific models in the domain and exhibits promising generalization capabilities to unknown and more complex tasks.
DL3DV-10K is a large-scale real-scene data set containing more than 10,000 high-quality videos. Each video is manually annotated with scene key points and complexity, and provides camera poses, NeRF estimated depth, point clouds, and 3D meshes. This data set can be used for computer vision research such as general NeRF research, scene consistency tracking, and visual language models.
aMUSEd is an open source platform that provides a variety of natural language processing (NLP) models, datasets and tools. These include aMUSEd, a lightweight MUSE-based masked image model (MIM) for text-to-image generation. Compared to latent diffusion, MIM requires fewer reasoning steps and is easier to interpret. Additionally, MIM can be fine-tuned to learn additional styles with just one image. aMUSEd also provides two model checkpoints that can directly generate 256x256 and 512x512 resolution images.
This paper introduces a perceptual loss-based diffusion model that improves sample quality by directly incorporating perceptual loss into diffusion training. For conditional generation, this method only improves sample quality without affecting the conditional input, thus not sacrificing sample diversity. For unconditional generation, this approach also improves sample quality. The paper introduces the principle and experimental results of the method in detail.
SteinDreamer provides a text-to-3D fractional distillation solution. They proposed a variance reduction solution called Stein Score Distillation (SSD), which achieved effective reduction of distillation variance by controlling variable construction and Stein identity. Their experimental results show that SSD can effectively reduce distillation variance and continuously improve visual quality in object- and scene-level generation. Furthermore, they demonstrated that SteinDreamer has faster convergence speed compared to existing methods.
X-Adapter is a universal upgrade tool that enables pre-trained plug-in modules (e.g. ControlNet, LoRA) to be used directly with upgraded text-to-image diffusion models (e.g. SD-XL) without further retraining. By training additional networks to control the frozen upgraded model, X-Adapter retains the connectors of the old model and adds a trainable mapping layer to connect the decoders of different versions of the model for feature remapping. The remapped features will serve as guides for upgrading the model. In order to enhance the guidance ability of X-Adapter, we adopt an empty text training strategy. After training, we also introduce a two-stage denoising strategy to adjust the initial latent variables of the X-Adapter and upgraded models. X-Adapter demonstrates universal compatibility with various plug-ins and enables different versions of plug-ins to work together, thus extending the capabilities of the diffusion community. We conducted extensive experiments to demonstrate that X-Adapter may have wider applications in upgraded basic diffusion models.
Stable Diffusion XL is a Hugging Face Space running on TPUv5e, which provides the application of the stable diffusion XL model. Stable Diffusion XL is a powerful natural language processing model that is widely used in text generation, question answering, semantic understanding and other fields. The model runs on TPUv5e and is efficient and stable, capable of handling large-scale data and complex tasks.
MiniGPT-5 is an interleaved visual language generation technology based on generative votes, which can generate text narratives and related images at the same time. It adopts a two-stage training strategy, the first stage is for description-free multi-modal generation training, and the second stage is for multi-modal learning. This model achieves good results on multi-modal dialogue generation tasks.