Found 124 related AI tools
CameraBench is a model for analyzing camera motion in video, aiming to understand camera motion patterns through video. Its main advantage lies in utilizing generative visual language models for principle classification of camera motion and video text retrieval. By comparing with traditional structure-from-motion (SfM) and real-time localization and construction (SLAM) methods, the model shows significant advantages in capturing scene semantics. The model is open source and suitable for use by researchers and developers, and more improved versions will be released in the future.
The Describe Anything Model (DAM) is able to process specific areas of an image or video and generate a detailed description. Its main advantage is that it can generate high-quality localized descriptions through simple tags (points, boxes, graffiti or masks), which greatly improves image understanding in the field of computer vision. Developed jointly by NVIDIA and multiple universities, the model is suitable for use in research, development, and real-world applications.
EasyControl is a framework that provides efficient and flexible control for Diffusion Transformers, aiming to solve problems such as efficiency bottlenecks and insufficient model adaptability existing in the current DiT ecosystem. Its main advantages include: supporting multiple condition combinations, improving generation flexibility and reasoning efficiency. This product is developed based on the latest research results and is suitable for use in areas such as image generation and style transfer.
Thera is an advanced super-resolution technology capable of producing high-quality images at different scales. Its main advantage lies in the built-in physical observation model, which effectively avoids aliasing. Developed by a research team at ETH Zurich, the technology is suitable for use in the fields of image enhancement and computer vision, and has broad applications in particular in remote sensing and photogrammetry.
MIDI is an innovative image-to-3D scene generation technology that utilizes a multi-instance diffusion model to generate multiple 3D instances with accurate spatial relationships directly from a single image. The core of this technology lies in its multi-instance attention mechanism, which can effectively capture the interaction and spatial consistency between objects without complex multi-step processing. MIDI excels in image-to-scene generation, and is suitable for synthetic data, real scene data, and stylized scene images generated by text-to-image diffusion models. Its main advantages include efficiency, high fidelity, and strong generalization capabilities.
GaussianCity is a framework focused on efficiently generating borderless 3D cities, based on 3D Gaussian rendering technology. This technology solves the memory and computing bottlenecks faced by traditional methods when generating large-scale urban scenes through compact 3D scene representation and spatially aware Gaussian attribute decoders. Its main advantage is the ability to quickly generate large-scale 3D cities in a single forward pass, significantly outperforming existing technologies. This product was developed by the S-Lab team of Nanyang Technological University. The related paper was published in CVPR 2025. The code and model have been open source and are suitable for researchers and developers who need to efficiently generate 3D urban environments.
MLGym is an open source framework and benchmark developed by Meta's GenAI team and UCSB NLP team for training and evaluating AI research agents. It promotes the development of reinforcement learning algorithms by providing diverse AI research tasks and helping researchers train and evaluate models in real-world research scenarios. The framework supports a variety of tasks, including computer vision, natural language processing and reinforcement learning, and aims to provide a standardized testing platform for AI research.
Pippo is a generative model developed by Meta Reality Labs in cooperation with multiple universities. It can generate high-resolution multi-view videos from a single ordinary photo. The core benefit of this technology is the ability to generate high-quality 1K resolution video without additional inputs such as parametric models or camera parameters. It is based on a multi-view diffusion converter architecture and has a wide range of application prospects, such as virtual reality, film and television production, etc. Pippo's code is open source, but it does not include pre-trained weights. Users need to train the model by themselves.
VideoWorld is a deep generative model focused on learning complex knowledge from purely visual input (unlabeled videos). It uses autoregressive video generation technology to explore how to learn task rules, reasoning and planning capabilities through visual information only. The core advantage of this model lies in its innovative latent dynamic model (LDM), which can efficiently represent multi-step visual changes, thereby significantly improving learning efficiency and knowledge acquisition capabilities. VideoWorld performed well in video Go and robot control tasks, demonstrating its strong generalization capabilities and learning capabilities for complex tasks. The research background of this model stems from the imitation of organisms learning knowledge through vision rather than language, and aims to open up new ways for artificial intelligence to acquire knowledge.
Video Depth Anything is a deep learning-based video depth estimation model that provides high-quality, time-consistent depth estimation for extremely long videos. This technology is developed based on Depth Anything V2 and has strong generalization capabilities and stability. Its main advantages include depth estimation capabilities for videos of arbitrary length, temporal consistency, and good adaptability to open-world videos. This model was developed by ByteDance’s research team to solve challenges in depth estimation in long videos, such as temporal consistency issues and adaptability issues in complex scenes. Currently, the code and demonstration of the model are publicly available for researchers and developers to use.
ViTPose is a series of human pose estimation models based on Transformer architecture. It leverages the powerful feature extraction capabilities of Transformer to provide a simple and effective baseline for human pose estimation tasks. The ViTPose model performs well on multiple datasets with high accuracy and efficiency. The model is maintained and updated by the University of Sydney community and is available in a variety of different scales to meet the needs of different application scenarios. On the Hugging Face platform, ViTPose models are available to users in open source form. Users can easily download and deploy these models to conduct research and application development related to human posture estimation.
TryOffAnyone is a deep learning model for generating tiled cloth from a human body. This model can convert pictures of people wearing clothes into cloth tiles, which is of great significance to fields such as clothing design and virtual fitting. It uses deep learning technology to achieve highly realistic cloth simulation, allowing users to preview the wearing effect of clothing more intuitively. The main advantages of this model include realistic cloth simulation and a high degree of automation, which can reduce time and costs during the actual fitting process.
video-analyzer is a video analysis tool that combines Llama's 11B visual model and OpenAI's Whisper model to describe what is happening in the video by extracting key frames, feeding them into the visual model to obtain details, and combining the details of each frame with the available transcripts. This tool represents a combination of computer vision, audio transcription, and natural language processing to generate detailed descriptions of video content. Its key benefits include running completely natively without a cloud service or API key, intelligent extraction of video keyframes, high-quality audio transcription using OpenAI’s Whisper, frame analysis using Ollama and Llama3.2 11B vision models, and generation of natural language descriptions of video content.
MegaSaM is a system that allows accurate, fast, and robust estimation of camera parameters and depth maps from monocular videos of dynamic scenes. The system breaks through the limitations of traditional structures from motion and monocular SLAM technologies, which usually assume that the input video mainly contains static scenes and a large amount of parallax. Through careful modification of the depth visual SLAM framework, MegaSaM is able to extend to videos of complex dynamic scenes in the real world, including videos with unknown fields of view and unrestricted camera paths. Extensive experiments with this technique on synthetic and real videos show that MegaSaM is more accurate and robust in camera pose and depth estimation, with faster or comparable runtimes compared to previous and parallel work.
The NVIDIA Jetson Orin Nano Super Developer Kit is a compact generative AI supercomputer that delivers higher performance and lower price. It supports a wide range of user groups from commercial AI developers to hobbyists and students, providing a 1.7x improvement in generative AI inference performance, a performance improvement of 67 INT8 TOPS, and a 102GB/s memory bandwidth improvement. This product is ideal for developing LLM chatbots based on search-enhanced generation, building visual AI agents, or deploying AI-based bots.
This is a video invisible object segmentation and content completion model proposed by Carnegie Mellon University. This model uses the basic knowledge of video generation models to process visible object sequences in videos through conditional generation tasks to generate object masks and RGB content including visible and invisible parts. The main advantages of this technology include the ability to handle highly occlusion situations and the ability to effectively handle deformed objects. In addition, the model outperforms existing advanced methods on multiple data sets, especially in non-visible segmentation of occluded areas of objects, with performance improvements of up to 13%.
StableAnimator is the first end-to-end identity-preserving video diffusion framework capable of synthesizing high-quality videos without post-processing. This technique ensures identity consistency through conditional synthesis of a reference image and a sequence of poses. Its main advantage is that there is no need to rely on third-party tools, and it is suitable for users who require high-quality portrait animation.
Controllable Human-Object Interaction Synthesis (CHOIS) is an advanced technology that can simultaneously generate object motion and human motion based on linguistic descriptions, initial object and human states, and sparse object waypoints. This technology is critical for simulating realistic human behavior, especially in scenarios that require precise hand-object contact and appropriate contact supported by the ground. CHOIS improves the matching between the generated object motion and the input object path points and ensures the authenticity of the interaction by introducing object geometry loss as additional supervision information and designing guidance terms to enforce contact constraints during the sampling process of training the diffusion model.
PSHuman is an innovative framework that leverages multi-view diffusion models and explicit reconstruction techniques to reconstruct realistic 3D human models from a single image. The importance of this technique lies in its ability to handle complex self-occlusion problems and avoid geometric distortion in the generated facial details. PSHuman jointly models global body shape and local facial features through a cross-scale diffusion model, achieving new perspective generation that is rich in detail and maintains identity features. In addition, PSHuman also enhances cross-view body shape consistency under different human postures through body priors provided by parametric models such as SMPL-X. The main advantages of PSHuman include rich geometric details, high texture fidelity, and strong generalization capabilities.
text-to-pose is a research project that aims to generate human poses from text descriptions and use these poses to generate images. The technology combines natural language processing and computer vision to enable text-to-image generation by improving the control and quality of diffusion models. The project background is based on papers published at the NeurIPS 2024 Workshop, which is innovative and cutting-edge. Key advantages of the technology include improved accuracy and controllability of image generation, as well as potential applications in areas such as artistic creation and virtual reality.
Phantomy AI is an advanced tool that leverages computer vision software to enhance user interaction and presentation through on-screen object detection and gesture recognition technology. It requires no additional hardware and can control the screen through intuitive gestures, providing users with a touch-free way to interact. The main advantages of Phantomy AI include highly accurate on-screen object detection, gesture-based control, smooth slide navigation, enhanced user experience and a wide range of application scenarios. Product background information shows that Phantom AI was developed by AI engineer Almajd Ismail, who has a background in software development and full-stack development. Regarding pricing and positioning, no specific information is provided on the page.
DINO-X is a large visual model with object perception as the core. It has core capabilities such as open set detection, intelligent question and answer, human posture, object counting, and clothing color changing. It can not only identify known targets, but also flexibly deal with unknown categories. With advanced algorithms, the model has excellent adaptability and robustness, can accurately respond to various unforeseen challenges, and provides a comprehensive solution for complex visual data. DINO-X has a wide range of application scenarios, including robotics, agriculture, retail industry, security monitoring, traffic management, manufacturing, smart home, logistics and warehousing, entertainment media, etc. It is DeepDataSpace's flagship product in the field of computer vision technology.
Data Annotation Platform is an end-to-end data annotation platform that allows users to upload computer vision data, select annotation types, and download results without any minimum commitment. The platform supports a variety of data annotation types, including rectangles, polygons, 3D cubes, key points, semantic segmentation, instance segmentation and pan-visual segmentation, etc., serving AI project managers, machine learning engineers, AI startups and research teams to solve the challenges they encounter in the data annotation process. With its features such as seamless execution, cost calculator, instruction generator, free tasks, API access, and team access, the platform provides users with a simple, efficient, and cost-effective data annotation solution.
AutoSeg-SAM2 is an automatic full video segmentation tool based on Segment-Anything-2 (SAM2) and Segment-Anything-1 (SAM1). It can track every object in the video and detect possible new objects. The importance of this tool lies in its ability to provide static segmentation results and track these results using SAM2, which is of great significance in areas such as video content analysis, object recognition and video editing. Product background information shows that it was developed by zrporz and is based on Facebook Research's SAM2 and zrporz's own SAM1. Price-wise, since this is an open source project, it is free.
TurboLens is a full-featured platform that integrates OCR, computer vision and generative AI. It can automatically generate insights from unstructured images quickly and simplify workflow. Product background information shows that TurboLens is designed to extract customized insights from printed and handwritten documents through its innovative OCR technology and AI-driven translation and analysis suite. In addition, TurboLens also provides mathematical formula and table recognition functions, converts images into actionable data, and translates mathematical formulas into LaTeX format and tables into Excel format. In terms of product price, TurboLens provides free and paid plans to meet the needs of different users.
LLaMA-Mesh is a technology that extends the ability of large language models (LLMs) pre-trained on text to generate 3D meshes. This technology leverages the spatial knowledge already embedded in LLMs and enables conversational 3D generation and mesh understanding. The main advantage of LLaMA-Mesh is its ability to represent the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without extending the vocabulary. Key benefits of this technology include the ability to generate 3D meshes from text prompts, produce interleaved text and 3D mesh output on demand, and understand and interpret 3D meshes. LLaMA-Mesh achieves mesh generation quality comparable to models trained from scratch while maintaining strong text generation performance.
CountAnything is a cutting-edge application that leverages advanced computer vision algorithms to enable automatic, accurate object counting. It is suitable for a variety of scenarios, including industry, breeding, construction, medicine, and retail. The main advantage of this product is its high precision and efficiency, which can significantly improve the accuracy and speed of counting work. Product background information shows that CountAnything is currently open to users outside of mainland China and provides free trials.
The NVIDIA AI Blueprint for Video Search and Summarization is a reference workflow based on NVIDIA NIM microservices and generative AI models for building visual AI agents that understand natural language cues and perform visual question answering. These agents can be deployed in a variety of scenarios such as factories, warehouses, retail stores, airports, traffic intersections, etc. to help operations teams make better decisions from the rich insights generated from natural interactions.
GenXD is a framework focused on 3D and 4D scene generation, which jointly studies general 3D and 4D generation using common camera and object motions in daily life. Due to the lack of large-scale 4D data in the community, GenXD first proposed a data curation process to obtain camera poses and object motion intensity from videos. Based on this process, GenXD introduces a large-scale real-world 4D scene dataset: CamVid-30K. By utilizing all 3D and 4D data, the GenXD framework is able to generate any 3D or 4D scene. It proposes multi-view-temporal modules that decouple camera and object motion to seamlessly learn from 3D and 4D data. In addition, GenXD also uses masked latent conditions to support multiple condition views. GenXD is able to generate videos that follow camera trajectories and consistent 3D views that can be lifted to a 3D representation. It is extensively evaluated on a variety of real-world and synthetic datasets, demonstrating the effectiveness and versatility of GenXD compared to previous methods for 3D and 4D generation.
Tencent-Hunyuan-Large (Hunyuan Large Model) is an industry-leading open source large-scale hybrid expert (MoE) model launched by Tencent, with 389 billion total parameters and 52 billion activation parameters. The model has made significant progress in fields such as natural language processing, computer vision, and scientific tasks, especially in processing long-context input and improving long-context task processing capabilities. The open source of the Hunyuan large model aims to inspire more researchers to innovate and jointly promote the progress and application of AI technology.
Flex3D is a two-stage process that generates high-quality 3D assets from a single image or text prompt. This technology represents the latest advancement in the field of 3D reconstruction and can significantly improve the efficiency and quality of 3D content generation. The development of Flex3D is supported by Meta and team members with deep backgrounds in 3D reconstruction and computer vision.
StableDelight is an advanced model focused on removing specular reflections from textured surfaces. It builds on the success of StableNormal, which focuses on improving the stability of monocular normal estimation. StableDelight solves the challenging task of removing reflections by applying this concept. The training data includes Hypersim, Lumos, and various specular highlight removal datasets from TSHRNet. Furthermore, we integrate multi-scale SSIM loss and stochastic conditional scaling techniques during diffusion training to improve the clarity of one-step diffusion predictions.
Colorful Diffuse Intrinsic Image Decomposition is an image processing technique that decomposes photos taken in the wild into albedo, diffuse shadows, and non-diffuse residual components. This technique enables the estimation of colorful diffuse shadows in images by progressively removing monochromatic lighting and Lambertian world assumptions, including multiple lighting and secondary reflections in the scene, while modeling specular and visible light sources. This technology is important for image editing applications such as specular removal and pixel-level white balancing.
diffusion-e2e-ft is an open source image conditional diffusion model fine-tuning tool that improves the performance of specific tasks by fine-tuning pre-trained diffusion models. The tool supports a variety of models and tasks, such as depth estimation and normal estimation, and provides detailed usage instructions and model checkpoints. It has important applications in the fields of image processing and computer vision, and can significantly improve the accuracy and efficiency of models on specific tasks.
opencv_contrib is an additional module library for OpenCV used to develop and test new image processing functions. These modules are usually integrated into the OpenCV core library after the API is stable, fully tested, and widely accepted. This library allows developers to use the latest image processing technology, driving innovation in the field of computer vision.
OpenCV is a cross-platform open source computer vision and machine learning software library that provides a range of programming functions, including but not limited to image processing, video analysis, feature detection, machine learning, etc. This library is widely used in academic research and commercial projects, and is favored by developers because of its powerful functionality and flexibility.
GVHMR is an innovative human motion restoration technology that solves the problem of recovering world-positioned human motion from monocular video through a gravity perspective coordinate system. This technique can reduce the ambiguity of learning image-pose mapping and avoid the cumulative error of consecutive images in autoregressive methods. GVHMR has performed well in wild benchmark tests, not only surpassing the existing state-of-the-art technology in accuracy and speed, but its training process and model weights are open to the public, which has high scientific research and practical value.
Shangchen Zhou is a PhD student with a strong research background in computer vision and machine learning. His work mainly focuses on visual content enhancement, editing and generative AI (2D and 3D). His research results are widely used in fields such as super-resolution, deblurring, and low-light enhancement of images and videos, making important contributions to improving the quality of visual content and user experience.
Segment Anything 2 for Surgical Video Segmentation is a surgical video segmentation model based on Segment Anything Model 2. It uses advanced computer vision technology to automatically segment surgical videos to identify and locate surgical tools, improving the efficiency and accuracy of surgical video analysis. This model is suitable for various surgical scenarios such as endoscopic surgery and cochlear implant surgery, and has the characteristics of high accuracy and high robustness.
AvatarPose is a method for estimating the 3D pose and shape of multiple closely interacting people from sparse multi-view videos. This technique significantly improves the robustness and accuracy of estimating 3D poses in close interactions by reconstructing each person's personalized implicit neural avatar and using it as a prior to refine the pose through color and contour rendering losses.
SA-V Dataset is an open-world video dataset designed for training general object segmentation models, containing 51K diverse videos and 643K spatio-temporal segmentation masks (masklets). This dataset is used for computer vision research and is allowed to be used under the CC BY 4.0 license. Video content is diverse and includes topics such as places, objects, and scenes, with masks ranging from large-scale objects such as buildings to details such as interior decorations.
Meta Segment Anything Model 2 (SAM 2) is a next-generation model developed by Meta for real-time, promptable object segmentation in videos and images. It achieves state-of-the-art performance and supports zero-shot generalization, i.e., no need for custom adaptation to apply to previously unseen visual content. The release of SAM 2 follows an open science approach, with the code and model weights shared under the Apache 2.0 license, and the SA-V dataset also shared under the CC BY 4.0 license.
roboflow/sports is an open source computer vision toolset focusing on applications in the sports field. It utilizes advanced image processing technologies such as object detection, image segmentation, key point detection, etc. to solve challenges in sports analysis. This toolset was developed by Roboflow to promote the application of computer vision technology in the sports field and is continuously optimized through community contributions.
VGGSfM is a deep learning-based 3D reconstruction technology designed to reconstruct the camera pose and 3D structure of a scene from an unrestricted set of 2D images. This technology enables end-to-end training through a fully differentiable deep learning framework. It leverages deep 2D point tracking technology to extract reliable pixel-level trajectories while recovering all cameras based on image and trajectory features, and optimizes camera and triangulated 3D points via differentiable bundled adjustment layers. VGGSfM achieves state-of-the-art performance on three popular datasets: CO3D, IMC Phototourism and ETH3D.
MASt3R is an advanced model for 3D image matching developed by Naver Corporation, which focuses on improving geometric 3D vision tasks in the field of computer vision. This model utilizes the latest deep learning technology and can achieve accurate 3D matching between images through training, which is of great significance to fields such as augmented reality, autonomous driving, and robot navigation.
GaussianCube is an innovative 3D radiation representation method that greatly promotes the development of 3D generative modeling through structured and explicit representation. This technique achieves high-accuracy fitting by rearranging Gaussian functions into a predefined voxel grid using a novel density-constrained Gaussian fitting algorithm and an optimal transfer method. GaussianCube has fewer parameters and higher quality than traditional implicit feature decoders or spatially unstructured radiative representations, making 3D generative modeling easier.
L4GM is a 4D large-scale reconstruction model capable of quickly generating animated objects from single-view video input. It employs a novel dataset containing multi-view videos showing animated objects rendered in the Objaverse. The dataset contains 44K different objects and 110K animations, rendered from 48 viewpoints, resulting in 12M videos containing a total of 300M frames. L4GM is built on the pre-trained 3D large-scale reconstruction model LGM, which is capable of outputting 3D Gaussian ellipsoids from multi-view image inputs. L4GM outputs a 3D Gaussian splatting representation of each frame, which is then upsampled to a higher frame rate for temporal smoothing. In addition, L4GM adds a temporal self-attention layer to help learn temporal consistency and uses multi-view rendering loss at each time step to train the model.
UniAnimate is a unified video diffusion model framework for human image animation. It reduces optimization difficulty and ensures temporal coherence by mapping reference images, pose guidance, and noisy videos into a common feature space. UniAnimate can handle long sequences and supports random noise input and first frame conditional input, significantly improving the ability to generate long-term videos. Furthermore, it explores alternative temporal modeling architectures based on state-space models as a replacement for the original computationally intensive temporal Transformer. UniAnimate achieves synthetic results that outperform existing state-of-the-art techniques in both quantitative and qualitative evaluations, and is able to generate highly consistent one-minute videos by iteratively using a first-frame conditional strategy.
MASA is an advanced model for object matching in video frames, which is capable of handling multi-object tracking (MOT) in complex scenes. MASA does not rely on domain-specific annotated video data sets, but learns instance-level correspondences through rich object segmentation through the Segment Anything Model (SAM). MASA has designed a universal adapter that can be used with basic segmentation or detection models to achieve zero-shot tracking capabilities that can perform well even in complex domains.
VastGaussian is an open source project for 3D scene reconstruction, which simulates the geometry and appearance information of large scenes by using 3D Gaussian. This project was implemented by the author from scratch and may have some errors, but it provides a new attempt in the field of 3D scene reconstruction. The main advantages of the project include the ability to handle large data sets, as well as improvements to the original 3DGS project to make it easier to understand and use.
AI Online Course is an interactive learning platform that provides a clear and concise introduction to artificial intelligence, making complex concepts easy to understand. It covers machine learning, deep learning, computer vision, autonomous driving, chatbots, etc., and emphasizes practical applications and technical advantages.
SadCaptcha is a plugin for solving TikTok verification codes, which can solve TikTok's rotation, puzzle and 3D shape verification codes quickly and accurately. It uses advanced computer vision algorithms to solve CAPTCHAs efficiently and works on any device and screen resolution.
JavaVision is an all-round visual intelligent recognition project developed based on Java. It not only implements core functions such as PaddleOCR-V4, YoloV8 object recognition, face recognition, and image search, but can also be easily expanded to other fields, such as speech recognition, animal recognition, security inspection, etc. Project features include the use of the SpringBoot framework, versatility, high performance, reliability and stability, easy integration and flexible scalability. JavaVision aims to provide Java developers with a comprehensive visual intelligent recognition solution, allowing them to build advanced, reliable and easy-to-integrate AI applications in a familiar and favorite programming language.
CoreNet is a deep neural network toolkit that enables researchers and engineers to train standard and novel small- and large-scale models for a variety of tasks, including basic models (such as CLIP and LLM), object classification, object detection, and semantic segmentation.
SAM is an advanced video object segmentation model that combines optical flow and RGB information to discover and segment moving objects in videos. The model achieves significant performance gains on both single- and multi-object benchmarks while maintaining object identity consistency.
ZeST is an image material transfer technology jointly developed by the University of Oxford, Stability AI and MIT CSAIL research teams. It can achieve material transfer of objects from one image to another without any prior training. ZeST supports the migration of a single material and can handle multiple material editing in a single image. Users can easily apply one material to multiple objects in the image. In addition, ZeST also supports fast image processing on the device, getting rid of dependence on cloud computing or server-side processing, greatly improving efficiency.
PhysAvatar is an innovative framework that combines inverse rendering and inverse physics to automatically estimate the physical parameters of human body shape, appearance, and clothing from multi-view video data. It uses grid-aligned 4D Gaussian spatiotemporal grid tracking technology and a physically based inverse renderer to estimate intrinsic material properties. PhysAvatar integrates a physics simulator to estimate the physical parameters of clothing in a principled manner using gradient-based optimization methods. These innovative capabilities enable PhysAvatar to render high-quality, new-view avatars wearing loose clothing under motion and lighting conditions outside of the training data.
Open-Sora-Plan is an open source project that aims to provide high-quality video datasets to the open source community. The project has crawled and processed 40,258 high-quality videos from open source websites, covering 60% of horizontal screen videos. It also provides automatically generated dense subtitles for use in applications such as machine learning. The project is free and open source, and everyone is welcome to participate and support it.
Scenic is a code library focused on computer vision research based on attention models. It provides functions such as optimized training and evaluation loops, baseline models, etc., and is suitable for multi-modal data such as images, videos, and audios. Provide SOTA models and baselines to support rapid prototyping at a free price.
Huawei’s open source and self-developed AI framework MindSpore. Automatic differentiation, parallel blessing, one training, can be deployed in multiple scenarios. The deep learning training inference framework that supports all scenarios of device, edge and cloud is mainly used in AI fields such as computer vision and natural language processing, and is targeted at data scientists, algorithm engineers and other people. It mainly has functional features such as general automatic differentiation based on source code conversion, automatic implementation of distributed parallel training, data processing, and graph execution engine. Easily train neural networks with automatic differentiation. The framework is open source and Huawei cultivates an AI development ecosystem.
ObjectDrop is a supervised method designed to achieve photorealistic object removal and insertion. It leverages a counted fact data set and bootstrap supervision techniques. The main function is the ability to remove objects from the image and their impact on the scene (such as occlusions, shadows and reflections), as well as the ability to insert objects into the image in an extremely realistic way. It achieves object deletion by fine-tuning the diffusion model on a small specially captured data set, and for object insertion, it uses a bootstrapped supervision method to synthesize a large-scale count fact data set using the deletion model, and then fine-tune it to the real data set after training on this data set to obtain a high-quality insertion model. Compared with the previous method, ObjectDrop has significantly improved the authenticity of object deletion and insertion.
T-Rex2 is a paradigm-breaking object detection technology capable of identifying a wide range of objects from the mundane to the esoteric without the need for task-specific tuning or massive training data sets. It combines visual and text prompts, giving it powerful zero-shot capabilities and can be widely used in object detection tasks in various scenarios. T-Rex2 combines four components: image encoder, visual cue encoder, text cue encoder and box decoder. It follows DETR's end-to-end design principles and covers a variety of application scenarios. T-Rex2 achieved the best performance in four academic benchmarks including COCO, LVIS, ODinW and Roboflow100.
StyleSketch is a method for extracting high-resolution stylized sketches from facial images. This method exploits the rich semantics of deep features of pre-trained StyleGAN and is able to train a sketch generator using only 16 pairs of faces and corresponding sketch images. Through partial loss in staged learning, StyleSketch is able to quickly converge and extract high-quality sketches. Compared with existing state-of-the-art sketch extraction methods and few-sample image adaptation methods, StyleSketch performs better on the task of extracting high-resolution abstract facial sketches.
Skyvern is an automation tool that combines large language models (LLMs) and computer vision techniques to automate browser-based workflows. It provides a simple API endpoint to fully automate manual workflows, replacing brittle or unreliable automation solutions.
Glyph-ByT5 is a custom text encoder designed to improve visual text rendering accuracy in text-to-image generative models. It does this by fine-tuning the character-aware ByT5 encoder and using a carefully curated text dataset of paired glyphs. After integrating Glyph-ByT5 with SDXL, the Glyph-SDXL model was formed, which increased the text rendering accuracy in design image generation from less than 20% to nearly 90%. The model can also realize automatic multi-line layout rendering of paragraph text, with the number of characters ranging from dozens to hundreds of characters, while maintaining high spelling accuracy. In addition, by fine-tuning using a small number of high-quality real images containing visual text, Glyph-SDXL's scene text rendering capabilities in open-domain real images have also been greatly improved. These encouraging results are intended to encourage further exploration into designing customized text encoders for different challenging tasks.
LensAI is a contextual advertising platform based on artificial intelligence that places ads directly in objects through video and image analysis to achieve effective traffic monetization. LensAI recognizes objects, signs, actions and context and matches them with relevant ads, fine-tuning targeting to deliver significant advertising effects and high rates of return. LensAI provides advanced image and video advertising solutions that bring benefits to all key stakeholders in the digital advertising ecosystem.
FineControlNet is an official Pytorch-based implementation for generating images that can control the shape and texture of image instances through spatially aligned text control inputs (such as 2D human poses) and instance-specific text descriptions. It can use anything from simple line drawings as spatial input, to complex human poses. FineControlNet ensures natural interaction and visual coordination between instances and environments, while gaining the quality and generalization capabilities of Stable Diffusion, but with more control.
DUSt3R is a novel dense and unconstrained stereo 3D reconstruction method suitable for arbitrary image collections. It does not require prior knowledge of camera calibration or viewpoint pose information, and relaxes the strict constraints of traditional projective camera models by treating the pairwise reconstruction problem as a regression of point maps. DUSt3R provides a unified monocular and binocular reconstruction method and proposes a simple and effective global alignment strategy in multi-image situations. Build the network architecture based on standard Transformer encoders and decoders, taking advantage of powerful pre-trained models. DUSt3R directly provides the 3D model and depth information of the scene and can recover pixel matching, relative and absolute camera information from it.
U-xer is a computer vision-based test automation and RPA tool designed to automate anything seen on the screen, including web and desktop applications. It has two modes: easy and advanced, which can meet the different needs of non-technical users and advanced users. U-xer can recognize the screen and interpret the screen content like a human, enabling more natural and accurate automation. It is suitable for various application scenarios, including web applications, desktop software, mobile devices, etc., and provides customized solutions. Please check the official website for pricing and positioning of U-xer.
YOLOv8 is the latest version of the YOLO series of target detection models, which can accurately and quickly identify and locate multiple objects in images or videos, and track their movement in real time. Compared with previous versions, YOLOv8 has greatly improved detection speed and accuracy, and supports a variety of additional computer vision tasks, such as instance segmentation, pose estimation, etc. YOLOv8 can be deployed on different hardware platforms in a variety of formats, providing a one-stop end-to-end target detection solution.
VisFusion is a technology that uses video data for online 3D scene reconstruction. It can extract and reconstruct a 3D environment from videos in real time. This technology combines computer vision and deep learning to provide users with a powerful tool for creating accurate 3D models.
Sora is a text-to-video generation model developed by OpenAI that can generate realistic image sequences up to 1 minute based on text descriptions. It has the ability to understand and simulate the movement of the physical world, and its goal is to train models that help people solve problems that require physical interaction. Sora can interpret long-form prompts and generate a variety of people, animals, landscapes, and city scenes based on text input. Its disadvantage is that it is difficult to accurately depict the physics of complex scenes and understand cause and effect.
AnimateLCM-SVD-xt is a new image-to-video generation model that can generate high-quality, coherent videos in a few steps. This model uses consistency knowledge distillation and stereo matching learning technology to make the generated video more stable and coherent, while greatly reducing the amount of calculation. Key features include: 1) Generate 25 frames of 576x1024 resolution video in 4-8 steps; 2) Reduce the amount of calculation 12.5 times compared to the ordinary video diffusion model; 3) The generated video is of good quality and does not require additional classifier guidance.
GLIGEN is an open image generation model based on text prompts, which can generate images based on text descriptions and bounding boxes and other limiting conditions. This model is implemented by freezing the parameters of a pre-trained text-to-image diffusion model and inserting new data into it. This modular design enables efficient training and strong inference flexibility. GLIGEN can support conditional image generation in the open world, and also has strong generalization capabilities for emerging concepts and layouts.
Ollama is an open source project that can run various large-scale AI models locally on Windows, supports GPU acceleration, has a built-in OpenAI model compatibility layer, and provides a permanently online API. Users can seamlessly access Ollama’s complete model library for image and voice interaction. Ollama provides powerful AI capabilities without configuration, helping developers and creators build AI applications on Windows.
Vision Arena is an open source platform created by Hugging Face for testing and comparing the effects of different computer vision models. It provides a friendly interface that allows users to upload images and process them through different models to visually compare the quality of the results. The platform is pre-installed with mainstream image classification, object detection, semantic segmentation and other models, and also supports custom models. The key advantages are that it is open source and free, easy to use, supports multi-model parallel testing, and is conducive to model effect evaluation and selection. It is suitable for computer vision R&D personnel, algorithm engineers and other roles, and can accelerate the experiment and optimization of computer vision models.
ImageTools is a universal cutout tool that uses advanced computer vision algorithms to accurately and automatically remove the background from photos and highlight the subject. It is suitable for image editing, advertising design, e-commerce and other scenarios, providing users with the flexibility and creative space to display image subjects in various situations.
LiveFood is a data set containing more than 5,100 food videos. The videos include four fields: ingredients, cooking, presentation, and eating. All videos are carefully annotated by professional workers, and a strict double-checking mechanism is used to further ensure the quality of the annotations. We also propose the Global Prototype Encoding (GPE) model to handle this incremental learning problem, achieving competitive performance compared with traditional techniques.
VMamba is a visual state space model that combines the advantages of convolutional neural networks (CNNs) and visual transformers (ViTs) to achieve linear complexity without sacrificing global perception. The Cross-Scan module (CSM) is introduced to solve the problem of direction sensitivity, which can show excellent performance in various visual perception tasks. As the image resolution increases, it shows more significant advantages over existing benchmark models.
Vision Mamba is an efficient visual representation learning framework built using the bidirectional Mamba module, which can overcome computing and memory limitations and perform Transformer-style understanding of high-resolution images. It does not rely on the self-attention mechanism and compresses visual representation through position embedding and bidirectional state space models to achieve higher performance and better computing and memory efficiency. This framework outperforms classic visual Transformers such as DeiT on ImageNet classification, COCO target detection, and ADE20k semantic segmentation tasks, but its computational and memory efficiency increases by 2.8 times and 86.8%.
SIFU is a method that uses side-view images to reconstruct high-quality 3D clothing virtual character models. Its core innovation is to propose a new implicit function based on side-view images, which can enhance feature extraction and improve geometric accuracy. In addition, SIFU also introduces a 3D consistent texture optimization process that can greatly improve texture quality, enabling texture editing with the help of a text-to-image diffusion model. SIFU is good at handling complex postures and loose clothing, making it an ideal solution for practical applications.
FMA-Net is a deep learning model for video super-resolution and deblurring. It can restore low-resolution and blurry videos to high-resolution and clear videos. Through flow-guided dynamic filtering and multi-attention iterative feature refining technology, this model can effectively handle large movements in videos and achieve joint super-resolution and deblurring of videos. The model has a simple structure and remarkable effects, and can be widely used in video enhancement, editing and other fields.
Repaint123 can generate high-quality, multi-view consistent 3D content from a single image in 2 minutes. It combines the powerful image generation capabilities of the 2D scattering model and the texture alignment capabilities of the progressive redrawing strategy to generate high-quality, consistent multi-view images, and improves the image quality during the redrawing process through visibility-aware adaptive redrawing intensity. The generated high-quality, multi-view consistent images enable fast 3D content generation with a simple mean square error loss function.
The code repository contains research on learning from synthetic image data (mainly pictures), including three projects: StableRep, Scaling and SynCLR. These projects studied how to train visual representation models using synthetic image data generated by text-to-image models, and achieved very good results.
ODIN (Omni-Dimensional INstance segmentation) is a model that can perform segmentation and labeling on 2D RGB images and 3D point clouds using a converter architecture. It differentiates between 2D and 3D feature operations by fusing information alternately within 2D views and between 3D views. ODIN achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It surpasses all previous works when using sampled point clouds from 3D meshes instead of perceptual 3D point clouds. When used as a 3D perception engine in a coachable embodied agent architecture, it sets a new state-of-the-art on the TEACh conversational action benchmark. Our code and checkpoints can be found on the project website.
3D Fauna is a method to build three-dimensional animal models by learning 2D network pictures. It addresses the challenge of model generalization by introducing a collection of semantically related models and provides a new large-scale dataset. During inference, given an image of any quadruped, our model can reconstruct a relevant 3D mesh model in a feed-forward manner within seconds.
Wild2Avatar is a neural rendering method for rendering human appearance in occluded wild monocular videos. It can render humans in realistic scenes, even if obstacles may block the camera view and cause partial occlusion. This method is implemented by decomposing the scene into three parts (occlusions, humans and background) and using a specific objective function to enforce the separation of humans from occlusions and background to ensure the integrity of the human model.
Human101 is a framework for quickly reconstructing the human body from a single view. It is able to train a 3D Gaussian model in 100 seconds and render 1024-resolution images at over 60FPS without pre-stored Gaussian attributes for each frame. The Human101 pipeline is as follows: First, 2D human poses are extracted from single-view videos. Then, the pose is used to drive the 3D simulator to generate matching 3D skeleton animation. Finally, a time-related 3D Gaussian model is constructed based on animation and rendered in real time.
Why choose Innovatiana for data annotation outsourcing? Innovatiana is a company dedicated to providing meaningful and impactful outsourcing services for your artificial intelligence needs. We recruit and train our own data annotation team in Madagascar, providing them with fair salaries, good working conditions and career development opportunities. We reject the use of crowdsourcing practices to provide you with meaningful and impactful outsourcing services and transparently source the data used for AI. Our tasks are handled by an English or French speaking manager, allowing for close management and communication. We offer flexible pricing based on your needs and budget. We value the security and confidentiality of data and adopt best information security practices to protect data. Our data annotation experts are professionally trained to provide you with high-quality annotated data for training your artificial intelligence models.
UniRef is a unified model for reference object segmentation in images and videos. It supports various tasks such as semantic reference image segmentation (RIS), few-shot segmentation (FSS), semantic reference video object segmentation (RVOS), and video object segmentation (VOS). The core of UniRef is the UniFusion module, which can efficiently inject various reference information into the basic network. UniRef can be used as a plug-in component for basic models such as SAM. UniRef provides models trained on multiple benchmark data sets, and also opens source code for research use.
ALFI is an enterprise SaaS platform powered by artificial intelligence, using computer vision, machine learning, deep learning and edge computing technologies. It provides features such as ad targeting, real-time audience analytics, and personalized content delivery. ALFI’s unique network installs AI screens in ride-sharing services such as Uber and Lyft, enabling precise targeting and personalized delivery of digital out-of-home advertising. It uses computer vision technology to match audiences with relevant ads in real time and delivers content in a privacy-compliant manner. ALFI's goal is to provide brands with more precise advertising and provide enterprises with real-time audience analysis and customized content delivery.
Wrestling Endurance Challenge is a wrestling endurance challenge application that combines artificial intelligence and computer vision. The app assigns tasks through AI and uses computer vision to detect the user's duration. Users receive instructions through speakers or headphones to participate in endurance challenges. The application uses continuous machine learning to perform calculations in the cloud and ensures privacy and security. Video will not be sent and only joint coordinates and trajectory data will be exported.
Vision AI offers three computer vision products, including Vertex AI Vision, custom machine learning models, and the Vision API. You can use these products to extract valuable information from images, perform image classification and search, and create a variety of computer vision applications. Vision AI provides an easy-to-use interface and powerful pre-trained models to meet different user needs.
The Manot Insight Management Platform improves the performance of computer vision models through accurate positioning. It provides product managers and engineers with actionable insights so they can determine why computer vision models are failing.
Adversarial Diffusion Distillation is a real-time image editing platform that converts any physical medium to digital and edits anywhere via your phone, tablet or computer. It uses advanced computer vision technology to quickly and easily convert physical media to digital media, including paper, walls, whiteboards, books, and more. Adversarial Diffusion Distillation can help users improve work efficiency and reduce time and costs.
The product is a novel denoised diffusion probabilistic model that learns to sample from signal distributions that are not directly observed, but rather measured via known differentiable forward models. This product can directly sample from a partially observed unknown signal distribution and is suitable for computer vision tasks. In inverse graphics, it is able to generate a 3D scene distribution consistent with a single 2D input image. Product pricing is flexible and positioned in the fields of image processing and computer vision.
AttentionKart is a platform that uses artificial intelligence to provide engagement insights. It uses computer vision technologies such as facial recognition, expression recognition, eye tracking, etc. to help users analyze engagement and interaction and gain in-depth insights into user behavior. The platform can analyze video footage offline and integrate third-party applications online. The main functions include participation analysis, accurate user portraits, interaction optimization, etc. It is suitable for online courses in educational institutions, corporate conference presentations, sales calls and other scenarios.
YOLO-NAS Pose is a free, open source library for training computer vision models based on PyTorch. It provides training scripts and examples for quickly and easily replicating model results. Built-in SOTA models make it easy to load and fine-tune production-ready pre-trained models, including best practices and validated hyperparameters for optimal accuracy. It can shorten the training life cycle and eliminate uncertainty. Models for different tasks such as classification, detection, and segmentation are provided and can be easily integrated into the code base.
OpenCV is a real-time optimized computer vision library that provides a powerful set of tools and hardware support. It also supports the execution of machine learning (ML) and artificial intelligence (AI) models. OpenCV is open source and free for commercial use.
PanoHead is a 360° geometry-aware 3D full-head synthesis method that can be trained solely on unstructured images from the wild to achieve high-quality view-consistent 360° full-head image synthesis with different appearances and detailed geometries.