Found 171 related AI tools
Cricket (QuQu) is an open source and free desktop voice input and text processing tool, specially designed for Chinese users. It offers privacy protection and local processing with no subscription fees compared to Wispr Flow. By integrating the FunASR local model, Cricket can accurately recognize Chinese and optimize the voice input experience, making it suitable for developers and ordinary users.
BlabbyAI is a speech-to-text AI transcription tool that provides services to users in the form of a Chrome extension. Its importance lies in greatly improving the efficiency of user input text, which is especially suitable for scenarios where content needs to be recorded quickly or where manual input is inconvenient. Key benefits include fast and accurate speech recognition, enabling seamless voice typing on any website. In terms of product background, it meets people's needs for efficient input methods in modern society. Regarding the price, the document does not mention it, and it is speculated that there may be a free trial or a paid model. It is positioned as a voice input auxiliary tool to help users improve productivity.
Mumble Note is an AI voice shorthand tool that converts users' dictation into clear notes, to-dos, and output. This product has functions such as privacy protection and intelligent question and answer, providing users with an efficient voice recording and management experience.
Speechly is a tool designed to turn your speech into structured emails, making it easy to get clear and easy-to-read messages without manual typing. It supports up to 100 languages.
Unmute is an innovative speech recognition and synthesis tool designed to enable users to efficiently interact with AI through natural language. Its low-latency technology ensures a smooth user experience and is suitable for scenarios that require real-time feedback. The product will be released as open source to promote the participation of more developers and users. The price has not yet been announced, but it is expected to be a combination of free and paid models.
Kimi-Audio is an advanced open source audio base model designed to handle a variety of audio processing tasks such as speech recognition and audio dialogue. The model is massively pre-trained on more than 13 million hours of diverse audio and text data, with powerful audio inference and language understanding capabilities. Its main advantages include excellent performance and flexibility, making it suitable for researchers and developers to conduct audio-related research and development.
Amazon Nova Sonic is a cutting-edge basic model that integrates speech understanding and generation to improve the natural fluency of human-machine dialogue. This model overcomes the complexity in traditional voice applications and achieves deeper communication understanding through a unified architecture. It is suitable for AI applications in multiple industries and has important business value. With the continuous development of artificial intelligence technology, Nova Sonic will provide customers with a better voice interaction experience and improve service efficiency.
Yinke Transcription is an online tool that focuses on audio and video transcription. It uses advanced speech recognition technology to quickly convert audio or video files into text. Its main advantages include fast transcription speed, high accuracy, and support for multiple languages and file formats. The product is positioned as an efficient office and learning aid, designed to help users save time and energy and improve work efficiency. AudioTranscription provides a free trial version so that users can experience its core functions, while the paid version provides more advanced features and large file support to meet the needs of different users.
DuRT is a speech recognition and translation tool focused on macOS systems. It realizes real-time recognition and translation of speech through local AI models and system services, supports multiple speech recognition methods, and improves recognition accuracy and language support range. The product displays results in the form of a floating box, allowing users to quickly obtain information during use. Its main advantages include high accuracy, privacy protection (no user information is collected), and convenient operation experience. DuRT is positioned as an efficient productivity tool designed to help users communicate and work more efficiently in multi-language environments. The product is currently available for download on the Mac App Store, and the specific price is not clearly mentioned on the page.
Scribe is a high-precision speech-to-text model developed by ElevenLabs designed to handle the unpredictability of real-world audio. It supports 99 languages and provides features such as word-level timestamping, speaker separation and audio event tagging. Scribe performs well on the FLEURS and Common Voice benchmarks, outperforming leading models such as Gemini 2.0 Flash, Whisper Large V3, and Deepgram Nova-3. It significantly reduces error rates for traditionally underserved languages such as Serbian, Cantonese, and Malayalam, which typically achieve error rates in excess of 40% in competing models. Scribe provides API interfaces for developers to integrate, and will launch a low-latency version to support real-time applications.
Phi-4-multimodal-instruct is a multimodal basic model developed by Microsoft that supports text, image and audio input and generates text output. The model is built based on the research and data sets of Phi-3.5 and Phi-4.0, and undergoes processes such as supervised fine-tuning, direct preference optimization, and human feedback reinforcement learning to improve instruction compliance and safety. It supports text, image and audio input in multiple languages, has a context length of 128K, and is suitable for a variety of multi-modal tasks, such as speech recognition, speech translation, visual question answering, etc. The model has achieved significant improvements in multi-modal capabilities, especially on speech and visual tasks. It provides developers with powerful multi-modal processing capabilities that can be used to build various multi-modal applications.
FireRedASR-AED-L is an open source industrial-grade automatic speech recognition model designed to meet the needs of high efficiency and high performance speech recognition. The model uses an attention-based encoder-decoder architecture and supports multiple languages such as Mandarin, Chinese dialects and English. It reached new top levels on public Mandarin speech recognition benchmarks and performed well in singing lyrics recognition. The main advantages of this model include high performance, low latency, and broad applicability to a variety of voice interaction scenarios. Its open source feature allows developers to freely use and modify the code, further promoting the development of speech recognition technology.
FireRedASR is an open source industrial-grade Mandarin automatic speech recognition model that uses Encoder-Decoder and LLM integrated architecture. It comes in two variants: FireRedASR-LLM and FireRedASR-AED, designed for high performance and energy efficiency requirements respectively. The model performed well on the Mandarin benchmark, as well as on dialect and English speech recognition. It is suitable for industrial-level applications that require efficient speech-to-text conversion, such as smart assistants, video subtitle generation, etc. The model is open source, making it easy for developers to integrate and optimize.
Bulletpen is an innovative AI writing application designed to help users convert spoken expressions into high-quality written text. It uses speech recognition and natural language processing technology to optimize and polish the user's spoken content to generate written text with clear structure and smooth language. The main advantage of this product is that it can significantly improve writing efficiency, especially for users who find it difficult or lack inspiration when writing. Bulletpen was developed by 17-year-old high school student Rexan Wong with the goal of providing students, writers, and content creators with an easy-to-use writing aid. It offers both free and paid plans to meet the needs of different users.
Whisper Turbo is a speech recognition tool optimized based on the Whisper Large-v3 model and designed for fast speech transcription. It leverages advanced AI technology to efficiently convert speech to text from different audio sources, supporting multiple languages and accents. This tool is provided to users for free and is designed to help people save time and energy and improve work efficiency. It is mainly aimed at users who need to quickly and accurately transcribe voice content, such as bloggers, content creators, enterprises, etc., providing them with convenient speech-to-text solutions.
RealtimeSTT is an open source speech recognition model that converts speech to text in real time. It uses advanced voice activity detection technology to automatically detect the start and end of speech without manual operation. In addition, it also supports wake word activation function, users can start voice recognition by speaking a specific wake word. This model has the characteristics of low latency and high efficiency, and is suitable for application scenarios that require real-time voice transcription, such as voice assistants, meeting records, etc. It is developed based on Python and is easy to integrate and use. It is open source on GitHub and has an active community with constant new updates and improvements.
xiaozhi-esp32 is an open source AI chatbot project developed based on Espressif's ESP-IDF. It combines large language models with hardware devices to enable users to create personalized AI companions. The project supports speech recognition and dialogue in multiple languages, has a voiceprint recognition function, and can identify the voice characteristics of different users. Its open source feature lowers the threshold for AI hardware development, provides valuable learning resources for students, developers and other groups, and helps promote the application and innovation of AI technology in the hardware field. The project is currently free and open source, and is suitable for developers of different levels to learn and develop secondary projects.
Tongyi is a browser plug-in that integrates speech recognition, real-time subtitle translation, intelligent summary and other functions. It is designed to improve users' efficiency in online classes, TV drama watching, online meetings and other scenarios. It uses AI technology to help users quickly record, transcribe, translate and summarize web content, and is especially suitable for users who need to process large amounts of information. Product background Based on the current era of information explosion, users need more efficient tools to manage, understand and digest information. The product currently offers a free trial, and the specific price and positioning are determined based on user needs.
Robo Blogger is an artificial intelligence assistant focused on converting speech into blog posts. It captures ideas in natural language and structures them into organized blog content, while incorporating reference materials to ensure accuracy and depth. This tool is based on concepts from the previous Report mAIstro project and is optimized for blog post creation. By separating creative capture and content structuring, Robo Blogger helps keep original ideas authentic while ensuring professional presentation.
Moonshine Web is a simple application built on React and Vite that runs Moonshine Base, a powerful speech recognition model optimized for fast and accurate automatic speech recognition (ASR) for resource-constrained devices. The app runs natively on the browser side, using Transformers.js and WebGPU acceleration (or WASM as an alternative). Its importance lies in its ability to provide users with a solution for local speech recognition without a server, which is particularly important for application scenarios that require fast processing of speech data.
OmniAudio-2.6B is a 2.6B parameter multi-modal model capable of seamlessly processing text and audio input. This model combines Gemma-2B, Whisper turbo and a custom projection module. Unlike the traditional method of concatenating ASR and LLM models, it unifies these two capabilities in an efficient architecture and implements it with minimal latency and resource overhead. This enables secure and fast processing of audio text directly on edge devices such as smartphones, laptops and robots.
Megrez-3B-Omni is an end-to-end full-modal understanding model developed by Wuwen Xinqiong. It is based on the large language model Megrez-3B-Instruct extension and has the ability to understand and analyze three modal data: pictures, text, and audio. This model achieves optimal accuracy in image understanding, language understanding, and speech understanding. It supports Chinese and English voice input and multiple rounds of dialogue, supports voice questions on input pictures, and directly responds to text based on voice commands. It has achieved leading results on multiple benchmark tasks.
Shortcut by Poised is a voice-based AI assistant designed to improve users' work efficiency through natural conversations. It allows users to quickly get answers, organize thoughts, and draft messages, emails, and documents through voice input while maintaining a consistent workflow. The product uses AI technology to convert natural language into refined text, and provides a variety of language style options to meet the needs of different occasions. The background information of Shortcut by Poised shows that it was published on Product Hunt and will soon launch Windows and mobile app versions. The Mac version is currently available for download.
Coval is a platform focused on AI agent testing and evaluation, aiming to improve the reliability and efficiency of AI agents through simulation and evaluation. Built by experts in the field of autonomous testing, the platform supports the testing of voice and chat agents and provides comprehensive evaluation reports to help users optimize the performance of AI agents. Coval’s key benefits include simplifying the testing process, providing AI-driven simulations, compatibility with voice AI, and providing detailed performance analysis. Product background information shows that Coval aims to help enterprises deploy AI agents quickly and reliably to improve the quality and efficiency of customer services. Coval offers three pricing plans to meet the needs of businesses of different sizes.
Whisper-NER is an innovative model that allows simultaneous speech transcription and entity recognition. The model supports open-type named entity recognition (NER), capable of identifying diverse and evolving entities. Whisper-NER is intended to serve as a powerful base model for automatic speech recognition (ASR) and NER downstream tasks, and can be fine-tuned on specific datasets to improve performance.
ultravox-v0_4_1-mistral-nemo is a multi-modal speech large language model (LLM) based on pre-trained Mistral-Nemo-Instruct-2407 and whisper-large-v3-turbo. The model is capable of processing both speech and text input, for example, a text system prompt and a speech user message. Ultravox converts the input audio into an embed via the special <|audio|> pseudo-tag and generates output text. Future releases plan to extend the token vocabulary to support the generation of semantic and acoustic audio tokens that can then be fed into a vocoder to produce speech output. This model was developed by Fixie.ai and is licensed under the MIT license.
fixie-ai/ultravox-v0_4_1-llama-3_1-70b is a large language model based on pre-trained Llama3.1-70B-Instruct and whisper-large-v3-turbo, which can process speech and text input and generate text output. The model converts the input audio into an embedding through the special pseudo tag <|audio|> and merges it with the text prompt to generate the output text. Ultravox was developed to expand application scenarios for speech recognition and text generation, such as voice agents, speech-to-speech translation, and spoken audio analysis. This model is licensed under the MIT license and developed by Fixie.ai.
fixie-ai/ultravox-v0_4_1-llama-3_1-8b is a large language model based on pre-trained Llama3.1-8B-Instruct and whisper-large-v3-turbo, which can process speech and text input and generate text output. The model converts input audio into embeddings via special <|audio|> pseudo-tags and generates output text. Future releases plan to extend the token vocabulary to support the generation of semantic and acoustic audio tokens, which can in turn be used by the vocoder to produce speech output. The model performs well in translation evaluation without preference adjustment and is suitable for scenarios such as voice agency, speech-to-speech translation, and speech analysis.
Ultravox.ai is an advanced speech language model (SLM) that processes speech directly without converting it to text, enabling more natural and smooth conversations. It supports multiple languages and easily adapts to new languages or accents, ensuring smooth communication with different audiences. Product background information shows that Ultravox.ai is an open source model that users can customize and deploy according to their own needs at a price of 5 cents per minute.
Kaka Subtitle Assistant (VideoCaptioner) is a powerful video subtitle preparation software that uses a large language model to perform intelligent segmentation, correction, optimization, and translation of subtitles, realizing one-click processing of the entire subtitle video process. The product does not require high configuration, is simple to operate, and has a built-in basic LLM model, ensuring that it can be used out of the box and consumes less model tokens, making it suitable for video producers and content creators.
Najva is an AI-powered voice assistant designed specifically for Mac that combines advanced local speech recognition technology with powerful AI models to convert your speech into smart text. This app is especially suitable for users who can think faster than they can type, such as writers, developers, medical professionals, etc. With its lightweight, native Swift application, zero tracking, and completely free, Najva provides users with a workflow solution that focuses on privacy and efficiency.
hertz-dev is Standard Intelligence's open source full-duplex, audio-only converter base model with 8.5 billion parameters. The model represents a scalable cross-modal learning technique capable of converting mono 16kHz speech into an 8Hz latent representation with a bitrate of 1kbps, outperforming other audio encoders. The main advantages of hertz-dev include low latency, high efficiency and ease of fine-tuning and building by researchers. Product background information shows that Standard Intelligence is committed to building general intelligence that is beneficial to all mankind, and hertz-dev is the first step in this journey.
Trancribro is a private, device-side speech recognition keyboard and text service application running on the Android platform. It uses whisper.cpp to run the OpenAI Whisper series model and combines it with Silero VAD for voice activity detection. The application provides a voice input keyboard that allows users to enter text by voice, and can be used explicitly by other applications, or set as a user-selected speech-to-text application, and some applications may use it for speech-to-text. The background of Transcribebro is to provide users with a more secure and private voice-to-text solution, avoiding the privacy leakage issues that may arise from cloud processing. The application is open source and users are free to view, modify and distribute the code.
Universal-2 is the latest speech recognition model launched by AssemblyAI. It surpasses the previous generation Universal-1 in accuracy and precision. It can better capture the complexity of human language and provide users with audio data without the need for secondary inspection. The importance of this technology lies in its ability to provide sharper insights, faster workflows, and a best-in-class product experience. Universal-2 has significantly improved in proper noun recognition, text formatting and alphanumeric recognition, reducing word error rates in practical applications.
GLM-4-Voice is an end-to-end speech model developed by the Tsinghua University team. It can directly understand and generate Chinese and English speech for real-time voice dialogue. It uses advanced speech recognition and synthesis technology to achieve seamless conversion from speech to text to speech, with low latency and high IQ conversational capabilities. This model is optimized for IQ and synthetic expressiveness in speech mode, and is suitable for scenarios requiring real-time speech interaction.
Whispo is a voice dictation tool that uses artificial intelligence technology to convert the user's voice into text in real time. This tool uses OpenAI Whisper technology for speech recognition, supports speech transcription using a custom API, and allows for post-transcription processing through large language models. Whispo supports multiple operating systems, including macOS (Apple Silicon) and Windows x64, and all data is stored locally, ensuring user privacy. Its design background is to improve the work efficiency of users who require a lot of text input, whether it is programming, writing or daily record keeping. Whispo is currently free to trial, but the specific pricing strategy is not yet clear on the page.
Spirit LM is a basic multi-modal language model that can freely mix text and speech. The model is based on a 7B pre-trained text language model and extends to speech modes by continuously training on text and speech units. Speech and text sequences are concatenated into a single token stream and trained using a small automatically curated speech-text parallel corpus using a word-level interleaved approach. Spirit LM has two versions: the basic version uses speech phoneme units (HuBERT), while the expression version uses pitch and style units in addition to phoneme units to simulate expressivity. For both versions, the text is encoded using subword BPE tokens. This model not only demonstrates the semantic capabilities of the text model, but also demonstrates the expressive capabilities of the speech model. Furthermore, we show that Spirit LM is able to learn new tasks (e.g., ASR, TTS, speech classification) across modalities with a small number of samples.
FunASR is a voice offline file transcription service software package that integrates voice endpoint detection, speech recognition, punctuation and other models. It can convert long audio and video into text with punctuation, and supports simultaneous transcription of multiple requests. It supports ITN and user-defined hot words, the server is integrated with ffmpeg, supports input of multiple audio and video formats, and provides multiple programming language clients. It is suitable for enterprises and developers who require efficient and accurate voice transcription services.
AsrTools is a speech-to-text tool based on artificial intelligence technology. It implements efficient speech recognition functions without GPU and complex configuration by calling the ASR service interface of major manufacturers. This tool supports batch processing and multi-thread concurrency, and can quickly convert audio files into subtitle files in SRT or TXT format. The user interface of AsrTools is based on PyQt5 and qfluentwidgets, providing a high-looking and easy-to-operate interactive experience. Its main advantages include the stability of calling interfaces from major manufacturers, the convenience of not requiring complex configuration, and the flexibility of multi-format output. AsrTools is suitable for users who need to quickly convert speech content into text, especially in the fields of video production, audio editing and subtitle generation. Currently, AsrTools provides free use of ASR services from major manufacturers, which can significantly reduce costs and improve work efficiency for individuals and small teams.
NotesGPT is an online service that uses artificial intelligence technology to convert users' voice notes into organized summaries and clear action items. It uses advanced speech recognition and natural language processing technology to help users record and manage notes more efficiently. It is especially suitable for users who need to quickly record information and organize it into structured content. Product background information shows that NotesGPT is technically supported by Together.ai and Convex, which shows that there is strong AI technology support behind it. At present, the product seems to be in the promotion stage, and the specific price and positioning information are not clearly displayed on the page.
Reverb is an open source speech recognition and speaker segmentation model inference code that uses the WeNet framework for speech recognition (ASR) and the Pyannote framework for speaker segmentation. It provides detailed model descriptions and allows users to download models from Hugging Face. Reverb aims to provide developers and researchers with high-quality speech recognition and speaker segmentation tools to support a variety of speech processing tasks.
Rev AI provides high-precision speech transcription services, supports more than 58 languages, and can convert speech to text in video and voice applications. It sets the accuracy standard for video and speech applications by training with the world's most diverse collection of sounds. Rev AI also provides services such as live streaming transcription, human transcription, language recognition, sentiment analysis, topic extraction, summarization and translation. Rev AI’s technical strengths include low word error rates, minimal bias against gender and racial accent, support for more languages, and the most readable transcripts possible. Additionally, it complies with the world's top security standards, including SOC II, HIPAA, GDPR, and PCI compliance.
AI-Powered Meeting Summarizer is a Gradio-based website application that converts meeting recordings to text and uses whisper.cpp for audio-to-text conversion and the Ollama server for text summarization. This tool is great for quickly extracting key points, decisions, and action items from a meeting.
EMOVA (EMotionally Omni-present Voice Assistant) is a multi-modal language model that enables end-to-end speech processing while maintaining leading visual-linguistic performance. The model enables emotionally rich multimodal dialogue through a semantic-acoustic decoupled speech tokenizer and achieves state-of-the-art performance on visual-linguistic and speech benchmarks.
OmniSenseVoice is a speech recognition model optimized based on SenseVoice, designed for fast reasoning and precise timestamps, providing a smarter and faster audio transcription method.
The Deepgram Voice Agent API is a unified speech-to-speech API that allows natural-sounding conversations between humans and machines. The API is powered by industry-leading speech recognition and speech synthesis models to listen, think and speak naturally and in real time. Deepgram is committed to driving the future of voice-first AI through its voice agent API, integrating advanced generative AI technology to create a business world capable of smooth, human-like voice agents.
iFlytek Spark is an AI large language model launched by iFlytek that is fully benchmarked against GPT-4 Turbo. It integrates multiple AI technologies, such as speech recognition, natural language processing, machine learning, etc., to provide users with efficient and intelligent office productivity tools. This product can not only process text information, but also perform speech recognition and generation. It supports multiple languages and is suitable for many fields such as enterprise services, smart hardware, smart government affairs, smart finance, and smart medical care.
iFlytek Virtual Human uses the latest AI virtual image technology, combined with core AI technologies such as speech recognition, semantic understanding, speech synthesis, NLP, and Spark model, to provide multi-scenario virtual human product services with virtual human image asset construction, AI-driven, and multi-modal interaction. One-stop virtual human audio and video content production, AIGC helps create flexibility and efficiency; input text or recording in the virtual 'AI studio', complete the output of audio and video works with one click, and render the manuscript within 3 minutes.
EVI 2 is a new basic speech-to-speech model launched by Hume AI, which can have smooth conversations with users in a natural way close to humans. It has the ability to respond quickly, understand user intonation, generate different intonations, and perform specific requests. EVI 2 has enhanced emotional intelligence through special training to predict and adapt to user preferences, maintaining a fun and engaging character and personality. In addition, EVI 2 also has multi-language capabilities and can adapt to different application scenarios and user needs.
The Xinchen Lingo speech model is an advanced artificial intelligence speech model that focuses on providing efficient and accurate speech recognition and processing services. It can understand and process natural language, making human-computer interaction smoother and more natural. The model relies on Xihu Xinchen’s powerful AI technology and is committed to providing high-quality voice interaction experience in various scenarios.
Linglong is an AI note-taking assistant that supports users to record information at any time and save it in rich text form through the voice AI note-taking function. It also has an AI smart tag function that can automatically generate titles to help users have a conversation with their own knowledge base. In addition, Langlong adopts the original AI card box note-taking method, allowing users to continuously record and realize the natural presentation of knowledge. The product supports multi-platform synchronization, including Android, Apple and Web versions, to meet the needs of different users.
Aixploria is an artificial intelligence-focused website that provides an online AI tool catalog to help users discover and select the best AI tools for their needs. The platform's simplified design and intuitive search engine allow users to easily find various AI applications through keyword searches. Aixploria not only provides a list of tools, but also publishes articles on how each AI works, helping users understand the latest trends and most popular applications. In addition, Aixploria also has a 'top 10 AI' section that is updated in real time, allowing users to quickly learn about the top AI tools in each category. Aixploria is suitable for everyone interested in AI, whether a beginner or an expert, you will find valuable information here.
Mini-Omni is an open source multi-modal large-scale language model that can achieve real-time speech input and streaming audio output dialogue capabilities. It features real-time speech-to-speech dialogue without the need for additional ASR or TTS models. In addition, it can also perform speech output while thinking, supporting the simultaneous generation of text and audio. Mini-Omni further enhances performance with batch inference of 'Audio-to-Text' and 'Audio-to-Audio'.
Easy Voice Toolkit is an AI voice toolbox based on open source voice projects, providing a variety of automated audio tools including voice model training. The toolbox integrates seamlessly to form a complete workflow, and users can use the tools selectively as needed or in sequence to gradually convert raw audio files into ideal speech models.
OpenVoiceChat is an open source project that aims to provide a platform for natural speech conversations with large language models (LLM). It supports multiple speech recognition (STT), text-to-speech (TTS) and LLM models, allowing users to interact with AI through speech. The project adopts the Apache-2.0 license, emphasizing openness and ease of use, and aims to become an open source alternative to closed commercial implementations.
Llama3-s v0.2 is a multi-modal checkpoint developed by Homebrew Computer Company focused on improving speech understanding. The model is improved through early integration of semantic tags and community feedback to simplify the model structure, improve compression efficiency, and achieve consistent speech feature extraction. Llama3-s v0.2 performs stably on multiple speech understanding benchmarks and provides live demos, allowing users to experience its capabilities for themselves. Although the model is still in the early stages of development and has some limitations, such as being sensitive to audio compression and unable to handle audio longer than 10 seconds, the team plans to address these issues in future updates.
Encounter AI - Advisor is a speech recognition technology based on SRI's Hidden Markov Model (HMM) to provide real-time audio monitoring services to multi-unit restaurant operators. It uses advanced technology to accurately track and analyze every conversation at the restaurant level, eliminating the common subjectivity problem of "he said/she said" and providing retail leaders with real-time conversation analysis to help them achieve their goals and increase revenue.
Seed-ASR is a speech recognition model based on Large Language Model (LLM) developed by ByteDance. It leverages the power of LLM by feeding continuous speech representations and contextual information into LLM, guided by large-scale training and context-aware capabilities, to significantly improve performance on a comprehensive evaluation set that includes multiple domains, accents/dialects, and languages. Compared with recently released large-scale ASR models, Seed-ASR achieves a 10%-40% word error rate reduction on Chinese and English public test sets, further demonstrating its powerful performance.
speech-to-speech is an open source modular GPT4-o project that implements speech-to-speech conversion through continuous parts such as speech activity detection, speech-to-text, language model, and text-to-speech. It leverages the Transformers library and models available on the Hugging Face hub, providing a high degree of modularity and flexibility.
Hanwang Voice King App is an intelligent voice flagship application independently developed by Hanwang Technology based on its self-developed multi-modal world model. It integrates AI voice recording, intelligent translation and simultaneous interpretation, and supports functions such as AI accurate transcription, recording synchronization, script organization, intelligent summary and uninterrupted real-time translation. Relying on full-stack AI technology, Hanwang Voice King is committed to helping users overcome language barriers and improve efficiency and convenience in office, study, conference, travel and other scenarios.
Whisper-diarization is an open source project that combines Whisper's automatic speech recognition (ASR) capabilities, vocal activity detection (VAD), and speaker embedding technology. It improves the accuracy of speaker embeddings by extracting the sound parts in the audio, then using Whisper to generate transcripts and correcting timestamps and alignments with WhisperX to reduce segmentation errors due to time offsets. Next, MarbleNet is used for VAD and segmentation to exclude silence, TitaNet is used to extract speaker embeddings to identify the speaker of each paragraph, and finally the results are associated with timestamps generated by WhisperX, the speaker of each word is detected based on the timestamp, and realigned using a punctuation model to compensate for small temporal shifts.
Qwen2 Audio Instruct Demo is an interactive demonstration website based on audio commands. It uses the latest artificial intelligence technology to allow users to interact with web pages through voice commands. This technology not only enhances the user experience, but also provides easier access for people with disabilities. The product background information includes its development team and technical support, and the price is positioned as a free trial, mainly for user groups interested in artificial intelligence interaction.
West is an open source speech recognition transcription model that implements speech-to-text conversion based on a large language model (LLM) in a concise form of 300 lines of code. It consists of a large language model, a speech encoder and a projector, of which only the projector part is trainable. The development of WeST is inspired by SLAM-ASR and LLaMA 3.1, aiming to achieve efficient speech recognition functions through simplified code.
Listening-while-Speaking Language Model (LSLM) is an artificial intelligence dialogue model designed to improve the naturalness of human-computer interaction. Through full-duplex modeling (FDM) technology, it achieves the ability to listen while speaking, enhancing real-time interactivity, especially the ability to be interrupted and respond in real-time when the generated content is not satisfactory. LSLM adopts a token-based decoder with only TTS for speech generation, and a streaming self-supervised learning (SSL) encoder for real-time audio input, and explores the optimal interaction balance through three fusion strategies (early fusion, mid-term fusion, and late fusion).
Voice Assistant Plugin for GPT is a voice assistant plug-in specially designed for GPT, aiming to improve user experience through voice interaction. The plug-in combines advanced speech recognition technology to allow users to communicate with GPT through voice commands, achieving a more natural and convenient conversation experience. Product background information shows that the plug-in was developed by Air Tech Studio, supports multiple languages, pays attention to user data security, and does not share any data with third parties.
Say My Name! is a speech recognition app built with fun and personalization at its core. It leverages advanced speech recognition technology to allow a user's device to recognize and respond to the user's voice, especially the user's name. This application not only increases the fun of user interaction with the device, but also improves the convenience of operation. The main advantages of Say My Name! include highly accurate speech recognition, personalized password settings and user-friendly interface.
PC Agent uses artificial intelligence technology to understand the user's computer environment through screen content and audio transcription, thereby providing more accurate auxiliary services. It aims to address the limitations of current chatbots and improve user experience through deeper interactions. Product background information shows that PC Agent focuses on improving the efficiency of personal computer use. Its main advantages include intelligent understanding of the environment, providing personalized help and continuous feature updates.
AIAvatarKit is a tool for quickly building AI-based conversational avatars. It supports running on VRChat, Clusters, and other Metaverse platforms, as well as real-world devices. The tool is easy to start, has unlimited expansion capabilities, and can be customized to the user's needs. The main advantages include: 1. Multi-platform support: Can run on multiple platforms, including VRChat, Cluster and Metaverse platforms. 2. Easy to start: Users can start conversations immediately without complicated setup. 3. Scalability: Users can add unlimited functions as needed. 4. Technical support: VOICEVOX API, Google or Azure voice service API key and OpenAI API key are required.
SenseVoiceSmall is a basic speech model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language recognition (LID), speech emotion recognition (SER) and audio event detection (AED). This model has been trained with more than 400,000 hours of data, supports more than 50 languages, and has recognition performance that surpasses the Whisper model. Its small model SenseVoice-Small uses a non-autoregressive end-to-end framework and has extremely low inference latency. It only takes 70 milliseconds to process 10 seconds of audio, which is 15 times faster than Whisper-Large. In addition, SenseVoice also provides convenient fine-tuning scripts and strategies, and supports service deployment pipelines for multiple concurrent requests. Client languages include Python, C++, HTML, Java, and C#.
Onyxium is a comprehensive AI tool platform that provides a variety of AI technologies including image recognition, text analysis, speech recognition, etc. It is designed to help users easily access the latest AI technologies, use these tools at low cost, and improve the efficiency of projects and workflows.
FunAudioLLM is a framework designed to enhance natural speech interaction between humans and Large Language Models (LLMs). It contains two innovative models: SenseVoice is responsible for high-precision multilingual speech recognition, emotion recognition and audio event detection; CosyVoice is responsible for natural speech generation and supports multilingual, timbre and emotion control. SenseVoice supports more than 50 languages and has extremely low latency; CosyVoice is good at multilingual voice generation, zero-sample context generation, cross-language voice cloning and command following capabilities. The relevant models have been open sourced on Modelscope and Huggingface, and the corresponding training, inference and fine-tuning codes have been released on GitHub.
SenseVoice is a basic speech model that includes multiple speech understanding capabilities such as automatic speech recognition (ASR), speech language recognition (LID), speech emotion recognition (SER), and audio event detection (AED). It focuses on high-precision multilingual speech recognition, speech emotion recognition and audio event detection, supports more than 50 languages, and its recognition performance exceeds the Whisper model. The model uses a non-autoregressive end-to-end framework with extremely low inference latency, making it ideal for real-time speech processing.
Azure Cognitive Services Speech is a speech recognition and synthesis service launched by Microsoft that supports speech-to-text and text-to-speech functions in more than 100 languages and dialects. It improves the accuracy of your transcriptions by creating custom speech models that handle specific terminology, background noise, and accents. In addition, the service also supports real-time speech-to-text, speech translation, text-to-speech and other functions, and is suitable for a variety of business scenarios, such as subtitle generation, post-call transcription analysis, video translation, etc.
GPT-4o is OpenAI’s latest innovation and represents the forefront of artificial intelligence technology. It extends the capabilities of GPT-4 with a truly multimodal approach, including text, visuals, and audio. GPT-4o revolutionizes our interactions with AI technology thanks to its speed, cost-effectiveness, and universal accessibility. It excels in text understanding, image analysis, and speech recognition, providing smooth and intuitive AI interaction, suitable for a variety of applications from academic research to specific industry needs.
sherpa-onnx is a speech recognition and speech synthesis project based on the next generation Kaldi. It uses onnxruntime for inference and supports a variety of speech-related functions, including speech-to-text (ASR), text-to-speech (TTS), speaker recognition, speaker verification, language recognition, keyword detection, etc. It supports multiple platforms and operating systems, including embedded systems, Android, iOS, Raspberry Pi, RISC-V, servers, and more.
StreamSpeech is a real-time speech-to-speech translation model based on multi-task learning. It simultaneously learns translation and synchronization strategies through a unified framework, effectively identifies translation opportunities in streaming speech input, and achieves a high-quality real-time communication experience. The model achieves leading performance on the CVSS benchmark and can provide low-latency intermediate results such as ASR or translation results.
LookOnceToHear is an innovative smart headphone interaction system that allows users to select the target speaker they want to hear through simple visual recognition. This technology received an honorable mention for Best Paper at CHI 2024. It achieves real-time speech extraction by synthesizing audio mixes, head-related transfer functions (HRTFs) and binaural room impulse responses (BRIRs), providing users with a novel way to interact.
BeMyEars is a real-time subtitle generation tool that uses local devices to complete speech recognition, providing the ultimate experience for the hearing-impaired and users who need subtitles. Its main advantages include multi-language support, multi-source input, privacy protection, etc.
Transkriptor is a browser plug-in that converts audio to text. It uses advanced artificial intelligence technology to automatically record and transcribe different types of voice content such as meetings, interviews, and lectures. Transkriptor has a simple and intuitive interface, supports multiple file formats, provides secure transcription services, and has functions such as generating subtitles, supporting multi-language transcription, and remote collaborative editing.
Chartnote is a plug-in that can quickly complete medical documents. It makes writing medical records quick and easy by using technologies such as generative artificial intelligence, speech recognition, and smart templates. Its main advantages are increased work efficiency, reduced documentation time, and accurate clinical records. Chartnote is suitable for doctors, nurses and other medical practitioners.
Gemini 1.5 Flash is the latest AI model launched by the Google DeepMind team, which uses a 'distillation' process to extract core knowledge and skills from the larger 1.5 Pro model and serve it in the form of a smaller, more efficient model. The model performs well in multi-modal reasoning, long text processing, chat applications, image and video subtitle generation, long document and tabular data extraction, etc. Its importance lies in providing a solution for applications that require low latency and low-cost services while maintaining high-quality output.
FunClip is a fully open source, locally deployed automated video editing tool that performs video speech recognition by calling the open source FunASR Paraformer series models of Alibaba Tongyi Lab. Then users can freely select the text fragment or speaker in the recognition result, and click the crop button to obtain the video of the corresponding fragment. FunClip integrates Alibaba's open source industrial-grade model Paraformer-Large, which is currently one of the open source Chinese ASR models with the best recognition results, and can accurately predict timestamps in an integrated manner.
TransLinguist is a remote interpretation product that uses speech recognition and automatic translation technology to perform real-time interpretation between various languages. It provides high-quality remote interpretation services to help users eliminate language barriers in meetings, trainings, speeches, and other events. The main advantages of TransLinguist are cost savings, increased audience engagement, and safe and reliable language services.
Retell AI is a powerful AI agent building platform that allows users to quickly build and test complex workflows and deploy them via phone calls, web calls, or anywhere else. The platform supports the use of any large language model (LLM) and provides a real-time interactive experience, including human-like voice and speech cloning support. Key benefits of Retell AI include low latency, high stability, and HIPAA-compliant security.
boff.ai is a website based on artificial intelligence speech recognition and natural language processing technology. Its main advantage is that it can quickly and accurately recognize the user's voice input and be able to understand its intention to provide corresponding answers and suggestions. boff.ai is positioned to provide intelligent voice assistant services to help users process information and complete tasks more efficiently.
The A.I. intelligent customer service solution is a complete customer service system provided by iFlytek for enterprises based on its advanced voice technology. The system realizes functions such as intelligent outbound calls, intelligent answering, voice navigation, online text customer service, quality inspection analysis, and agent assistance through multiple channels such as phone, Web, APP, mini-programs, and self-service terminals. It helps companies improve customer service efficiency, reduce labor costs, and improve customer service experience through technologies such as high-recognition speech recognition engines, natural and smooth speech synthesis technology, intelligent interruption capabilities, IVR navigation, and customer service platform middleware.
Feishu Miaoji is an intelligent meeting minutes tool that can transcribe meeting contents into easily searchable and translatable verbatim drafts, automatically summarize meeting minutes and to-do items, and improve review and collaboration efficiency.
WhisperKit, launched by Argmax, is an inference toolkit based on the Whisper project that allows speech recognition and transcription in iOS and macOS applications. The goal of the project is to gather developer feedback and release a stable release candidate within a few weeks to accelerate the production of on-device inference.
Luzia is a smart assistant that provides easy access to the power of artificial intelligence through WhatsApp, no registration required and absolutely free. Luzia helps you with the daily tasks of work, school, socializing, and pursuing your passions.
AnyGPT is a unified multi-modal large-scale language model that uses discrete representations to perform unified processing of various modalities, including speech, text, images and music. AnyGPT can stabilize training without changing the current large language model architecture or training paradigm. It relies entirely on data-level preprocessing, which promotes the seamless integration of new modalities into language models, similar to the addition of new languages. We construct a text-centric multimodal dataset for multimodal alignment pre-training. Leveraging generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108,000 multi-turn dialogue examples, with multiple modalities intertwined, thus enabling the model to handle any combination of multi-modal input and output. Experimental results show that AnyGPT can promote any-to-any multi-modal dialogue while achieving performance comparable to dedicated models in all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities in language models.
VocBot Turbo is an efficient speech-to-text tool that can quickly convert speech content into text, supports multiple languages and audio formats, and provides accurate recognition results. VocBot Turbo has a high degree of accuracy and flexibility and is suitable for various scenarios, including meeting recording, voice transcription, voice search, etc. It also has a user-friendly interface and easy-to-use operations, allowing you to perform speech-to-text with ease.
WhisperFusion is a product based on the functions of WhisperLive and WhisperSpeech, which enables seamless conversations with AI by integrating the Mistral Large Language Model (LLM) in the real-time speech-to-text process. Both Whisper and LLM are optimized by the TensorRT engine to maximize performance and real-time processing capabilities. WhisperSpeech uses torch.compile for optimization. The product is positioned to provide ultra-low latency AI real-time conversation experience.
Voxos is a versatile and user-friendly desktop voice assistant that integrates LLM into daily workflows, making it more streamlined than using a Web UI to access LLM. It's perfect for anyone who uses a desktop computer and wants to save time and energy. Additionally, you can build your own custom features based on Voxos’ modular design. Voxos is designed to be easily extendable and customizable. Therefore, we encourage you to tailor your modifications in a way that conforms to current design patterns, and hope that you will benefit all users of Voxos by submitting an MR.
AI Audio Kit is a tool for audio transcription on macOS using OpenAI's official Whisper API. It uses advanced AI technology to achieve accurate transcription without tedious upload steps, while supporting long text summarization capabilities. Available for $9, the AI Audio Kit is designed to save users time and effort.
MacGaiver is an AI assistant software that helps users get help quickly in any application. Users simply activate MacGaiver with a keyboard shortcut, then ask questions via voice or text without leaving the app, and MacGaiver will provide corresponding answers in text and voice. It uses the OpenAI GPT V model and the OpenAI Vision API to answer user questions in seconds.
HoneyDo is a voice recognition AI shopping list assistant that inputs a shopping list by voice, and AI converts it into a neat and orderly list. In addition, it also supports functions such as taking photos to identify ingredients and making a list, as well as sharing shopping lists with family members in real time. HoneyDo is divided into free version and PRO version. The PRO version provides unlimited voice recording and image capture functions.
Tencent Cloud Speech Recognition (ASR) provides developers with the best experience in speech-to-text services. The speech recognition service has the characteristics of high recognition accuracy, convenient access, and stable performance. Tencent Cloud speech recognition service opens three service forms: real-time speech recognition, sentence recognition and recording file recognition to meet the needs of different types of developers. Advanced technology, high cost performance, multi-language support, suitable for customer service, conferences, courts and other scenarios.
Hintscribe is an innovative speech-to-text desktop application. It can transcribe system audio in real time, and through integration with ChatGPT, it allows users to interact with the transcribed text to achieve a variety of tasks such as answering questions, translating text, or creating witty comments for social platforms. The application's real-time transcription function can significantly improve meeting efficiency; the seamless integration with various conference platforms enables simple and convenient transcription; the real-time interview recording and transcription function can reduce the interviewer's note-taking burden and allow the interviewer to focus more on interacting with the candidate. The application can also provide interview response suggestions through ChatGPT to help candidates improve their performance.
elsAi is a powerful AI assistant tool that helps users improve work efficiency and productivity. It has multiple functions such as intelligent translation, speech recognition, and intelligent recommendation, and supports multiple languages and scenario applications. elsAi is positioned to provide users with convenient AI assistance tools.
speakSync is a real-time speech translation APP based on artificial intelligence. It can achieve instant translation between multiple languages, supports speech-to-text and text-to-speech, and uses OpenAI's Whisper and GPT models to achieve smooth and accurate translation effects. This APP is specially designed for travelers, business people and language learners, simplifying the translation process and creating a barrier-free cross-language communication environment.