Tag: speech synthesis

FlowSpeech

FlowSpeech is a free AI podcast generator that uses the latest speech synthesis technology to convert text into natural human voices, suitable for various user needs. It supports input in multiple formats, including PDF, TXT, etc., allowing users to quickly obtain information. Provides a variety of subscription options to help creators create podcasts more efficiently.

AI content creation productivity tools speech synthesis +1

MOSS-TTSD

MOSS-TTSD is an open source bilingual dialogue synthesis model that supports natural and expressive speech generation. It converts conversation scripts into high-quality speech, suitable for podcast production and AI conversation applications. Features of the model include zero-shot speech cloning and long-term speech generation with a high degree of expressiveness and realism. The training basis of MOSS-TTSD includes large-scale language data and speech data, ensuring the naturalness and accuracy of generated speech. The technology is suitable for commercial use and is completely open source.

Open source speech synthesis Podcast production bilingual +1

Open source machine learning educate speech synthesis +1

EaseVoice Trainer

EaseVoice Trainer is a backend project designed to simplify and enhance the speech synthesis and conversion training process. This project is improved based on GPT-SoVITS, focusing on user experience and system maintainability. Its design concept is different from the original project and aims to provide a more modular and customized solution suitable for a variety of scenarios from small-scale experiments to large-scale production. This tool can help developers and researchers conduct speech synthesis and conversion research and development more efficiently.

educate

Open source deep learning speech synthesis text to speech +1

MegaTTS 3

MegaTTS 3 is an efficient speech synthesis model based on PyTorch developed by ByteDance, with ultra-high-quality speech cloning capabilities. Its lightweight architecture only contains 0.45B parameters, supports Chinese, English and code switching, can generate natural and smooth speech based on input text, and is widely used in academic research and technology development.

educate

speech synthesis Developer Tools text to speech Interactive presentation

OpenAI.fm

OpenAI.fm is an interactive demonstration platform that allows developers to experience the latest text-to-speech models gpt-4o-transcribe, gpt-4o-mini-transcribe and gpt-4o-mini-tts in the OpenAI API. The technology generates natural and smooth speech, making text content vivid and easy to understand. It is suitable for various application scenarios, especially in voice assistants and content creation, and can help developers better communicate with users and improve user experience. This product is positioned for efficient speech synthesis and is suitable for developers who want to integrate speech functions.

writing

Orpheus TTS

Orpheus TTS is an open source text-to-speech system based on the Llama-3b model, designed to provide more natural human speech synthesis. It has strong voice cloning capabilities and emotional expression capabilities, and is suitable for various real-time application scenarios. This product is free and aims to provide developers and researchers with convenient speech synthesis tools.

Artificial Intelligence Open source machine learning speech synthesis +1

CSM 1B

CSM 1B is a speech generation model based on the Llama architecture, capable of generating RVQ audio codes from text and audio input. This model is mainly used in the field of speech synthesis and has high-quality speech generation capabilities. Its advantage lies in its ability to handle multi-speaker dialogue scenarios and generate natural and smooth speech through contextual information. The model is open source and intended to support research and educational purposes, but use for impersonation, fraud, or illegal activities is expressly prohibited.

speech synthesis Open source model text to speech People who talk a lot

Sesame CSM

CSM is a conversational speech generation model developed by Sesame that generates high-quality speech from text and audio input. The model is based on the Llama architecture and uses the Mimi audio encoder. It is primarily used for speech synthesis and interactive speech applications such as voice assistants and educational tools. The main advantages of CSM are its ability to generate natural and smooth speech and its ability to optimize speech output through contextual information. The model is currently open source and suitable for research and educational purposes.

Artificial Intelligence Open source educate speech synthesis +1

#10

Sesame AI

Sesame AI represents the next generation of speech synthesis technology. By combining advanced artificial intelligence technology and natural language processing, it can generate extremely realistic speech, with real emotional expression and natural conversation flow. The platform excels at generating human-like speech patterns while maintaining consistent personality traits, making it ideal for content creators, developers and enterprises to add natural speech capabilities to their applications. Its specific price and market positioning are currently unclear, but its powerful functions and wide range of application scenarios make it highly competitive in the market.

Artificial Intelligence natural language processing multilingual speech synthesis +2

Spark-TTS

Spark-TTS is an efficient text-to-speech synthesis model based on a large language model with the characteristics of single-stream decoupled speech tokens. It leverages the power of large language models to reconstruct audio directly from code predictions, omitting additional acoustic feature generation models, thereby increasing efficiency and reducing complexity. The model supports zero-shot text-to-speech synthesis and is able to switch scenarios across languages and codes, making it ideal for speech synthesis applications that require high naturalness and accuracy. It also supports virtual voice creation, and users can generate different voices by adjusting parameters such as gender, pitch, and speaking speed. The background of this model is to solve the problems of low efficiency and high complexity in traditional speech synthesis systems, aiming to provide efficient, flexible and powerful solutions for research and production. Currently, the model is mainly geared toward academic research and legitimate applications, such as personalized speech synthesis, assistive technology, and language research.

speech synthesis large language model zero sample cross language +1

Llasa

Llasa is a text-to-speech (TTS) basic model based on the Llama framework, specially designed for large-scale speech synthesis tasks. The model is trained using 160,000 hours of labeled speech data and has efficient language generation capabilities and multi-language support. Its main advantages include powerful speech synthesis capabilities, low inference cost, and flexible framework compatibility. This model is suitable for education, entertainment and business scenarios and can provide users with high-quality speech synthesis solutions. The model is currently available for free on Hugging Face, aiming to promote the development and application of speech synthesis technology.

Artificial Intelligence educate multilingual speech synthesis +1

Octave TTS

Octave TTS is a next-generation speech synthesis model developed by Hume AI that not only converts text into speech, but also understands the semantics and emotion of the text to generate expressive speech output. The core advantage of this technology lies in its deep understanding of language, which enables it to generate natural and vivid speech based on context, and is suitable for a variety of application scenarios, such as audiobooks, virtual assistants, and emotional voice interactions. The emergence of Octave TTS marks the development of speech synthesis technology from simple text reading to a more expressive and interactive direction, providing users with a more personalized and emotional voice experience. Currently, the product is mainly aimed at developers and creators, providing services through APIs and platforms, and is expected to be expanded to more languages and application scenarios in the future.

Artificial Intelligence Multi-language support speech synthesis Voice cloning +1

IndexTTS

IndexTTS is a GPT-style text-to-speech (TTS) model, mainly developed based on XTTS and Tortoise. It can correct the pronunciation of Chinese characters through pinyin and control pauses through punctuation. This system introduces a character-pinyin hybrid modeling method in the Chinese scene, which significantly improves training stability, timbre similarity, and sound quality. Additionally, it integrates BigVGAN2 to optimize audio quality. The model was trained on tens of thousands of hours of data and outperformed currently popular TTS systems such as XTTS, CosyVoice2, and F5-TTS. IndexTTS is suitable for scenarios that require high-quality speech synthesis, such as voice assistants, audiobooks, etc. Its open source nature also makes it suitable for academic research and commercial applications.

Artificial Intelligence natural language processing Open source speech synthesis +1

#15

Starsound AI

Xingsheng AI is a tool focused on generating AI podcasts. It utilizes advanced LLM models (such as kimi) and TTS models (such as Minimax Speech-01-Turbo) to quickly transform text content into vivid podcasts. The main advantage of this technology is its efficient content generation capabilities, which can help creators quickly produce podcasts and save time and energy. Xingsheng AI is suitable for content creators, podcast enthusiasts, and users who need to quickly generate audio content. Its positioning is to provide users with a convenient podcast generation solution, and there is currently no clear price information.

AI content creation speech synthesis podcast +1

Zonos-v0.1-hybrid

Zonos-v0.1-hybrid is an open source text-to-speech model developed by Zyphra that generates highly natural speech based on text prompts. The model is trained on a large amount of English speech data, uses eSpeak for text normalization and phoneticization, and then predicts DAC tokens through a transformer or hybrid backbone network. It supports multiple languages, including English, Japanese, Chinese, French, and German, and provides fine-grained control over the speech rate, pitch, audio quality, and emotion of the generated speech. In addition, it has a zero-sample voice cloning function that requires only 5 to 30 seconds of voice samples to achieve high-fidelity voice cloning. The model runs faster on an RTX 4090 with a real-time factor of about 2x. It also comes with an easy-to-use grario interface and can be easily installed and deployed via a Docker file. Currently, the model is available on Hugging Face, and users can use it for free, but they need to deploy it themselves.

multilingual speech synthesis text to speech Voice cloning +1

LLaSA_training

LLaSA_training is a speech synthesis training project based on LLaMA, which aims to improve the efficiency and performance of speech synthesis models by optimizing computing resources for training time and inference time. The project uses open source data sets and internal data sets for training, supports multiple configurations and training methods, and has high flexibility and scalability. Its main advantages include efficient data processing capabilities, powerful speech synthesis effects, and support for multiple languages. This project is suitable for researchers and developers who need high-performance speech synthesis solutions, and can be used to develop application scenarios such as intelligent voice assistants and voice broadcast systems.

deep learning speech synthesis Distributed training LLaMA +1

Llasa-1B

Llasa-1B is a text-to-speech model developed by the Hong Kong University of Science and Technology Audio Laboratory. It is based on the LLaMA architecture and can convert text into natural and smooth speech by combining speech tags in the XCodec2 codebook. The model was trained on 250,000 hours of Chinese and English speech data and supports speech generation from plain text or synthesis using given speech cues. Its main advantage is that it can generate high-quality multi-language speech and is suitable for a variety of speech synthesis scenarios, such as audio books, voice assistants, etc. This model is licensed under CC BY-NC-ND 4.0 and commercial use is prohibited.

Artificial Intelligence speech synthesis text to speech Open source model +1

Llasa-3B

Llasa-3B is a powerful text-to-speech (TTS) model developed based on the LLaMA architecture and focuses on Chinese and English speech synthesis. By combining the speech coding technology of XCodec2, this model can efficiently convert text into natural and smooth speech. Its main advantages include high-quality speech output, support for multi-language synthesis, and flexible voice prompt functions. This model is suitable for a variety of scenarios that require speech synthesis, such as audiobook production, voice assistant development, etc. Its open source nature also allows developers to freely explore and extend its functionality.

speech synthesis Open source model text to speech Chinese and English support +1

AI ContentCraft

AI ContentCraft is a powerful content creation platform designed to help creators quickly generate stories, podcast scripts, and multimedia content. It provides creators with a one-stop solution by integrating text generation, speech synthesis, and image generation technologies. This tool supports the conversion of Chinese and English content and is suitable for users who need efficient creation. Its technology stack includes DeepSeek AI, Kokoro TTS and Replicate API, ensuring high-quality content generation. The product is currently open source and free, suitable for individuals and teams.

writing

AI image generation content creation text generation +3

#21

Hailuo AI Audio

Hailuo AI Audio uses advanced speech synthesis technology to convert text into natural and smooth speech. Its main advantage is that it can generate high-quality, expressive speech and is suitable for a variety of scenarios, such as audiobook production, voice broadcast, etc. This product is positioned as a professional-grade audio synthesis tool. It currently provides a limited-time free trial, aiming to provide users with efficient and convenient speech generation solutions.

content creation speech synthesis educational tools audio production

kokoro-onnx

kokoro-onnx is a text-to-speech (TTS) project based on the Kokoro model and ONNX runtime. It supports English, with plans to support French, Japanese, Korean and Chinese. This model features fast, near-real-time performance on macOS M1 and offers multiple sound options, including whispers. The model is lightweight, about 300MB (about 80MB after quantization). The project is open source on GitHub and adopts the MIT license to facilitate integration and use by developers.

Open source speech synthesis lightweight TTS +1

audiblez

Audiblez is a tool that uses Kokoro's high-quality speech synthesis technology to convert ordinary e-books (.epub format) into .m4b format audiobooks. It supports multiple languages and sounds, and users can complete the conversion through simple command line operations, which greatly enriches the e-book reading experience and is especially suitable for use in inconvenient reading scenarios such as driving and sports. This tool was developed by Claudio Santini in 2025 and is free and open source under the MIT license.

Open source productivity tools speech synthesis e-book +1

Kokoro-82M

Kokoro-82M is a text-to-speech (TTS) model created by hexgrad and hosted on Hugging Face. It has 82 million parameters and is open source using the Apache 2.0 license. The model released v0.19 on December 25, 2024, and provides 10 unique voice packs. Kokoro-82M ranked first in TTS Spaces Arena, showing its efficiency in parameter scale and data usage. It supports US English and British English and can be used to generate high-quality speech output.

speech synthesis Open source model text to speech Efficient computing

#25

Synthesys

Synthesys is an AI content generation platform that provides AI video, AI voice and AI image generation services. It helps users generate professional-level content at lower costs and with simpler operations by using advanced artificial intelligence technology. Synthesys' product background is based on the current market demand for high-quality, low-cost content generation. Its main advantages include supporting ultra-realistic speech synthesis in multiple languages, generating high-definition videos without professional equipment, and user-friendly interface design. The platform's pricing strategy includes free trials and different levels of paid services, positioned to meet the content generation needs of enterprises of different sizes.

video generation speech synthesis text to speech image synthesis +3

AI content creation speech synthesis entertainment +1

#26

Voxdazz

Voxdazz is an online platform that uses artificial intelligence technology to imitate celebrity voices. Users can select a celebrity's voice template, enter what they want to say, and Voxdazz will generate a corresponding video. This technology is based on complex algorithms and is able to simulate natural intonation, rhythm and emphasis, very close to human speech. It is not only suitable for the production of entertainment and humorous videos, but also for sharing funny content imitating celebrities. With its high-quality speech generation and user-friendly interface, Voxdazz provides users with a new way of entertainment and creative expression.

hobby

ElevenLabs Flash

Flash is the latest text-to-speech (TTS) model launched by ElevenLabs. It generates speech at a speed of 75 milliseconds plus application and network delays. It is the preferred model for low-latency, conversational voice agents. Flash v2 only supports English, while Flash v2.5 supports 32 languages and costs 1 credit for every two characters. Flash continues to surpass similar ultra-low latency models in blind tests and is the fastest and quality-assured model.

Multi-language support speech synthesis low latency TTS +1

#28

Gemini 2.0 Flash Experimental

Gemini 2.0 Flash Experimental is the latest AI model developed by Google DeepMind, designed to provide an intelligent agent experience with low latency and enhanced performance. This model supports the use of native tools and can natively create images and generate speech for the first time, representing an important advancement in AI technology in understanding and generating multimedia content. The Gemini Flash model family has become one of the key technologies that promotes the development of the AI field with its efficient processing capabilities and wide range of application scenarios.

AI image generation natural language processing machine learning +1

#29

CosyVoice 2

CosyVoice 2 is a speech synthesis model developed by Alibaba Group's SpeechLab@Tongyi team. It is based on supervised discrete speech labeling and combines two popular generative models: language models (LMs) and flow matching to achieve speech synthesis with high naturalness, content consistency, and speaker similarity. The model has important applications in multimodal large language models (LLMs), especially in interactive experiences where response latency and real-time factors are critical to speech synthesis. CosyVoice 2 improves the codebook utilization of speech tags through finite scalar quantization, simplifies the text-to-speech language model architecture, and designs a block-aware causal flow matching model to adapt to different synthesis scenarios. Trained on large-scale multilingual datasets, it achieves human-comparable synthesis quality with extremely low response latency and real-time performance.

multilingual Large language model speech synthesis low latency +1

CosyVoice speech generation large model 2.0-0.5B

CosyVoice speech generation large model 2.0-0.5B is a high-performance speech synthesis model that supports zero-sample, cross-language speech synthesis and can directly generate corresponding speech output based on text content. This model is provided by Tongyi Laboratory and has powerful speech synthesis capabilities and a wide range of application scenarios, including but not limited to smart assistants, audio books, virtual anchors, etc. The importance of the model lies in its ability to provide natural and smooth speech output, which greatly enriches the human-computer interaction experience.

Artificial Intelligence natural language processing machine learning speech synthesis +1

OuteTTS-0.2-500M

OuteTTS-0.2-500M is a text-to-speech synthesis model built on Qwen-2.5-0.5B. It is trained on a larger data set and achieves significant improvements in accuracy, naturalness, vocabulary, voice cloning capabilities, and multi-language support. This model is especially grateful for the GPU funding provided by Hugging Face to support the training of the model.

Multi-language support speech synthesis high performance text to speech +1

#32

Small video treasure

ClipTurbo is an AI-driven video generation tool designed to help users easily create high-quality marketing videos. This tool uses AI technology to process copywriting, translation, icon matching and TTS speech synthesis, and finally uses manim to render the video, avoiding the problem of pure generative AI being limited by the platform. Xiao Video Bao supports a variety of templates. Users can select resolution, frame rate, aspect ratio or screen orientation according to their needs, and the template will be automatically adapted. In addition, it supports multiple voice services, including built-in EdgeTTS voice. Currently, Xiao Video Bao is still in the early development stage and is only available to registered users of Sanhua AI.

speech synthesis AI video generation TTS Marketing video +1

OuteTTS

OuteTTS is an experimental text-to-speech model that uses pure language modeling methods to generate speech. Its importance lies in its ability to convert text into natural-sounding speech through advanced language model technology, which is of great significance to areas such as speech synthesis, voice assistants and automatic dubbing. Developed by OuteAI, this model provides support for the Hugging Face model and the GGUF model, and can perform advanced functions such as voice cloning through the interface.

natural language processing machine learning speech synthesis text to speech +1

OuteTTS-0.1-350M

OuteTTS-0.1-350M is a text-to-speech synthesis technology based on a pure language model. It does not require external adapters or complex architectures and achieves high-quality speech synthesis through carefully designed prompts and audio tags. This model is based on the LLaMa architecture and uses 350M parameters, demonstrating the potential of directly using language models for speech synthesis. It processes audio in three steps: audio tokenization using WavTokenizer, CTC-enforced alignment to create precise word-to-audio token mapping, and creation of structured prompts that follow a specific format. Key advantages of OuteTTS include a pure language modeling approach, sound cloning capabilities, and compatibility with llama.cpp and GGUF formats.

language model speech synthesis audio processing text to speech +1

#35

Fish Speech

Fish Speech is a product that focuses on speech synthesis. It uses advanced deep learning technology to convert text into natural and smooth speech. This product supports multiple languages, including Chinese, English, etc., and is suitable for scenarios that require text-to-speech conversion, such as voice assistants, audiobook production, etc. Fish Speech is characterized by its high-quality speech output, ease of use, and flexibility as its main advantages. Background information shows that the product is continuously updated, increasing the data set size, and improving the parameters of the quantizer to provide better services.

Multi-language support deep learning speech synthesis text to speech

MiniMates

MiniMates is a lightweight picture digital human-driven algorithm that can run in real time on an ordinary computer and supports both voice-driven and expression-driven modes. It is 10-100 times faster than liveportrait, EchoMimic, MuseTalk and other algorithms on the market, allowing users to customize their own AI partners with very little resource consumption. The main advantages of this technology include extremely fast experience, personalized customization, and the ability to be embedded in the terminal, eliminating dependence on Python and CUDA. MiniMates follows the MIT license and is suitable for application scenarios that require fast and efficient facial animation and speech synthesis.

speech synthesis digital man AI partner real time animation +1

SoundStorm

SoundStorm is an audio generation technology developed by Google Research that significantly reduces audio synthesis time by generating audio tokens in parallel. 这项技术能够生成高质量、与语音和声学条件一致性高的音频，并且可以与文本到语义模型结合，控制说话内容、说话者声音和说话轮次，实现长文本的语音合成和自然对话的生成。 The importance of SoundStorm is that it solves the problem of slow inference speed of traditional autoregressive audio generation models when processing long sequences, and improves the efficiency and quality of audio generation.

speech synthesis music generation audio generation Neural audio codec +1

#38

MaskGCT TTS Demo

MaskGCT TTS Demo is a text-to-speech (TTS) demonstration based on the MaskGCT model, provided by amphion on the Hugging Face platform. This model uses deep learning technology to convert text into natural and smooth speech, which is suitable for multiple languages and scenarios. The MaskGCT model has attracted attention due to its efficient speech synthesis capabilities and support for multiple languages. It can not only improve the accuracy of speech recognition and synthesis, but also provide personalized speech services in different application scenarios. Currently, the product is available for free trial on the Hugging Face platform. Further information on the specific price and positioning information is required.

natural language processing deep learning speech synthesis text to speech +1

GLM-4-Voice

GLM-4-Voice is an end-to-end speech model developed by the Tsinghua University team. It can directly understand and generate Chinese and English speech for real-time voice dialogue. It uses advanced speech recognition and synthesis technology to achieve seamless conversion from speech to text to speech, with low latency and high IQ conversational capabilities. This model is optimized for IQ and synthetic expressiveness in speech mode, and is suitable for scenarios requiring real-time speech interaction.

speech recognition speech synthesis real time conversation Chinese English +1

#40

MaskGCT

MaskGCT is an innovative zero-shot text-to-speech (TTS) model that solves problems existing in autoregressive and non-autoregressive systems by eliminating the need for explicit alignment information and phoneme-level duration prediction. MaskGCT employs a two-stage model: the first stage uses text to predict semantic tags extracted from a speech self-supervised learning (SSL) model; the second stage, the model predicts acoustic tags based on these semantic tags. MaskGCT follows a mask-and-predict learning paradigm where during training it learns to predict masked semantic or acoustic tags based on given conditions and cues. During inference, the model generates tokens of a specified length in parallel. Experiments show that MaskGCT surpasses the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and understandability.

speech synthesis text to speech Zero-shot learning emotional control +1

F5-TTS

F5-TTS is a text-to-speech synthesis (TTS) model developed by the SWivid team. It uses deep learning technology to convert text into natural and smooth speech output that is faithful to the original text. When generating speech, this model not only pursues high naturalness, but also focuses on the clarity and accuracy of speech. It is suitable for various application scenarios that require high-quality speech synthesis, such as voice assistants, audiobook production, automatic news broadcasts, etc. The F5-TTS model is released on the Hugging Face platform, which users can easily download and deploy. It supports multiple languages and sound types and has high flexibility and scalability.

Artificial Intelligence natural language processing deep learning speech synthesis +1

#42

Llama 3.2 3b Voice

Llama 3.2 3b Voice is a speech synthesis model based on the Hugging Face platform, which can convert text into natural and smooth speech. This model uses advanced deep learning technology and can imitate the intonation, rhythm and emotion of human speech, and is suitable for a variety of scenarios, such as voice assistants, audio books, automatic broadcasts, etc.

Artificial Intelligence natural language processing deep learning speech synthesis

VALL-E 2

VALL-E 2 is a speech synthesis model launched by Microsoft Research Asia. It uses repeated perceptual sampling and group coding modeling technology to greatly improve the robustness and naturalness of speech synthesis. This model can convert written text into natural speech and is suitable for many fields such as education, entertainment, and multilingual communication. It plays an important role in improving accessibility and enhancing cross-language communication.

Artificial Intelligence natural language processing speech synthesis text to speech

#44

Deepgram Voice Agent API

The Deepgram Voice Agent API is a unified speech-to-speech API that allows natural-sounding conversations between humans and machines. The API is powered by industry-leading speech recognition and speech synthesis models to listen, think and speak naturally and in real time. Deepgram is committed to driving the future of voice-first AI through its voice agent API, integrating advanced generative AI technology to create a business world capable of smooth, human-like voice agents.

natural language processing speech recognition speech synthesis AI agent +1

#45

MiniMax

MiniMax Model Matrix is a set of products that integrates a variety of large AI models, including video generation, music generation, text generation and speech synthesis, etc. It aims to promote the innovation of content creation through advanced artificial intelligence technology. These models are capable of not only providing high-resolution and high-frame-rate video generation, but also creating various styles of music, generating high-quality text content, and delivering speech synthesis with super-anthropomorphic timbres. The MiniMax model matrix represents the cutting-edge technology of AI in the field of content creation. It is efficient, innovative and diversified, and can meet the creative needs of different users.

text generation video generation speech synthesis Music creation +1

iFlytek Virtual Human

iFlytek Virtual Human uses the latest AI virtual image technology, combined with core AI technologies such as speech recognition, semantic understanding, speech synthesis, NLP, and Spark model, to provide multi-scenario virtual human product services with virtual human image asset construction, AI-driven, and multi-modal interaction. One-stop virtual human audio and video content production, AIGC helps create flexibility and efficiency; input text or recording in the virtual 'AI studio', complete the output of audio and video works with one click, and render the manuscript within 3 minutes.

speech recognition speech synthesis NLP semantic understanding +4

AI-Faceless-Video-Generator

AI-Faceless-Video-Generator is a project that uses artificial intelligence technology to generate video scripts, voices and talking avatars based on topics. It combines sadtalker for facial animation, gTTS to generate AI speech, and OpenAI language model to generate scripts, providing an end-to-end solution for generating personalized videos. Key benefits of the project include script generation, AI voice generation, facial animation creation, and an easy-to-use interface.

speech synthesis AI video generation facial animation Script writing

OptiSpeech

OptiSpeech is an efficient, lightweight and fast text-to-speech model designed for on-device text-to-speech conversion. It leverages advanced deep learning technology to convert text into natural-sounding speech, making it suitable for applications that require speech synthesis in mobile devices or embedded systems. The development of OptiSpeech was supported by GPU resources provided by Pneuma Solutions, which significantly accelerated the development process.

deep learning speech synthesis text to speech end-to-end model

Mini-Omni

Mini-Omni is an open source multi-modal large-scale language model that can achieve real-time speech input and streaming audio output dialogue capabilities. It features real-time speech-to-speech dialogue without the need for additional ASR or TTS models. In addition, it can also perform speech output while thinking, supporting the simultaneous generation of text and audio. Mini-Omni further enhances performance with batch inference of 'Audio-to-Text' and 'Audio-to-Audio'.

Open source multimodal speech recognition speech synthesis +1

speech-to-speech

speech-to-speech is an open source modular GPT4-o project that implements speech-to-speech conversion through continuous parts such as speech activity detection, speech-to-text, language model, and text-to-speech. It leverages the Transformers library and models available on the Hugging Face hub, providing a high degree of modularity and flexibility.

natural language processing Open source speech recognition speech synthesis

#51

Bailing-TTS

Bailing-TTS is a large-scale text-to-speech (TTS) model series developed by Giant Network’s AI Lab that focuses on generating high-quality Chinese dialect speech. The model uses continuous semi-supervised learning and a specific Transformer architecture to effectively align text and speech tokens through a multi-stage training process to achieve high-quality speech synthesis in Chinese dialects. Bailing-TTS has demonstrated speech synthesis effects close to natural human expressions in experiments, which is of great significance to the field of dialect speech synthesis.

speech synthesis Transformer text to speech dialect +1

#52

Gan.AI

Gan.AI is a company focused on conversational artificial intelligence research and products. It is committed to providing personalized video and audio communication solutions to well-known global brands through its advanced AI technology. The company's products and technologies have demonstrated remarkable results in personalized marketing, fan engagement, and improved user experience, and have been recognized and applied by brands including Samsung, Coca-Cola, and the San Antonio Spurs.

AI personalization Multi-language support speech synthesis +1

#53

Wondercraft

Wondercraft is an innovative online service that converts an author's manuscript into a voice reading that sounds like the author's own voice. This technology not only saves authors the time and money of recording in a studio and hiring audio experts to edit mixes, but it also provides an efficient, cost-effective solution that allows authors to focus on creating without having to be distracted by audio production.

personalization speech synthesis writing assistance audio production

ElevenLabs AI audio API

ElevenLabs AI Audio API provides high-quality speech synthesis services, supports multiple languages, and is suitable for chatbots, agents, websites, applications, etc., with low latency and high response speed. The API supports enterprise-level requirements, ensuring data security and compliance with SOC2 and GDPR compliance.

Multi-language support speech synthesis Data security Enterprise API

CosyVoice

CosyVoice is a large-scale multi-lingual speech generation model that not only supports speech generation in multiple languages, but also provides full-stack capabilities from inference to training to deployment. This model is important in the field of speech synthesis because it can generate natural and smooth speech that is close to real people and is suitable for multiple language environments. Background information on CosyVoice shows that it was developed by the FunAudioLLM team under the Apache-2.0 license.

machine learning multilingual speech synthesis speech generation

Swift

Swift is a fast AI voice assistant powered by Groq, Cartesia and Vercel. It uses Groq for fast inference with OpenAI Whisper and Meta Llama 3, Cartesia's Sonic speech model for fast speech synthesis, and real-time streaming to the front end. VAD technology is used to detect the user speaking and run callbacks on the speech clips. Swift is a Next.js project written in TypeScript and deployed on Vercel.

AI speech synthesis voice assistant Next.js +3

FunAudioLLM

FunAudioLLM is a framework designed to enhance natural speech interaction between humans and Large Language Models (LLMs). It contains two innovative models: SenseVoice is responsible for high-precision multilingual speech recognition, emotion recognition and audio event detection; CosyVoice is responsible for natural speech generation and supports multilingual, timbre and emotion control. SenseVoice supports more than 50 languages and has extremely low latency; CosyVoice is good at multilingual voice generation, zero-sample context generation, cross-language voice cloning and command following capabilities. The relevant models have been open sourced on Modelscope and Huggingface, and the corresponding training, inference and fine-tuning codes have been released on GitHub.

Open source speech recognition speech synthesis Multilingual +1

#58

Fish Audio text-to-speech

Text-to-speech technology is a technology that converts text information into speech. It is widely used in assisted reading, voice assistants, audiobook production and other fields. It improves the convenience of information acquisition by simulating human speech, which is especially helpful for visually impaired people or those who cannot use their eyes to read.

content creation speech synthesis Assisted reading

Azure Cognitive Services Speech

Azure Cognitive Services Speech is a speech recognition and synthesis service launched by Microsoft that supports speech-to-text and text-to-speech functions in more than 100 languages and dialects. It improves the accuracy of your transcriptions by creating custom speech models that handle specific terminology, background noise, and accents. In addition, the service also supports real-time speech-to-text, speech translation, text-to-speech and other functions, and is suitable for a variety of business scenarios, such as subtitle generation, post-call transcription analysis, video translation, etc.

Multi-language support speech recognition speech synthesis real-time interaction +1

Convert text to speech online for free

This product is an advanced online text-to-speech tool that uses artificial intelligence technology to convert text into natural and realistic speech. It supports multiple languages and voice styles and is suitable for advertising, video narration, audiobook production and other scenarios, enhancing the accessibility and attractiveness of content. Product background information shows that it provides great convenience for digital marketers, content creators, audiobook authors, and educators.

Artificial Intelligence Multi-language support Online tools speech synthesis

ToucanTTS

ToucanTTS is a multilingual and controllable text-to-speech synthesis toolkit developed by the Institute of Natural Language Processing at the University of Stuttgart in Germany. It's built using pure Python and PyTorch to keep it simple and easy to get started while being as powerful as possible. The toolkit supports teaching, training, and using cutting-edge speech synthesis models with a high degree of flexibility and customizability for education and research.

educate

multilingual speech synthesis educational tools text to speech +1

Awesome-ChatTTS

Awesome-ChatTTS is an open source project that aims to provide FAQs and related resource collections for the ChatTTS project to help users get started quickly and solve problems they may encounter during use. This project not only compiles detailed installation guides and parameter descriptions, but also provides examples of various tone seeds, as well as auxiliary materials such as video tutorials.

speech synthesis Open source projects Technical Tutorial

sherpa-onnx

sherpa-onnx is a speech recognition and speech synthesis project based on the next generation Kaldi. It uses onnxruntime for inference and supports a variety of speech-related functions, including speech-to-text (ASR), text-to-speech (TTS), speaker recognition, speaker verification, language recognition, keyword detection, etc. It supports multiple platforms and operating systems, including embedded systems, Android, iOS, Raspberry Pi, RISC-V, servers, and more.

machine learning speech recognition speech synthesis onnxruntime

#64

StreamSpeech

StreamSpeech is a real-time speech-to-speech translation model based on multi-task learning. It simultaneously learns translation and synchronization strategies through a unified framework, effectively identifies translation opportunities in streaming speech input, and achieves a high-quality real-time communication experience. The model achieves leading performance on the CVSS benchmark and can provide low-latency intermediate results such as ASR or translation results.

speech recognition speech synthesis real-time translation multi-task learning

AudioLCM

AudioLCM is a text-to-audio generation model based on PyTorch, which uses a latent consistency model to generate high-quality and efficient audio. This model was developed by Huadai Liu and others, providing an open source implementation and pre-trained model. It can convert text descriptions into near-real audio and has important application value, especially in fields such as speech synthesis and audio production.

speech synthesis audio generation PyTorch text to audio

seed-tts-eval

seed-tts-eval is a test set for evaluating the model's zero-shot speech generation capabilities. It provides an objective evaluation test set for cross-domain goals, including samples extracted from English and Mandarin public corpora, to measure the model's performance on various objective indicators. It uses 1000 samples from the Common Voice dataset and 2000 samples from the DiDiSpeech-2 dataset.

speech synthesis Automatic speech recognition speaker similarity

#67

Seed-TTS

Seed-TTS is a series of large-scale autoregressive text-to-speech (TTS) models launched by ByteDance that can generate speech that is indistinguishable from human speech. It excels in speech context learning, speaker similarity, and naturalness, and can be fine-tuned to further improve subjective ratings. Seed-TTS also provides superior control over speech attributes such as emotion, and can generate highly expressive and diverse speech. Furthermore, a self-distillation method is proposed for speech decomposition, as well as a reinforcement learning method to enhance model robustness, speaker similarity, and controllability. Also presented is Seed-TTSDiT, a non-autoregressive (NAR) variant of the Seed-TTS model, which adopts a completely diffusion-based architecture and does not rely on pre-estimated phoneme durations for speech generation through end-to-end processing.

AI natural language processing speech synthesis text to speech

ChatTTS-ui

ChatTTS-ui is a web interface and API interface provided for the ChatTTS project, allowing users to perform speech synthesis operations through web pages and make remote calls through the API interface. It supports a variety of timbre options, and users can customize speech synthesis parameters, such as laughter, pauses, etc. This project provides an easy-to-use interface for speech synthesis technology, lowering the technical threshold and making speech synthesis more convenient.

speech synthesis API interface web interface Multiple timbre options +1

ChatTTS.com

ChatTTS is a sound generation model designed for dialogue scenarios. It is especially suitable for dialogue tasks of large-scale language model assistants, as well as applications such as conversational audio and video introductions. It supports Chinese and English, and demonstrates high-quality and natural speech synthesis capabilities by using approximately 100,000 hours of Chinese and English data training.

Open source multilingual speech synthesis dialogue

AudioBook Bot

AudioBook Bot is a tool that uses generative artificial intelligence to convert text to speech. It can provide the voices of multiple characters for your book and narrate the book using your own voice. It can generate audiobooks with an entire cast of characters with very few samples.

writing

Artificial Intelligence speech synthesis audiobooks

TTS Generator AI

TTS Generator AI is an innovative free online text-to-speech tool that uses advanced AI technology to convert written text into high-quality, natural and smooth audio. The tool is suitable for a variety of users, including students who need auditory learning materials, researchers who want to listen to long-form documents, and professionals who want to make their written content more accessible. One of the highlights of the TTS tool is its ability to support a variety of text formats, from simple text files to complex PDF files, making it very flexible.

tool

AI tools speech synthesis text to speech

OpenVoice V2

OpenVoice V2 is a Text-to-Speech (TTS) model. It will be released in April 2024 and contains all the features of V1 and has been improved. It uses different training strategies, provides better sound quality, and supports multiple languages such as English, Spanish, French, Chinese, Japanese and Korean. Additionally, it allows free use for commercial purposes. OpenVoice V2 is able to accurately clone reference tonal colors and generate speech in a variety of languages and accents. It also supports zero-shot cross-language speech cloning, that is, the language of the generated speech and the language of the reference speech do not need to be present in large-scale multilingual training data sets.

multilingual speech synthesis Sound quality optimization commercial free

iFLYTEK A.I. Intelligent Customer Service Solution

The A.I. intelligent customer service solution is a complete customer service system provided by iFlytek for enterprises based on its advanced voice technology. The system realizes functions such as intelligent outbound calls, intelligent answering, voice navigation, online text customer service, quality inspection analysis, and agent assistance through multiple channels such as phone, Web, APP, mini-programs, and self-service terminals. It helps companies improve customer service efficiency, reduce labor costs, and improve customer service experience through technologies such as high-recognition speech recognition engines, natural and smooth speech synthesis technology, intelligent interruption capabilities, IVR navigation, and customer service platform middleware.

Business

automation speech recognition speech synthesis customer service +1

Hume AI EVI

Hume AI’s Empathic Voice Interface (EVI) is an API driven by the Empathic Large Language Model (eLLM), which can understand and simulate speech pitch, word accent, etc. to optimize human-computer interaction. It is based on more than 10 years of research, millions of patent data points and more than 30 papers published in top journals. EVI aims to provide a more natural and compassionate voice interface for any application, making people's interactions with AI more humane. This technology can be widely used in sales/meeting analysis, health and wellness, AI research services, social networks and other fields.

chat

natural language processing speech synthesis human-computer interaction Expression recognition +2

Artificial Intelligence speech synthesis Telegram audio conversion

#75

AI Voice Generator Bot

AI Speech Generator is a simple and easy-to-use product that uses artificial intelligence technology to convert text into audio. It provides up to 25 different voices for perfect interpretation of English. You just enter the text on Telegram and we will reply with the corresponding audio, no need to wait. Try it now and quickly convert text to speech.

Audio

#76

ApolloAI

ApolloAI is an artificial intelligence platform that provides AI images, videos, music, speech synthesis and other functions. Users can generate various types of content through text or image input, and have commercial use rights. Pricing is flexible, with both subscription and one-time purchase models available.

Artificial Intelligence image generation video generation speech synthesis +1

Voice Engine

Voice Engine is an advanced speech synthesis model that can generate natural speech that is very similar to the original speaker with only 15 seconds of speech samples. This model is widely used in education, entertainment, medical and other fields. It can provide reading assistance for non-literate people, translate speech for video and podcast content, and give unique voices to non-verbal people. Its significant advantages are that it requires fewer speech samples, generates high-quality speech, and supports multiple languages. Voice Engine is currently in a small-scale preview stage, and OpenAI is discussing its potential applications and ethical challenges with people from all walks of life.

Artificial Intelligence speech synthesis Voice translation natural speech +1

#78

VoiceBar

VoiceBar provides the most realistic AI speech synthesis service, including multiple languages and accents, with advanced voice quality and realism. No subscription required and usage is extremely competitive. Suitable for voice messages, multi-language text-to-speech, TikTok, explanation videos, learning and other scenarios.

AI speech synthesis TTS

Midgenie

The AI video dubbing and text-to-video app is a perfect tool for content creators, marketers, production companies, and businesses. Use our real, human-like AI voices and animated AI characters to dub your existing videos in 40 natural languages, or create videos from text. Fast, accurate translation and lip-sync capabilities give you studio-like quality. Pricing is flexible, fast and affordable.

translate speech synthesis Text to video AI video dubbing +1

#80

Pipio | Video Dubbing

This product uses AI technology to realize automatic dubbing and lip synchronization of video voices, and can easily realize multi-lingual translation of videos while retaining the original timbre. The main features include: 1) more than 33% synchronization accuracy, comparable to artificial lip synchronization; 2) lossless video resolution; 3) high-fidelity voice translation. Target groups include: corporate training departments, salespeople, marketing teams and content creators. A free entry version and a paid professional version are available, welcome to try it.

Artificial Intelligence speech synthesis video translation lip sync

Aura TTS Demo by Deepgram

The Aura TTS (text-to-speech) demo showcases Deepgram’s advanced speech synthesis technology that converts text into natural-sounding speech with multiple voice options.

AI speech synthesis Voice technology text to speech

#82

REECHO Rui Sheng

REECHO.AI is a super-realistic artificial intelligence voice cloning platform. Users can upload voice samples, and the system uses deep learning technology to clone voices and generate extremely high-quality AI voices, which can realize voice style conversion of different characters. The platform provides voice creation, voice dubbing and other services, allowing more people to participate in the creation of voice content through AI technology and lowering the threshold for creation. The platform is positioned to be popular and provides free use of basic functions.

speech synthesis Music creation Voice cloning AI dubbing +1

#83

VideoTrans video translation and dubbing tool

VideoTrans is a free and open source video translation and dubbing tool. It can recognize video subtitles with one click, translate them into other languages, perform multiple speech synthesis, and finally output target language videos with subtitles and dubbing. The software is easy to use and supports a variety of translation and dubbing engines, which can greatly improve the efficiency of video translation.

speech synthesis video translation machine translation Subtitle editing

#84

ToolBaz

ToolBaz is a free AI writing tool that can help users generate various AI content, including stories, emails, lyrics, pictures, voices, etc. It provides a variety of AI tools that can quickly generate content similar to human writing to meet users' various writing needs.

AI writing speech synthesis speech to text AI pictures +1

#85

Any GPT

AnyGPT is a unified multi-modal large-scale language model that uses discrete representations to perform unified processing of various modalities, including speech, text, images and music. AnyGPT can stabilize training without changing the current large language model architecture or training paradigm. It relies entirely on data-level preprocessing, which promotes the seamless integration of new modalities into language models, similar to the addition of new languages. We construct a text-centric multimodal dataset for multimodal alignment pre-training. Leveraging generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108,000 multi-turn dialogue examples, with multiple modalities intertwined, thus enabling the model to handle any combination of multi-modal input and output. Experimental results show that AnyGPT can promote any-to-any multi-modal dialogue while achieving performance comparable to dedicated models in all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities in language models.

image generation chatbot multimodal speech recognition +1

#86

BASE TTS

BASE TTS is a large-scale text-to-speech synthesis model developed by Amazon. It uses an automatic regression converter with 1 billion parameters to convert text into speech codes, and then generates speech waveforms through a convolutional decoder. The model was trained using more than 100,000 hours of public speech data, achieving a new state of speech naturalness. It also has novel speech coding technologies such as phoneme dissociation and compression. As the model size increases, BASE TTS demonstrates the ability to process complex sentences with natural intonation.

natural language processing deep learning speech synthesis speech model

#87

celebrity ai voice generator

Celebrity AI Voice Generator is a free online tool that can quickly generate the voice of any celebrity. It uses advanced AI technology to simulate and generate the voices of celebrities by analyzing their voice samples. Users only need to enter the name of the celebrity and the corresponding voice will be generated. Celebrity AI Voice Generator can be used in a variety of scenarios such as personal entertainment, education, and advertising.

AI speech synthesis entertainment celebrity

#88

Stability AI text-to-speech models

Stability AI high-fidelity text-to-speech models are designed to provide natural language guidance for speech synthesis models trained on large-scale datasets. It performs natural language guidance by annotating different speaker identities, styles, and recording conditions. This method was then applied to a 45,000-hour dataset used to train speech language models. Furthermore, the model proposes simple ways to improve audio fidelity and, despite relying entirely on discovered data, performs well to a large extent.

speech synthesis high fidelity Natural language guidance

#89

Luvvoice

Luvvoice is a free text-to-speech tool that offers more than 200 voice options to convert text to speech according to user needs. Luvvoice offers the advantages of ease of use, multi-language support and high-quality voice synthesis. Luvvoice's pricing is very affordable, allowing users to use more features for free, while also offering premium features for a fee.

speech synthesis free tools text to speech

speech synthesis audio generation Realistic voice AI voice

#90

Gotalk.ai

Gotalk.ai is a powerful AI speech generator capable of creating lifelike voices in minutes. Perfect for YouTube, podcasts, and phone system greetings. Experience natural speech synthesis through advanced AI algorithms and deep learning technology. Our platform offers advanced AI speech synthesis and is the solution of choice for professionals looking for innovative and efficient speech generation tools.

sound

#91

Whisper Speech

Whisper Speech is a fully open source text-to-speech model trained by Collabora and Lion on Juwels supercomputers. It supports multiple languages and multiple forms of input, including Node.js, Python, Elixir, HTTP, Cog, and Docker. The advantages of this model are efficient speech synthesis and flexible deployment. In terms of pricing, Whisper Speech is completely free. It is positioned to provide developers and researchers with a powerful, customizable text-to-speech solution.

Open source speech synthesis text to speech

#92

OpenAIDonakosy

OpenAI Donakosy is a powerful AI platform that can generate a variety of text content, including articles, blogs, advertisements, sales and marketing documents, social media content, business names and winning strategies, etc., supporting 53 languages. It also provides features such as advanced analytics, team management, project management, and custom templates. Sign up for a free trial now!

speech synthesis Marketing AI content generation Advertising creativity +1

educate speech synthesis text to speech mobile application +1

#93

Crikk

Crikk is an affordable and powerful text-to-speech tool that supports 56 languages and provides authentic speech synthesis technology. Whether it is used for speech broadcasting, audio books or education, Crikk can provide users with high-quality sound synthesis. Users can choose between a free trial or a $20-a-month professional version with a monthly quota of 500,000 characters, 6 different voices and 56 languages. In addition, Crikk will also launch a mobile application to realize text-to-speech of images or PDFs. Monster Incorporation Inc. is located in Delaware, United States.

hobby

AI image generation speech synthesis avatar +1

#94

audio2photoreal

audio2photoreal is an open source project for generating photorealistic avatars from audio. It contains a pytorch implementation that can synthesize human images of conversations from audio. The project provides training code, test code, pre-trained motion models, and dataset access. Its models include face diffusion model, human body diffusion model, human VQ VAE model and human guided transformer model. The project enables researchers and developers to train their own models and synthesize high-quality, lifelike avatars based on speech.

image

#95

OpenVoice

OpenVoice is an open source voice cloning technology that can accurately clone reference timbres and generate voices in multiple languages and accents. It can flexibly control speech style parameters such as emotion and accent, as well as rhythm, pauses and intonation, etc. It implements zero-shot cross-language speech cloning, that is, neither the language of the generated speech nor the reference speech needs to appear in the training data.

speech synthesis Voice cloning Zero-shot learning

#96

croaking sound

Guagua Audio Production AI+ is a full-process integrated sound production tool that combines human-machine cooperation, speech synthesis, virtual recording studio and full-chain data to improve production efficiency and reduce costs. Users can use AI-assisted drawings and fully automatic track alignment functions to easily complete sound production. The product supports the mass production of audio works, and has internationally leading speech synthesis technology, providing a variety of timbre options. At the same time, the product also provides virtual recording studio and full-chain data management functions, making the production process more efficient and transparent.

speech synthesis AI assistance sound production virtual recording studio +1

sound reproduction

Sound reproduction is an efficient and lightweight sound customization solution. Users can quickly have exclusive AI-customized sounds by recording in seconds in an open environment. The core product advantages include ultra-low cost, extremely fast reproduction, high degree of restoration and technological leadership. Applicable scenarios include video dubbing, voice assistants, car assistants, online education and audio reading, etc.

speech synthesis Sound customization AI audio

#98

TurnVoice

TurnVoice is a command line tool that converts and translates sounds in YouTube videos. It provides speech conversion and speech translation capabilities, can replace specific speaker voices, supports local file processing, and retains the original background audio. The tool uses multiple speech synthesis engines and supports multiple languages. TurnVoice is suitable for various scenarios, such as creative video production, voice translation, etc. This product is currently in the development stage. Please refer to the official website for detailed information such as supported functions and pricing.

speech synthesis video production Voice conversion Voice translation

#99

Video Translate

Video Translate can translate uploaded videos with one click while maintaining the natural style of the voice. Supports MP4, AVI, MOV format videos under 300MB within 60 seconds. Translation supports multiple languages, and speech synthesis is sourced from leading speech technology companies. In terms of pricing, free and paid versions are available, and the paid version can enjoy higher definition output. The product is positioned to help users seamlessly translate video content to expand multilingual audiences.

speech synthesis video translation free translation

#100

free text Into Speech

Free Text to Speech Online Converter is a multi-language text-to-speech online platform. It supports more than 20 languages, has natural pronunciation, is free to use without registration, and has fast conversion speed.

Multi-language support speech synthesis Online conversion