In my last post, I described trying out Kokoro text-to-speech (TTS) model via Kokoro-FastAPI web UI in a macOS (native) container. Here, I install Kokoro-TTS and Abogen on Windows, to take advantage of my Nvidia GPU.
Kokoro-TTS on Windows
I mentioned Kokoro-TTS in passing in my previous post. Kokoro-TTS is...
A CLI text-to-speech tool using the Kokoro model, supporting multiple languages, voices (with blending), and various input formats including EPUB books and PDF documents.
On Windows, I did the basic install:
- run
uv tool install kokoro-ttsorpip install kokoro-ttsin a working directory, - and download the ONNX model file and voices files to the same directory.
But, alas, I could not get Kokoro-TTS to use my NVidia GPU because Kokoro-TTS uses ONNX model instead of Pytorch, and I cannot be bothered to figure out the installation...
Abogen on Windows
Enter Abogen, which has a nice cross-platform PyQt desktop GUI and is much faster, utilising my GPU rather than CPU only.
Abogen is a powerful text-to-speech conversion tool that makes it easy to turn ePub, PDF, or text files into high-quality audio with matching subtitles in seconds.
Again following the instructions, I installed Abogen v1.1.16:
- first, download an install espeak-ng.msi,:
- then, for NVidia (and using uv instead of pip):
mkdir abogen cd abogen uv venv .venv/Scripts/activate.cmd uv pip install abogen
abogen is easy to use via the desktop user interface, and options include:
- blending voice!
- converting the ePUB into a single audio file - I prefer
.m4b(MPEG-4 audiobook format) with chapter markers, and metadata like title and author, - optionally generating separate audio files for each chapter - I prefer
.mp3which is more widely supported ony my devices (no metadata),
Abogen is about 12x fater on my GPU than with just my CPU!