In my last post, I described trying out Kokoro text-to-speech (TTS) model via Kokoro-FastAPI web UI in a macOS (native) container. Here, I install Kokoro-TTS and Abogen on Windows, to take advantage of my Nvidia GPU.

Kokoro-TTS on Windows

I mentioned Kokoro-TTS in passing in my previous post. Kokoro-TTS is...

A CLI text-to-speech tool using the Kokoro model, supporting multiple languages, voices (with blending), and various input formats including EPUB books and PDF documents.

On Windows, I did the basic install:

  • run uv tool install kokoro-tts or pip install kokoro-tts in a working directory,
  • and download the ONNX model file and voices files to the same directory.

But, alas, I could not get Kokoro-TTS to use my NVidia GPU because Kokoro-TTS uses ONNX model instead of Pytorch, and I cannot be bothered to figure out the installation...

Abogen on Windows

Enter Abogen, which has a nice cross-platform PyQt desktop GUI and is much faster, utilising my GPU rather than CPU only.

Abogen is a powerful text-to-speech conversion tool that makes it easy to turn ePub, PDF, or text files into high-quality audio with matching subtitles in seconds.

Again following the instructions, I installed Abogen v1.1.16:

  • first, download an install espeak-ng.msi,:
  • then, for NVidia (and using uv instead of pip):
    mkdir abogen
    cd abogen
    uv venv
    .venv/Scripts/activate.cmd
    uv pip install abogen

    Kokoro Text-to-Speech with Abogen_v1.1.6

abogen is easy to use via the desktop user interface, and options include:

  • blending voice!
  • converting the ePUB into a single audio file - I prefer .m4b (MPEG-4 audiobook format) with chapter markers, and metadata like title and author,
  • optionally generating separate audio files for each chapter - I prefer .mp3 which is more widely supported ony my devices (no metadata),

Abogen is about 12x fater on my GPU than with just my CPU!