cross-posted from: https://hexbear.net/post/3182889

So I've had this idea floating around in my head for some time now, but, I finally was able to get the tools required to run on my PC.

Piper is a Text To Speech utility that uses deep learning voice models to read text aloud. This utility is primarily utilized in Home Assistant to allow for a self-hosted voice assistant experience. However, I think it has revolutionary potential.

This idea hit me a while back as a result of my use of the Read Aloud extension for Firefox. This extension supports Piper voice models, and over time I have settled on the use of a model that, I think, is very nice to listen too and gets about 90% of pronunciations correct.

I spent a portion of my day today pulling some text from various Marxist thinkers and creating a few samples. I think there would be a lot of work involved in cleaning up the output produced by these models, but only so far as to ensure the model is pausing appropriately in some situations. This usually centers around the use of punctuation such as [] () - and various date formats 1885, 1993.

For the symbols, they are basically ignored, which can create a kind of run-on sentence where it feels like the voice model should be winded due to the lack of pauses. Replacing these symbols with a comma seems to help, but it varies case by case. For dates, often the model reads the date as though it is a number, so for example, 1917 becomes one-thousand nine-hundred and seventeen, instead of a more natural nineteen seventeen.

Tools could be built to detect these patterns in the text and replace them with the appropriate written equivalents.

Another issue would be, how and when to read aloud end notes. My goal would be to utilize these audio files in something like AudioBookShelf. AudioBookShelf likes having the audio files broken down by chapter and section, so end notes likely would be read at the end of a section or chapter, making them easily skiable, as opposed to reading them inline.

Anyway, here are some samples I created today.

The bulk of the labor involved in generating these files is finding those areas where the flow of reading seems off. As noted above, they can be easy to spot. But also, every author will have their own stylistic quirks that you'll need to work around. From there, chop the text up into files for each chapter and subsection, as well as end notes. Then run Piper against each text file, and finally convert the output wav file into a more digestible MP3, and then apply some metadata to those files.

This is still just a proof of concept. I'm sure there is a way in which we can collaborate to build the text-scripts needed for a clean recording, maybe a Git repository of texts. That part of this process is still not fleshed out.

I think as a goal, I'd like to create a kind of "Intro to Marxism" audio collection. A selection of books to onboard people, perhaps based on prolewiki's Absolute Beginner's List.

Then eventually I can tackle the bigger fish.

Thoughts?