Create Audiobooks With Text to Speech

The $3,000 Quote That Made Me Learn TTS

Self-published a book last year. 47,000 words. Did okay on Amazon. Then readers started asking: "Where's the audiobook?"

Good question. I wanted one too. Checked ACX (Amazon's audiobook platform). Narrators quoted $3,000 to $4,500. Per finished hour rates ranged from $200-$400. My book would've been about 8 hours of audio.

Yeah, no. Not happening on a self-published budget.

So I tried text-to-speech. Downloaded some free TTS software. Pasted my manuscript. Hit export.

Sounded like a robot reading a grocery list. Monotone. No emotion. Mispronounced character names. Paused in weird places mid-sentence. Completely unusable.

About to give up. Then I found a Reddit thread where actual authors were using TTS successfully. Not the way I tried though. They had a whole workflow I didn't know existed.

Took me three weeks to figure out. Here's what actually works.

Why Your First TTS Attempt Will Sound Terrible

Most people do this: paste entire manuscript into a TTS tool, export, upload to ACX or Findaway Voices, get rejected for "poor audio quality."

That approach fails for specific reasons:

Dialogue tags break the flow. When your text says "Hello," she said nervously, the TTS voice reads "comma she said nervously" out loud. Sounds awful.

Proper nouns get butchered. Character named Siobhan? TTS will say "See-oh-ban" instead of "Shiv-awn." Fantasy place names? Forget it.

No emotional variation. A dramatic chapter climax sounds identical to a calm breakfast scene. Zero tension. Zero pacing.

Chapter breaks aren't formatted right. TTS just keeps reading straight through "Chapter Seven" into the first sentence without proper pause timing.

You can't just export raw and expect it to work. But here's the thing—with proper prep work, modern TTS voices sound surprisingly good. Neural voices from Google, Amazon Polly, or ElevenLabs can pass for human if you set them up correctly.

The Actual Workflow Professional Self-Publishers Use

This is the process that works. Takes time upfront but saves thousands in narrator costs.

Step 1: Format Your Manuscript For Audio (Not Print)

Your print manuscript needs heavy edits before TTS can handle it.

Remove ALL dialogue tags. Seriously. Every "he said," "she whispered," "they shouted"—gone. Replace with action beats or just let the dialogue stand alone. Instead of:

"I'm leaving," John said angrily.

Write:

"I'm leaving." John slammed the door.

TTS reads the second version way cleaner. No awkward "said angrily" interruptions.

Fix all abbreviations. "Dr. Smith" becomes "Doctor Smith." "Mr. Jones" becomes "Mister Jones." TTS voices handle full words better than punctuation-heavy shortcuts.

Add pronunciation guides. Character named Aislinn? Spell it phonetically in your TTS document as "Ash-lin" so the voice says it right. You'll change it back for print later but for audio, phonetic wins.

Break chapters into smaller files. Don't feed TTS a 47,000-word document. Split into chapter files—Chapter01.txt, Chapter02.txt, etc. Way easier to fix mistakes without re-rendering everything.

Step 2: Choose The Right Voice (This Matters More Than You Think)

Free TTS voices sound robotic. You need neural or AI-trained voices.

Best options I tested:

Google Cloud Text-to-Speech – Neural2 voices sound natural. Costs about $16 per million characters. For a 50,000-word book, expect around $5-$8 total. Journey or Wavenet voices work well.

Amazon Polly – Neural voices (Joanna, Matthew) are solid. Similar pricing to Google. About $4 per million characters for neural voices.

ElevenLabs – Best quality I've heard. Sounds almost human. More expensive though—$22/month for 30,000 characters. You'd need multiple months for a full book but the quality difference is noticeable.

Avoid free options like eSpeak or Festival. They sound like 1998 GPS navigation. Nobody will listen past chapter one.

Don't just pick the first voice you hear. I tested this wrong initially—used a calm chapter with description, thought "yeah sounds fine," then rendered a fight scene and it sounded completely lifeless. Test with emotional scenes. Grab your most tense dialogue chapter. Render it with Matthew, then Joanna, then maybe Ruth. Listen to which one actually builds tension. Romance book? You need warmth in the voice. Thriller? Needs that edge. A narrator mismatch kills audiobooks—same with TTS voices.

Step 3: Use SSML Tags To Control Pacing and Emotion

Okay this part sounds technical but it's actually not that bad. SSML—Speech Synthesis Markup Language. Basically HTML but for telling TTS voices how to perform your text instead of how to display it.

I ignored SSML at first. Big mistake. The difference between using it and not using it is like night and day. With SSML, you control pauses, emphasis, pacing. Without it, you're stuck with whatever flat delivery the default voice gives you.

Example—add pauses:

<break time="2s"/>

This creates a 2-second pause. Use it after chapter titles or before scene changes.

Control speaking rate for different scenes:

<prosody rate="slow">The door creaked open.</prosody>

The voice slows down for that sentence, adding tension.

Emphasize words:

<emphasis level="strong">Never</emphasis> come back here.

The voice will stress "Never" like a real person would.

Adjust pitch for different characters (if you're brave enough to try character voices):

<prosody pitch="+10%">This is the high-pitched character.</prosody>

Learning SSML takes an afternoon. It's tedious. But it's the difference between robotic audio and something listenable. Use a TTS tool that supports SSML if you want control over pacing.

Step 4: Export, Listen, Fix Mistakes, Repeat

First export will have issues. Guaranteed.

Mispronounced words. Weird pacing. Maybe the voice reads a URL out loud you forgot to delete. Listen to every chapter. Make notes. Fix the text file. Re-export just that chapter.

This step took me the longest. Went through each chapter 2-3 times before it sounded right. But way faster than manually recording or hiring a narrator.

Step 5: Edit Audio (Yes, You Still Need This)

TTS output isn't upload-ready. You'll need basic audio editing.

I use Audacity because it's free and does the job. Reaper's $60 if you want something more professional but honestly Audacity worked fine for me. Import your chapter audio files. You'll need to do a few things ACX requires—add 0.5 seconds of silence at the start (room tone they call it), normalize the volume so Chapter 3 doesn't suddenly blast louder than Chapter 2, export as 192kbps MP3 or WAV depending on what the platform wants.

ACX is picky about audio specs. I failed their automated check twice before figuring out my peaks were too hot. Findaway Voices is way more flexible but still—clean audio matters.

Step 6: Test With Beta Listeners Before Publishing

Don't skip this. Send sample chapters to readers who already liked your book—they're invested enough to give honest feedback.

I sent three chapters to five people. Asked them straight up: "Does this sound listenable? Any parts where it sounds too robotic? Character voices working or weird?"

Two people said totally fine. Three pointed out specific sections that sounded off—one had a weird pause in dialogue, another mispronounced a location name I missed. Fixed those before uploading everything.

If your beta listeners say it sounds robotic, believe them. You've been listening to your own manuscript for weeks—you're deaf to problems at that point. Fresh ears catch what you miss.

When TTS Actually Works (And When It Doesn't)

Let's be real. TTS audiobooks aren't for every genre.

Works well for:

Non-fiction (business, self-help, how-to guides)
Light fiction with limited dialogue
Genre fiction where readers prioritize story over narration
Books with straightforward prose (no heavy accents, dialects, or complex character voices)

Struggles with:

Literary fiction requiring emotional nuance
Books with heavy accents or dialect writing
Multiple POV characters needing distinct voices
Poetry or experimental prose

My book was a sci-fi thriller. Worked fine. Friend tried TTS on his Southern Gothic novel with heavy dialect. Sounded terrible. Know your genre.

The Honest Cost and Time Breakdown

Here's what I actually spent:

Google Cloud TTS neural voice: $7
Audacity (audio editor): Free
Time formatting manuscript: ~12 hours
Time rendering and fixing audio: ~8 hours
Time editing final audio: ~6 hours

Total: $7 and about 26 hours of work

vs. hiring a narrator: $3,000-$4,500

Worth it? Depends on your budget and time. If you're broke and have free weekends, yeah. If you're making $10k per book, pay the narrator.

Should You Actually Do This?

TTS audiobooks are controversial. Some readers hate them. ACX reviewers can be picky about audio quality. You risk negative reviews mentioning "computer-generated narration."

But here's the thing: most listeners can't tell if you do it right. I've sold 200+ audiobook copies. Three reviews mentioned the voice. Two said they didn't mind. One complained.

If your book isn't selling enough to justify narrator costs, TTS is a legitimate option. Better to have an imperfect audiobook than no audiobook at all.

Just be honest in your product description. Don't pretend it's human-narrated. Say "digitally narrated" or "AI-narrated" upfront. Readers appreciate transparency.

For stitching together audio files from different chapters, similar tools exist for combining assets—whether you're merging photos for visual projects or assembling audio segments for cohesive output.

And if you're going the TTS route? Spend the time on SSML formatting. That's the real difference between "this sounds like a robot" and "wait, is this actually AI?"