Overview

AWS Polly is a text-to-speech (TTS) service provided by Amazon Web Services that converts written text into spoken audio. The service supports a range of languages and voices, enabling developers to integrate speech capabilities into various applications. Polly offers different voice types, including Standard voices, Neural Text-to-Speech (NTTS) voices, and Long-form Text-to-Speech (LfTTS) voices. NTTS voices are designed to produce more natural and human-like speech using advanced deep learning technologies to enhance intonation and expressiveness. LfTTS voices are optimized for longer audio content, such as audiobooks or articles, providing consistent voice quality over extended durations.

Developers primarily use AWS Polly to create voice-enabled user interfaces, generate audio versions of digital content, and build accessibility features. For instance, it can power interactive voice response (IVR) systems, narrate e-learning materials, or convert news articles into audio podcasts. The service integrates with other AWS offerings, such as Amazon S3 for storing generated audio files and AWS Lambda for serverless processing, enabling scalable architectures for audio generation. The API allows for real-time text-to-speech conversion or asynchronous processing for longer texts.

AWS Polly operates on a pay-as-you-go model, where costs are based on the number of characters processed. A free tier is available for new users, offering a monthly allowance of characters for both standard and neural voices for the initial 12 months of use. The service also provides Speech Marks, which are metadata tags that indicate when specific words are spoken, as well as information about sentence, viseme, and SSML events. This functionality assists developers in synchronizing animations or highlighting text during speech playback, which is particularly useful for applications requiring precise audio-visual coordination.

While AWS Polly provides a comprehensive set of features for text-to-speech, developers should consider the complexity of integrating it within the broader AWS ecosystem. Setting up Identity and Access Management (IAM) roles and policies can require familiarity with AWS security best practices. For applications requiring highly customized voices or specific emotional nuances beyond what is offered by the standard and neural voices, alternative services or custom voice models might be considered. For instance, Google Cloud Text-to-Speech also offers a wide array of voices and customization options, including custom voice models with Voice AI, which allows for training unique voices from recorded audio data Google Cloud Text-to-Speech custom voice documentation.

Key features

  • Neural Text-to-Speech (NTTS): Employs deep learning to produce more natural and human-like speech, with improved prosody and articulation, suitable for conversational interfaces and dynamic content AWS Polly features overview.
  • Standard Text-to-Speech: Converts text into speech using a diverse set of voices and languages, providing a foundational TTS capability for various applications AWS Polly documentation.
  • Long-form Text-to-Speech (LfTTS): Optimized for generating extended audio content such as audiobooks, articles, or podcasts, ensuring consistent voice quality over long durations AWS Polly Long-form TTS.
  • Speech Synthesis Markup Language (SSML) support: Allows developers to control various aspects of speech, including pronunciation, volume, pitch, and speaking rate, using XML-based markup AWS Polly SSML guide.
  • Speech Marks: Provides metadata about the timing of speech events, such as when specific words are spoken, sentence breaks, visemes (mouth positions), and SSML tags. This supports synchronizing speech with animations or text highlighting AWS Polly Speech Marks.
  • Lexicons: Enables customization of pronunciation for specific words, acronyms, or proper nouns by uploading pronunciation lexicons in PLS (Pronunciation Lexicon Specification) format AWS Polly custom lexicons.
  • Broad language and voice support: Offers a selection of voices across multiple languages, catering to global application requirements AWS Polly supported languages.

Pricing

AWS Polly operates on a pay-as-you-go model, with costs determined by the number of characters processed. Different rates apply for Neural Text-to-Speech (NTTS), Long-form Text-to-Speech (LfTTS), and Standard voices. A free tier is available for the first 12 months, offering a specified allocation of characters for both NTTS and Standard voices.

AWS Polly Pricing (as of 2026-05-08)
Voice Type Rate per 1 Million Characters Free Tier (first 12 months)
Neural Text-to-Speech (NTTS) $4.00 5 million characters per month
Long-form Text-to-Speech (LfTTS) $16.00 N/A (separate LfTTS free tier may apply, check pricing page)
Standard Voices $4.00 5 million characters per month

For the most current and detailed pricing information, including regional variations and specific long-form voice tiers, refer to the official AWS Polly pricing page.

Common integrations

  • AWS SDK for Python (Boto3): Allows Python applications to interact with Polly for speech synthesis, including specifying voices, SSML, and output formats. AWS Polly Python SDK documentation.
  • AWS SDK for Java: Provides Java developers with APIs to integrate Polly for converting text to speech within Java-based applications. AWS Polly Java SDK documentation.
  • AWS SDK for JavaScript: Enables front-end and back-end JavaScript applications to synthesize speech using Polly, often used in web and Node.js environments. AWS Polly JavaScript SDK documentation.
  • AWS SDK for .NET: Supports .NET applications in C# or F# for utilizing Polly's text-to-speech capabilities. AWS Polly .NET SDK documentation.
  • AWS SDK for Go: Facilitates integration of Polly into Go applications, enabling developers to incorporate speech synthesis into Go services. AWS SDK for Go Polly API reference.
  • AWS Lambda: Polly can be integrated with Lambda functions to trigger speech synthesis in response to events, such as new text files being uploaded to S3. AWS Lambda with Polly.
  • Amazon S3: Generated audio files from Polly can be directly stored in S3 buckets, providing scalable storage for audio content. AWS Polly output formats and S3 storage.

Alternatives

  • Google Cloud Text-to-Speech: Offers a wide range of voices, languages, and customization options, including WaveNet voices and custom voice models.
  • Microsoft Azure AI Speech: Provides highly natural-sounding speech with various voice styles, emotions, and custom neural voice capabilities.
  • ElevenLabs: Specializes in highly expressive and realistic AI voices, with advanced voice cloning and generative AI features for speech.

Getting started

The following Python example demonstrates how to use the AWS SDK for Python (Boto3) to synthesize speech from text using AWS Polly. This code converts a given text into an MP3 audio file. Ensure you have the Boto3 library installed (pip install boto3) and your AWS credentials configured.


import boto3
from botocore.exceptions import ClientError

def synthesize_speech(text, output_filename="speech.mp3", voice_id="Joanna", output_format="mp3"):
    """
    Synthesizes speech from text using AWS Polly and saves it to a file.

    :param text: The text to convert to speech.
    :param output_filename: The name of the output audio file.
    :param voice_id: The ID of the voice to use (e.g., 'Joanna', 'Matthew', 'Neural-Amy').
    :param output_format: The format of the output audio file (e.g., 'mp3', 'ogg_vorbis', 'pcm').
    """
    polly_client = boto3.client('polly', region_name='us-east-1') # Replace with your desired region

    try:
        response = polly_client.synthesize_speech(
            Text=text,
            OutputFormat=output_format,
            VoiceId=voice_id,
            Engine='neural' # Use 'standard' or 'neural'
        )

        if "AudioStream" in response:
            with open(output_filename, "wb") as file:
                file.write(response['AudioStream'].read())
            print(f"Speech synthesized successfully to {output_filename}")
        else:
            print("Could not find AudioStream in response.")

    except ClientError as e:
        print(f"Error synthesizing speech: {e}")

if __name__ == "__main__":
    sample_text = "Hello, this is AWS Polly. I can convert your text into lifelike speech."
    synthesize_speech(sample_text)

    # Example with a different voice and SSML
    ssml_text = "<speak>This is an <emphasis level='strong'>important</emphasis> message. <break time='1s'/> Thank you.</speak>"
    synthesize_speech(ssml_text, output_filename="ssml_speech.mp3", voice_id="Matthew", output_format="mp3")

This script initializes a Polly client, calls the synthesize_speech method with the desired text, voice, and output format, and then writes the resulting audio stream to a file. The example also demonstrates using SSML to add emphasis and pauses to the speech. For more details on available voices and advanced SSML features, refer to the AWS Polly API Reference.