How Does It Work?

Temi is the fastest and easiest way to convert audio to text. Upload a file, we transcribe it and email you a transcript in minutes. Amazon Transcribe makes it easy for developers to add speech to text capabilities to their applications. Audio data is virtually impossible for computers to search and analyze. Therefore, recorded speech needs to be converted to text before it can be used in applications. Powerful transcription that's ready for work. If you’re a mobile professional—or anyone on the go—who relies on a digital voice recorder or smartphone to capture notes and memos, use Dragon’s robust transcription features to turn your recordings into text quickly, easily and accurately. What is Watson Speech to Text? IBM Watson Speech to Text technology enables fast and accurate speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics. KPMG streamlines call transcription KPMG uses Speech to Text to transcribe and catalog thousands of hours of calls, reducing compliance costs for its clients by as much as 80 percent.

Upload Files

Start by uploading your files into our automated transcription software. You can upload files from Dropbox and YouTube or just upload any file from your computer.

SpeechPal Magic

In no time, your file will be magically transcribed. We use the most advanced technology in the industry to provide exceptional results at the lowest prices.

Edit, Share & Download

Once your file is completed, you will have the ability to edit and review the transcription directly from your SpeechPal portal. After you are finished you have the option to share or download your completed work.

Simplicity is key

Just drop your audio and video files into our transcription software, sit back and relax, And in no time at all, you will have a high quality transcription.

Get startedSpeech Transcription

Smarter Automation

When we say our automated transcription software is the most affordable and easiest to use, WE MEAN IT! In no time at all, no matter the length of audio, you will have a remarkably high quality transcription ready for use.

Get started

Intuitive Editor

The superhuman SpeechPal editor allows endless possibilities when proofing your transcription. Follow along as your audio plays in the editor, adjust the play speed, fix mistakes and much more.

Get started

What kind of audio works best?

Speech-to-text technology is really smart, yet it can’t produce perfect transcripts. Our automated transcription software will work for just about any file type but below are some standards that will produce the highest quality transcription.

Distinct Speakers

Clear Audio

Low Ambient Noise

Experience you can count on!

SpeechPal takes transcription to the next level. We’ve revolutionized the speech to text industry and we continue to break barriers by providing exceptional quality transcriptions at affordable prices.

Get started

Need an enterprise solution?

We have solutions for your entire team.

Contact Us

Looking for human transcription?

Try our sister company,

Take me there-->

Conversation Transcription is a speech-to-text solution that combines speech recognition, speaker identification, and sentence attribution to each speaker (also known as diarization) to provide real-time and/or asynchronous transcription of any conversation. Conversation Transcription distinguishes speakers in a conversation to determine who said what and when, and makes it easy for developers to add speech-to-text to their applications that perform multi-speaker diarization.

Key features

  • Timestamps - each speaker utterance has a timestamp, so that you can easily find when a phrase was said.
  • Readable transcripts - transcripts have formatting and punctuation added automatically to ensure the text closely matches what was being said.
  • User profiles - user profiles are generated by collecting user voice samples and sending them to signature generation.
  • Speaker identification - speakers are identified using user profiles and a speaker identifier is assigned to each.
  • Multi-speaker diarization - determine who said what by synthesizing the audio stream with each speaker identifier.
  • Real-time transcription – provide live transcripts of who is saying what and when while the conversation is happening.
  • asynchronous transcription – provide transcripts with higher accuracy by using a multichannel audio stream.


Although Conversation Transcription does not put a limit on the number of speakers in the room, it is optimized for 2-10 speakers per session.

Speech Transcription Mac

Get started

See the real-time conversation transcription quickstart to get started.

Speech Transcription Software

Use cases

To make meetings inclusive for everyone, such as participants who are deaf and hard of hearing, it is important to have transcription in real time. Conversation Transcription in real-time mode takes meeting audio and determines who is saying what, allowing all meeting participants to follow the transcript and participate in the meeting without a delay.

Improved efficiency

Meeting participants can focus on the meeting and leave note-taking to Conversation Transcription. Participants can actively engage in the meeting and quickly follow up on next steps, using the transcript instead of taking notes and potentially missing something during the meeting.

How it works

This is a high-level overview of how Conversation Transcription works.

Expected inputs

  • Multi-channel audio stream – For specification and design details, see Microsoft Speech Device SDK Microphone. To learn more or purchase a development kit, see Get Microsoft Speech Device SDK.
  • User voice samples – Conversation Transcription needs user profiles in advance of the conversation. You will need to collect audio recordings from each user, then send the recordings to the Signature Generation Service to validate the audio and generate user profiles.


User voice samples are optional. Without this input, the transcription will show different speakers, but shown as 'Speaker1', 'Speaker2', etc. instead of recognizing as pre-enrolled specific speaker names.

Real-time vs. asynchronous

Conversation Transcription offers three transcription modes:


Audio data is processed live to return speaker identifier + transcript. Select this mode if your transcription solution requirement is to provide conversation participants a live transcript view of their ongoing conversation. For example, building an application to make meetings more accessible the deaf and hard of hearing participants is an ideal use case for real-time transcription.


Audio data is batch processed to return speaker identifier and transcript. Select this mode if your transcription solution requirement is to provide higher accuracy without live transcript view. For example, if you want to build an application to allow meeting participants to easily catch up on missed meetings, then use the asynchronous transcription mode to get high-accuracy transcription results.

Real-time plus asynchronous

Audio data is processed live to return speaker identifier + transcript, and, in addition, a request is created to also get a high-accuracy transcript through asynchronous processing. Select this mode if your application has a need for real-time transcription but also requires a higher accuracy transcript for use after the conversation or meeting occurred.

Language support

Currently, Conversation Transcription supports all speech-to-text languages in the following regions: centralus, eastasia, eastus, westeurope. If you require additional locale support, contact the Conversation Transcription Feature Crew.

Next steps

Coments are closed
Scroll to top