Oracle Corporation

03/11/2024 | Press release | Distributed by Public on 03/11/2024 10:32

OCI Speech supports the Whisper model

The Oracle Cloud Infrastructure (OCI) Speech service now supports the Whisper model from OpenAI. Trained on a large corpus of multilingual data, Whisper is a speech-to-text model that supports file-based transcription for over 50 languages. It uses the same service end points and API and software developer kit (SDK) interfaces as the OCI Speech model to give you the most flexibility and compatibility. The Whisper model also gained speaker diarization, a feature that distinguishes and labels different voices within an audio stream, allowing for precise speaker separation in the transcription.

The Whisper model has five sizes: tiny, base, small, medium, and large-V2. For the best cost-performance trade off, the medium Whisper model is made available in all OC1 regions from both The Oracle Cloud Console and SDK.

Figure 1: Create an OCI Speech job using the Whisper model

The large-V2 model is supported when submitting a service request in the Ashburn and Phoenix regions. We plan to make more regions and models available in the future, based on customer feedback.

Key features and benefits

The Whisper model in OCI Speech offers the following features and benefits:

  • Multilingual support: Broaden your audience reach with Whisper's multilingual support voice-to-text transcription for over 50 languages, including Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Māori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
  • Diarization for speaker labeling: Introducing diarization capabilities for speaker labelling in audio recordings. The diarization feature enables distinct identification of multiple speakers. You can either specify the number of speakers (2-16) when submitting the transcription job or let OCI Speech automatically detect the number of speakers.
  • Same API and SDK interface as the native OCI Speech model: You use the same API and SDK interface when using the Whisper model as the native OCI Speech model. This integration ensures a smooth transition between models within OCI Speech. See the following table for a comparison of the native OCI Speech model and the Whisper model.

Feature

OCI Speech model

The Whisper model in OCI Speech

Real time transcriptions

Supported

Not supported

Large file size

Up to 2GB

Up to 2GB

Word level timestamp

Supported

Supported

File format

AAC, AC3, AMR, AU, FLAC, M4A, MKV, MP3, MP4, OGA, OGG, WAV, WEBM

AAC, AC3, AMR, AU, FLAC, M4A, MKV, MP3, MP4, OGA, OGG, WAV, WEBM

Multilingual support

EN, ES, FR, DE, PT, HI, IT

Same as Oracle ASR model plus 50 other languages

Diarization

Supported

Supported

English translation

Not supported

Coming soon

Table 1: Compare native OCI Speech model and the Whisper model in OCI Speech

Want to know more?

The OCI Speech service team is committed to empowering you with tools that redefine possibilities, and we look forward to you benefitting from the newly introduced Whisper model multilingual support with diarization capabilities. Contact your Oracle representative to discuss how OCI Speech with diarization can help you unlock the value of your multimedia data and gain the insight you need to bring your business to the next level.

If you're new to Oracle Cloud Infrastructure, try Oracle Cloud Free Trial, a free 30-day trial with US$300 in credits.

For more information, see the following resources: