In a previous blogpost, I showed you how to leverage the Whisper model on Azure OpenAI Services to transcribe an audio file to text.

We can easily take this a step further: instead of just returning the speech as text, why not return it translated into another language? Interestingly, the Whisper model and the Azure Open AI APIs, do support the translation to English of a given input. Translating the given input to any other language is not supported by the model. But for this we can simply use a GPT model: after transcribing the audio file by using Whisper, we can use the generated output as the input of a GPT prompt and ask it to translate the text into another language.

Using Whisper to translate speech to English

As said, translating a non-English audio file to written English is supported by the model (and the API) out-of-the-box. We can use the GetAudioTranslationAsync method on the Azure.AI.OpenAI.OpenAIClient class (Azure.AI.OpenAI NuGet package) to get the English translation of any given audio file:

private async Task<string> TranslateToEnglish(BinaryData data)
{
    var _client = new OpenAIClient(new Uri("[Your Azure OAI endpoint]"),
        new AzureKeyCredential("[Your key]"));

    var response = await _client.GetAudioTranslationAsync(
        new AudioTranslationOptions("whisper", data));

    return response.Value.Text;
}

We create an instance of an OpenAIClient, passing-in our Azure OpenAI endpoint and a key. We can then call the GetAudioTranslationAsync method, passing-in the binary data of the audio file we want to translate. The “whisper” parameter refers to the name we’ve given to the Whisper model we’ve deployed on Azure. For more info about deploying an Azure OpenAI Service and deploying a Whisper model through the Azure OpenAI Studio, have a look at my previous blogpost .

In response.Value.Text we can find the English translation of the given audio file in plain text. Additionally, other response formats are available as well (Srt, Vtt, ..). These can be requested by setting the ResponseFormat property on the AudioTranslationOptions that we pass-in to the GetAudioTranslationAsync method. For now, the default (Simple) is what we need.

Using Whisper and GPT to translate speech to other languages

If we want to translate an audio file to a language other then English, we can do this in 2 steps: Transcribe the audio file using Whisper and then asking GPT to translate this transcription.

Let’s take a look at the first step, transcribing an audio file:

private async Task<string> Transcribe(BinaryData data)
{
    var response = await _client.GetAudioTranscriptionAsync(
        new AudioTranscriptionOptions("whisper", data));

    return response.Value.Text;
}

This looks a lot like the previous method, but instead of calling GetAudioTranslationAsync, we call GetAudioTranscriptionAsync which will transcribe the given audio data into the input language.

Now, we can take this text, and ask a GPT model (GPT 3.5 Turbo for example) to translate this. Let’s start by deploying a GPT 3.5 Turbo model in Azure OpenAI Studio. Under Management – Deployments we can select Create new deployment and pass in the following values:

Once this model is in place, we can use it through the Azure.OpenAI SDK. Here’s how we can can do this:

private async Task<string> Translate(string content, string to)
{
    var chatResult = await _client.GetChatCompletionsAsync(
        new ChatCompletionsOptions
        {
            DeploymentName = "gpt-35-turbo",
            Temperature = 0,
            Messages =
            {
                new(ChatRole.System, $"You are a translator, translate the text entered by the user to {to}."),
                new ChatMessage(ChatRole.User, content),
            }
        });

    var answer = chatResult.Value.Choices[0].Message.Content;
    return answer;
}

Through the GetChatCompletionAsync method, we can send stuff to our GPT model. The DeploymentName property refers to the name we’ve defined when deploying this GPT 3.5 Turbo model. The Temperature is set to 0, which controls the creativity of the model. 0 means we want the result to be more deterministic and focussed. The Messages property is the collection of messages we’re sending of to the LLM. The first one is the System prompt. With such a prompt we can give the model instructions. In this case we’re saying the LLM should act as a translator and translate the given text to a certain language. The second message is the content we want to translate.

When the request finishes, the result (the translated content) can be found in chatResult.Value.Choices[0].Message.Content which we then return.

Here’s what this entire class (TranslationService) looks like:

using Azure;
using Azure.AI.OpenAI;

namespace WhisperTranslate.API;

public class TranslationService
{
    readonly OpenAIClient _client;

    public TranslationService()
    {
        _client = new OpenAIClient(new Uri("[Your Azure OAI endpoint]"), new AzureKeyCredential("[Your key]"));
    }

    public async Task<string> Translate(BinaryData audio, string to)
    {
        string translation;

        if (to.Equals("english", StringComparison.InvariantCultureIgnoreCase))
        {
            translation = await TranslateToEnglish(audio);
        }
        else
        {
            var transcription = await Transcribe(audio);
            translation = await Translate(transcription, to);
        }

        return translation;
    }


    private async Task<string> Transcribe(BinaryData data)
    {
        var response = await _client.GetAudioTranscriptionAsync(
            new AudioTranscriptionOptions("whisper", data));

        return response.Value.Text;
    }

    private async Task<string> TranslateToEnglish(BinaryData data)
    {
        var response = await _client.GetAudioTranslationAsync(
            new AudioTranslationOptions("whisper", data));

        return response.Value.Text;
    }

    private async Task<string> Translate(string content, string to)
    {
        var chatResult = await _client.GetChatCompletionsAsync(
            new ChatCompletionsOptions
            {
                DeploymentName = "gpt-35-turbo",
                Temperature = 0,
                Messages =
                {
                    new(ChatRole.System, $"You are a translator, translate the text entered by the user to {to}."),
                    new ChatMessage(ChatRole.User, content),
                }
            });

        var answer = chatResult.Value.Choices[0].Message.Content;
        return answer;
    }
}

And we can simply put this behind an API like this:

app.MapPost("/translate", async (IFormFile file, string to) =>
{
    var stream = new MemoryStream((int)file.Length);
    file.CopyTo(stream);

    return await new TranslationService().Translate(BinaryData.FromBytes(stream.ToArray()), to);
})
.WithName("TranslateAudio")
.WithOpenApi();

With this in place, we have an API endpoint where we can post an audio file to and with the “to” parameter we can specify to which language we want to audio file to translated to.

Creating the mobile app

Now that we’ve got this API, we want to start sending audio files to it and get the translated text back. Recording audio in MAUI isn’t difficult, thanks to the Plugin.Maui.Audio NuGet package. More info about this package and how to use it to record audio can be found on GitHub.

Let’s define our UI:

<VerticalStackLayout VerticalOptions="Center">
    <Picker x:Name="Picker" HorizontalOptions="Center">
        <Picker.ItemsSource>
            <x:Array Type="{x:Type x:String}">
                <x:String>English</x:String>
                <x:String>Dutch</x:String>
                <x:String>French</x:String>
                <x:String>Spanish</x:String>
            </x:Array>
        </Picker.ItemsSource>
    </Picker>

    <Button
        x:Name="btnStartRec"
        Clicked="Start_Clicked"
        HorizontalOptions="Center"
        Text="Start" />

    <Button
        x:Name="btnStopRec"
        Clicked="Stop_Clicked"
        HorizontalOptions="Center"
        IsVisible="false"
        Text="Stop" />

    <Label x:Name="lblStatus" HorizontalOptions="Center" />

    <Button
        x:Name="btnTranslate"
        Clicked="Translate_Clicked"
        HorizontalOptions="Center"
        IsVisible="False"
        Text="Translate"
        VerticalOptions="Center" />

    <Label x:Name="lblResult" HorizontalOptions="Center" />

</VerticalStackLayout>

There’s not much to it: a picker which the user can use to pick the language to translate to, a few buttons and labels.

After installing the Plugin.Maui.Audio NuGet Package, we can start writing code.

When the Start button is clicked, we want to do update our UI controls and call the StartRecording method which check for the required permissions, create an audio recorder and call the audioRecorder’s StartAsync method:

private async void Start_Clicked(object sender, EventArgs e)
{
    audioBytes = null;
    btnStopRec.IsVisible = true;
    btnStartRec.IsVisible = false;
    lblStatus.Text = "Recording...";

    try
    {
        await StartRecording();
    }
    catch (PermissionException pe)
    {
        lblStatus.Text = pe.Message;
    }
}

private async Task StartRecording()
{
    if (await CheckPermissionIsGrantedAsync<Microphone>())
    {
        audioRecorder = audioManager.CreateRecorder();

        await audioRecorder.StartAsync();
    }
    else
    {
        throw new PermissionException("No permission granted to access Microphone");
    }
}

public static async Task CheckPermissionIsGrantedAsync() where TPermission : BasePermission, new()
{
    PermissionStatus status = await Permissions.CheckStatusAsync();

    if (status == PermissionStatus.Granted)
    {
        return true;
    }

    if (status == PermissionStatus.Denied && DeviceInfo.Platform == DevicePlatform.iOS)
    {
        // Prompt the user to turn on in settings
        // On iOS once a permission has been denied it may not be requested again from the application
        return false;
    }

    if (Permissions.ShouldShowRationale())
    {
        // Prompt the user with additional information as to why the permission is needed
    }

    status = await Permissions.RequestAsync();

    return status == PermissionStatus.Granted;
}


When clicking the Stop button, we want to get the stop recording the audio and convert the recording into a byte array, after which we update some UI controls:

private async void Stop_Clicked(object sender, EventArgs e)
{
    await StopRecording();

    lblStatus.Text = "Recorded";

    btnTranslate.IsEnabled = true;

    btnTranslate.IsVisible = true;
    btnStartRec.IsVisible = true;
    btnStopRec.IsVisible = false;
}

private async Task StopRecording()
{
    if (audioRecorder is not null)
    {
        var audioSource = await audioRecorder.StopAsync();

        var audioStream = audioSource.GetAudioStream();

        List<byte> totalStream = new();
        byte[] buffer = new byte[32];
        int read;

        while ((read = audioStream.Read(buffer, 0, buffer.Length)) > 0)
        {
            totalStream.AddRange(buffer.Take(read));
        }

        audioBytes = totalStream.ToArray();
    }
}

And, finally, when clicking the Translate button, we want to grab the bytes of the recorded audio, take the language selected by the user and call our API sending the bytes and the language parameter:

private async void Translate_Clicked(object sender, EventArgs e)
{
    if (audioBytes is not null)
    {
        lblStatus.Text = "Translating...";
        var result = await TranslateAudio(audioBytes, Picker.SelectedItem as string ?? Picker.Items.First());
        lblResult.Text = result;
        lblStatus.Text = "Translated";
    }
}

private async Task<string> TranslateAudio(byte[] bytes, string to)
{
    using (var httpClient = new HttpClient())
    {
        using (var form = new MultipartFormDataContent())
        {
            using (var fileContent = new ByteArrayContent(bytes))
            {
                fileContent.Headers.ContentType = MediaTypeHeaderValue.Parse("multipart/form-data");
                form.Add(fileContent, "file", "audio");
                HttpResponseMessage response = await httpClient.PostAsync($"https://localhost:7224/translate?to={to}", form);

                return await response.Content.ReadAsStringAsync();
            }
        }
    }
}

The TranslateAudio method returns the translated text, which we then put on the screen.

Run both your app and the API (make sure your app accesses your API on the correct port where it is running) and you should be able to record some spoken text and get it translated to the selected language.

 

There you have it: a very simple demo that shows how to record spoken text and get it transcribed and translated by leveraging Whisper and a GPT model. This code sample can be found on GitHub.