openai-whisper-talk is a sample
Text completion endpointなど、
This is a
proof of concept application to demonstrate
voice chat between a human person and an AI.
I am sure that anyone who have tinkered with the
OpenAI tools would have probably came up with the same idea, leveraging the different tools to create something interesting.
This is part of the series of projects I created as code exercise to study
Vue 3 and
Nuxt 3. There are still lots of things I need to learn.
The user interface is like a normal phone. A list of AI friends is shown. You select an AI friend that you want to talk to and you will be shown the talk page. When you are finished, press the close button to go back to Home page.
Currently, there are 4 AI friends with different identity that you can talk to.
- Alice, responds in Valspeak (Valley girl) English
- Daniel, responds in Shakespearean, archaic English
- Junko, responds in Kansai Japanese
- Mark, responds in English
You can add more or edit the existing items by updating
Audio data is recorded automatically if sound is detected.
There is a threshold setting to eliminate background noise from triggering the audio capture. By default it is set to
-45dB (0dB is the loudest sound). Adjust the variable
MIN_DECIBELS if you want to set it to lower or higher depending on your needs.
When recording is enabled and sound is not detected for
2 seconds, the audio data will be uploaded and sent to the backend for transcribing. You can adjust the time by editing
Please note that transcribing time depends on the length of the audio data so it is possible that a new reply comes first before the old reply. To go around this limitation, at first I was thinking of setting the audio capture part like in
walkie-talkie where you press a button to speak and wait for the reply after.
I suggest to run this app with your
earphone attached to the machine or device. This will prevent the app from hearing its own reply. Also, please do not talk for too long, perhaps just one sentence at a time and pause a bit to allow time for transcription.
voice to text part, using
Whisper, takes time so do not expect instant reply.
I am developing this in an old machine and transcribing a simple ‘Good morning‘ takes about 5 seconds or so.
But perhaps in newer machines, it will be much faster.
I have set the
tiny to adapt to my circumstance but if you find that your machine is faster, set it to other models for improved voice transcription.
One of the AI friend (
Junko) that you can talk to will respond in Japanese.
In this case it might be useful to set the
language to Japanese and
task to translate.
$ whisper audio.ogg --language English --task translate --model tiny --output_dir './public/upload'
By doing so, you can speak both in English and Japanese.
If the voice input will always be in English, it is not necessary to set it to the same language as the expected output reply. The
text completion endpoint will take care of that.
Please note that the voice data will be uploaded in
public/upload and the output files of
Whisper will be saved in the same directory. I am not sure if it is possible to disable this. So you might see exponential increase of the number of files here.
Anyway, there are other configuration options that you can use to make
whisper more efficient in translating/transcribing the voice data. Please check
By writing a descriptive instruction in the beginning of the prompt for
Text Completion, it is possible to create different identity to the response.
Also, to make the conversation seem more natural, the previous conversation is stored in the backend and attached to the prompt every time we made a request to the endpoint. This will make the AI remember previous details of the conversation and it is possible to infer to those details in your conversation.
However, doing so makes the prompt longer as the conversation takes time. Please note that
completion endpoint has a maximum limit of 2048 tokens (4096 tokens for newer models)(4 chars = 1 token). But you will not hit the maximum since I set the ceiling much lower for the app (1800 tokens). When you reach this point, I remove the oldest parts of the conversation so that the conversation can continue. By that time, the remove parts will probably not matter anymore since the conversion has already dragged on for so long.
Text to Speech
Using Web Speech API, the text output from the
completion endpoint will be spoken by a selected voice depending on the AI friend identity.
You can change the voice parameter by editing the
assets/contacts.json file. You can check the available voices and test settings using this page.
Currently, I am using the basic
fetch function to send audio data to the backend. In the future, I will probably replace this with
useLazyFetch which is supposedly non-blocking.
I also need to use
session variable to store the chat conversation in the backend. I have seen @sidebase/nuxt-session but have not tried it yet.
First, you need to install
Whisper and its
$ pip install git+https://github.com/openai/whisper.git
You also need
ffmpeg installed on your system
# macos $ brew install ffmpeg # windows using chocolatey $ choco install ffmpeg # windows using scoop $ scoop install ffmpeg
By this time, you can test
Whisper using command line
$ whisper myaudiofile.ogg --language English --task translate
You can find sample audio files from here.
If that is successful, you can proceed to install this app.
Clone the repository and install the dependencies
$ git clone https://github.com/supershaneski/openai-whisper-talk.git myproject $ cd myproject $ npm install
.env file in the root directory and supply your own OpenAI API Key
Then run the app
$ npm run dev
Open your browser to
http://localhost:5000/ (port number depends on availability) to load the application page.
You might want to run this app using
This is needed to enable audio capture using a separate device like a smartphone.
In order to do so, prepare the proper
key files and edit
server.mjs at the root directory.
Then buid the project
$ npm run build
Finally, run the app
$ node server.mjs
Now, open your browser to
https://localhost:3000/ (port number depends on availability) or use your local IP address to load the page.