# TP3: word of mouth between AIs

## Goals
- Use off-the-shelf pre-trained models
- Aggregate several models in a common pipeline


In this practical session, you will use off-the-shelf pre-trained models to implement the revisited version of the [Word of Mouth game](https://en.wikipedia.org/wiki/Word_of_mouth), between AIs.
Traditionally, the word of mouth game is played between humans using two modalities: thoughts and speech. A sentence must be preserved through a chain of participants through oral communication, by whispering from mouth to ear.
In this version, we will consider 4 deep neural networks, each one associated to a given task:
- From speech to text
- From text to image
- From image to text
- From text to speech

The goal is to have the output speech as close as possible to the input speech.
To this end, you will be asked to use off-the-shelf pre-trained models available on GitHub and Hugging Face, for inference only (no training required).

## 0 - Install dependencies and download data

If you are using colab: check that your session include a GPU!


Reminder: you can install package in colab through the following lines:
```
!sudo apt install package
!pip install package
```

In [1]:
!mkdir _cache
!wget -nc -P _cache/ https://people.irisa.fr/Denis.Coquenet/courses/content/M2-DLV/TP3/test_audio.mp3
!wget -nc -P _cache/ https://people.irisa.fr/Denis.Coquenet/courses/content/M2-DLV/TP3/astronaut_rides_horse.png

mkdir: cannot create directory ‘_cache’: File exists
File ‘_cache/test_audio.mp3’ already there; not retrieving.

File ‘_cache/astronaut_rides_horse.png’ already there; not retrieving.



## 1 - Audio playing

Here is some code enabling to play sounds from a python script.

**TO DO**: 
Test this function with the "test_audio.mp3" file

In [None]:
from IPython.display import Audio

def playsound(filepath):
 display(Audio(filepath, autoplay=True))

## 2 - Speech-to-text: Whisper

This part is dedicated to the speech-to-text module. We will use the Whisper model whose source code is accessible on [GitHub](https://github.com/openai/whisper).

**TO DO**: 
1) Follow the instructions of the GitHub repository to implement a function that takes an mp3 file path and a Whisper model as input and output the corresponding String. 
2) Test your function with the "test_audio.mp3" file.


In [None]:
def speech_to_text(model, audio_path):
 pass # TO DO

## 3 - Text-to-Image: Stable Diffusion

This part is dedicated to the text-to-image module. We will use the Stable Diffusion model which is available through [Hugging Face](https://huggingface.co/stabilityai/stable-diffusion-2-1).

**TO DO**: 
1) Follow the instructions of the model page to implement a function that takes a String and a stable diffusion model as input and output the generated image. 
2) Test your function with the prompt "a photo of an astronaut riding a horse on mars" *several times* and display the generated images. What do you notice? Was it expected?


In [None]:
def text_to_image(model, text):
 pass # TO DO

## 4 - Image-to-text: BLIP

This part is dedicated to the image-to-text module. We will use the BLIP model which is available on [Hugging Face](https://huggingface.co/Salesforce/blip-image-captioning-large).

**TO DO**: 
1) Follow the instructions of the model page to implement a function that takes an image and a BLIP model as input and output the generated caption. 
2) Test your function with the "astronaut_rides_horse.png" file.

In [None]:
def image_to_text(model, image):
 pass # TO DO

## 5 - Text-to-Speech: Speech T5

This part is dedicated to the text-to-speech module. We will use the Speech T5 model which is available through [Hugging Face](https://huggingface.co/microsoft/speecht5_tts).

**TO DO**: 
1) Follow the instructions of the model page to implement a function that takes a text, a Speech T5 model and a file path as input and save the generated speech at the given path. 
2) Test your function with the following text "a photo of an astronaut riding a horse on mars".

In [None]:
def text_to_speech(model, text, output_path):
 pass # TO DO

## 6 - Altogether now

The goal now is to merge the four models in a single pipeline, to go from speech to text, to image, back to text, and finally back to speech.

**TO DO**: 
1) Implement a function that performs the four tasks in cascade (input: audio file path and the four models, output: speech) 
2) Run this function with the "test_audio.mp3" file. Is the message preserved? 
3) Iterate 10 times with this function. Is the message preserved? Display the intermediate outputs to see the evolution.

## 7 - Try it yourself

Record your voice (in english) and test the pipeline on several examples. Here is some code to record an audio file from the microphone of your computer. This may not work online, in that case, try it locally and upload the generated file in colab to continue.

In [None]:
import sounddevice as sd
from scipy.io.wavfile import write
def record_voice(duration=5, output_filename="my_audio_file.mp3"):
 freq = 44100

 # Start recorder with the given values of duration and sample frequency
 recording = sd.rec(int(duration * freq), samplerate=freq, channels=2)

 # Record audio for the given number of seconds
 sd.wait()

 # This will convert the NumPy array to an audio file with the given sampling frequency
 write(output_filename, freq, recording)

## 8 - To go further

Seek by yourself an additional step to add in the pipeline (for instance, text translation) and make it run

## 9 - Bonus

Use a stable diffusion model (available on HuggingFace) for image edition. For instance, try replacing the horse by another animal in the "astronaut_rides_horse.png" sample image.