Timbre Transfer with Machine Learning by Isaac Io Schankler

Machine Learning has long been a part of computer music, but with recent advances in the field we have been interested to see how these new tools could be put to use in Max. We invited Isaac Io Schankler to develop an example of their explorations with the nn~ object that implements the RAVE model in Max. We hope this offers a good starting point for experimenting with ML tools in Max.

From Isaac:

There are lots of ways to use machine learning in music! If you are Google or OpenAI, you can spend millions of dollars to scrape the internet for data and train your models using thousands of GPUs. But even if you are not a tech conglomerate with endless resources and questionable ethics, there are ways to make use of machine learning at home that are slightly less resource-intensive and less fraught. Lately I’ve been having fun using Antoine Caillon and Philippe Esling’s RAVE (Realtime Audio Variational autoEncoder). You can use RAVE as a kind of “timbre transfer” tool to make one thing sound like another, e.g. make a guitar sound like a violin, or make an orchestra sound like a Game Boy, or yes, make one person’s voice sound like another person.

For this patch, I trained a model on the Bach Violin Dataset, which features recordings by Kinga Augustyn, Ray Chen, Oliver Colbentson, Ko Donghwi, John Garner, Karen Gomyo, Minji Kim, Silei Li, and Emil Telmanyi. (While there are no laws about using copyrighted material to train AI models yet, I tried to stick to recordings that were in the public domain or had similarly permissive licenses.)

If you want to explore further, there are a few other pre-trained models here. If you’re interested in training your own model, you’ll need at least 3 hours of recorded audio and a computer with a capable GPU. If, like me, you don’t have access to a fancy GPU, you can use a cloud computing service like Google Colab or vast.ai. A Google Colab sketch, helpfully provided by hexorcismos, can be found here.

HOW TO USE THE PATCH

You’ll need to download and install nn~ first. Once that is installed, download the patch. After you run the patch and turn the audio on, you can try out the timbre transfer by turning on the mic input, or by playing back the little synth guitar loop that I provided. You should start to hear violin-like sounds that mimic the audio input—that’s the pre-trained RAVE model trying to reproduce your sounds!

You can then adjust the mix of the original input and the violin models (there are 2) to your liking. I also added some reverb to “soften” the noisy output you sometimes get from RAVE, but you can adjust that too if you like.

You might also want to adjust the delay of the original audio input. It takes RAVE a bit of time to generate its audio, so if you want the violin models to be in sync with the audio input, you will probably have to delay the original signal a bit (usually between 250 and 1000 ms).

Okay, so far so good. But here’s where things get really interesting. To generate its audio, RAVE uses a bunch of constantly changing control parameters. Normally these parameters are controlled directly by the audio input, but we can hijack these parameters and control them manually if we want. RAVE gives us 16 of those parameters to play with. These parameters are created by RAVE during training, so they don’t really map onto the musical parameters that we know and love (e.g. high and low, loud and soft, etc.). But imo that’s part of the fun of it, the exploration and experimentation and noodling to try and locate the kinds of sounds you want!

Since 16 parameters is a lot to deal with, I also added some LFOs that slowly sweep through the range of each parameter, if you just want to hear how the sound changes over time.

Enjoy(?)!
--Isaac Io Schankler
Find Isaac on Twitter and Instagram

by Andrew Benson on June 7, 2023

Robert Koster

I’m most interested in training my own model. You can’t do it with an M1 or M2 arm mac can you? Requires Nvidia or some other chipset?

would be great to see some of the other machine learning models come to Max, specifically the other model from Qosmo Nuetone, DDSP or whatever. Defs lots to explore. 😊

Carlos Ramos

Robert, I tried training the RAVE model with 3 hours of audio using on-demand NVIDIA H100s. It took me a little over 72 hours of training time and around 150 EUR to have a more-or-less decent inference model that I can use to explore the latent space for *that* particular sound. Since I'm just a hobbyist I don't feel that much encouraged to train a new dataset for another 72 hours or even to train the same 3hrs of audio for more time to have higher quality results.
I have tried training it on my M1 Max too and it took 12 hours to barely reach 200 epochs. lol