WDJP - Stability AI Releases Revolutionary Stable Audio

Electronic Music News > Stability AI Releases Revolutionary Stable Audio

Stability AI Releases Revolutionary Stable Audio

Pure text-to-audio magic now available for everyone

Published

Sept. 18 2023

Writer

Rehan S.

News Type

Tech

In the ever-evolving world of AI technology, where generative marvels have dazzled us with images and code, a new symphony emerges: text-to-audio generation. Stability AI has just unveiled the public debut of its groundbreaking technology, opening the floodgates for anyone to harness the power of simple text prompts to craft mesmerizing audio clips. While Stability AI is renowned for birthing the astonishing text-to-image generation AI, it now steps onto a fresh stage with Stable Audio.

This summer, took a quantum leap with its SDXL base model, enhancing its prowess in image composition. And just last month, the company further expanded its horizons, venturing into the realm of code generation with the launch of StableCode. But now, it's time to let the music play.

Stable Audio, a cutting-edge capability, draws upon the same core AI techniques that have enabled Stable Diffusion to weave images from thin air. However, this time, the magic unfolds in the auditory dimension. It relies on a diffusion model, painstakingly trained not on images, but on audio, to conjure enchanting audio clips.

"Stability AI is best known for its work in images, but now we're launching our first product for music and audio generation, which is called Stable Audio. The concept is really simple: you describe the music or audio that you want to hear in text, and our system generates it for you."

- Ed Newton-Rex, VP of Audio at Stability AI

Harmonai: The Heartbeat of Stable Audio

Ed Newton-Rex is no newcomer to the world of computer-generated music, having founded his own startup, Jukedeck, back in 2011, which eventually found a new home under the TikTok umbrella in 2019. However, the roots of Stable Audio stretch beyond Jukedeck, germinating in Stability AI's internal research studio for music generation, lovingly named Harmonai, a brainchild of Zach Evans.

"It's a lot of taking the same ideas technologically from the image generation space and applying them to the domain of audio,"

"Harmonai is the research lab that I started, and it is fully part of Stability AI. It is basically a way to have this generative audio research happening as a community effort in the open."

- Zack Evans

While generating base audio tracks with technology is not entirely novel, Stable Audio takes it several steps further. Traditional 'symbolic generation' techniques, as Evans puts it, often work with MIDI (Musical Instrument Digital Interface) files, which are capable of representing basic musical elements like a drum roll. However, the generative AI power of Stable Audio transcends these limitations, empowering users to craft new music that breaks free from the shackles of repetitive notes typically associated with MIDI and symbolic generation.

Stable Audio doesn't dabble in MIDI; instead, it operates directly with raw audio samples, ensuring a superior output quality. The model was meticulously trained on a vast trove of over 800,000 licensed music pieces from the esteemed AudioSparx audio library.

"Having that much data, it's very complete metadata,"

"That's one of the really hard things to do when you're doing these text-based models—having audio data that is not only high-quality audio but also has good corresponding metadata."

- Zack Evans

Unleashing Creativity with Stable Audio

Unlike image generation models that often cater to the whims of replicating specific artists' styles, Stable Audio takes a different route. Users won't be able to instruct the AI to compose music reminiscent of the Beatles or any other iconic musical group.

"We haven't trained on the Beatles,"

"With audio sample generation for musicians, that has tended not to be what people want to go for."

- Ed Newton-Rex, VP of Audio at Stability AI

Mastering the Art of Text-to-Audio Generation

As a diffusion model, the Stable Audio model boasts approximately 1.2 billion parameters, a scale nearly matching the original Stable Diffusion's capabilities for image generation.

The text model responsible for interpreting prompts and generating audio was meticulously crafted and trained by Stability AI. Evans elucidated that the text model employs a technique known as Contrastive Language Audio Pretraining (CLAP). As part of the Stable Audio launch, Stability AI is also rolling out a prompt guide to assist users in crafting text prompts that steer the AI toward producing desired audio files.

Stable Audio casts its magic in two iterations: a free version and a $12/month Pro plan. The free variant grants users 20 generations per month of up to 20-second tracks, while the Pro version cranks up the magic with 500 generations and 90-second tracks. Click here to try it for free or to subscribe to the Pro plan.

"We want to give everyone the chance to use this and experiment with it,"

- Ed Newton-Rex, VP of Audio at Stability AI

In the world of AI-driven audio creation, Stability AI's Stable Audio has taken center stage, inviting all to join the symphony of creativity, where text becomes music, and imagination knows no bounds.

Follow up Stability AI and Stable Audio for more updates:

Stability AI Official Website

Stable Audio Official Website

YouTube

Instagram

TikTok