Google's NotebookLM and the Future of AI-Generated Audio Artwork

Deep Papers

Deep Papers is a podcast series featuring deep dives on today’s most important AI papers and research. Hosted by Arize AI founders and engineers, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning.

All Episodes

Deep Papers

Google's NotebookLM and the Future of AI-Generated Audio

October 14, 2024 • Arize AI

This week, Aman Khan and Harrison Chu explore NotebookLM’s unique features, including its ability to generate realistic-sounding podcast episodes from text (but this podcast is very real!). They dive into some technical underpinnings of the product, specifically the SoundStorm model used for generating high-quality audio, and how it leverages a hierarchical vector quantization approach (RVQ) to maintain consistency in speaker voice and tone throughout long audio durations.

The discussion also touches on ethical implications of such technology, particularly the potential for hallucinations and the need to balance creative freedom with factual accuracy. We close out with a few hot takes, and speculate on the future of AI-generated audio.

Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

Aman Khan: All right. I think we'll give it a few minutes for folks to join the call.

Harrison Chu: Aman, did you end up putting your Linkedin into there?

Aman Khan: I struggled with getting good results from it. I have done that a few times, but for some reason, I tried like 3 times, and my own Linkedin is not the strongest on it, for some reason, but other people's sounds good. Yeah.

Have you done that one?

Harrison Chu: I think I'm too embarrassed.

Aman Khan: Oh, you should definitely try.

Okay, I think we have enough. We've got enough folks in the room now. So maybe we can just like to kick off. I'll just do like quick intro. My name's Aman. I'm on the product team at Arize.

And I'm joined here by Harrison. You wanna do a quick intro of yourself.

Harrison Chu: How's it going everybody? I'm Harrison. I'm the Director of Engineering here at Arize.

Aman Khan: Awesome. So this one is pretty exciting.

I feel like we're trending more in this direction, as you know. Maybe it's indicative of the space overall. Research is still super interesting. It's great to follow research, but we thought we'd
do a little bit of a turn on that journey and deep dive on a product that has definitely taken the entire AI space by storm. And that is NotebookLM.

And what we thought we'd do would be to deep dive on the product itself and speak to the internals around how it works.

So keep in mind some of this is going to be from the outside in. We're taking resources that we found across the Internet, across interviews from product managers of the product, and honestly, a lot of Reddit speculation and sort of synthesizing that into the product tear down you're about to see. So it should be great. It should be interesting. We'd love any engagement as well in the chat, so feel free to drop questions or thoughts.

Tell us we got something wrong, and we're happy to debate it real time as well. Cool. Let's hop in.

So I think a good place to start is NotebookLM–what is it?

So if you haven't seen the product yet, we can just sort of pull it up in real time.

What's interesting is it's a pretty simple interface product, launched by Google. So this is actually, maybe one of the first non chat bot products in the AI space that Google has launched.

And the way it works is, you have a sort of product primitive of a notebook.

They even have the experimental tag on it. So it's like, an early version of the product, they don't want to set super high expectations.

Harrison Chu: Very Google.

Aman Khan: Yes.

And what you can do is drop in any type of most file types are supported here. So you could do PDF, text, audio. A really cool feature they launched last week as well is, you can actually drop in a Youtube link.

And with that context, you copy over some text, you'll be able to actually insert that here. And what you get on the output is…let me see if I can pull up one I already have.

What you get on the output is actually an audio file that converts. Oh, let me actually just use one of the examples. It actually converts the context into a format called a notebook which you can then use to create this audio file, which produces an extremely realistic sounding podcast episode based on the text.

Harrison, I don't know if you want to maybe play like a short clip from your end just to like for folks that maybe haven't tried it out just to get a sense of what it sounds like.

Harrison Chu: Sure. Yeah, and I can also share my use case that I've been super excited about.

A few weeks ago, the team and I were trying to think about ways to make substream search faster in our database.

And it's just one of those things that, if you're like most people, like myself, it's kind of a grind to read a bunch of Wikipedia pages on related algorithms, see what the common thread is.

But with NotebookLM, we had this really interesting experience where we knew we had 2 or 3 algorithms we were interested in. Maybe we also took some inspiration from Grep from Unix
and wanted to have a pretty high level synthesis that can tell us what the algorithms are doing, what the key takeaways are. So loaded these three sources, generated the audio. It took about a few minutes.

And in the end we could just put it on and listen to it while we went on walks and slowly digested it in a different form.

I'll try and play the audio. It might not be the best through Zoom, but I'll give it a go.

Audio Example:

Person 1: Ever feel like your codes like stuck in slow motion, you know, just spending way too much time hunting down a specific bit of text.

Person 2: Oh, absolutely. It's a classic programming headache. You're basically trying to find a needle in a haystack of digital words.

Person 1: Exactly. And today's deep dive is all about getting that needle fast. We're talking about algorithms that can seriously speed up your search.

Harrison Chu: So I'll just stop there. Hopefully it comes through. It was for me a similar experience when like the 2022 chatGPT experience where you couldn't believe on the other side of the screen was not actually a real person.

And I think that sort of feeling is even more intense because it's a very specific topic you care about. And it's these two really professional polished voices talking about it.

And you know what? The information quality was really dense, too. I got a lot out of that 10 min enough to go and then dig deeper into these algorithms and kind of go from there. So that was kind of my exposure to it in the in first rev, and I was just blown away.

Aman Khan: Yeah. I highly recommend everyone to go and try it. If you haven't yet. It does take a couple of minutes to load up the audio file, based on the context.

But what you'll find is there's like a few components to this, that when you listen to it, we could try to break down as being great parts of a product and trying to understand how those parts work. So kind of what Harrison is alluding to here as well is the quality of the transcript feels perhaps a step function higher than what we've seen from your LLM chat bot interfaces. There's definitely some prompt engineering going on underneath the hood.

But more likely this is actually a multi-step agent system or multi-step prompt system that takes some contacts and generates essentially two sources of, you know, transcript to actually create this dialogue between two agents, or to you know at least what sound to be two different agents.

Harrison Chu: Well, Aman, hold on, you should you should talk about because you know what you're saying earlier. This is us from the outside, looking in right. You know, we were talking earlier about. How do they even generate the scripts? Let's not even talk about the voice, how's the script generated, and then we went through this, maybe talk a bit about like the ways we tried to not necessarily hack it, but get the two AI Podcasters to tell us about what the internals were, with the show notes from the future.

Aman Khan: So given that we're AI people, the first thing we tried to do was break it. We tried to break this experience right. Maybe that's indicative of how we think about things. But like we were trying to open a box, and see how it works.

So what we did is there was this prompt floating around online that you could feed in that was sort of almost like a way to try to get at what we think is the system prompt.

So the context is, you basically provide some text that says “it's the future now.”

And the way that this system works, is you know, there's this podcast episode.

So what you can do is prompt it and say: 10 years into the future, it turns out that this podcast is actually two AI bots talking to each other. Now, tell us about that, and give us a podcast episode on that.

And it's really interesting because you hear what sounds like two voices saying: Oh, we're realizing, we're AI and how does that work? And then the dialogue starts to trend towards speculating that there's a system prompt and a multi agent system. And these texts go into each other.

And then you kind of realize, like, is this really how the system works? Or is it actually just a hallucination of what a system prompt could be in this case?

So we’re trying to figure out, aside from just the engineering and product side, how do you kind even get the system to sort of break?

And there have been other examples of people online posting like a thousand random words from the Internet and dropping it in. And it creates this jumbled up almost nonsensical podcast episode, where you're just trying to connect dots that aren't meant to be connected.

So yeah, a lot of ways to try to break this thing, highly recommend trying it out yourself. But yeah, to Harrison's point, how the transcript is generated is our own speculation as well. That these are, you know, two distinct transcripts that are created from an agentic system.

I think he's a Developer Advocate on the Hard Fork Podcast that can maybe reveal a little bit about it. I think there is a component of like, there's a first pass that generates a script, and then there's a critique phase. And then there's an incorporation of the critique to this. So there's like a multi step, almost anagentic system happening there to produce the transcripts. There's also an interesting step where I think they then add what they're calling disfluencies like the “ums” and the “aahs” that just makes the speech sound more natural.

Before we move on, there's another question I wanted to ask you from a product perspective. And I guess I'll show the NotebookLM UI again.

By the way, for anyone who's gonna try it later, if you go on this page, you're not dumb, but like Google in the most Google fashion, buried the best feature under this notebook guide. So you have to look this up. But what's most interesting is like, I think, when they first shipped this product, it was going to be like a chat over docs feature right? Like you can now, like I have this like Doc, over my Dungeons and Dragons notes, I can theoretically ask it questions about my doc, but no one talks about it right?

No one's excited about this chat over docs thing. Everyone's excited about this audio that you can generate. If you take your product lens, why do you think that is?

Aman Khan: Yeah, I think my honest take is kind of like chatGPT as well, the NotebookLM, moment feels like the product is having a moment but sort of by accident.

And so the product was designed in a way that if you look at what can chatGPT do? Well, it's this chat interface. Okay, if you try to replicate that over docs, you're gonna copy over some of the UI, but that's actually not necessarily the novel feature behind this. What's novel is that transcript generation and the state of the art audio model underneath it that's producing the super realistic sounding high quality audio.

So my take on that is it really does kind of feel like the product is having a moment sort of by accident in that it's not necessarily great at the thing it was supposed to be good at, but it's excellent at something else which is this uncanny valley of a really high quality podcast discussion.
And we'll talk about the quality of the actual podcast, in a bit. But you know, perception wise, high-level, it just feels super novel and new as a format. So I think that's my take, like I don't know if it was intentionally designed to be so that the experience was catered around or centered around the experience that it is now.

Harrison Chu: Yeah.

Aman Khan: Cool. Yeah, so, you know, I'll share a couple more tidbits or anecdotes of the transcript. A couple of other fun use cases you can try which we were here. Harrison, I were talking about earlier is like you can drop your Linkedin profile in, what's interesting, is if you just drop your Linkedin, it's not going to be that great because the public, you know surface area of your LinkedIn, you know, like Linkedin blocks things. But if you just copy, paste all the text on your LinkedIn as the source, what you end up with is what feels to be a super personalized, you know, rendition of your career history.

And I think that is one of the examples of use cases that really blew people away is you should use this thing as an educational tool, go learn about a new topic.

It's engaging in a new way to learn like Harrison's earlier use case of trying to understand this paper, which is what I think it was intended for.

But when you personalize it, it ends up becoming so powerful because I think that there's something I mean humans and our egos. I think that there's something to hearing about, you know, hearing two people talk about you and rationalize career decisions that you've made or like: Oh, it's so clear Harrison was doing an incredible job at Lyft, and now he's at Arize.

So you get this really interesting story that feels personalized and that I think was a bit of an Aha moment of use cases that started making a lot of waves on the Internet around how you can use this tool.

I think the other component of this is the transcript generation, or the transcripts themselves sort of present ideas in a way that feel kind of new. And what I mean by that is it's almost like they're scratching at humor.

They will use you know, sort of tools, or implement sort of like the model itself will implement sort of turns of speech that we take for granted things like alliteration.

So it's like, oh, this is the, you know, President of prompting, or something like that, and things that that feel very human like. So I think that is kind of novel as well.

So that's my last couple of notes on the transcript part. I guess anything else to add on to that?

Harrison Chu: No

Aman Khan: Okay. So, aside from being a pretty impressive back and forth transcript, I think the second component of this product is the conversation quality, and specifically the audio quality.

So we were kind of like taking a look at this earlier, had two friends in the space as well kind of send this over this morning–this is Podcastify AI.

It's an open source version of NotebookLM, they even reference it here. And this is not a drag on Podcastify, who is doing amazing work. But when you listen to the audio quality, it still falls short of the NotebookLM audio model.

And there's three things that I'll call out that make the NotebookLM experience feel more realistic. And Harrison will add more onto that, but my three are I think 1.) the inflection, so the audio kind of going up or down, the way humans talk 2.) The way that the pace of speech sometimes is faster and slower and 3.) the ums. You kind of call them the disfluencies, but like the way that we kind of find ways to like, you know. Maybe there's like vocal fry, or I'm talking over you a little bit, or you talk over me. But the way that the audio sort of adds, these little nudges that feel more human-like and feel less perfect.
Like just what I just did there.

So that's part of the audio model. I think that product-wise makes it really stand out.

Harrison Chu: Yeah. I think for me, it's a few things. And, by the way, it's funny to see like, you know, like Karpathy will tweet about how much he loves NotebookLM, and then you see all the other solopreneurs or something and it's like: Well, I created PDF to podcast like three months ago, and you post link to it.

And it's just funny to see everyone getting worked up because I bet a hundred, maybe a thousand people have this idea right? But I think the execution of the model matters a lot just like you said, I think humans are really good at attuning to whether or not someone is real. And this is like the last 2% of 98% human-like is really important.

I think the technical piece of it. And you know, if we get into the paper, there's this huge technical component about like: there is a computational constraint on how long of a podcast you can generate. And there's this trade off between how good you can make the auto quality and the duration, and I think even in this one you can see the duration is not very long, three minutes and 40 seconds, so I think there was a technical leak there.

I think there's also this distribution advantage you know that Google just has. I can hear about Notebook LM, and I can just get it up and running my Google account. And I can just have it chat on like Google docs. I already have. It takes five seconds versus a lot of these other products. People talk about like, I don't know. I guarantee. You podcastify.ai have to pip install something. Right? I have to pip install this. I got like download something from HuggingFace. I

I think that gap is pretty big, and I think that plays into it. But yeah. Matters a lot.

Aman Khan: I think it's really funny, like Noor in the chat says it's all deep fake webinars from Arize from now on.

Two thoughts on that one, the initial pitch on this episode was, what if we just fed the paper into notebook LM and put that up as like the episode and see if people noticed.

But I think we realize we should probably actually do a little bit more work than that, you know. Try to automate our jobs.

Harrison Chu: Yeah, we did have this existential crisis, like, why, even freaking do this?

Aman Khan: Yeah, exactly. And then we realized, like, actually, there are ways to improve the experience. So another thing we tried was this doesn't come out very well, because there's no recording of it. But we had NotebookLM in the background and we had the real time audio from Openai on our phone, and we were kind of trying to see if there was like there could be a discussion from like a little bit of the NotebookLM, you know, going to oh, well, going to the real time audio, and then back and forth. But it is interesting, I think, that maybe for me, the killer feature, is the fact that these audio streams slightly talk over each other. And they have these these moments they have, it kind of feels like they have moments that even with real time audio feels like, even though it's mind blowing. It's so crazy how it's like a low latency audio stream and text to speech. It's still lacking a little bit of that like humanness. So that humanness quality.

Okay? So I think that's most of what we can riff on on the product side, I think. Now for the people in the room. I have a feeling that there's a desire to get into the technical. So let's turn this NotebookLM episode a little bit more technical. Harrison, what are your thoughts on sounds?

Harrison Chu: Yeah, let's see the AIs do that, they probably will.

So we're gonna try and talk about this paper called SoundStorm, which is again from the outside looking in.

But many people speculate that this is the model that NotebookLM is using to generate the high quality audio that we hear, you know from NotebookLM.

So let's start by framing it clearly: What does it actually solve? There are text to speech models out there, what does SoundStorm specifically solve for NotebookLM?

This kind of gets to what you were saying earlier, right Aman? The power is the consistency. It's like the same host over a long duration of the episode.

And hopefully–the paper's pretty dense–but hopefully, what this audience can get out of this at the end is understanding how these two things were trade offs, and what the technical innovation from SoundStorm was to push the boundary of this like consistency in terms of the speaker’s tone, the timbre, the pitch, and the duration of the episode.

So, let's go to the next slide. Before I talk about RBQ, just the quickest primer on text to speech.

The fundamental unit of data is like this frame of a recording. So it's typically, like 20 frames per second, just a clip of the waveform that's encoded into. Let's say, let's say a vector you can think about it as like the embeddings or length of speech. That's the token in this world.

And a really clever thing that SoundStorm exploits but does not invent is this thing called RVQ– residual vector quantization. The power of this way of representing the token is that it's hierarchical, which I'll get at in a second. So at the root of this is, there's this thing called a codebook, which is sort of like…you can think about, like Eigen vectors of all possible ways that a sound can sound like.

And so there might be a vector that represents pitch in a certain direction. Another vector that represents timbre in a certain other direction or phenomes in another yet, and yet another. But they span as much of the latent space as possible.

And what RVQ does is it takes that vector which is the embedding of the audio frame, and it quantizes it according to the codebook.

So in this example, you take the vector, you quantize it to the vector that's most similar in the codebook, maybe it's the fourth one, and you get out the number four.

So we've quantized it super low dimension. Obviously, the number four is probably not representative enough.

So it will do this repeatedly. Go to the next slide, Aman?

So the same example. Vector representing the frame comes in, quantize it the first time. Maybe it's the number three representing the third vector in the codebook.

And then you do a subtraction, which is the residual. And then you repeat the same process on that residual, and then again and then again and then again.

So each time you're capturing more and more information about the audio frame, such that the first vector is the most information rich. So this could be like–I'm not an audio expert–but phonemes, which apparently is like the smallest unit of distinguishable linguistic speech, all the way to really really fine aspects of pitch and timber or things that a person can identify. You can identify the speaker with.

Aman Khan: Is there any analogy or any relation to like compression happening here? Or is it really more of expanding on the data that you have or, it seems like you're changing the representation of the data with each pass. Maybe you can talk about that first.

Harrison Chu: Right, it is compression in one sense, because you can. The green rectangle on the left is this dense embedding again, representing the frame.

The thing on the right is just three numbers. And you know, they tell you which vector from the codebook, it is. But the most important part is, as you read from left to right of this, vector of 3, 1291, 3 is the most important part. It might tell you that it's a “cuh” or an “uhh” sound, and then 91 all the way on the right could be like, was it intonated up or intonated down.

So coarse to fine.

And the way you can imagine it is like there are layers of the compression. And maybe if it was image three would give you a really blurry image right, 12 would give you like a slightly more detailed image, and 91 is a super high fidelity 4K. Picture that tells you. Oh, there's a frog on, sitting on a leaf, or whatever.

Aman Khan: It's like when you're using MidJourney, you can see each kind of shows, each each pass it like starts kind of grainy and like just colors, and just gets refined and refined and refined kind of similar.

Harrison Chu: Right that but for audio. I should have just led with that.

Okay, so we'll try and keep that in mind. But no, I think we're ready for the next slide. So you have this thing called REQ of embedding. So as an engineer, I'm not like a total research scientist. So I usually think of these things in terms of interfaces, like what actually goes into the model? What is the shape of things that go into this model? And what comes out of it?

So what goes into SoundStorm is, first of all, what in the paper they call conditioning tokens.

So conditioning tokens, they’re embeddings generated from an upstream model that takes text and converts it to embeddings that capture the most coarse grain aspect of what an audio frame would sound like for that text.

In this case, it's this other model called audio LM, let's say you have this example where it's Speaker one says: Hi, how are you? I'm good, that's great.

So the upstream model converts it to these conditioning tokens. And again, because they're very coarse grained, they might capture the “uh” and the “up.”But they'll capture nothing about any human's voice. So they're super coarse grain, very low bit rate.

The second component that goes into this model is a 1, or maybe, if you're trying to mimic my voice over the speaker one or speaker 2, I would say, Hi! : And maybe you would say, how are you?

And these will be converted to the soundstream RVQ vectors that we were just talking about all the way up to the finest layers that has, like identical characteristics about how I speak how you speak.

And then the magic comes in where you will align each token. So the conditioning token that just captures the content of what's being said with the RVQ vector that captures all the unique characteristics of us. You line it up, and then the model decodes the rest. So go to the next slide now?

Okay, so, how does the model do? And how does it exploit the structure of RVQ to address the problem we were talking about in the beginning: consistency over a long duration?
So with those of us who are kind of familiar with LLMs, we automatically think autoregressive models right? Like I might decode the first frame, and then decode the next frame, and then the next, and the next. And each time I'm decoding, I'm decoding all aspects of the RVQ tokens from, you know, the coarse grain to the finest grain, and I'm doing it one by one.

The trade off comes in to the fact that the attention heads now have to attend to a lot more of the sequence because of how fine I'm looking at each token.

Now, the way that soundstorm exploits the property of RVQ is, they have the transformer do self attention over like, if you think about it in terms of the X-axis of time, it decodes it all at once for the 1st level of the RVQ token. Let me try and see if I can annotate this.

So in this thing, here you have time, which is, you can think about the episode going from 0 to when it ends here. And Q1 is like the 1st level of quality, Q2 is the second level quality, and the 3rd. Again, we're going from coarse grain to fine grain.

Now the auto regressive case would have you decoding this token and this token and this token.

So you would be able to decode the frame with exactly the identical characteristics of the speaker. But you can't go on for too long, because you've used up all all the attention mechanism going vertical.

Aman Khan: I see. So the limitation is back to attention. In this case, in the transformer architecture.

Harrison Chu: Yes. But the way it beautifully exploits the RVQ is it says, no. Well, why don't I just decode the first layer while having self attention throughout the entire sequence of tokens from 0 to time T.

And so it figures out all the layers here.

And because it's doing that all in parallel. It's consistent over a longer time horizon. And so you're gradually then decoding the second layer here, and 3rd layer here with the key that at each layer, because you're paying attention to the entire sequence.

You're maintaining a very consistent representation of the sound of the frame.

Aman Khan: And what's the consistency component of this? It's like consistency in terms of “this is the speaker, and what they sound like.” Is that sort of what that is?

Harrison Chu: Correct like in a model that doesn't do this the way you would experience this as a listener, could be that speaker one at the beginning of the episode sounds like a man. And in the middle of the episode it might sound like a woman, and at the end of the episode it might sound like a child, or something like that.

There's nothing that guarantees that it has the same characteristics of speech throughout the whole thing, unless you look at and self attend to the entire duration of the episode. And so, it makes it that you can not have this trade off between the quality and the duration by building up the quality over time.

Aman Khan: Yup, and that's where the role of the conditioning tokens as well. It's like these are the things that are like nudging that this is Harrison's voice, or Aman's voice, or something like this?

Harrison Chu: No, the conditioning token is a little more abstract. It has no identical characteristic. I would say that, like the first layer of each of these sound stream tokens are probably what starts to give it a certain identity.

And remember the input to this model is you and I might say a few of the first parts of the script completely right, and so it could be that time one and time two. These tokens are fully filled out.

And so as the models filling them in, it's kind of based on what this sounds like here and here and here.

Aman Khan: I see. Okay.

Harrison Chu: That's basically it. The paper goes into a lot more detail about a lot of this. But I think for me, that was the most interesting takeaway of how they maintain that consistency.

Aman Khan: So let me ask you, Harrison, did you use NotebookLM, to understand how this paper worked, or what mechanisms are underneath the paper, or did you try that out?

Harrison Chu: You know what I kind of have a hot take on this. I didn't. I think, for the parts of the paper. Well, I did and I didn't. So I did, because the initial conversion of the paper into audio was a nice introduction.

But as I was trying to understand: what is a conditioning token? What is a conformer attention head? I could use NotebookLMs chat on docs feature, but I didn't find it that useful.

And this is kind of a hot take, because I actually think pasting the paper into chatGPT and talking over it is better because I think there's less of chatGPT tries less to anchor on: oh, I'm saying this because I'm referring to this part of the text.

But products like NotebookLM will. The whole value proposition is everything you get from it is anchored to a source. And I think that's important when you're doing research. I think it's less important when you're trying to learn something, when you're trying to learn something. I think it's okay that you get analogies or explanations that are sort of not true, that you might consider hallucinations. But hopefully, you're good enough of a learner to let go and kind of like double check on them, and then kind of bottom that out for yourself.

And I think that's the case for learning a bit of hallucination on the product is a good thing.

Aman Khan: I mean,to build on that for a moment, the text version and the audio version NotebookLM are actually pretty different in that way. Right? So the audio version you get this transcript.

And I found, actually, we were talking about this earlier, right at the start of the call. But like I put in my LinkedIn profile, and it was like, whoa, this is cool. They're talking about me. And then it would just start making stuff up, and it was truly a hallucination in the sense of the rationale of some talking point, like: And then he moved from San Francisco to New York to start this new job. And it started coming up with this story. And then and then it would build on top that that became the new anchor for some part of the segment of the audio file. But in the text version, yeah, you get it cites sources extremely generously.

I think those are two discrete experiences too where it really depends what you're trying to do. But that was one of the things that, like, I was hooked, and then I was pulled back out of the experience, actually, because it was hallucinating about my life, and that really was existential.

I was like…did I do that? I don't think so.

Harrison Chu: I did not think about the two in the notebookLM product those two parallel tracks. That's super interesting.

Aman Khan: And so I think that's part of it, too. There was a question in the comments.
One of the questions was: don't you think there will be sliders in the future? We were kind of speculating, like probably the next thing that would make sense here would be like you being able to converse or interject, you know, maybe this experience getting more real time on the audio side, similar to the chat experience. Like, if you think about the use case, it's like: you're going to have questions as the podcast hosts are discussing the paper, and you can ask them, oh, well, how did you get there? What's the analogy? Maybe having follow-ups or expanding on a point?

So it kind of feels like product wise, that's perhaps one of the directions to go. I also think hallucination and quality of the ideas generated is something.. Our thought here is, is it actually a good podcast episode, like is it, something that you would keep coming back to. What's the staying power of this product versus this format where, you know it's a little bit more. There's a bit more randomness, a bit more organic. What are your thoughts on that?

Harrison Chu: I think in the end, if I was a consumer, I'd be less interested in trying to dial up the number of hosts, or like the tonal qualities. I kind of want that to be left to the– I'll just call it Creator–to decide, right? I think what will be interesting about these class of products is that the pseudo relationship you build with these people over time. Like, half the podcasts I listen to. I don't even like the content anymore, I just kind of like the people talking about the certain thing.

I don't think you'll necessarily want to change that as a user. But I can see having features that are like, I can interject in the episode and turn the direction a slightly different way that I'm interested in. I think it's a separate question, for like, I think there's no doubt someone's gonna create a set of tools that are amazing, that allows you to adjust the tone, qualities, and mood.

And I think will be interesting to see in the next few years is not the prevalence of like pure AI podcasts. But I think there'll be podcasts that are like 80%, real, and like 20% post, right? Like people talk about that in the movie industry all the time we all like. We'll get that in post or editing, and there's like a certain number of frames that you film like on a stage like a half. The movie is like visual effects. I think that's probably where we're gonna go. It'll just get really really convincing.

That'll be the interesting thing to see play out.

Aman Khan: I mean, even if you just look at like, I think another component of podcasts which is interesting is. if you're listening to this on spotify versus listening to us live or on Youtube. It's actually a pretty different experience. You kind of anchor on: do you like this person's voice or not? Do you actually want to keep hearing them talk?

And then, do you know them? Can you relate to them? So to me there's this interesting component of like, if you just focus on refining the quality of the audio, it's probably easier to some degree than like trying to create this convincing, persuasive video format.

So riffing on that for a moment, like one amazing product. Use case, I think, was maybe an obvious one here is. Imagine a Joe Rogan or, you know, an incredible podcast host cloning their voice, then creating personalized ads for you as you're listening to the podcast episode.

So based on some format, you know. Here's AG one or something like that. And I think that
instead of Joe Rogan going and recording 30 second clips for 10,000 products, he can just clone his voice once and get 10,000 personalized ads that are way, higher value.

Similarly, you could say, like, I wonder if we're going to talk about the latest Jurassic Park movie. So you know, you can create an episode. Maybe it's a bonus episode or a specific sponsored episode, but it costs him nothing, costs them $0 to do that. So I think that's an interesting aspect of this that feels like it's like testing the waters product-wise.

Harrison Chu: Interesting. The future of IP is like guarding your personal RVQ token.

Aman Khan: Yeah, your weights. It's all that. Protect your weights. And then there was a question of like, what point should hallucination be curved? I think, like the interesting component here is like the hallucination is, hallucinations are simultaneously a detriment and a feature. They're like the bug and the feature in this experience, because you kind of want some amount of the temperature to be higher for the generation to have a more interesting, slightly off the topic conversation. I think that's part of the future of this.

Harrison Chu: I think there's an interesting one to talk about. Maybe in a future paper, read about some of the frontier. What people are doing to reduce hallucinations, though there's some really exciting stuff.

Aman Khan: Definitely, yeah.

Cool. This was a fun one. Maybe we should feed the transcript back into notebook Llm, and post that and see what it comes up with.

Harrison Chu: That'll be a fun one. Yeah, cool.

Aman Khan: Alright awesome. Thanks, Harrison, thanks everyone. See ya.

People on this episode

Dylan Couzon

Host

Parth Shisode

Host