056 – Ron Jaworski – Making the Internet More Audible

Hello and welcome back to Future Ear and the launch of Season Two of the Future Ear Radio podcast! When I started Future Ear in 2017, my goal was to, “connect the trends converging around the ear.” Since then, I’ve been fortunate to meet many of the top experts working in many of the industries that are gravitating toward our ears and bring them on to the podcast to share their hard-earned wisdom and insight. My plan this season is to continue to expand the conversation into new directions and to go deeper in the areas that I covered in the initial season.

To kick off this season, I’m joined by Ron Jaworski, CEO & Founder of Trinity Audio. This is a fitting start to the new season as we explore a big topic that I’ve yet to really cover in depth on the show, which is the innovation happening around text-to-speech technology. Trinity Audio is one of the leading text-to-speech (TTS) companies, allowing publishers to quickly convert all of their content into one-click audio files that can be embedded on their website, so I figured Ron would be the perfect conversational partner for this discussion.

As someone working in the world of AirPods, hearables and hearing aids, the picture that Ron helps to paint is quite compelling. Imagine a scenario where you’re reading an article on a train and you arrive to your destination just as you’ve read half of the article. Rather than picking up where you left off later on, what if you could seamlessly transition from reading to listening to the remainder of the article as you depart the train on foot with AirPods in your ears?

That’s the broad direction that text-to-speech is headed. Speech engines such as Amazon Polly have been increasingly improving in its ability to sound more human-like, which is enabling broad swaths of the internet to be made audible using turnkey solutions like Trinity Audio. If you want to hear what the cutting edge sounds like, just tune into the BBC’s newest synthetic voice that the company developed as part of its “Project Songbird” initiative.

Throughout the conversation, Ron shares his unique perspective on what’s driving the uptick in quality in the TTS engines, and what’s on the near and far-term horizons with this technology that we should expect. In addition, he shares some interesting insight around some of the ways he foresees companies, publishers and advertisers monetizing this content channel.

Ultimately, we’re seeing a third modality (audio) being enabled alongside text and video. The rise of the audio modality coincides with the dramatic increase in adoption of hearables. As the audio modality matures in its sophistication and broadens its surface area, it should ultimately translate to more opportunities for content consumption. In essence, our attention economy will seek to continue to grow by tapping into all the on-the-go and in-between moments that fill our days.

-Thanks for Reading-
Dave

EPISODE TRANSCRIPT

Dave Kemp:

Okay. Welcome back. Here we are. Season two. I am so excited for this season, so excited for my very first conversation here. I have an awesome guest, Ron Jaworski. Ron, tell us a little bit about who you are and what you do.

Ron Jaworski:

Thanks, Dave. It’s a pleasure being here. I’m Ron Jaworski. After a short NFL career as a quarterback of the Philadelphia [inaudible 00:00:27]. No, I’m just joking. I’m the CEO. I’m the CEO of Trinity Audio, a company started that is trying to audify the internet, so to speak. And super excited to be here and talk to you.

Dave Kemp:

Absolutely. It’s so funny that you mention that Ron Jaworski. When I was getting ready to interview you, I was searching Spotify to see the podcasts that you’ve been on, and I type in Ron Jaworski, and there’s a hundred different… football talk, America, and all these different things. No, I’m talking to the better Ron Jaworski. I’m not talking to Jaws today. But it’s awesome to have you here. I’m really, really excited for this, because what’s funny is when I first started a Future Ear, I thought, okay, what are going to be the future use cases for hearing aids?

Dave Kemp:

That was the very central question to everything. Then obviously the consumer hearable side took off. Airpods became a monstrous hit. So then it evolved into, okay, what’s going to be the use cases for not only hearing aids, but for all these true wireless earbuds, Airpods, and the like? And so as time goes on, that continues to be at the root of what I think about. And it never fully registered with me how big of a deal text-to-speech is for my world, like the impact that it’s ultimately going to have.

Dave Kemp:

I’ve had a number of conversations with you, and it just was like, yeah, this is a big deal. I understand it. But the gravitas never fully registered. And then what happened was I actually came across BBC’s synthetic voice that was out of Project Songbird, and it just blew me away. I’m like, “Oh, my God. If this is the way that the synthetic voices in all of these automated text-to-speech engines are starting to sound, they sound really, really realistic.” And it started to get me really excited, because ultimately it made me think that what happens if you not only think of spoken audio as being podcasts, but you think of the entire internet as being made audible?

Dave Kemp:

This idea that any article that you read, your local newspaper, the New York Times, Wall Street Journal, any type of content that you read in the not too distant future, if you can consume it in the same way that you can do it through video, or you read it through text, it’s a third modality, and it’s done through audio, and it’s done in a really compelling way. What are all the different facets of that? So I’m really excited to have this conversation today, because I think this obviously has massive implications to the world that I operate in.

Dave Kemp:

If you’re a hearing aid wearer, and suddenly you not only have this device that serves as an amplification tool to help improve the quality of life, but also you can think of it as a newspaper for your ears. If you’re a huge fan of the New York Times and you read that every single day, what happens if you can just listen to 50% of those articles, or 100% of those articles? So I wanted to get Ron on because he’s the expert in this space. Let’s start with Trinity. This is your baby. You created this thing and you had an epiphany. So tell us about how this whole thing came to be and what the big vision is here, because I think we’re on the same wavelength with a potential of converting text-to-speech.

Ron Jaworski:

First of all, yes. Yes to everything that you said. Definitely right. This is the same vision as I have. The epiphany moment basically came at… I was going down the elevator. I was reading an article on my mobile phone, got into my car, and thought to myself why I can’t listen to an article I just started reading. This is the moment that I said to myself it doesn’t… It was 2017 and I told myself, “It doesn’t make sense that I don’t have an easy solution that I just press play and this is it. I’m listening to it.”

Ron Jaworski:

This was the eureka moment, and I said, “Okay, I need to set forth and do that.” I have a history coming from the ad tech industry, from video products, and dealing with media companies, and dealing with users that consume content, and dealing with advertisers, because we also offer a monetization on top of our solution. I understood that there is a product that, first of all, makes sense for the user because, yes, there are times, there are mindsets that I want to listen instead of reading, instead of watching. I want to listen.

Ron Jaworski:

I understand that there is… 2017, for those of us that like audio, and audio is their thing, we started feeling like, okay, this is… Our time is coming. It’s coming soon. So I did understand that there is a market education to be made, but definitely media companies, publisher, content creator will start to understand in the coming years the importance of audio. They are always looking, content creators, media companies are always looking for a new engagement with the users, a new user experience. And of course, if they can add a monetization layer and generate also some revenue out of it, thumbs up.

Ron Jaworski:

The last thing is advertiser. When I was sure that the same that happened from TV, and then YouTube started, and then a lot of the budgets went from TV advertisement to online video, I was sure that the same is going to happen with audio, from radio to online audio in different forms, due to the fact of all the additional data, and information, and targeting that you can use within the digital landscape. So, by the way, this the reason it’s called Trinity, because it’s the holy triangle between the user, the publisher, and the advertiser.

Ron Jaworski:

And of course I always tell this joke that it’s also funny to have the holy trinity from a couple of Jews in Israel. But this was the moment that I understand it. I actually remember it clearly, sitting in my car and saying, “Okay, there is something here.” Actually, the epiphany was both. I understood that if it makes sense for me, it will make sense to a lot of other people. And I understood the fact that it serves all the checkbox, all the pillars of the internet, and I remember the first media conference when I had… It’s funny.

Ron Jaworski:

I had a webpage. I think I even… of USA Today with some sort of an embedded player, that this was the first design that I had in mind. And I went to a different [inaudible 00:07:44]. I talked to people from USA Today, and from Wall Street Journal, and I showed them, okay, this is what I want to do. And everybody looked at me, “Okay, so you are the crazy guy in the room. Okay, we just wanted to know who it is.” But, you know, three years later, Trinity Audio is a fact. And more and more publisher, and media companies, and content creators are joining on a daily basis.

Ron Jaworski:

And as I said in the beginning, our vision is basically to audify the internet. This is our aim. Turn any textual asset to audio, add a voice layer to it, and have it accessible. As you said, walking down the street, have your personal assistant give you any kind of information, any kind of data. Anything that you want will be available with a voice command.

Dave Kemp:

Yeah. No, I love it. It’s funny. I feel as if I’ve met more audio companies and audio engineers from Israel. There must be something in the water over there, because you guys all seem to have a lot of prowess when it comes to these different technologies. I really like the fact that you came from the ad tech space, and I want to definitely circle back to that. I think that this idea of… It’s ultimately part of the attention economy. In today’s day and age, attention is the only scarce commodity out there, other than maybe bitcoin.

Dave Kemp:

So if you think about it, that’s the one thing that everybody’s fighting over is your attention. That’s the one thing that we all have, and everybody’s fighting over it. So I think that it makes a lot of sense that you had that thought process where, okay, television ad budgets eventually migrated to YouTube, at least in some capacity. And if the attention is ultimately going to be that a lot of people are going to be spending more and more time consuming content and spending their attention through their ears, that then means that there’s probably going to be a lot of advertising opportunity.

Dave Kemp:

So I would imagine that your hunch there is right, and that there’s a lot of different… You can do pre-roll, mid-roll, all the different ads that you would insert into these in the same way that, when you’re reading an article online, you have ads intermixed intermittently. So that makes a lot of sense to me. Let’s take this example that you mention there about walking down the street. Again, applying this to the use case that I’m most interested in, on the go. I think that the world that we’re living in today, I’ve mentioned this before and I’m going to just keep hammering it home, 100 million pairs of Airpods exist.

Dave Kemp:

There’s probably a whole lot more than that. I believe that was as of the end of 2020. I’m sorry, as of the end of 2019. So the pandemic only increased it, not to mention all of the competitive types of hearables that are out there. So there are hundreds of millions of true wireless earbuds that are out there, and it’s only picking up steam. So I always frame all of my thinking in these use cases under that assumption that this is new. For anybody that’s building anything today… I got really excited at the end of last year about Marsbot, because I thought now we’re starting to see apps that are legitimately being built off of the assumption that people are wearing Airpods in their ears for extended periods of time.

Dave Kemp:

And that is going to continue to happen, and you can apply it to hearing aids as well. So I though, okay, let’s start to parse this out and apply this logic. If I’m reading an article, I think that one of the most intriguing possibilities here is let’s say that Trinity Audio has the TTS engine that is running… I’m sorry. Text-to-speech, yeah. TTS. And it’s running behind that article that I’m reading. So I’m reading an article that takes me 10 minutes to read. I’ve gotten halfway through it. I want to finish it, and what I prefer to do is just listen to that last five minutes.

Dave Kemp:

A seamless handoff would be incredible. And so let’s say that that exists, and I would love to hear if this is something that you’re working on, where I can then consume that last five minutes. And then what’s another intriguing possibility is what happens if I have another five minutes of my walk? Is that where the voice assistant thing comes in, and I have a really short little turnstile conversation where I can more or less retrieve another type of piece of content? Walk me through your thought process here, because I think this is where things are starting to get really, really interesting, is content more or less being handed off in a seamless fashion, and then the voice assistant playing the role of the curator for the next piece of content that you’re delivered.

Ron Jaworski:

Okay. I will divide it… Let’s divide it to two groups, okay?

Dave Kemp:

Okay.

Ron Jaworski:

The early adopters, the audiophiles, the ones that are super excited about audio, that are listening to podcasts for the past 10 years and not for the past one or two years, people like ourselves. For us, place our finger in the middle of an article and listen from a specific place, or discussing with a virtual assistant is a no-brainer. This is where we want to be. In many cases, the technology didn’t catch up with our needs or our wants yet. When it will happen? My guess is one, two, three years, top. We will be there. It will be easy for us to interact and manage our day with the virtual assistant.

Ron Jaworski:

But let’s talk about also the population where are out there right now. For them, even the concept of having the option to listen to an article is not obvious. I must say that we are seeing from the user engagement with our product, many people, for example, that press on our player and think that it’s actually a video player and the image on top of the article should turn into a video. And it’s written in the player, “listen to the article” and everything. So a lot of the people don’t understand that they have the option right now.

Ron Jaworski:

Another thing by the way… and by the way we get a lot of feedbacks from users that said, “Well, I press your player. I wasn’t sure what’s supposed to do, and I listened and it was amazing. It’s great. And thank you very much. It’s a great solution. I want to learn more.” So this is where we stand right now. I think we still have tons of market education. If we’re talking about just listening to the content, using a definitely AI voice. On the virtual assistant side, we definitely have much more.

Ron Jaworski:

The good thing about all of them is we have large companies like Amazon, like Google that are investing tons of dollars and engineering power and marketing into that to make this revolution come true. So this is the first thing. In regards to where we stand right now, I would say we are, let’s say, in the middle of between the situation that you describe. So we don’t have the option right now, although it will be developed during 2021, the option to just, I don’t know, highlight the word and then start reading out from that specific word. It’s something that involves…

Ron Jaworski:

It’s not something that’s, I would say, a barrier on the tech side. It’s just a matter of integration, and again, education of the user. They need to know that this option is exist. And this is, I think, definitely… Education of the market is something that I’m going to say again and again and again. But it’s a fact. This is something that it’s upon us to do. We are doing personalization of content. So if we are learning your user behavior, if we understand what you like to listen to, we know based on the publisher… It still stays in the same place of a specific publisher.

Ron Jaworski:

So for example, we work with the McClatchy Media Group, so if you are a Miami Herald user, you will be served several articles in the same session. So you don’t need to… You press play and you’re walking down the street. You can keep on walking and you get the latest article, which are based on what is the hottest thing and something that is related to what you like to listen to. So we are about there. Now, in regards to the virtual assistant, again, this is something that we are working on.

Ron Jaworski:

We already have a beta working that you can converse, you can talk. You have the basic functionality, pause, next, increase volume, decrease, increase speed, things like that. And you have the discovery mode, like give me the latest news from a specific journalist. Something like that. So let’s call it our personal assistant is basically in beta mode right now. We will release later on. But again, when we think about it, market education. We need to enable the microphone. The user need to enable their microphone. And then the biggest question is what will be the method of activation?

Ron Jaworski:

Will it be push to talk, you need to press something and then activate the personal assistant, and then start engaging with it? Or it will be always on, and it will just wait for an activation keyword. For example, I don’t know, if you are in the Miami Herald, so let’s say you said, “Miami,” and then [inaudible 00:17:46] virtual assistant starts to work. It’s a question of privacy. It’s a question of load time. Basically, we consume the battery. We consume the internet bandwidth of a specific user. Depends on which system that we use.

Ron Jaworski:

On the other hand, if it’s push to talk, we lose a little bit of this voice interface. There are a lot of question of what would be the best way, and what would make user take the first step in starting using. We have no doubt three years from now, four years from now it will be a commodity. Everybody will use it, just because at the end of the day it makes sense. But it’s a process. The same thing with smartphones. When it came out 2006, 2007, three, four, five years later, everybody looked back and said, “Wow. We are hooked.”

Dave Kemp:

Yeah. No, I think that’s a great way to frame it. It’s actually interesting one of the next questions that I had, when I was parsing through all the Jaws interviews and I came across yours, I listened to the one, again… I had actually already listened to it, but I re-listened to the one that you had with Bret Kinsella. So thanks again, Bret, for doing some of my homework for me, because I listened back to that one, and I actually wanted to ask you about this point that you make, because you made it on that podcast too, about the commoditization piece.

Dave Kemp:

This is really interesting to me, and I think it’s cool that you’re transparent about it, that basically it’s a building block, this whole text-to-speech engine. I believe at the time you said that you were Amazon Polly. Is that still what you’re using? Okay. If you’re operating under the assumption that eventually all kinds of websites are going to have this type of text-to-speech, and that it’s just going to gradually get better and better, and it’s going to sound more and more realistic, what’s the next phase? How do you differentiate? Where are those differentiating factors?

Dave Kemp:

I have some thoughts as to what I would imagine it to be, but obviously I want to hear it from you of where these things can be differentiated.

Ron Jaworski:

Okay. Let’s talk at the back stage, and then we’ll move to the front.

Dave Kemp:

Cool.

Ron Jaworski:

On the back stage, we need to understand that text-to-speech is with us for the past 20 years. So although it made a major leap in the last two years, the technology exists for ages. And it’s just a part, as you said, it’s part of the building block. And the product itself, it’s a full building. So the TTS is definitely a part. But then you need to… For example, one of the advantage that we have in our solution that our competitor doesn’t have is the fact that you take our JavaScript, you embed it on the page, and then in many cases in a couple of minutes our algorithmic setting do an assessment, and it knows exactly what is the textual part of a website and what is not.

Ron Jaworski:

Now, in many cases when you take a text-to-speech solution and you say, “Okay, I want to embed it on my webpage, on my mobile app,” whatever platform you use to upload your content, you publish it per article. You publish it per research. You publish it per whatever type of content basically. So the fact that you can take our solution, embed it on the page, and in a matter of minute it is relevant for all your webpage, for all your article, this is definitely something that differentiate us. And this gives you the option of scale and don’t need to go one webpage at a time. And you do understand.

Ron Jaworski:

Yeah, and then for example, I had a call earlier with another website that we did some testing. He said, “The fact that in a matter of seconds all my latest article, and even the article that I wrote 10 years ago can be in an audio form, this is [inaudible 00:21:50].” This is the first thing. And then of course turning text… First of all, the analysis of the textual part of a specific article or whatnot. Then you have, of course, the creation of lists, as I said earlier. You have a playlist. You’re pressing play. You’re not looking for an article for two, three, four, five minutes. You are listening to as long as you want. This is the second thing.

Ron Jaworski:

The third thing, and this is something that is super robust. We started working at the beginning of last year, and actually this two weeks ago we actually raised a major product that is built on all those different products. So we have our CMS, which is our content management system, that basically any kind of audio content that we create for the publisher is stored over there. So once the publisher start working for us, all the audio content that we providing is on the content management system. Over there, he can mix and match. He can create RSS feed from any kind of content that he wants.

Ron Jaworski:

By the way, all of that is categorized, so anything that is related to voice SEO that will become a major thing in the coming years is already being set up for it. So we actually preparing for what’s going to happen two or three years from now. And for any kind of engine, mainly Google, of course, but not only, it will be easy for them to have a… easy for them, the publisher, to have a higher SEO rank because we know that voice SEO would be a major thing in the coming years. And basically, we index the audio file that we’re creating for it.

Ron Jaworski:

But not only that, any kind of search queries that we will need to make to enable the voice discovery is easy for us, because we already put the major data that is relevant for a specific audio file. So we know again any kind of virtual assistant, of course ours, but anyone else can also find the relevant audio file for any query. And then using this RSS feed, you have the option to distribute the content, using our CMS, to any kind of audio platform. So if you want to have your content as a media company on Spotify, Google Podcasts, or Apple Podcasts, or iHeartRadio, or whatever kind of platform that you want, with a click of a button, it’s there.

Ron Jaworski:

In addition, Alexa Skills, Google Action, it’s also the same RSS feed. We can push the relevant content, and we can get it from, let’s say just, I don’t know, the top most listened to, the most recent, a specific category. Let’s say the recent news stories, the recent sports stories. And you can have different feeds, and you can have different channels, whether it’s within the Alexa Skills or on Spotify. It doesn’t really matter. But you have a new way to distribute and connect with your users everywhere all the time.

Ron Jaworski:

And I want to mention two things about audio and about it. In the last three years, we see that podcast is booming and audiobooks is booming as well. And I think it’s something that we need to hold a minute and think about it, because you would imagine that one should go on top of the other. It doesn’t make sense that people will listen to audiobooks and more to podcasts. But what we found, and this is how it makes sense, when you were talking about on the go, i the fact that audio stands for new time throughout the day.

Ron Jaworski:

It doesn’t come… In many cases we talk with media companies who say, “Okay, but if they will listen, they won’t read my article.” Well, that’s not the case. When someone want to listen, he doesn’t want to read, and he doesn’t want to watch [inaudible 00:26:02]. And this is the reason why audio is booming, because it’s new time throughout the day. We are able to do something that we wouldn’t be able to do in the past. If I’m coming back from work, and I need to prepare dinner for my kids, and I want to listen to an article, I can’t read while I’m cooking. I can’t read while I’m driving. Hopefully, I’m not doing it, because there are some people that are reading while they’re driving.

Ron Jaworski:

But this is what we are doing. This is why it’s major revolution, because media companies find a new time throughout the day. So for a media company says, “Okay, I’m meeting this user for two hours throughout the day. Okay, now I have another 30 minutes where I can meet him from.” This is a major thing for them. This is an epiphany for them, saying, “Okay, we need to leverage audio, because our user want to communicate with us. They want to consume the product that we have to offer.”

Ron Jaworski:

A bit of a side note. I’ll go back the system. Now, one of the thing that we released two weeks ago is actually, as I said in the beginning, one of the differentiator for our system is the robustness, the scale, the fact that we can audify your whole content in a matter of second. But we do understand that in many, many cases you want to have the option to do something which is more customized. You want to control the content that you are listening. You want to have a news flash, or us have a flash everything, whatever you name it. You want to have a specific content that is not TTS as simple as that.

Ron Jaworski:

It’s not something that will be published on your webpage, something like that. And something that you want to edit, and something that you want to renew all the time. So we create an editor for journalists, reporters, writers, doesn’t really matter. To create their own different type of content with more control on the voices, on the pronunciation, on the tone. They can play with it more. They can edit it. Because if they want to have the option to control specific article in different manners like changing voices, or they have some problem with pronunciation that they want to pronounce differently, or something that they want to release all the time, like they have news flash that every two to three hours they want to update.

Ron Jaworski:

So we have this option, and we built a product called the Octopus. That’s it’s name, because it touch many different places. And basically it offers a news company, sports company, whatever they want to have their AI, so to speak, newsroom that updates all the time and sends all the relevant, let’s call it, new reports or new information, whatever, to all the different angles. And with a click of the button, they update it and it’s on Spotify, on Google Podcast, on their Alexa Skills, or wherever they want. So with that solution, we basically give the option of robustness on one hand and customization on the other one.

Ron Jaworski:

And two of the major things that we are dealing with right now for 2021 is definitely the voice layer. As I said earlier, it’s in beta. And another thing, of course, is more and more personalization for the users, because I think this is definitely a major.

Dave Kemp:

Yeah. No, I think those are… I got so many different thoughts on this. The first thing I want to say, going off of the side note that you made, is I can’t agree more with what you just said about this idea that at the end of the day, like I said, attention, very scarce. You only have so much that you can use and it’s finite. So it’s not as if I can use an hour retroactively or anything like that. So it’s an efficiency thing. It’s that feeling of I’m efficient. As an anecdotal example, this is a little embarrassing to admit, I was going to walk my dog the other night, and I couldn’t find my headphones. And I couldn’t bear the thought of walking in silence.

Dave Kemp:

But what it really, I think, was is that it’s that efficiency thing. This is going to be 30, 45 minutes that I could be spending consuming information and feeling like, oh, I just knocked out another podcast, and I feel a little bit more well-researched and smart, and all these different things. So I fully support that notion that you get 24 hours a day. You should be sleeping for seven to eight of those, and so we’re really talking about 15 to 16 hours, and it’s like… As an advertiser, they’re trying to get at you at any given moment, and you want to just consume content.

Dave Kemp:

I feel like whether you’re reading something, or you’re watching something, or now you’re listening to things, I think you’re right where it’s an additive thing. It’s a new modality that plays into those times throughout the day. And again, I think a lot of those times tend to be those in between moments, and that’s where I tend to think where the biggest opportunity here, especially around text-to-speech is, is the handoff. If you’re watching a video and you can still relay some of that information through audio or through an article, and then you want to hand it off, because you have to walk five minutes from the subway to your destination, and you, again, were operating off the assumption that everybody’s wearing things in their ears, that to me just screams opportunity.

Dave Kemp:

So I agree with you where it is that additive piece. The other thing I really want to get into here, and I definitely want to touch on some of the other things that you mention there, but the piece that I really want to talk about next is around the… You mentioned a little bit ago in the last two years… TTS has been around for a while, but in the last two years we’ve seen dramatic increases. Like I said, this really caught my attention, was the Project Songbird from BBC, like, “Hello, I’m the news synthetic voice from the BBC.”

Dave Kemp:

I’m like, this sounds like a person. And you had a really fascinating Twitter exchange with somebody that I was looking at, where you were talking about this study that you conducted. And I want you to share that study that you guys did, and then use that as a reference to speak to how improved these are becoming, and maybe the backend reason why they are sounding more and more humanlike.

Ron Jaworski:

Okay, so the experiments that we did, and we did it a couple of years ago, we took a different… I think it was eight. Yes, it was eight different audio pieces of the same content, by the way, and we spread it within our company, and we asked the people in the company to rate whether it was human or machine. And from those eight pieces, they were two that 100% said, “Okay, this is a machine.” There were two that 100% said, “It’s definitely a human.” And the other four was 50/50, something around that area. And the only thing that everybody didn’t know is that all the voices were AI voices.

Ron Jaworski:

And what we tried to find out, actually, what I tried to prove is that people are biased. If I’m telling you that this is an AI voice, you will try to find where the machine got it wrong. You will sit. “Wait. There you go. You didn’t pronounce it correctly.” And I think this is the problem. And I think one of the thing that usually when I talk about this experiment, it’s important for me to mention two major lines that are basically colliding. On one hand we have text-to-speech that, as I said, two years ago becoming more and more a robust solution, more humane solution, due to the fact of the progress in machine learning and AI that make it easier to synthetic voice to become more humane. This is one thing.

Ron Jaworski:

So we have this line moving forward. On the other hand, we have us as human being engaging, having discussions, having conversation with AI. We are talking to virtual assistant, we are talking voice chat bots, we are talking we machine more and more on a daily basis. And due to that, our ear becoming more and more tolerant to mechanical voices. So those two lines are colliding. And I think that it’s in the near future. I’m not talking 10 years or five years. Even less. It will become easier for us because we will get used to it. It wouldn’t bother us as it bother us in the past when we heard a TTS engine. It sounds like Stephen Hawking, something like it.

Ron Jaworski:

So this is where we are right now, and it’s just getting improved, and it’s improving exponentially. Another thing which is super important, by the way, and this is branded voices. And when we’re talking about branded voices, I’m meaning about… The best example, and this is my dream, and it’s somebody that is in the audience is a friend of the CEO of Nespresso, please make the intro or get a call, because my dream is to take George Clooney, put him in a studio and generate the George Clooney text-to-speech engine, and have all Nespresso correspondence with their clients in George Clooney voice.

Ron Jaworski:

And think about it. You wake up in the morning. You say to Alexa, “Alexa, enable Nespresso,” and then George will say, “Good morning, Ron. How are you? What would you like to order?” I will order the capsules that I want, and then George will tell me a joke, or give me the latest movie that he’s working on. And I’ll say, “Thank you very much, George,” and I’ll wait for… Or I’ll go to the customer center, and then again, the voice chat bot will be in the voice of George Clooney. This is what I imagine.

Ron Jaworski:

And there are tons of different brands, and there are a ton of different advertisers, and each government, and each organization, and each university that will find their voice, their own unique voice. And by the way, this technology, two, three years ago, would probably cost you at least one million dollar with the actor and everything in it. Of course, not with George Clooney. George Clooney definitely will cost more than that, but any kind of other voice actor. Today, we are getting to, definitely for big brands, it’s not an issue. It’s not big bucks. But it will come cheaper and cheaper, and everybody will use it.

Ron Jaworski:

And you have the local dealership with their own unique voice assistant, and the local pizza the same with their own unique voice.

Dave Kemp:

Yeah. No, I think this is a great point that you make, because I have two lines of thought here. First, I think that it’s highly important. And this is a trend that we’re seeing now in the voice ecosystem way more broadly, is that sort of the oligopoly of just the characterization that we’re going to just have Alexa and Google Assistant and everything’s going to funnel through them. That’s quickly being broken down. I think that people realize now it’s going to be a land of millions, probably, of voice assistants, all kinds of different voice assistants.

Dave Kemp:

It’s going to be a metaverse of them basically. I think that begs the question, again, of one of the constant themes that I’ve heard from a lot of the different folks that are operating in the voice ecosystem is when they’re communicating to brands, saying, “What does your brand sound like?” Sonic branding, and a spokesperson like you mentioned, and I think this is another really good example of it. In a world where so much of that content is there’s no skeuomorphic interface. You don’t have the app in front of you. You don’t see it. You don’t have the advertisements, whatever that be. Everything’s audio.

Dave Kemp:

Who knows? Maybe there will be some visual modalities that go along with it in an ambient type of environment, but by and large I think it’s going to be something where it’s all based around audio. So I do think that there’s going to have to be a lot of thought put into every single brand of what exactly is that umbrella, that overarching theme of our brand in an audio-only setting? So I do think that this idea of having a branded voice, more or less, for whether it be the ads that are about you, or the owned media that you’re putting out as a brand, that makes a lot of sense to me.

Dave Kemp:

The other thing that is running through my head here is the fear that some might have about, again, oh, AI is going to take all of our jobs. That kind of thought process. An interesting tangent to this is if you look at the media ecosystem today, one of the biggest trends are newsletters. And so I look at Substack, for example. It seems like everyone and their brother now has a newsletter that they’re publishing through Substack. It’s the hot thing to do, and I look at that and I say, “Okay, in this… ” If we use that particular medium and we apply it to this trajectory and we say that media is more or less migrating to our ears, and all those newsletters become audiofied in some way, I see this maybe going down one of two paths.

Dave Kemp:

Now, it could go the TTS path, and there’s a lot of specialization and customization around that, around what maybe you want your Substack to sound like. But the other option I could see is kind of like what Ben Thompson done with Stratechery, where he actually reads it. He reads it. He has a partner that reads his quotes, so it’s not him reading his quotes. But it has a very distinct feel to it. And again, it all ties into this theme that it seems like media is going to the ear. And my question to you is kind of applying those two thoughts there, where clearly as more and more media migrates to your ear, you’re going to want to have a specialty touch to it. You’ll want to have something that differentiates in the way that it sounds, in the way that it’s read.

Dave Kemp:

What are your thoughts around this whole thought? Do you think that there’s room for both and that some will just opt to have it be done manually and they’ll upload everything through their own voice, and they’ll read it all, and there will be… There’s obviously a lot of voice actors that exist. Or does that opportunity more so fall in the more TTS fashion where you create your own TTS engine that’s branded around your voice, but it’s able to be automated as an engine?

Ron Jaworski:

That’s a great question. I think there will be a mixture. I think at the end of the day most of it will be AI, due to the fact of the efficiency and the cost. Because a voice actor, definitely everybody like to listen to a great voice actor. But it’s also important to understand that I’m listening to a lot of books on audible, and in many cases there’s a good chance I will prefer probably me reading out the book than the voice actor that they brought to read out the book for me. So I think it’s also an expertise which is important to get the right voice actors and make enough experience.

Ron Jaworski:

And it’s costly. So I think in many cases you will have for the, I don’t know, maybe the highlights, or something which is more personal, or something which is more like, I don’t know, the Christmas greeting would be with the human voice. And I think the ongoing, the daily routines, the daily updates, and things like that would be using an AI voice. What we need to remember is that at the end of the day is with progress with the humaneness, if there’s some sort of a word like that, will get quite close to AI voices becoming… It will be hard for you to differentiate human voice and AI.

Ron Jaworski:

It’s already happening in many cases. A lot of the cases we’re getting feedback saying that, “Is this really an AI voice?” And this is now. So think what will happen two or three years from now. So this is the first thing. The second thing, I think that the differentiator will be the different voices for brands. So you’ll have the New York Times voice on one end, but you’ll have also a unique voice for… By the way, as technology become more and more cheaper, you’ll have the unique voice for, again, local dealership, the local newspaper, or things like that. Even the public school will have its own unique voice reading out the relevant books for the kids.

Ron Jaworski:

So I think… And by the way, if it would be a known voice, it also has its own characteristic, so it’s fine. But I think in the end of it, it would be 95%, 97% AI. And for those unique vintage moment it will be human voice. Every time that there is a revolution starts, everybody thinks about, oh my God, what will happen right now? We won’t have anymore jobs.

Dave Kemp:

New jobs come about.

Ron Jaworski:

Yeah. All the time. All the time.

Dave Kemp:

One thing that I’m curious about, another aspect of this that I’ve been thinking a lot about lately, is the whole notion of what is a podcast? I think about it like, obviously, everybody has a preconceived notion of what a podcast is. It’s a 30-minute, a 15-minute, a 60-minute interview, or a soliloquy, or a group chat. But my point is that in a world where truly anything can be converted into a podcast, more or less, if we consider a podcast to be kind of like a snippet of audio, what is the role that you see? You mentioned earlier, like Spotify, some of these major content… I think of them as warehouses. CMS, I guess, is the more appropriate term.

Dave Kemp:

But these places that have these gigantic catalogs. Obviously, Spotify starts with music, then they move into traditional podcast. But I envision that in the next few years as this world, the text-to-speech world, really balloons, and you have, like you said, a podcast more or less could be from a major publication, or from any type of publication that’s putting out recurring content on a daily basis, for example. It gives you the ability to turn that into a piece of audio. And then I look at them as, okay, so then they have a ton more audio in their catalog.

Dave Kemp:

And I know this is with you guys having your own CMS catalog. I’m just thinking about clearly there comes a point in time where that becomes almost it’s so gigantic that it’s hard to even wade through that. It’s like there’s so much content. And then it’s like, okay, how do you solve that? And everybody would probably say, “Well, you have to have some mechanism of discovery.” So when it comes to these content management systems, how do you envision across the next few years of that becoming something that is for a per user basis, you are able to efficiently connect the user to the content that they’re looking for.

Dave Kemp:

You mentioned some of the metadata that you guys are getting today. I’m just really fixated on this thought of in the same way that Twitter has these smart algorithms that are surfacing content based on what the people that you follow like, and you can apply that to any of the different social media tools. They’ve all been innovating around this to make you spend your attention on their app for longer periods of time. They have all kinds of clever ways to just constantly keep you enthralled with more information, more information that’s more or less personalized to you. So the question is what does this sort of mechanism look like in the audio world? How do you as a user, without having to go and discover all this on your own…

Dave Kemp:

And maybe that’s the answer is that you do have to discover this on your own. What are your thoughts around how the innovation around discover will unfold?

Ron Jaworski:

Wow, that’s a tough one.

Dave Kemp:

It’s like the million-dollar question. Boom. Go. On the spot.

Ron Jaworski:

No doubt about it. Okay, let’s start talking from my small domain. I think that when I look at what we are doing in, let’s call it, a walled internet audio library. So you have the obvious one, which is the first pillar, which is radio, radio show, music. The obvious one, the one that when we started consuming audio on our mobile phone was there, was radio, Wi-Fi radio and music. And then audiobooks started to become more and more common. This is the second pillar. You have the third pillar, which is podcasting that is with us for more than 10 years. But actually, really start getting into our lives…

Dave Kemp:

With Serial.

Ron Jaworski:

Yeah. Exactly. And when I look at Trinity Audio, I think Trinity Audio is the fourth and last pillar, which is taking all the textual asset in the world and turning them into audio. So those are the four pillars that will build the audio library, and I feel that definitely Trinity Audio is in the place where the most content, the most potential content is there. By the way, we for example, we did some summarization for the year. We did some checkups about how much content we create, how many users do we make, and we found out that we create hundreds of thousands of content pieces on a single month.

Ron Jaworski:

Hundreds of thousands of audio pieces [inaudible 00:50:37]. This is crazy. This is something that no podcast company is creating. We meet millions of user on a monthly basis. It’s already robust and we are just at the beginning. So this is where we are right now. And so where do I see Trinity Audio? Well, I think I see Trinity Audio meeting the listener, and giving him the personalized experience in the fact that we will meet him throughout different times throughout the day. We will meet him when he consume different publication. We will meet him when he’s traveling through different states.

Ron Jaworski:

So we will be able to give him what he likes and what he used to consume based on his preferences. Now, how you combine it, I don’t know. If it will have a close integration with Spotify, then of course Spotify can do the combination between what we know about him and what Spotify knows about him, and gives him the relevant mixture of music, radio show, podcasting, news, sports, and so on and so forth, and any kind of other information. But I think at the end of the day you will have your own… You will have the David Kemp radio station-

Dave Kemp:

Good radio station.

Ron Jaworski:

-with all of the relevant content. You will have your own personalized radio station with the content that you like, with the news that you like, and the radio shows, because… In the relevant app, you will have… And I’m sure, and this is what we are aiming for, is the Trinity Audio will play a major role within all of that, because when you are driving to work in L.A. for example, and it’s, let’s say, between 30 minutes to four-hour drive, you want to consume different types of content, and some of them will be super educational. Some of them will be gossip. Some of them will be music. And Trinity will be a portion of that amount.

Dave Kemp:

Yeah. Well said. I love that. I keep thinking about, like you said, we’re at the beginning of this. We’re at the beginning of this with so many different facets of it. In my world with hearables, we’re at the beginning of the biometric sensors. And I’ve had some really fascinating discussions on this podcast about when you start to embed some of these different sensors that will be able to read more or less telemetry and the physiological metrics that you’ll be outputting. And a lot of this will be highly scrutinized around it needs to be secure and all that, so I don’t want to freak anyone out.

Dave Kemp:

All kinds of ethical debates about these things, but from a technology standpoint it will be feasible to know, okay, you’re agitated, or you’re really relaxed, or whatever that is. And the David Kemp audio radio station is actually a really good analogy to use, because I think it applies to any kind of audio that can fit into that. And it knows… And this is where I continue to believe that the voice assistant in this scenario is so fundamental to this vision, is that the voice assistant will be the mediator of that content. I don’t think that it’s just going to randomly start playing these things. It’s going to be suggesting them.

Dave Kemp:

And this is where it’s more of a conversation that you’re going to be having. It’s going to say… Like you said, it recognizes that you have a four-hour drive, or it knows that you have a 15-minute walk, and maybe it’s the same 15-minute walk every single day. And it picks up on that contextual behavior. It picks up on habits. It knows what you like. It’s in the morning. It’s at that given time I like podcasts, and then maybe later in the day my brain is zapped, and I just want to listen to some instrumental music or whatever that might be. So that’s where I see this getting really exciting.

Dave Kemp:

For your role at Trinity, I think this idea of you guys are part of the mechanism that feeds into this stream more or less. You’re helping with bringing all that text into a modality. And we were talking about it before we started recording. You had mentioned that Dr. Teri Fisher, he talks about how the whole notion of for the first time we’ve always had to… And I think he got this Brian Roemmele. It’s this idea that we’ve always had to communicate to computers in their language, going all the way back to every computer language, HTML, Java, whatever these things are. And now for the first time, we’re able to communicate to computers in our voices.

Dave Kemp:

And I think that in this really infantile primitive state of voice technology, it feels so crude that it’s like I just want to revert back to speaking their language and talking to my computer clearly with my fingers on a keyboard. But I do think that it’s actually pretty profound when you think about it, because as things continue to progress, and as it all becomes enabled to where you have the ability to consume content in these different modalities, and these different interfaces, with these different types of pieces of hardware, all of that then, I think, enables this vision, this possibility where you don’t ever really have to do anything more than just speak to the computer to get what you want. And that’s what’s really exciting about this.

Dave Kemp:

And I think that in this scenario where you’re able to just conversate and be like, “Just give me some of the top stories of the day,” and whether that be in a podcast format or maybe a mix of I get 30 minutes of content and five minute of it is from a talk show of the two people that I really respect. Five minutes is actually a Wall Street Journal story that’s read to me in a very humanlike sounding voice. Five minutes is Twitter banter of a number of different people about the tweets that they’ve had for that day. This is all actually becoming pretty enabled. It’s quietly happening under the radar in piecemeal, but it’s becoming an aggregate, I think, more so than a lot of people realize.

Dave Kemp:

So I do think that, while in isolation, wow, text-to-speech is really important, when you start to combine that with all the other things that are happening, and you look at it from the big picture, you say, “Yeah, okay. Now I do start to see how I can have many of the same experiences that I have today with my phone in a setting that doesn’t involve my phone at all.”

Ron Jaworski:

Look, I think… And you mentioned Teri Fisher, and definitely I think this is a big revolution. We talk a lot about audio and audio and voice, voice and audio all the time. The two side of the same coin. You can’t have one without the other. And I think that you talk about earpiece and Airpods and headphones, and we need to talk about the voice interface within it will change the usage of our phone. You said leaving our phone out of the equation. This is so true, because I believe that in the coming years, we’ll move more and more to wearables. And wearables can be, beside of course headsets, it would be the watch that’s will replace our phone. It will be the glasses that will replace other functionalities of our phone.

Ron Jaworski:

And in the coming years, we will use more and more voice and audio to interact with the world around us in the matter of data, information, things like that, whether it’s our email account, or just listening to the news. And it will be with the headset, with the glasses, and with the watch we’ll have on our hand. And of course, payments will be through that and all of that. So definitely we are moving from, or to, a phoneless era in the coming years. And definitely audio and voice is a major part within it. This is one thing.

Ron Jaworski:

The second thing, and I think it’s… One of the thing that I like about voice, a lot of the time when you talk about revolution and new technologies, you said, okay, so there is a new technology and there are the early adopters, usually, are the young people. In many cases, either kids, or people in young age that adopt the technology between 20 to 30. And then from 30 and up it becomes… It’s being adopted, but it’s more slowly. And if you’re talking about 60-plus, there’s a good chance. And I like to give the example when the computer came into our lives, especially laptop, everybody started using them a lot.

Ron Jaworski:

Not like… Let’s say, I don’t know, 20 years ago. Something around that area. And learning, making elderly understand the fact of double-clicking on mouse, it wasn’t obvious. Took elder people a lot of time to get a double-click to make an activation of a link or something like that happen. It wasn’t a known for them. But voice technology is super adopted within the age group of 65 and up. Why? Because it’s natural. And I think this is something which is new about this technology, that nobody can exactly understand when we will get to the tipping point, because there are different ages that really…

Ron Jaworski:

The change, it is a change. We need to educate the market. Going back to that again. And people know how to learn. They need to play with it, but it’s much more easier than everybody else, because we are using our voice, the most natural thing for us. So I think when historian will look back at this time of voice technology came to our lives, I think it will definitely be a major change in the way human beings behaved in general, and the way we are getting more and more closer to becoming the cyborgs that we talk all about with all the wearables and virtual assistants, and things like that.

Ron Jaworski:

And you know the singularity point is becoming much, much closer. Much, much closer. I’m not sure if that’s a good thing or a bad thing, but it’s becoming more and more closer.

Dave Kemp:

It’s interesting that you touched on how this is being adopted widespread across the board. You look at… Actually, it’s like a barbell. The two biggest cohorts that are adopting it the fastest are young kids and older adults. And really that was part of the big reason why I got so involved in the voice space was that realization that, oh my gosh. You have seniors and you have 70-plus, 80-plus-year-olds that are picking this up and they’re loving it and feeling that they’re not being marginalized by tech. Not having to type on a small little glass screen when they have dexterity issues, or they can’t quite see the text on there.

Dave Kemp:

So it is something where it’s kind of exciting from that standpoint where it really does feel inclusive at a really broad level. So that to me is why I get really excited about this, is the thought of, man, if you just put a voice assistant, or all of these voice assistants and all this functionality in a hearing aid, it just transforms that experience. It changes the whole notion of why you would wear that device. It incentivizes you on a whole new level. Because again, you’re then thinking about this differently. It’s not just something that helps to amplify the sound around you and improve your quality of life that way. It also might be the main mechanism in which you read the news.

Dave Kemp:

So those are all things that get me excited, and that’s why I wanted to lead off this season with you, because I do think this is going to be a gigantic use case for in-the-ear devices in general, and hearing aids in general, but I really enjoyed this conversation. I think that we touched on a lot of different topics. And it’s going to be really interesting to bring you back on down the line as Trinity continues to evolve. I’m so impressed that in, really, three years’ time you’ve gotten to the point to where, literally, you can take a piece of code, put it on a WordPress site and every single article on that WordPress site, or any kind of blog site content management system, whatever, you can take it and within minutes you have this option at the top of the article. Click to read.

Dave Kemp:

It’s just going to get better. The synthetic voices are going to just get more and more human-sounding, and I just continue to believe that this is going to be a massive, massive use case for people in general, but really specifically for in-the-ear devices.

Ron Jaworski:

Well, I must say that every time that I talk with a voice or an audio enthusiast like me, basically, and like you, it’s so much fun. Because I think I remember the path that we started three years ago, and it was hard for me to find people that I can share my enthusiasm to talk about things to replace [inaudible 01:04:44], things like that. And it’s becoming… This group is growing. And more and more people, you can talk with more people about it. It’s so much fun. So first of all, I want to say thank you for having me, and I will be more than glad to come back, I don’t know, a year from now, tell you, “Okay, this is what we achieved, this is whatnot.”

Ron Jaworski:

Because it’s also evolving all the time. You know, we take different turns all the time because it’s evolving. And again, thank you for having me, and I hope that… My dream for this year to have some podcast recording face-to-face. This is my dream for 2021.

Dave Kemp:

Absolutely. I can’t wait. When the day comes, I’m serious, when the day comes when we can all start to travel again, I’m going to be living out of a suitcase. I’m going to be so on the road, ready to just meet with people again. I can’t wait for it. I’m so deprived right now. So I’m looking forward to that too, and we will 100% have to do a face-to-face podcast at some point. So Ron, really quick as we wrap up, why don’t you share with everybody where people can find you, find more information on Trinity?

Ron Jaworski:

First of all, of course, our website at trinityaudio.ai. Or can just ping me on LinkedIn, or Twitter at Ron Jaworski. Again, any kind of audio or voice enthusiastic want to change ideas or do some brainstorm, I’m always keen to learn more.

Dave Kemp:

Ron Jaworski, the audio guy, not the NFL guy. Not Jaws. Awesome, Ron. Thank you.

Ron Jaworski:

Thank you very much, David.

Dave Kemp:

Yeah, man. Thank you. Thanks for everybody who tuned in here to the end, and we will chat with you next time. Cheers.

056 – Ron Jaworski – Making the Internet More Audible

EPISODE TRANSCRIPT

Like this:

Published by Dave Kemp

Leave a ReplyCancel reply

EPISODE TRANSCRIPT

Share this:

Like this:

Published by Dave Kemp

Leave a ReplyCancel reply

Discover more from