Bilawal Sidhu: A Mind Living at the Intersection of Bits and Atoms

Season 5 episode 3

October 17, 2023

Subscribe:

Bilawal Sidhu discusses his journey in AR/VR and AI. He provides insights on creator tools, platform strategies, and ethical AI. Sidhu recommends embracing AI as a tool and focusing on utility for mainstream AR/VR adoption.

Guests

Bilawal Sidhu

Creative Technologist & AI Creator

Bilawal Sidhu

Creative Technologist & AI Creator

Listen

Watch

Read

Announcer:

Today on Building the Open Metaverse.

Bilawal Sidhu:

It's wild how fast you can move as a creator these days. Even if you're juggling a day job, you can have a cool idea; you could spot a trend and just go for it.

And if you can find that intersection of content, market, fit, sometimes you'll go viral.

Announcer:

Welcome to Building the Open Metaverse, where technology experts discuss how the community is building the open metaverse together, hosted by Patrick Cozzi and Marc Petit.

Marc Petit:

Hello and welcome, metaverse builders, dreamers, and pioneers. You're listening to Building the Open Metaverse, season five, the podcast that is your portal into open virtual worlds and spatial computing.

My name is Marc Petit, and this is my co-host, Patrick Cozzi.

Patrick Cozzi:

Hey Marc, thrilled to be here.

Marc Petit:

We bring you the people and the projects that are at the leading edge of building the immersive internet of the future, the open and interoperable metaverse for all.

Patrick Cozzi:

Today, we have a special guest joining us on that mission.

Bil Sidhu is an immersion tech expert and visionary creator who's worked on cutting-edge AR, VR, and AI projects at Google. He's also an influential content creator with over a million followers, including me, across YouTube, TikTok, and several others. Welcome, Bil.

Bilawal Sidhu:

Thank you so much for having me, a longtime listener, first-time dialer. Great to be here.

Marc Petit:

As you know, if you're a listener, the first question we like to ask is for you to describe your journey to the metaverse in your own words.

Bilawal Sidhu:

My journey to the metaverse really started at the age of 11. I fell in love with visual effects. I was in India at the time, and learning Flash 5, that was the extent of interactive stuff on the web back then. Then I saw this little show on Discovery Kids called Mega Movie Magic, and they had this amazing episode on Independence Day.

What I saw was the sequence where the mothership arrives over New York City and blows the city to smithereens. And they were doing all of this on a computer; It blew my mind. The same computer that I was using to make these cartoony vector animations could be capable of doing something that seamlessly blended the physical and the digital worlds. I wanted to learn how to do that. So, that took me down the path of learning 3DS Max, Maya, After Effects, all that good stuff.

I fell in love with visual effects, and specifically digital compositing and blending 3D elements into the real world.

That blending of the physical and digital has been a theme throughout my career. As I got older, I got more into the technical side of things and ended up studying computer science and business administration at the University of Southern California. Then as I graduated in 2013, this is when the mobile boom was taking place. That got me into product design, and then really spent a decade doing AR/VR.

I went from design to production and then product management on the platform side to product. Along the way I've continued to flex my creative muscles, making really fun short-form VFX and playing with the latest AI tools. And here we are today.

Patrick Cozzi:

Look, you've had a very impressive career bridging the physical and digital world. I'm clearly a fan of the 3D digital twins work you did at Google, and also a fan of all the viral AR effects and videos that you've put out there.

What are some of your biggest lessons learned or takeaways?

Bilawal Sidhu:

Gosh, there's a bunch to distill here. I would say the meta takeaway really is that there are fun things that you can do on a tanker ship, and then there's really fun things you can do on the speedboat. What I mean by tanker ship is large companies, platforms that have the resources to tackle problem spaces that you just can't otherwise, digital twins certainly being one of them.

But also the stuff you can do as an individual creator on these speedboats. Let me start with a speedboat and then get to the tanker ship. It's wild how fast you can move as a creator these days. Even if you're juggling a day job, as I did for an extended period of time, you could have a cool idea, you could spot a trend, and just go for it. If you can find that intersection of content, market, fit, sometimes you'll go viral. That's just a lot of fun, and it becomes this addictive feedback loop.

It's wild to even look back. I was tallying up all my views; it's like 350 million views across YouTube and TikTok. Over a million followers, as you said. And as we were talking in our prep call, Marc, you mentioned the 100 million views on the AR filters. That's wild to think that a solo creator in this day and age can have that type of impact is awe-inspiring to me and keeps me going every day.

But on the other hand, there are things that you can only do on the tanker ship. I would say two areas come to mind there, both with interesting takeaways.

The first one is definitely if we go back maybe three or four hype cycles ago to VR cameras and VR media. In the early days of VR, circa 2016, 2017. Working on immersive capture systems, Jump and VR180 at Google, and then producing content that drove billions of hours of watch time on YouTube was just a lot of fun.

You can't seed a new form of immersive media anywhere else. To be able to do that on a platform like YouTube with YouTube VR was amazing. Interesting lessons there, similar to what I said, if you want outliers, one of the things that really crystallized for me back then was you have to build for the next generation, both the next generation of consumers or viewers as well as the creators. We did a bunch of stuff, so I had a chance to produce stuff like Elton John VR, Coachella, a bunch of stuff with Warner Music. But then also cool stuff with YouTube-native creators, like Temple Run VR and SOKRISPYMEDIA, a really cool guy that ended up doing all the stuff for MrBeast in his Squid Games episode.

What was interesting to me, looking back at that, the stuff that really disproportionately outperformed was the stuff that targeted these native creators.

It was the younger viewers, younger celebrities, younger boy bands that drew in audiences to go dust off their cardboard VR headset, or go buy a headset, to be able to feel like they're actually there with these folks. That was mind-blowing. Also on the creation end, it's like, you work with a lot of top agencies. But it was Sam at twentysomethings with the two other creators at SOKRISPYMEDIA that just did things with the capabilities that we gave them that ran rings around all the top tier agencies.

At that time, one of the things I had a chance to work on was depth-based stitching. If you know stereo VR media, especially 360 VR media, 60 frames per second, two eyes, doing any complex compositing is just a gigantic pain in the behind.

We launched depth-based stitching back then. Which, looking at monocular depth estimation now, feels like, oh, my god, that's so dated. But it was super wild the type of stuff this kid was able to do. I think that content still holds up. If you go look up Tiny Tank VR on YouTube and watch it in the latest headsets, it's absolutely amazing.

The last one is really digital twins. This is where I spent four years of my life. And obviously, Patrick, that's where you and I got to meet and work on some really cool stuff together. Google Maps had obviously been a thing, Google Earth had been a thing. But the opportunity to remap the world with a new set of sensors, the latest satellite aerial and ground-level imagery, and creating this ground-up digital twin that is perhaps the most ubiquitous digital twin of the world was just an amazing experience.

This digital twin powers so much; it connects the worlds of bits and atoms, which, to me, is a fundamental characteristic of the metaverse. It powers everything from navigation to flood forecasting to AR/VR. There I learned, obviously, all that goes into creating this type of digital twin, but also about how you get a very large incumbent organization to disrupt themselves rather than being disrupted. And that's a lesson that I'm going to take with me into the future. Because building something is half of the challenge. The other half is also getting it out to people.

So the last thing I'll say here is working on the AR core geospatial API, which was launched at IO last year, and then the 3D tiles API, which we had a chance to work on that came out this IO. It's like taking 15 years of imagery and making it accessible to everyone now, not just for internal applications, is absolutely wild.

To think that you can be anywhere in the world with street view, which is over 100 countries, and you get sub-meter visual localization accuracy. A couple of degrees of rotational accuracy is mind-blowing, and it's out there at no cost to developers. A similar thing with 3D tiles, this dataset that's been the crown jewels and stayed in consumer applications is now available to developers. I think we're just seeing the initial signs of what people are going to do with this stuff and just scratching the surface of all the cool stuff that's going to happen with these technologies.

We're laying these platform primitives. And gosh, every day, even when I look on LinkedIn, I'll see sometimes on, Patrick, your feed. You're retweeting all this cool stuff that people are doing that there's no way you could have thought of that and put that in some PRD on all the possible use cases of this stuff.

Then finally, consumer stuff. Giving it to developers is fun, but to have had the opportunity to be one of the co-founders of immersive view and create something that brings together 3D visualization, cloud rendering, real-world simulation, and NeRF tech in service of user experience was amazing, and a testament to the awesome folks I got to work with at Google. Because hey, there's street view, live view, and now there's immersive view.

I'm super excited to see what happens on both fronts, the consumer and developer. Gosh, we're just in the first or second innings.

Marc Petit:

Talking about that, you've been putting out a lot of content recently on the latest capture technology. What excites you the most about this evolution of 3D mapping techniques?

We had photogrammetry, NeRF, neural radiance fields, and now we have Gaussian splats, GASP, if I'm not mistaken.

How do you think those technologies will impact everything we do, including architecture, real estate, and design?

Bilawal Sidhu:

One observation, especially with the Gaussian splatting stuff, certainly is the velocity at which people are going from research to product. Just hot off the press is Polycam; yesterday, it rolled out support for Gaussian splatting. And reality capture, which is the bucket I put all of this stuff in because the techniques always keep evolving. But at the end of the day, it's a form of reality capture.

Photogrammetry is obviously not new. Both of you all and your listeners probably know this has been around since before there were computers, photogrammetry was a thing. Then, obviously, it took data center scale and large teams of experts to do this stuff. Then you got tools like Agisoft Metashape, RealityCapture, and all these other tools that let you do it on a consumer GPU.

All these technologies are getting more and more accessible. It's funny, NeRF, what ostensibly came out maybe two years ago as a research paper, and then they bled into products. Obviously, Luma leading the charge on one end.

I love listening to that episode you all had with Amit. Then the stuff Google's doing with pre-rendered NeRF fly-through is also super, super exciting. Gaussian splatting is maybe two months old. From two years to two months from research to product, oh, my God, I can't even fathom what the velocity is going to be from here. Can it even get faster? I don't know. Two seems like the upper bound for how fast it can get.

Speaking to Marc, your point about how is this going to impact all these use cases? I view it as this world model. You have these 3D world models that have three facets. It's for visualization, analysis, and machine understanding is roughly how I bucket it. You've got these beautiful human-readable models, so NeRFs, photogrammetry, Gaussian splat, something that a human could look at. It's like, oh, this is exactly what reality looks like.

Then you've also got these abstracted representations that are really useful for all the GEOINT stuff that's been happening in Esri Land and so forth. View shadow analysis, doing all sorts of classification work. Really, really cool stuff there.

Finally, this machine-readable model that makes zero sense to a human, but you can do things like localization, obviously all the slam stuff that powers automotive navigation, and so forth.

All those facets are getting democratized. You could go take a 360 camera. There's a bunch of technologies out there for you to build your own VPS map. Now, on the visualization end, we're taking that step up from photogrammetry assets that ostensibly have already impacted gaming. Games like Battlefront and Call of Duty, where it's way easier to go into your fricking parking lot and photo scan a bunch of things, and then optimize them and kitbash reality together to replicate the complexity of reality, but you still have a lot of human effort in the loop.

With stuff like NeRFs and GASP, oh my God, it's getting way cleaner to go from those input images to something that just looks production-ready, and you're starting to see that bleed into virtual production already.

NeRFs is a super hot topic. I think GASP, by virtue of it being an easier... Unlike this implicit black box representation of NeRFs, the fact that it's just the best of classical computer graphics meets ML-based approaches for training, it's easier to integrate it in photogrammetry pipelines. I'm excited to see more and more photogrammetry players. I would say the incumbents in photogrammetry, like the bigger apps like Agisoft Metashape, and of course RealityCapture and so forth, the context capture with Bentley, I think they will not be able to help but adopt radiance fields and these light field-esque representations.

Just not exactly light field, technically speaking, but certainly models the complexity of reality. I think it's just going to be a key primitive for utility and delight, and it'll be easier and easier to capture these things and now render these things. Interoperability, hopefully, will get addressed with the Gaussian splatting and associated techniques.

How cool to go from stuff that required massive data centers to your fricking iPhone and a consumer GPU to do some wild stuff with it.

Patrick Cozzi:

Marc, you may know, but Bil's social media is where I go to learn about a lot of this emerging tech. One of the things I really like is the re-skinning reality video where you're showcasing generative AI with a 3D scan.

Could you walk us through how you made it, the creative vision, and the technical process?

Bilawal Sidhu:

Yeah, that was a fun one. My current affliction on social media has definitely been Twitter and X, and I think that was one of my first posts that went viral this past March on there.

The vision behind it really is just taking all the primitives that I'm interested in, and you play with Lego bricks; how do you put them together in interesting combinations? Because one of the things I find is we're at this place where new primitives keep dropping, and we haven't put them in the most obvious permutations and combinations.

One of them for me is when Stable Diffusion 2.0 came out; this was late last year. They rolled out something called depth-to-image. Because typing in a text prompt and getting something cool is fun, but I call it slot machine AI where you just keep re-rolling the prompt and hope for something interesting. But how do you exert control over that image generation process?

Depth-to-image was the first inkling of that. But as you know, depth has a way of conditioning image generation. Depth certainly models the geometry of the scene, but it loses all the high-frequency texture detail and all this other stuff that you need. My goal was really to re-skin reality.

How do you capture something and restyle it while staying faithful to the spatial composition and contents of the scene? A fancy filter, if you will, is probably the simplest way to describe this.

Then ControlNet dropped earlier this year. ControlNet took like, hey, here's all these task-specific models that have been in the computer vision space forever. For depth estimation, computing edge maps, computing normal maps. How do you use all of that to accomplish that same goal?

This experiment came out of, I've done this 3D capture, my parents were just recently retired, and so they were leaving their house in India. Before they did that, just being a photo-scanning nerd, I went and immaculately captured it.

I wanted to see if I could re-skin my mom's drawing room, as she likes to call it. What would it look like if I wanted to do different styles of Indian decor? Is this something that's possible? And so I tried that out, basically taking those images, creating a photogrammetry model. From that, taking the depth maps that you get, but then also use ControlNet to compute the edge maps. Between that combination of depth and canny edge detection, you can model, or you could get a good sense of both of the scenes. You've got the furniture that's really nicely represented with a def map, and then the paintings on the wall, the nuances, all the furniture texture detail pop out very nicely in these edge maps.

Using ControlNet to create keyframes, and this software called EbSynth, which is not even ML, but is used to interpolate between these keyframes. Putting all that together to re-skin reality. What was wild was when I showed that to my mom, even without really any fancy prompting, the paintings even that were being replaced were regionally accurate. She's like, "Oh, this is so-and-so ruler in Rajasthan in India." I was like, oh wow. Somehow latent space still has... It's organized in a fashion that's still perhaps somewhat semantically meaningful.

What's wild, as you fast-forward a couple of months, tools like Kaiber AI make that super hacky workflow one click. It was cool to create this trend of, hey, let's scan something and then re-skin it, and then seeing this explosion of creators doing exactly that with Luma and Kaiber now as a one-click thing rather than this lofty pipeline.

It was a lot of fun. It perhaps also speaks to how trends these days last just two or three months. Because of the type of engagement I got on that in March, it's nothing compared to... If I post that today, people are like, we've seen re-skinning reality, blah, blah, this is old hat. What's the next thing?

Marc Petit:

Let's go back to speedboats and tankers because you have that unique point of view. You've worked at both big tech companies, and you are now an independent creator.

How do you think AI, back to what you just said, will empower both major studios and indie creators?

Bilawal Sidhu:

I think that is the trillion-dollar question right now. It's funny because, at a high level, I think there's a multiplier effect that's going to happen. Where you're going to be able to apply high visual fidelity content in places where there wasn't going to be the budget for it, or places you wouldn't expect it. Such as social media, for example.

You're used to these fast-produced pieces of content, suddenly you can uplevel the production quality there massively. Indies will be able to rival the output of a studio; Then, holy crap, studios are going to set whole new standards altogether.

I often think about what Marvel is going to do with their treasure trove of visual umami. They've got just insane, insane, insane IP, and amazing artwork that hundreds and thousands of people have poured blood, sweat, and tears into for the longest time. I'm really, really excited to see what happens.

Yet, there's this weird handcuffing situation with all the bigger companies. Everyone's worried about, oh, well, what's the data provenance of this stuff? Can I actually play with this stuff in production? I haven't seen as many major studios play with this stuff. When they have, there's been this visceral backlash. If you think of Marvel just had Secret Invasion come out. The introduction sequence for Secret Invasion used a bunch of custom-trained models, and it was done with generative AI, and it thematically fit into the theme of these shapeshifting aliens that are a subject of that Marvel IP. And, oh my God, if you go look at the comments on the internet, it was scathing.

I think stuff like that is scaring the incumbents, the bigger studios, to play with it.

Funnily, and ironically enough, it's the indies that are most unencumbered to play with this stuff right now, which is where you're seeing perhaps a lot of the innovation. But I think that's just a question of time. Clearly, there needs to be some legal precedent set about whether it's considered fair use for these models to train on copyright imagery or whether it's not. I think all of that stuff will solve itself.

In parallel, you have companies like Adobe, Shutterstock, Getty focusing on datasets where they do have rock-solid provenance and provide an alternative. I had a chance to interview some of the Adobe folks working on that stuff, and they described it as putting their models on a data diet. So we'll see, we've got these massive models that take all the complexity of stuff that humans have imaged and created.

On the other hand, you've got where these models that have rock-solid data provenance, perhaps a much smaller controlled dataset, stock imagery, for example. The optionality is going to be there, and it'll be exciting to see major studios hop on board. But as always, it's the innovator's dilemma in play as well. Will these folks jump on board? Will there be a new set of Pixars that emerge, or will it be the unbundling of Pixars?

How we saw celebrities get unbundled into online influencers, macro influencers, that in aggregate might have the reach of, say, a Tom Cruise. That part is hard to say, but I've got a feeling adoption is only a question of time. But the folks doing the coolest stuff, definitely in indie land, they don't have to worry about a massive legal department.

Marc Petit:

Do you have a quick recommendation for artists who are worried about AI? What would you tell them?

Bilawal Sidhu:

One of the things I see happening is it's the demise of specialization. The way we've made content, and both of you all are intimately familiar with this, it's been this waterfall approach of specialization.

You've got modeling artists, you've got riggers, you've got texturing artists, people who just make an entire career specializing in digital compositors, where I got into all this stuff. That's their career. I think what happens now is that T-shape set of expertise where, hey, you learn to work a little cross-functionally so you can chain a bunch of these people together into this assembly line pipeline is less relevance, and your deep well of expertise can now be augmented by these AI models.

I like to say you go from being T-shaped to a tripod or a table. What I tell creators and technologists that are getting into just creative tech, in general, is to embrace that. The stuff you're good at, awesome, but lean on these models to do all this other stuff.

The thing I suck at is coming up with fricking titles and thumbnails, and even that is a fricking career for people that get paid big bucks just to come up with high click-through rate titles and thumbnails. Well, you could use an AI model to do that. All these crazy ideas you had that you couldn't have accomplished by yourself, you can now. To put it a different way, it's fun to play an instrument in the symphony, but now you can orchestrate the symphony. And play the instruments you like, but orchestrate these models to do the rest.

The one thing I would say is, please, play with these models. Don't be afraid of it. Because I found when people do play with them, they realize it's not this big bad, crazy Kaiju Godzilla that's going to take their jobs. It's more like this friendly house spider that's in their backyard. Eventually, they realize, actually, this is for my own benefit. This is just another tool.

Everyone freaked out about Photoshop. Gosh, the transition from film to digital video, everyone had these pangs of anxiety. It's just maybe the rate at which it is happening is drawing up a little more anxiety than it should. That's just another tool. And we creators will learn to operate at a higher level of abstraction and still stay relevant.

Patrick Cozzi:

There's clearly a lot to unpack with AI. I wanted to ask you about what you think are potential risks or downsides that most concern you for personalized addictive AI content, and how you think we could avoid these.

Bilawal Sidhu:

There is definitely a dark side to all of this. It's not all rainbows and sunshine; I think we have to be honest about that. Already this transition we've seen from, look, I'm a millennial with back pain. I still love YouTube a lot, of course I use TikTok, that's my biggest channel. But the experience as a consumer is so different.

On YouTube, pre Shorts, of course. Now everything has a feed of content where you don't really decide. It's an infinite feed of content, and the algorithm decides what to show you next. What I'm finding is, to contrast the YouTube experience where maybe you search for something, it's on your homepage, you watch a video, you decide what to watch next. Versus just scrolling to the next thing, and you're passively consuming in this feed style model, which is very ubiquitous now.

I think that's going to get supercharged with generative AI. The real dystopian vision of this is perhaps you've got the perfect biometric feedback loop, whether it's eye gaze detection, and based on how the blood vessels in your eyes are moving, your heart rate, and blood pressure. Maybe you've got a bunch of IoT sensors on. Screw retention editing on YouTube, you have real-time retention editing happening to hack your attention.

Obviously, that's going to be great for monetization. You almost have this tension between wellbeing and monetization that I think platforms need to really care about, but also creators. Before that automated, fully generative feed of content happens, and I'm convinced that'll happen for some long tail set of content that perhaps creators don't even enjoy making anyway. I think it's very, very important not to create a treadmill of your own design that exhausts you.

The other aspect that I worry about, there's the platform side where, how do you balance these two opposing tensions? How do you keep engagement high, also keep wellbeing high, but also have a thriving multi-party marketplace where you can monetize this stuff both as advertisers, the platform itself, and creators?

Storage, compute, and certainly GPUs, you got to pay the Jensen tax at some point too. But on the creation end, I love folks like Gary Vaynerchuk. I don't know if you're familiar with him, but Gary Vee is very popular for “if you're not making 10 pieces of TikTok a day, you're just missing out.” You got to be cranking our content every single day. In this feed economy, just bombarding content online so that a few of them break through.

I see a world where we have to make a decision where, hey, do I want to automate my job as a creator so I can spend half the time making the same amount of content, have a fulfilling life, and spend time in the real world? Or is it going to become this race to the bottom where everyone's like, well, crap, now I can automate this stuff, so I'm going to take that extra time and make even more content?

The expectation for creators to break through goes from five or ten pieces of content a week to 100 pieces of content. I think that could end badly for mental health and some of the other things we talked about.

Marc Petit:

You've been very thoughtful in your content and talking about the ethics around emerging technology, and you just talked about wellbeing.

What do you think are the steps and the mindset that those platforms, policymakers, and even creators themselves should take to make sure that AI does have a positive impact?

Bilawal Sidhu:

There's a quote that really resonated with me, or a paraphrasing of the quote, which is generative AI is to the world of bits that atomic energy is to the world of atoms. I think that's actually the case because it can be extremely enriching and extremely disruptive, for the reasons we just discussed.

There's certainly the value alignment challenges, and that goes back to, well, whose value are we aligning to? There are values about what it means to be a creator, what's a fulfilling life as a creator.

Same thing on the consumption end. These models become the lens through which we consume information, all the stuff that social media and search companies have been dealing with, like values in many ways are geopolitical in nature, but then there are some agreed upon universal values. I think that's going to be very challenging. I don't think there's an answer for that. I certainly don't pretend to have them.

What I've noticed is that the thing we really need to be cautious of, as a step towards regulation, we can end up creating the very thing that we are seeking to avoid happening. What do I mean by that? If you saw any of the various Congressional and Senate hearings on AI, there's a big narrative happening about, on one hand, hey, look, AI companies are super forthright about engaging with regulators before regulation needs to step in and good or bad things have happened in the world.

On the other hand, that can be perceived as regulatory capture. To me, the truth is always somewhere in the middle. But it's also about, what happens if we start regulating these models, and let's say there's a decision made that open source should continue to blossom. Even there, there are challenges.

Let's say you do decide, hey, we got to regulate open source. Holy crap. Well, how do you do that? Are you going to run some process on every Nvidia GPU that gives some centralized authority visibility into what workloads are happening? Suddenly in an attempt to keep the world safe, you end up creating Orwell's wildest nightmare. These things that seem on the surface like a good thing to do can quickly end up being, by virtue of this dual-use tech, end up being implemented in a way that they become very authoritarian and very dystopian.

Maybe the last thing I'd say on this is I think we need to just regulate models, different models differently. Generative AI is pretty broad. Personally, I don't worry about these image generation models that much. Yeah, people talk about the deep fake problem and all this stuff, but inherently those capabilities have existed; maybe they're now more democratized.

Creation tools have been being democratized, and we do see misinformation anyway in the last round of elections without any of this stuff. So yeah, maybe that supercharges it. But I think that turns into how we deal with software today, where software isn't foolproof.

We all use iOS, Android, Windows; they all have zero-day exploits. They all have vulnerabilities that need to be patched. I think in this content world we can figure out a way to play that game of whack-a-mole and stay slightly ahead of it, or at least drastically mitigate the harm that happens.

The stuff I worry about personally, this perhaps goes back to some of the GEOINT stuff that, certainly, like you all are familiar with too, it's like multimodal understanding.

If you have multimodal models, like an uncapped version of GPT-4 or GPT-4 with vision, that stuff scares me. We've already seen examples of these red teaming, blue teaming attempts to say, let me take all the public webcam feeds that exist, and then I want to go look at all Instagram photos taken in certain areas, and then I want to do basically pattern of life analysis on where have these influencers been? And people are able to pinpoint, hey, this photo that this influencer took on Instagram was this time in Times Square, and this is when this happened.

If you have these types of models at the edge, or certainly available without a very robust trust and safety layer, I think that could be very problematic.

It's like nation-states and centralized actors are one thing, but once you give bad actors that are smaller and independent this ability, oh my God, things could get so scary because it's big brother in a box.

I don't know how exactly you implement that because I don't want, again, the solution there to be, well, we need to have visibility into every single model on every single device at the edge. I think about what OpenAI is doing, perhaps the right way to do this. They announced vision in March or April now at this point, and only now, in September, almost October, is it being rolled out after they did sufficient red teaming to feel like they mitigated 80% of the risks and some of the other stuff that can be reactive to.

Maybe that's the right way to do it, put stuff out into the world, do some thinking in advance, and then be very responsive to the stuff out there.

Patrick Cozzi:

Yeah, I like that approach. I also want to double back because we were just talking about the pace of innovation and this rapid innovation with generative AI. You're both a creator and a technologist, and I wanted to ask about what excites you but also what concerns you about this.

Bilawal Sidhu:

The pace of innovation is exhausting, and all the AI creators that I talk to today are almost way too exhausted. They just can't keep up with everything. It's almost like we have to be a bit more of a hive mind, if you will. Where different creators are specializing in different wells of stuff that's happening in a way, and then putting that knowledge together and knowledge transferring to each other so we can keep the bigger picture, the broader map intact.

Again, like I said, you’ve got to be a generalist. You can't hyper-specialize because the magic happens when you put all these things together. The other part as a technologist, and perhaps as a product-builder that I worry about is the precarious fate of startups.

I'm an angel investor in companies. I'm an angel investor in Pika Labs; they're doing some very cool stuff. In Backbone, and then a couple of other startups that haven't decloaked yet.

As I talked to a bunch of them, it's also exhausting that you have to have this pivot party every three months. Oh, this new thing came out, or the foundational model subsumed the thing that you had built a moat around. It's this constant churn that you have to deal with. In many ways, I say it feels like a young person's game right now, who has the hunger to keep up with all of this stuff, especially as they're building out.

The chasm between new capability and a sticky product is wide, and there's still a lot of iteration that needs to happen there, and then you need to keep doing it. It's fascinating. What does that mean for startups? I think right now, we're seeing this unbundling effect happening, so stick to creation tools. It's like, people have taken what Adobe's doing or Autodesk is doing and other companies and unbundling the problem space.

Hey, we specialize in audio, and there's audio speech, music. We specialize in video, like text-to-video or image-to-video, video-to-video, 3D-similar things. Reality capture versus text to 3D, et cetera. I think there's going to be a phase where all of that stuff gets subsumed, and a rebundling happens, and maybe that'll be a bunch of acquisitions.

I wonder what's going to happen to the startup model there. Because the other funny trend underpinning all of this is the VC model itself is, people are rethinking it. All the LPs are wondering, if I get really good interest rates here and these are the returns you're giving me, do I even put money into VC? And parallel to that, hey, you don't even really need a lot of money to make these startups.

There's just so much change happening on so many fronts that it's almost hard to foresee. All the mental models we've had for tech thus far almost don't apply.

Some people try to make the “this is just industrialization.” I'm like, homie, this is so different than industrialization that happened over such a long period of time. Yes, the US economy went from being largely agrarian to industrialized over a 30, 40-year period. What's going to happen here? I don't know, there's so many concerns but also so many opportunities.

It's hard to see past the next three or four months. There's this almost fog beyond that point. Yes, we'll have nebulous goals like AGI, but what the heck does AGI mean? I don't think anyone's really agreed upon the definition. As the fog clears up, we just move the goalpost to the next thing. Because yo, if you showed Alan Turing GPT-4, I think he'd probably say, this is AGI. But we're still sitting here and debating it.

What excites me is so much opportunity. What scares me is the immense amount of change that I don't see going away.

Marc Petit:

From your perspective, and you have a front-row seat on all of this emerging technology, how far are we, and where are we in the process of getting AR and VR to the mainstream?

Bilawal Sidhu:

It feels like every two years, we feel like, yes, the hockey puck moment is just next year. Then next year is going to be the year of VR. Then next year is going to be the year of AR.

I think it's interesting; you all talked about this in one of the episodes that I really loved, the Matthew Ball episode you all did early on, and then after the hype of circa 2021, 2022. It was interesting to me that the metaverse got positioned as AR/VR optional circa 2021. It's like, hey, you don't need it. Like Fortnite, that's the metaverse. VR chat, that's the metaverse. And Rec Room. And where a lot of these platforms are getting a bunch of their growth is certainly not inside of the headset. I think that gives time for the underpinning technology to proliferate and get out there.

A bunch of the stuff we talked about for real-world AR, certainly a bunch of the game engine tech underpinning this. All these other problems around, how do you have that massive Coachella-like experience where you're truly embodied with thousands of people, and it feels exactly like that, versus some cruddy pixel art Web 3.0 rendition of that concert that just doesn't feel as hot?

I think AR/VR is back in vogue now after that little winter. Apple certainly rebranded it to be spatial media and spatial computing. Hats off to Zuck and Meta for really for carrying the hardware space, really seeding the install base for a while now, and continuing to do that.

What do I think are the main drivers or the blockers to AR/VR adoption? Obviously, it's just install base of headsets. You are not going to have that flywheel effect of creation and consumption taking place until that crosses a certain threshold of devices. I think the Quest 3, oh my god, it's already better than the Quest Pro. I regret wasting money on it last year, and it's $499. You've got the Tesla Roadster from Apple, super expensive, very high-end. Will probably buy that, too, because we're all early adopters here. But on the other hand, yeah, the $499 one, that's 80% of the bang at 20% of the buck. Oh, my God. I think Quest 3 is going to sell a record number of units.

Marc Petit:

What did you think about those no display, like the Ribbon, putting some AI into glasses that have cameras, have an understanding of your surroundings; I find this fascinating.

We had Google Glass was a major failure. But it looks to me that we are starting to see a new class of project maybe emerging, but not as immersive as AR/VR, but that augment your experience because of that machinery, because you put cameras everywhere and you understand what's around you.

It's a bit scary as well.

Bilawal Sidhu:

Yeah, it's like, what's going to be the Robert Scobel version of the new Ray-Ban Glasses fiasco? It's a funny one, and I talk about this with Robert too.

You're bringing up a really good point here, Marc. Which is it feels like Meta’s strategy could be viewed in two ways. It's like, you keep making these headsets, they're still VR headsets, like what Apple's doing, let's make them pass through. Make them mixed reality, so to speak. And start slimming them down like pancake lenses, slimmer form factor on one end of the spectrum but fully immersive experiences.

On the other hand, you've got these passive experiences where you keep putting sensors on a form factor that resembles that dream of always-on AR glasses, but no display, not immersive. Already, that's so cool for a couple of things. We were talking about Visual Positioning System and stuff like that.

Gosh, I hate doing the six'd off dance with a cell phone, right? Oh, okay, let me extract on feature points, and then the matching happens. Okay, I'm localized now. What if your headsets, just like your glasses, are doing that passively? All these multimodal capabilities on the understanding side are going to be awesome.

I think you could view it as approaching it from both ends of the spectrum, and they'll meet in the middle as a way of hedging your bets. Or you can view it as burning the candle at both ends. I think we have to fast-forward five or ten years to see what the right decision is. I'm curious if Apple is going to do something similar. Because there were all these rumors about there being lightweight glasses in development, and then at the last mile, they made the decision to do this MR headset instead.

Will that change? There's also a constellation of devices with AirPods that have really fancy IMUs on them, and all these other platform capabilities that could come together in this Voltron perfect AR meets VR device. But I think the last piece that was missing, I think always for AR/VR, was language, which these generative models solve. In another word, spatial computing.

The dream was, hey you all, it's way more intuitive. You don't need your mouse, and now you can use your hands, and it's all spatial and your brain... We work in a 3D world every single day. You see a doorknob, you know exactly what to do with it, you don't need to build some new mental model. But language is one that was missing. You talked about Google Glass, circa 2013, and I was definitely a nerd that got a Google Glass Explorer edition. But holy crap, voice assistant sucked back then.

It was really the ML era, the ImageNet, and post-era hadn't even really taken off. In many ways, that primitive of language, being able to talk intuitively to these agents that are chatbots now that perhaps get embodied in these virtual spaces is going to be very exciting. To be able to search what you see, query what you see, all that stuff that's stuck in a lens app on your phone, and Google can now come to your headset. I think it's going to be very exciting to see how that pans out.

I don't know what is going to be the first mainstream device. Is it going to be VR going from niche to ubiquity, or is it going to be something much lighter weight that many more people buy that gets progressively more immersive? That's hard to say, but Meta seems to be trying both.

Marc Petit:

You can see potentially, as you describe, a lot of utility. I think what drives adoption is that utility because it will help people.

Patrick Cozzi:

I wanted to ask you a bit on the application side. If you could wave your magic wand and bring one creative AI app or experience to life, what would you do?

Bilawal Sidhu:

I want this intermediate creation tool between a 3D app, a compositing tool, and a nonlinear editor with collaboration capabilities. To draw the analogy, I started off by creating product design, and back then, it was still like fireworks was still a thing. You did a lot of your design stuff in Illustrator and in Photoshop, and then you did all your interactive prototyping, send fricking PDFs with hyperlinks. Then along came Sketch, and then along came Figma, which brought collaboration all together, and suddenly you didn't need all these other tools.

I think where creation, in terms of the velocity and volume of content that creators are expected to create on one end, has gone up, but the tools are still in that waterfall approach. Specialized tools for specialized creators. You want to do crazy simulation, you’ve got to go use Houdini. You want to go do anything in production quality 3D animation, Maya is the tool at hand. Nuke, After Effects, similar and compositing, et cetera.

Now we've seen some convergences happening, obviously a big fan of Unreal Engine, Marc. The stuff that you can do in Unreal Engine is phenomenal, and virtual production is taking advantage of it. But for a lot of the stuff, as a creator, you've got the... I think we might've even talked about this in the prep call. It's like you've got the baggage of a game engine with you.

It's the same thing on the Unity end, too, the same thing on Blender. You can do a bunch of cool stuff in Blender. It has a compositing tool, but not many people know. It has a real-time engine in it too, but it's got the bloat of a 3D tool. I want this lighter-weight tool where you can jump between this context of video editing, doing the compositing, and 3D all in one application.

The closest thing I have seen in the market is this application called HitFilm. It's like this After Effects meets Premier with some light 3D tools, 3D capabilities on training wheels. It's like Creative Suite on training wheels, but it's not quite there for modern creation tools.

It's good for perhaps 11-year-old me that's just getting into stuff. Oh my God, what I would wish to, by the way, have that. Oh gosh, it would be... Yeah, I probably wouldn't have had to pirate a copy of Maya and After Effects in that case. I think that could happen now because that tech has been incubated.

The other funny thing is a bunch of that real-time segmentation, light estimation, all that stuff is sitting in AR platforms like AR Core Kit, Snapchat, but creators are still stuck with this ill-tailored tapestry of tools.

I have plenty of thoughts on why this doesn't exist and all that, but maybe all these new AI capabilities can justify that investment to reimagine the all-in-one creation tool. Gosh, I would certainly love that.

Marc Petit:

Before we let you go, Patrick and I have a very interesting question to ask you.

You have 1.3 million subscribers across YouTube, TikTok, Twitter, LinkedIn, Substack, Maven, and maybe even other platforms. How did you do that? And what advice do you have for people like us to get that exposure?

Bilawal Sidhu:

You asked, how did I build up that audience? It definitely happened in phases and tranches.

Push one was probably circa 2016 to 2018. YouTube was the first platform I started on. I think I was lucky in that as YouTube VR started to pop off, I just started making 360, and VR media content, and a couple of pieces went viral then. My top video has 54 million views, and it was the video that I never expected to go viral. It's this pumpkin zombie; it's a Halloween piece I made. It's a pumpkin zombie coming down at you in front of the Transamerica building.

What I learned there is just put out reps, keep making stuff, and something's going to pop if you find that content, market, fit. Always, retrospectively, you'll find inklings of that, and then you've got to double down on it.

If you have a piece go viral, look at why. Then honestly, it feels weird as a creator. You think, “oh, I just said this already. Why would I make another variation of it?” But the world is a big place, and the way algorithms work now, 99.99% of people haven't heard of you. They haven't heard that message. You have to keep repeating that message over and over again, whether that's an informational message or a visual message. That was YouTube. It took me a while to get to even 100K. I think 2018 is where I really started reaching that point.

Then the real push, too, was, for me, like many other creators over the pandemic. When lockdown happened, suddenly, I didn't need to sit in a godforsaken bus from San Francisco to Mountain View, and I just started cranking out way more content on TikTok.

I got in early on TikTok and then YouTube Shorts when that popped up. That was almost perfect because I think a lot of people find long-form video creation very daunting.

Whether you're into visual creation or informational content, it is very easy to make a 15-60 second short. You could pick up your phone, capture the video, edit it all in-app if you want to, or very quickly do it in a couple of hours on another tool.

I think being early to a platform or early to a technology and doubling down on it is another lesson. Right now, LinkedIn video is such underpriced retention. If anyone's trying to grow on LinkedIn, you should be posting short-form video there for sure. Same thing with Twitter right now. Elon and X are clearly pushing longer-form video. The Thread era is gone; longer form posts work there too. You should be getting in on that.

The thing that the platform is pushing, take that and bring your unique message to it. Finally, the last tranche where I'm at today is I'm obsessed with Twitter and X, and Substack too. It's like, I think Twitter and X... I will never get used to saying X; it's just so hard being like, you're like an X creator. That just feels weird to say. But that's where the AI community is. It's a refreshing change for me, having made visual shorts for a long time to start getting into, hey, “how do you make these things? What's the story behind this new research paper that just came out, and how do you contextualize it for entrepreneurs, for creators, for anyone who's interested?” And I've been enjoying doing that. You have to focus on a platform and move on to the next thing.

The last thing I'll say is it's never too late to get started. I had like 2,000 followers on Twitter or X in February of this year. I'm pushing 35 or 40K right now, hoping to get to 100K. But it felt so daunting. It's like, oh my god, I wish I'd started five years ago. No, no, no, just start today. You never know what's going to happen.

If you have a unique voice, which many people do, the platform is eventually going to see your value, and you're going to get that moment. When you get that moment, oh my god, please triple down on it. Don't get a viral post, and make your next post five months later. There are times I've done that on TikTok, and I deeply regret it.

Hopefully some takeaways for folks who are thinking about social media.

Patrick Cozzi:

I think you know we like to wrap up the show with a shout-out or shout-outs to any organization or people.

Bilawal Sidhu:

I got to give a shout-out to Google. Got to give a shout-out to Deloitte. I got to play with all the AR/VR stuff in the innovation group at Deloitte. Folks across both of these organizations supported some really lofty, crazy ideas.

I was the person that would do the opposite of the advice that my mentors gave me. The mentor's advice was always, do the tried and tested thing, move a metric that matters, and that's how you get promoted. I was lucky enough to get promoted three times over five years at Google, doing all the crazy stuff. That wouldn't have happened without all the mentors, the colleagues, and stuff that really supported these crazy asinine ideas, all the way up the stack to all the VPs.

I’m really, really thankful for that.

Of course, I'd be remiss if I didn't shout out my parents. I'm a young boy from India, and if you know anything about any conservative culture, whether it's an Asian culture, there are three careers you can choose: doctor, lawyer, and engineer. They did a great job of not indoctrinating that type of thinking in me.

As a kid, oh gosh, I really wanted to go do this 3D animation thing. They supported it, but I think they did the right job of giving me a sensible idea of the pragmatic realities of that too. Despite getting into my dream film school, USC, I ended up studying computer science and business. In retrospect, thank the Lord I did that. Who could have thought creative technology would converge like this?

Shout-out to them. They really supported me early on, got me internet access super early, and then supported all my crazy dreams outside of work. Gosh, it wouldn't be where I was if not for the amazing companies that I worked for, and certainly the two people that brought me into this world. So, shout out mom and dad.

Marc Petit:

It was amazing to have you with us today. I strongly encourage our listeners to check you out. You're everywhere. But I think in every platform, you bring a pretty unique point of view. People got a chance to hear how articulate you are.

You are at the forefront of many of the things that are happening in the creative tech industries, and I really appreciate the depths of your thinking. The fact that you look at all aspects of it and not just surfing and trends for the purpose of gathering followers. I think the depth of your content is quite amazing, so I encourage everybody to check you out wherever you are.

Thank you for being there with us today.

Bilawal Sidhu:

Thank you so much for the opportunity.

Marc Petit:

Of course, thank you to our ever-growing audience as well. We're not as big as Bilawal, but you can reach us for feedback on our website, buildingtheopenmetaverse.org, as well as on our LinkedIn page, our YouTube channel, and, of course, all the podcast platforms.

Thank you everybody. Thank you, Patrick. And thank you again, Bilawal.