Transcript: Jeff Tatarchuk, Co-founder & CEO on TensorWave | Startup Project

This page contains the readable, near-verbatim transcript from this Startup Project episode.

  • Guest: Jeff Tatarchuk
  • Company: TensorWave

Full Transcript

Nataraj (00:02.214) Hello, everyone. Welcome to Startup Project. Today on the show, have Jeff. Jeff is the co-founder of Tensorwave. With so much of AI compute spending, I think we're in an interesting time where, first time ever, we're seeing a bunch of new Neo clouds evolve around different strategies in order to provide more AI compute that we need. And Tensorwave is one such company. So today, I'm going to talk about

You know how tends to be solving the AI compute problem. Why they're exclusively working with AMD and what it takes to build in the modern day data centers and a lot more with that Jeff welcome to the show.

Jeff Tatarchuk (00:47.79) Alright, it's good to be here, man. Thanks for having me.

Nataraj (00:49.924) Yeah, so a couple of years back, if you ask me, like, you know, there were smaller cloud companies, know, Cloudflare was a smaller cloud company. They were trying to be a larger cloud company. There's DigitalOcean, which is even a much smaller cloud company. But I always wondered, why are more cloud companies not coming up? And then once, like, the whole

AI compute spending cycle started, then we've seen a lot of Neo cloud companies doing different things. Some are focused on exclusively bringing more infrastructure online, some are bringing, you know, some companies like OAS data bringing whole new architectures. I think every time we sort of have a big wave of new innovation, there's always a new architecture innovation that comes up and then we see companies built around that. And I think you are one such company as well.

So, you for those folks who have never heard about Tensorwave, can you give a little bit of introduction of, you know, what is Tensorwave, what do you guys do, and how did the whole idea of Tensorwave come together?

Jeff Tatarchuk (01:58.05) Yeah. So TensorWave is simply put a Neo cloud that only deploys AMD GPUs. And it started out. So my other co-founder, Derek and I initially had launched a FPGA cloud business about eight years ago. And we were solving one of the harder problems first. FPGAs are usually more of an edge chip that's used for really low latency use cases.

And we decided to make them available at scale in the cloud. And we were one of the largest FPGA clouds previously. so, yeah, yeah, yeah. So VMXL was the company. And then we were working with a lot of the chip providers, Xilinx at the time, and Intel who had acquired Altera and then recently kind of decoupled Altera. And once they realized how challenging FPGAs can be,

Nataraj (02:34.007) was that VM itself.

Jeff Tatarchuk (02:56.21) It's we started with that. And so we were mostly focused on video transcoding and weather modeling and not really focused on AI at the time. But it taught us what we needed to know to deploy cloud infrastructure, set up data center infrastructure, create a really easy experience as possible for the end user.

And so we were doing that for some time and we realized that, you know, as soon as the market shifted all into AI after chat GPT was launched, all the attention shifted from any resources that were going into FPGAs into GPUs. And, you know, in 2022 and 2023, there was huge demand for GPUs. Nvidia had supply shortages. You couldn't get access to anything. And we had a friend come up to us asking, saying,

And, you know, knowing that we were doing cloud, you know, we were kind of AI adjacent asking if we could help him and his portfolio companies get access to some GPUs. And we said, you know, we're not really focused on GPUs right now, but we do work with, and we had worked primarily with Xilinx and Xilinx had gotten acquired by AMD about four plus years ago at the time. And then our company

We were working with AMD. We became kind of their support internally with our FPGA cloud. They would send us all of their latest and greatest FPGAs and we'd get them, you know, we'd co-develop and work and get them debugged for them in the cloud. And so we had already kind of built and were embedded into AMD. And then when they announced their GPU offering, it made sense for us to make the shift to go all in and deploying.

their GPUs at scale. And so when our friend, my VC buddy, came up to us saying, hey, can you help us get access to some GPUs? We said, would your portfolio companies consider going after AMD? And he said something that I never forgot. And he says, if it works, we will definitely encourage them to use AMD. And the light bulb went off the next day. Frankly, we created TensorWave.

Jeff Tatarchuk (05:03.566) and called AMD and said, hey, we need a significant allocation of GPUs and we're going to go all in to be the first and best to deploy your GPUs at scale. And we were announced December, 2023 as one of AMD's official launch partners for their first kind of data center chip, the MI300X. And we've been off to the races ever since. so it started with, you know, the supply shortage and then the frustration that the customers were having around

you know, the, the pricing and the profit margin that, Jensen and video were demanding. And we kept having people coming to us saying, Hey, we're tired of giving our, you know, profit and margin to Jensen. We need another solution. And so helping bring optionality to the market. we, with, our experience with AMD and seeing, you know, Lisa Sue's vision and roadmap on where they plan on going. knew that AMD was the next best solution to,

to go all in and be that for them. And so we are the premier AMD support. If you are considering AMD, we're your best solution to make it as easy as possible to deploy their chips at scale. So that's how it got started. And we've been off to the races ever since.

Nataraj (06:21.318) Talking a little bit about the development of FPGA Cloud, what was that business like? Who were your customers? Because I think that will help understand us, how did that help in transitioning towards building.

Jeff Tatarchuk (06:37.868) Yeah, so I mean with

With FPGAs, there's a lot of significant challenges around it. There aren't compilers written, or if they are, they're not very good. And everybody was trying to solve that problem. And one of the things with an FPGA is, yes, theoretically, it can be more flexible and more efficient and performant than a GPU. But the amount of complexity it takes to squeeze out that extra performance.

we found out very quickly wasn't worth the squeeze. so even if we could get an extra 10 to 15 % performance boost, the amount of extra work you had to do wasn't worth it. And so that's one of the lessons we learned very early on as we were one of very few companies working in this space and working with a small group of ecosystem.

Nataraj (07:32.262) Are you trying to build a cloud for FPGAs or are you?

Jeff Tatarchuk (07:35.694) There was a couple, but not very many, but it was more so the ecosystem that we were working with, working with Zylinks and Altera, and there were a few other researchers that were also trying to solve this problem as well. so everybody was kind of leaning in. I'll never forget being at Intel and Intel is like, asking us the same question, like, who are your customers? Like, who's actually buying these things? And we did have a number of customers that

you know, that we're building various products with like whether I said, what, like whether modeling products and video transcoding products, a lot of simulation, products. so it, it was, it was rough because there wasn't a lot of tools available. then the developers needed, you know, cause FPGA is in order to read, you know, for those that don't know, and FPGA is a field programmable gate array. that is, you know, very, very fast and

you can, you know, reprogram an FPGA, but before you had to be an electrical engineer to go in and reprogram, you know, and bear a log or whatever. but now like the engineers that can create, you know, reprogram an FPGA and create the bitstreams necessary to do it. They're very, very expensive, and specialized. And so I ran into a lot of just challenges early on as we were, you know, putting all this together, but you know,

It taught us everything we needed to know as we were deploying the data center infrastructure to support this, building our relationships with our OEMs and dealing with building a cloud platform on top to make sure that it's as easy as possible. our initial goal was to create a cloud platform that abstracted away what was underneath the hood completely. And so they just could say, this is what we need.

and we can spin it up and give them access to it. so working on some of those problems early on is what gave us the experience to it. And it was funny when people were, you know, when we first went into doing the GPUs with AMD, everybody's like, you know, these could be very challenging. You know, the first, first batch of AMD GPUs, it be a lot of problems. And we're like, man, you don't know the kind of problems we were dealing with previously. Like the problems we're dealing with on the, on the GPU side is nothing in comparison.

Jeff Tatarchuk (10:02.784) So yeah, so building out the infrastructure, supporting customers, and then making a platform as easy for the customers to use as possible with a very complex chip was our first kind of strike at this.

Nataraj (10:17.476) And for the PGA cloud, is it similar to building a regular cloud where you're working with like different data center vendors, who are your partners in that effort?

Jeff Tatarchuk (10:31.158) Yeah, so we actually we started off co located when we first built we just needed access to power and we needed to do it quickly. So we actually got set up in Cheyenne, Wyoming. And then quickly.

Nataraj (10:44.676) Like co-located and this is for the audience, like there's already working data centers and we basically take some space in a working data center and put your stuff on infrastructure there.

Jeff Tatarchuk (10:55.288) That's right. That's right. And so, yeah, they'd already built the data center. They were managing it. All we had to do is bring our servers, deploy them. yeah, they had a team working there around the clock that would manage and maintain it. And if we had any issues, they would go in and help us with it. And so we started with that and then we were able to actually build our own, data center. And, it's where we were able to get, really creative and work on some, some new and exciting efficiencies that that made it,

even better for us to do it ourselves. We're actually able to save a lot more money and learn along the way. we, learned the whole stack from, you know, acquiring the power to building out the data center infrastructure to managing and maintaining the servers to building out the software and cloud stack to support it with a very small tiger team at the time. So, yeah.

I still have PTSD from those days, it was a great experience and learned a lot of lessons about when you start a company. There was a lot of incentives that Cheyenne Wyoming had given us to move out there. And they'd given us some grants and different things, and all of that was contingent upon us hiring people for…

you know, to fulfill those grants. And if you're in a town or a place where people don't want to live, it's very hard to recruit. And so we were trying to recruit from the coasts and in a town that didn't have access to the infrastructure or housing with amenities or resources. It became, it became a challenge. And so we thought, Hey, like, this is a place where we can, you know, be a big fish in a small pond. can grow, we can work with the city and the government and make all these different moves and they're to give us money.

But sometimes when people give you money, comes at a cost. And so you have to count that cost when you're starting a company. that was one of the things we definitely learned early on.

Nataraj (13:00.55) So in some sense you will write time, write place with the right experience, deeply embedded into the infrastructure space and how to build data centers. And you basically are in some ways position to start TensorWave if you look back at it. So let's talk about TensorWave. So you're now building data centers with

AMD GPU clusters that offer AMD GPU clusters. And top of that, you offer a cloud or platform on which I could just apply my own compute clusters, do inference training, all that stuff. Is that the right way to describe conservative?

Jeff Tatarchuk (13:46.776) That's right. So we, yeah, we, we will do the whole thing from identifying power, building the data center, retrofitting older data centers to support these GPUs and then build out the whole stack for customers that need access to compute for both training and inference and production.

Nataraj (14:07.598) What does building data center today look like? How long does it take? What are the challenges that we're seeing? How much of the challenges that you see in the news about getting access to power, getting access, like how quickly people want to get these things running? Or how much of that is high-pressure? How much of that that you see on ground?

Jeff Tatarchuk (14:31.468) No, it's a real problem. Power is the commodity that is the bottleneck right now. They can make the chips. The supply chain is fine for now. But there isn't currently enough developed critical power to deploy the GPUs. And so there are a lot of companies out there that have access to power.

that are trying to sell it, but you you need to be able to get your substations and everything built and set up and the water and all the other pieces necessary in order to, to build your data center on top. And those timelines are a lot longer, take a lot longer than, people usually anticipate. And so, yes, like getting access to power is a major concern. there isn't enough power available, I think like in order to hit the current demand over the next few years, we need like 30 gigawatts. And if you think like a,

It takes one nuclear power plant to power one gigawatt of power. you know, how fast are we putting up, you know, nuclear power plants today? It's just not fast enough. so one of the things that we focus on is, you know, we have a pipeline of opportunities for us to build and they could be, you know, completely open, you know, greenfield opportunities that allows us to, you know, build on top and do all of those things. But for us, we have to optimize for speed. We have to be able to build.

quickly and deploy quickly. have customers that want it now. They said optimize for speed. They're willing to pay a premium for being able to deploy fast because this is a race and the big AI labs need access to as much compute as they possibly can. And yeah, as I mentioned, the biggest bottleneck is getting access to power. then we're seeing today, so you can, there's a lot of stranded power out in the middle of Texas.

And we have some of these other larger projects that we're seeing that yes, you can get access to power and you can build these data centers, but getting the people out to those data centers to act, to build and support the data centers is the challenge. Cause it takes thousands of plumbers and electricians and you know, everybody else to support and build it. So if it's out in the middle of nowhere, you know, you almost have to build your own little town around it to support it. So that comes with its own individual challenges and

Jeff Tatarchuk (16:56.738) There are some towns as we're seeing that are running into permitting issues or having the towns or cities, you know, protests, having data centers in their town. They don't want, you know, AI taking over their jobs. And so there's a lot of moving parts that have to go into identifying data centers and making sure that the data centers and power and builds all happen within the proper timelines for the end user. that's,

you know, really what's most important to them. So yeah, there's a lot of moving parts and then it's the financing those things that have to go into it as well. So

Nataraj (17:33.991) Yeah, that was my next question. I mean, you guys raised 100 million in CVC. It looks like a lot of amount, but when you're thinking about constructing a lot of data centers, then it still looks like a small amount considering, I think, big tech is spending about 600 billion in 2026 on capital expenditure of building data centers. So how's, and you guys have built, I'm assuming, I think two data centers already.

that are up and running or is the second one already up and running. But you have two data centers up and running, which is quite a cost for like, how does the financing today work and how are you sharing costs?

Jeff Tatarchuk (18:19.214) Yeah, it is a lot. It's a lot of money. When we got the hundred million dollars, it just comes in the bank and right out the bank. always tell people that we're still living on ramen, even though we've raised, you know, mid nine figures currently for our data center, you know, GPU deployments. And so luckily we do have, you know, great partners when it comes to financing with our investors, our lead investor.

is Magnetar. Magnetar is the fund that took CoreWeave from Seed to IPO and backed their financing. And then they're able to help bring in some of the other larger banks to also help with the financing as well. so cost of capital is an important factor in this. We are new company. We're only two years old. And so

you know, being able to number one, like get access to customers that have a great credit rating and the better the credit rating, the better our cost of capital is. but then we're able to lower the price of our deployment for the, for the end user. so, we are in a great position with funding, to do everything we need. But again, it still comes down to, you know, lining up or playing air traffic control to line up your.

power with your customer, with the financing and making sure all of those things land and can be built and deployed at the time necessary and coordinate all of that is really the challenge behind it. one of the things you mentioned early on is that, there are a lot of people attempting to do cloud. Before it was just the hyperscalers and then a couple of others. And then all of a sudden, one of my friends at Nvidia said they have 150.

you know, Neo clouds at, you know, that are like that. I don't even feel like there are 150 AI labs, you know, in this ecosystem. And so how, how is a, you know, ecosystem with 150 Neo clouds actually able to stay alive? And I do think some people saw, you know, the, the, the, the profit opportunity to make money where people think all you need is a data center, some servers with some GPUs in them, you plug them in and you rent them back to.

Jeff Tatarchuk (20:42.606) open AI and you're fine when that's there's a lot more that goes into it that a lot of people don't take into consideration as they are as they're doing this. And so I think some people just thought as a financial arbitrage and I think those that saw it as such will find out the hard way that there's a lot more that goes into it.

Nataraj (21:07.406) I think the difference would be companies who on paper can just say, go to a company like Equinox or Equinix that builds data centers and just partner with someone like that. And then if you can arrange financing and buy GPU clusters from Nvidia, then put that on the same data center and create a cloud on top of it and rent it out. That seems like a doable version given if you're good at raising certain amounts of capital on paper.

I think there a lot of companies like that, but I have not considered them as know, clubs because there is still a technical challenge of bringing a new type of GPU into the data center and building the racks, the cooling systems, the compilers, know, NVIDIA SCUDA. Now you have to make sure that if I'm a large AI lab, I'm

running my models on both Nvidia and AMD. So that means there's a technical challenge of how do you make that happen for your customer who is running a large cluster and they already have Nvidia already in another one easy way to run on your cluster. So there's a technical challenge of building the actual rack and making it available to a customer through your cloud. think that if we count that way, I think there are not that many Neo clouds.

But I think the NeoCloud market definition is slightly different because on paper a lot of companies can look like NeoClouds which are not actually NeoClouds because there's no technical differentiation or there's no technical challenge that is being solved as a company, right?

Jeff Tatarchuk (22:46.402) No, no, it's all it is spreadsheets and financial arbitrage. and if you put all of that together, you can, you know, make it happen, but the rubber meets the road when you actually have to deploy it. So GPS can be finicky. And, as you mentioned, like there, there are a lot of, there are a lot of challenges at stake. So

We love it though, because we worked on some of the harder problems first in the FPGA world, like seeing some of the opportunities that we're able to work on with AMD to truly bring this to market is really exciting for us.

Nataraj (23:24.474) So, I mean, you obviously talk a little bit about your role. mean, you're a chief growth officer. You're trying to grow this thing. Talk a little bit about customers, right? One thing is, all the top AI labs want more and more compute. That's pretty obvious. But whenever I think of training workload, how many, there's, think, a larger market for fine tuning. I think a lot more companies are doing fine tuning with either smaller models or larger models.

But how many companies are going afterwards big clusters and how do you see that demand shaping up?

Jeff Tatarchuk (24:00.226) Yeah.

You're right. I do think at, especially at the enterprise level is where we see fine tuning is the bigger opportunity. And there are a lot of companies out there that are working specifically with enterprise where they can, you know, take one of the larger models and then fine tune it to the enterprise's specific, you know, use case or needs. And that being done at scale, I think is a significant opportunity. And still, I do feel like the enterprise is still trying to navigate the AI.

you know, landscape and how they're going to integrate and implement it into what they're doing. And so I do see a lot of pent up demand on the other side of enterprise that hasn't fully broken yet, but there are a lot of people trying to solve that. on the other front, yes. Like, I mean, you, asked how many, I, again, I don't think there's more than a hundred that are doing significant like hero training runs that need more than like a thousand GPU clusters for years at a time.

I could be wrong. Maybe there's a lot more hiding underneath the bushes, but I can't imagine there being more than that. And, you know, the primary focus is on being able to support the top 10, you know, hyperscalers and AI labs that need access to compute. And so for both training and for inference now with AMD, AMD started out optimizing their GPUs for inference and

That was their first use case and making it optimized for that was really important to hit the ground running while Nvidia was focused more on training. AMD was able to capture a lot of the inference market. Meta announced that they host their llama models on AMD. Obviously, Azure has AMD in their platform. OpenAI just announced.

Jeff Tatarchuk (25:55.906) A few months back that they're doing a significant deployment. They six gigawatts of AMD will be deployed in the future. And so we're seeing a lot more of the AI labs interested and focused and they're still their primary focuses on inference. But as of recently, we are seeing more like if you look at our 8,000 GPU cluster that we deployed of the MI325s, it was built as a training cluster.

And so Lisa Sue had given us the challenge and the mandate to create the most performant AMD training cluster. And we did in record time. And so we've been able to focus on that. And so from my perspective as the chief growth officer, as I like to joke, the chief GPU officer, as I'm meeting with the labs that need access to compute, they have a lot of people banging down their door.

The amount of customers I have that are, you know, that announced that they've raised 50 to over a hundred million dollars. They're like, man, we're getting so many steak dinners from all of these AI clouds and getting flown on private jets all over the place. And they have a lot to choose from when it comes to the Nvidia world. But if they want to make a bet on helping to diversify and democratize, you know, what they're using, they have to look for an AMD solution. And, you know, we are the best with that.

When we first started, mean, was, was the challenge of people saying, I had no idea. And he did that. And so it was coming up with as many different creative ways to get the attention of the market, to make sure that they knew that AMD yes, has a GPU. Yes. It, you know, can do inference and yes, it also can do training and giving people options to try it. And so I'll never forget when we first started the company, like

three months in, we started in December and then GTC in 2024 was in March. And we had just raised our first like 40 million bucks and we decided to get a one of those LED trucks and we would surround the San Jose Convention Center during GTC. And we had, you know, a comparison of the H100 and the MI300 on the back and showcoasing all the different specs and showing that the MI300X definitely is better. And at the end, there was this robot.

Jeff Tatarchuk (28:21.07) that had a red pill in its hand and it went like this to the audience, to the thing. And everybody loved it. It blew people's minds. knew exactly what we were trying to communicate. And people were taking selfies in front of it. And everybody at a lot of the big companies that were at GTC at the time were like, yeah, you guys were all in our Slack channel as a way of getting the attention. So that people just needed to know, right? It was kind of a grassroots guerrilla marketing.

Nataraj (28:27.088) matrix.

Jeff Tatarchuk (28:49.25) that needed to be done to just kind of shake people like, there's a new player that is viable on the market and we should consider it. And so we got a ton of leads coming in and how we took, like a lot of people are just curious. know, initially we had a lot of people coming who had worked for all the national labs, which do, you know, lot of those supercomputers are built using AMD. And so they were already familiar with AMD. It was kind of our initial audience that was coming over. then a lot of, we, have some.

clients that have never bought an Nvidia GPU, like intentionally, just because they love the, they love AMD, they love what they stand for and they're committed to it. And so we still see that kind of transfer over from the consumer side and to those that are now developing in the AI world who want only AMD. And then it's, it's then it's, you know, we still have to go out and prove ourselves that we, know, that it does work. It's just as good, if not more efficient than an Nvidia GPU.

And we can show that we are the best at supporting them at that process of scale. So we typically will bring in a customer, we'll analyze what are they doing, what's their use case, what frameworks are they building on? And then we will then go and validate to make sure that everything that they're using, our internal AIML team will validate what they're doing and then make sure that there aren't any gotchas or bugs or any issues that they're gonna have.

run it and then once we validated it, and sometimes there are issues that we have to go back to AMD, we have to go back to one of the frameworks and fix something before we let the prospect on. And then once we have all that ironed out, we then get them on a POC to let them, you know, take a test drive, sit behind the wheel for themselves and show them that yes, it does truly work and it's just as efficient and better than Nvidia in a lot of ways.

Nataraj (30:44.294) Are there any specific architecture advantages people get by using AMD? Like is there a differentiation that AMD does better than Nvidia?

Jeff Tatarchuk (30:44.45) But even.

Jeff Tatarchuk (30:54.786) Yeah. And so like right now AMD still has an advantage and landed with the advantage of having more, more memory. They have more VRAM significantly than, the, than Nvidia does. And so if you need to host some of the larger models, like for instance, if you're hosting like a 70B model, you need, you know, two.

GPUs in order to do it on an H100 that has only 80 gigabytes of VRAM versus the MI300X that has 192 gigabytes of VRAMs. You're able to host more without having to kind of split it up over across more GPUs effectively. And then AMD has their chiplet architecture that Lisa bet on early on. And as a result of the chiplet architecture,

Yeah, it's starting to pay off showing that now they can break a chip down into more pieces and they could take advantage of a chip even more so than what you can do on Nvidia.

Nataraj (32:07.142) One of the biggest advantages that NVIDIA has is obviously CUDA, it provides libraries, compilers, and debugging tools for GPUs and sort of like their proprietary system that will run on GPUs, NVIDIA GPUs, designed for NVIDIA GPUs, and that has been long argued as one of the biggest sort of NVIDIA mode out there. How does that affect customers?

Jeff Tatarchuk (32:27.15) Mm-hmm.

Nataraj (32:34.096) for trying to run out of AMD. Like, the development team might be already familiar with, you know, building on top of Nvidia. And this is sort of like a, in some sense, it's not a unique problem. I think in some sense, like, when AWS, Azure, and GCP have, like, if your organization has AWS as a strength, then you hire more AWS people. That's sort of like compounds, and that's one of the reasons why AWS is so popular in startups, because the early talent knows AWS in every startup.

sort of builds on AWS, right? They have the small company advantages. So, talking about those challenges of having CUDA as a big model, how does that change, what are you guys building? Because now you're sort of like one of the earliest clouds that is building AMD clusters. How does that look like for AMD?

Jeff Tatarchuk (33:21.17) Yeah, so that was the you know, the initial kind of gut responses, you know, how, is the software and what we found is yes, there are some folk that have built on some of the CUDA specific libraries that are proprietary and if they were to have to switch, you know, if they were doing like Kublaz or CDNN or

you know, kudy and then or something like that, they would have to take some time porting it back from that to AMD. so, aim, Nvidia has CUDA and then AMD has Raco and, but what it simply means is like most, what we're finding is most AI engineers are, using, you know, PI torch, TensorFlow or Jacks. And if you are building with any of those frameworks,

You can port your code over from CUDA to or NVIDIA to AMD seamlessly. This was actually one of the things that we used to raise our first initial money was there was an article that was done by Databricks and the Mosaic ML team at the time with Naveen and Abhi over there. They were simply showcasing that they did it on an MI250 that they could showcase that you can.

you know, take your code from Nvidia on CUDA and run it straight out of the box on AMD and it works. And that was when the light bulb went off for us. This was in, I believe like late 2022, 2023 or sorry, 2023. Um, and it, it revolutionized everything. Like, we didn't even know it was possible to do that. And so, um, yeah, it's, it is, um, you know, uh, yes, there are challenges, but

Every day, AMD has done a lot to catch up on the software side as things are becoming more and more efficient. And one of things that we've done at Tensorwave is me and my team, launched a summit called Beyond CUDA. And so the first year we did the LED truck, the second year we did Beyond CUDA during GTC. We did it the Monday of GTC.

Jeff Tatarchuk (35:39.534) And we brought together the best researchers, founders and engineers who were building things outside of the CUDA ecosystem to showcase what they were doing and that it was actually possible and easy and just as performant to do. And we had about 400 show up and it had so much energy and excitement around it that it's continued to, you know, that the momentum has continued to this day. so,

So continuing to see, know, AMD's done a lot to support all of the other like Inference Engine frameworks from Triton to SG Lang to VLLM and the others. And so there's a lot of people working on some really cool projects that are trying to create kind of heterogeneous, you know, ability to switch from a TPU to an Nvidia GPU to an AMD GPU or whatever. for us, you know,

One of my things is I want to build an ecosystem of people that are doing these sort of things, solving these sort of problems and work on doing them together. And so we just had a customer the other day as we were telling them, you know, I telling them about, you know, cause we usually hear that it's the number one people think people say is about the, the kuda moat. And this particular customer said, you only need kuda if you don't know how to, if you don't understand, you know, how to really, you know, do low level development on the GPU. And for them, like kuda

they don't need it at all. Like they, they know what to do with the hardware and to get it to do what it needs to do. So, one of the original guys that developed CUDA, name is Greg De Almos. He's on our team. he's worked for Nvidia and Intel and AMD and did his own startup that was built on AMD GPUs. And now, you know, his main focus is on this and he now has this, this

Nataraj (37:33.734) think he's the founder of the open source scalar AI, right?

Jeff Tatarchuk (37:37.358) Yeah, that's right. Exactly. He's, he's great. And so Scalar LM is a unified training and inference stack that he's been developing. one of the quips that he says is that CUDA was always built with the purpose of going beyond CUDA and, know, growing and going beyond the, the ecosystem. so, that's one of the things that we partner very closely with AMD to.

support the projects that are out there to help build the ecosystem and strengthen it so that it is more brust, more robust and that people have the resources they need to be supported. yes, CUDA is no longer remote. AMD works out of the box. And yeah, if you think otherwise, we'd love to let you try it for yourself and we'll show you.

Nataraj (38:32.27) And so that means anyone like OpenAI can just easily change or quickly adapt and deploy that training cluster on any AMD based cloud essentially.

Jeff Tatarchuk (38:34.862) Thank

Jeff Tatarchuk (38:45.344) Exactly. Yep, exactly.

Nataraj (38:47.789) How does success for TensorFlow look like, know, three years down the line, you know, how does TensorFlow being successful will look like?

Jeff Tatarchuk (38:59.852) Yeah, I guess I'll start with me. What, what success would look like from my perspective is that the customer has viable options for them to buy compute and they don't have to be dictated by the demands of Jensen or Nvidia and that there is a true, you know, competitive market available for compute and from tensor waves.

perspective, we are able to play a significant part in that by providing the best optionality and providing the best AMD GPUs for the market to be able to consider those options. so for us, we want to make sure that we have a resilient, secure, performant cloud that they can rely on. We think of ourselves as the AI utility company. We want to make it just as

You know, just as secure and just as you're just as confident as you flipping a switch in your house. You're just as confident in your cluster running behind the scenes as you're doing your training and then you're running your inference and production. You are just as confident in that. And so for me, I would, I would see success is that, you know, our customers are happy because the compute decision that they have to make is the largest, you know, purchase they're going to make.

as a company where they're spending hundreds of millions of dollars, if not billions of dollars on these GPUs. And if they have to spend a lot of time messing with them and fixing them when a GPU dies and when there's downtime and working on all of the extra stuff, they're wasting time that they should be focused on their customers, their product, and solving the problems that they should be focused on. So I would consider Tensorwave a success.

If we've completely abstracted that away and we have a great resilient product that is able to provide their AI training and inference needs at scale. And yeah, we are the ultimate go-to for that on the AMD side.

Nataraj (41:12.102) Yeah, I think that's a good note to end the conversation. I think this has been a very fascinating conversation. And thanks for coming on the show.

Jeff Tatarchuk (41:19.17) Hey, thanks for having me.