Skip to main content
Episode 141

Season 8, Episode 141

The Evolution Of Data, AI, And Security In Tech With Tomasz Tunguz

Guests:

Tomasz Tunguz

Listen on Apple PodcastsListen on Spotify Podcasts

Episode Summary

In this episode, Tomasz Tunguz of Theory Ventures discusses the intersection of AI, technology, and security. We explore how AI is revolutionizing software development, data management challenges, and security's vital role in this dynamic landscape. 

Show Notes

In this episode of The Secure Developer, Guy Podjarny engages in a deep and insightful conversation with Tomasz Tunguz, founding partner of Theory Ventures. They delve into the fascinating world of AI security and its burgeoning impact on the software development landscape. Tomasz brings a unique investor's lens to the discussion, shedding light on how early-stage software companies are leveraging AI to revolutionize market strategies.

The conversation navigates through the complexities of AI in the realm of security. Tomasz highlights key trends such as data loss prevention, categorization of AI-related companies, and the significant security challenges in this dynamic space. The episode also touches on the critical role of data governance and compliance in the age of AI, exploring how these elements are becoming increasingly intertwined with security concerns.

A significant part of the discussion is dedicated to the future of AI-powered software development. Guy and Tomasz ponder the evolution of coding, predicting a shift towards higher levels of abstraction and the potential challenges this may pose for security. They speculate on the profound changes AI could bring, transforming how software is developed and the implications for developers and security professionals.

This episode provides a comprehensive look into the intersection of AI, technology, and security. It's a must-listen for anyone interested in understanding AI's current and future landscape in the tech world, especially from a security standpoint. The insights and predictions offered by Tomasz Tunguz make it an engaging and informative session, perfect for professionals and enthusiasts alike who are keen to stay ahead.

Links

Follow Us

Share

[INTRODUCTION]

[0:00:29] ANNOUNCER: You are listening to The Secure Developer, where we speak to leaders and experts about DevSecOps, dev and sec collaboration, cloud security, and much more. The podcast is part of the DevSecCon community, found on devseccon.com, where you can find incredible dev and security resources, and discuss them with other smart and kind community members.

This podcast is sponsored by Snyk. Snyk’s developer security platform helps developers build secure applications without slowing down, fixing vulnerabilities in code, open source containers, and infrastructure as code. To learn more, visit snyk.io/tsd.

[EPISODE]

[0:01:16] Guy Podjarny: Hello, everyone, welcome back to The Secure Developer. Thanks for tuning back in. Today, we have a really fun conversation and very a different lens than what we had before. We have Tom Tunguz who’s the founding partner of Theory Ventures. We'll hear more about that in a second, to talk a little bit about AI from an investor lens, and about AI, and AI security in particular. So, Tom, thanks for coming on to the show.

[0:01:39] Tomasz Tunguz: Privileged to be here. Thanks very much for having me, Guy.

[0:01:41] Guy Podjarny: So, Tom, maybe just for starters, tell us a little bit about yourself and Theory Ventures. Give us a bit of context to your views of the world.

[0:01:48] Tomasz Tunguz: Yes. So, I've been a venture capitalist for 14 years. I was at Google for three years before that, worked at a company that went public. When I was 17, I started a little software company and fell in love with technology, love the incredibly rapid learning curve. And about a year ago, left my previous firm to start a new one called Theory. We're six people today. We spent a lot of time in and around AI, and we build a super concentrated portfolio. So, the idea is to find companies through extensive research, and then back them, and continue to back them, and continue to back them, and primarily help very technical founders build out their go-to-market strategies.

[0:02:21] Guy Podjarny: That's awesome. And then theory, I guess, is the name implied. It’s the deep pieces on a domain and AI being, I guess, kind of an area. I mean, that's a pretty big title on it. You spent some time I guess, exploring the domain, probably, before the ChatGPT craze?

[0:02:37] Tomasz Tunguz: Yes. I mean, so I remember when I was in grad school, there were four sections to the machine learning book. There was classification. This is a dog or a cat. There's prediction, which is, what will stock price be tomorrow. There's a natural language processing bag of words and all that named entity extraction. And then the last was, it was always the last part of the chapter, and the professor said, “You need to know this, but it doesn't really work in its neural nets. Maybe one day.” Over the last 20 years, we've seen each one of those technologies really have its day in the sun, and now it's clearly the last one.

[0:03:11] Guy Podjarny: Yes. Definitely the neural networks, although there's some NLP involved, I guess, in AI [inaudible 0:03:17]. So, we'll dig a little bit into AI security and we've got a bunch of things to talk about there. But maybe I'd love to just get your help classifying AI. You just gave us these four types of them. When you look maybe from today's landscape, and you look at types of companies on it, how do you divide them, the AI-related companies?

[0:03:38] Tomasz Tunguz: Yes. So, I think, it's a classic three-layer cake here. The infrastructure companies, where there's databases, or vector databases. You have the large language model foundation companies. You might have a tooling around that, like vector computers, which is really important RAG. Then you have the developer platforms. So, you want to run a private model, and you want to run it on a VPC, like a Google Vertex or Amazon Bedrock, kind of call that platform as is. And then you have the applications themselves.

I would argue the application tier, just the same way that every company became a mobile company. Like, if you look at the old venture websites, people used to have on their bios, “I invest in Internet,” right? It's an absolutely absurd assertion today, and we're kind of in this phase now, this transition phase of, “We invest in AI,” but the reality is that the application layer in particular, we're just investing in software.

[0:04:28] Guy Podjarny: Yes. And AI kind of needs to be embedded. So, but some people kind of lean a bit more into sort of the AI, I guess, disruption to know whatever, applying a new solution approach to a problem domain because of the new power that has become available through AI?

[0:04:45] Tomasz Tunguz: Yes, that's right. I mean, I think, and the way I characterise the application layer, is we're sort of waiting for the AI native moment. And what I mean by that is, when that mobile app store launched, all of the initial applications were little websites, websites stuffed inside of the iOS Chiclets, and it took Foursquare and Uber to reimagine those applications in a way where they use the accelerometer, or they use the GPS, and they were mobile native. I think most software companies today are the first and we will see a raft of companies that start from the first principles and reimagine applications entirely AI first.

[0:05:27] Guy Podjarny: Yes. I like the mobile native approach. Because I haven't really thought about mobile as much. I've been thinking a lot about cloud-native, and how – I guess, it's maybe like the same trajectory. But it's more about how do you operate a company versus the interface, which is, I guess, what you’re highlighting. But it's similar. It needs to be – you could have used the cloud as in and shifted and lifted and sort of, I'm running it on a virtualised infrastructure or someone else's running versus locally. Then, there’s the actual kind of cloud-native, which is, it's whatever, it's elastic, and takes advantage of all these capabilities. It's a different development process, et cetera, et cetera.

[0:06:00] Tomasz Tunguz: Yes. That's exactly right. So, you have these new features, which might like change the billing. One of the things that we've been debating at the application layer is, you have these very classic demarcations between software. You have CRM software, and marketing software, and customer support software. What if you use an LLM to kind of combine all that, and reconfigure these categories to try to capture more share.

[0:06:22] Guy Podjarny: Yes. That sounds confusing.

[0:06:24] Tomasz Tunguz: Yes. Especially, if you have existing budget line items, and all of a sudden, you're trying to reconfigure those, that stuff.

[0:06:31] Guy Podjarny: Yes. You’re still needing to tap into one of those. So, from a security lens, I think security probably touches, I'm sort of hearing maybe two takeaways. One, I guess, if I'm building an AI security startup, if I'm building something that relates job, and I want to help you secure AI-related functionality that you're adding up, that would fall into the second bracket into the tools?

[0:06:51] Tomasz Tunguz: That's right. Yes. So, there's a lot of demand there. I mean, we live this stuff every day. But the future is here. It's unevenly distributed. I would argue LLMs are so unevenly distributed. At this point, we came across some research things, I think it was Morgan Stanley, less than 3% of the global 2,000 have touched a large language model. We host dinner with buyers, we had two dinners with 12 buyers, and only one of them had touched it.

The main dynamic inside of enterprises today, or at least the one that, that we've been able to capture is, the board and the CEO are pushing hard, “We need an LLM strategy.” I'm sure everybody in the audience has heard this. And internally, whether it's the CISO or the VP of data, they are worried. They're worried about data loss, they are worried about the lack of determinism within these models, the hallucinations, and the impact of making poor business decisions. They're worried about these models, being co-opted.

I mean, you just saw Meta released an open-source pot that three days later had to be shut down, Microsoft had Eva. And these things are like humans, where you keep telling them a certain thing and they’ll learn and evolve. So, there's a whole security apparatus that needs to be created. Today, it's buried under this moniker of governance, and the funny thing is, I ask everybody I meet, “What does that word mean?” Nobody can tell you what it actually means. So, we're trying to distil, like, is it data loss prevention? Is it like model drift? Is it observability? Is it compliance? Is it legal? I think we need to get to that next level of granularity to be able to answer that question.

[0:08:25] Guy Podjarny: Yes, for sure. And we might explore that a little bit over here. But it's interesting, the governance. Yes, to an extent. It's just like a responsibility layer of like, use it, but do it responsibly. Sure. I'll have some governance still. So, I'm responsible. I'm sort of doing it correctly.

[0:08:38] Tomasz Tunguz: Then, we were asking ourselves, I think it was on Monday, “Do software engineers – is there like a governance software that exists for classic software engineers?” I mean, it sounds absurd. Sure. There's like SOC2 and ISO 27001, and all those like certifications, but there's no governance tool.

[0:08:53] Guy Podjarny: I mean, security, oftentimes, it's referred to as security governance on it. So, I guess, to an extent, and it's under security, and I think it had more time to brew the terminology. Not sure if it's any more crisp, or clear, like security governance is probably, there's a few better definitions.

The other takeaway, though, that I take is, and we'll probably sort of explore both of those in more depth in a sec, is if every company if it's like mobile, it's like internet, it's like cloud, then probably if you're a security person like you better get on it in terms of understanding what are the security implications of AI. Because I guess, the answer that says, “Oh, in my company, would not really use AI”, is probably either a short-lived or an incorrect statement.

[0:09:33] Tomasz Tunguz: That's right. I think the dynamic here is I remember when Dropbox first started and had the security blocked it and they said, “We don't know what this is. We have to stop it.” But employees’ personal computers, mobile phones, it will leak into the organisation. I think it's inevitable and the attack vectors that exists for these models are different and important enough that it's worth getting up to speed on.

[0:09:57] Guy Podjarny: Yes. I think it's a point, it's sort of the reason I've been doing much of the chapter with our guests here, episodes on AI security, because of the same conviction, which is, we all need to form an opinion on it and the pace of adoption. So, it’s interesting indeed, to hear your view. My sense was that the pace of adoption, if you were to contrast it to cloud, maybe mobile is faster. Maybe the value proposition is more immediate on it. Maybe it's just the hype and the range, maybe the ease of implementation. But even so, despite the stats that you mentioned right now in enterprise, like how many enterprises were embracing cloud, and I mean, if these 12 enterprises embrace cloud within a year of it being launched?

[0:10:41] Tomasz Tunguz: We’d count on one hand, maybe. You look at Microsoft's latest quarterly announcement, they went from 12,000 to, I think, 18,000 or 19,000 enterprise users of some form of the technology, whether it's Copilot or the new Clippy. So, I think the thing that's different about like the cloud wave, but it's similar to web and mobile is the press is interested in it. The press writes about. And so, as a result, it’s a phenomenon, like I haven't seen in a really long time where Thanksgiving Day dinner, we're not talking about Bitcoin this year, we're talking about AI. What is the bot you’re using? What are you doing it? How are you saving time at work? What funny meme can you generate out of the journey?

[0:11:22] Guy Podjarny: Yes, indeed. At the end, it permeates. I guess, in that sense, it is a little bit more like Internet, or like Bitcoin, consumer-ready, and it's not like Cloud, which is a vaccine, not that many people really know about it. There was an attempt, I think, when the cloud came out of no talking about it. But it’s just too hard to explain.

[0:11:41] Tomasz Tunguz: Yes, it's too amorphous. I mean, even the word cloud came later, and whether it was ASPs or software as a service, people are what, what are you talking about? But this AI stuff, you can go on Bing or Google and kind of play around with it.

[0:11:52] Guy Podjarny: Feel it for yourself. Maybe let's explore especially in sort of that first category of AI security. I guess, it touches both. So again, kind of investor lens here, and you're seeing a bunch of companies, or you're hearing from buyers and companies, what is it that they worry about. When you hear about AI security, in what type of categories or concerns jump up, and I don't know, whatever lens you have on? Does come from startups or does it come from buyers?

[0:12:20] Tomasz Tunguz: Yes. So, the first need is data loss prevention. That's what everybody is worried about. Some non-public information being used to train a model or it popping up elsewhere. You look at the launch of the GPTs from OpenAI on Monday, and with one quick command, you can get access to the entire underlying data set. Classic example of total data loss prevention. DLP is a category. My understanding and talking to buyers is that is a tainted acronym, where historically has been a category underperformed and people have been really disappointed. So, I think there'll be a new wave of data loss prevention from LLMs.

[0:12:53] Guy Podjarny: Or something different, so they're not tainted by previous tool markets. But effectively, it would be data loss prevention.

[0:13:00] Tomasz Tunguz: Exactly, right. The second category, I would call it like LLM endpoint, which is I'm an individual user with an enterprise, I want to use a model. The CISO or someone inside of the business wants to control which model I use, and then wants to run some scan, both on the input and the output of that model. So, this touches a little bit into the DLP side, but maybe I want to check for is Copilot assistant free producing copyrighted code that violates some laws in place, either budget limits, or rate limiting that's then imposed at that particular place? Are there sort of compliance concerns if I work in finance? So, there's both like at the infrastructure layer, and then at the endpoint layer. I think, you'll have two different solutions.

[0:13:44] Guy Podjarny: Just to delineate those a little bit. So, the first one is more of a trust with the data leaving my boundaries. It's more, I gave OpenAI, I copied my data. My data has been transferred to OpenAI, and then within the OpenAI interfaces, the admins and all that, they might have access to it. The second is more about the end user giving data and is that correct? But also, how much you're trusting the output. So, it's less the setup, less the infrastructure and data, and more of the day-to-day use.

[0:14:12] Tomasz Tunguz: Yes. That's a great clarification. I mean, just to sharpen that point a little bit. I can think like, data loss prevention, I think I want to use a cloud model, or I want to train or fine tune my own. How do I control the data that's going in there, and I make sure that it's in a safe box? Then, the second is my employee is the risk vector, and how do I wrap that person?

So, those are two. Then, there's a third category, which is around compliance. EU has different laws around the right to explainability. A lot of these LLM models do not have that. So, for mortgage scoring, or credit decisioning, we'll have to finally come up with a solution. So, those are the three categories so far that we've been able to identify. The fourth one, which borders on security and data is it's basically a data access control. We don't really have a great name for it, but it touches LLMs, but it's not exclusive, especially, with marketers are going to need certain amounts of data in order to fine-tune, whatever model they're producing. Let's say it's to produce a million ads a month. The customer support teams will want access to data to be able to create robots that respond to customer support emails.

It seems to me like there's a missing layer, that's like, it's not row-level access control. It's not our back. But there's just like data approval, access approval. I mean, maybe it's PAM, privileged access management for data. Something that looks like that, where all of the assets of the data assets of a company are put together, and then there are workflows about ongoing management that tie into compliance, and tie into different departments that needs to be built.

[0:15:50] Guy Podjarny: And it sounds like, and I think I relate to that, which is to an extent you can kind of paint all four under a data security mantle. One being on almost like solver. Like, okay, do I give that data for training from large bulks of data that have left the perimeter that I secure on my own? The second is around the flow of data between users and the LLMs, and back again, and how trustworthy that data also, which is leaked. If I leave compliance to the end, that third one you talked about is access to data that is within the system, and with prompt injection to an extent, and the data the LLM knows, any user of that LLM might know. So, how do you prevent that? How do you define that the data is amorphous? So, how do you even define it’s allowed to know what, let alone how do you enforce it? Then, the last one is, probably there's a bunch of regulations around this stuff. How do I demonstrate that I've done what I needed to do for all three? Is that right?

[0:16:44] Tomasz Tunguz: Beautiful summary. It’s my blog post.

[0:16:49] Guy Podjarny: It makes sense because AI is all about data and the security of it. I am seeing though, I think your characterisation is one that I buy into from like the concerns and the problems. From a solution perspective, there is in the world of all cloud security, to an extent mobile security and such. But if I think of this more as infra, there's also a delineation of stage, and maybe, simplistically, you can talk about some of the OpSec side, the, “Hey, I'm building something. Am I doing it securely? Am I building a secure thing?” Then, there’s the SecOps side of when I operate it, how do I know if someone's attacking me? Do you see those? Or is the industry just not there yet? They're still thinking about the basics, and so they're not thinking about those tools?

[0:17:29] Tomasz Tunguz: Yes. I don't think we're there yet, like talking about like rasps and security operations centres, like DDoS attacks on LLM bots, and that kind of stuff. That's happening to some extent. I think Google just published some research that it’s suing some group that was using Bard maliciously, so it's starting to happen. But I haven't seen it yet.

[0:17:46] Guy Podjarny: Yes. It's probably just not yet the concern of some of those because they haven't actually had the mileage as much. I think it was really interesting. When you geek out about it, a bit more theoretically, I think you're conveying what you're actually seeing in the real world on it, and maybe I'm a bit more material. But when you think about the complexity of SecOps, then it's really hard. How would you know, in this probabilistic world, how would you know that you're being attacked? How would you identify that it's an attack? When it is an attack, what type of forensic information do you need available to you? And what types of mitigation tools would you have? You know when the attack is happening right now, maybe you can kind of cut off its user totally. But is there other mitigation-type capabilities you can do? So, I mean, I think that's super interesting.

[0:18:34] Tomasz Tunguz: Yes. I mean, just a quick on that point. I mean, you can imagine, these systems spit out logs, right? So, do those logs end up going into your sim or whatever sort of analysis framework you have? The other challenge with that is one of these requests, that hits an LLM is like 10 to 50 times more expensive than a regular page load. You can imagine DDoS attacks are actually bankruptcy, just trying to reduce the infrastructure costs. So maybe, a mitigation mechanism there is you fall back to a caching layer that's just serving these or like it’s shadow banning certain IP addresses or accounts where you're serving them off so much less expensive infrastructure, as opposed to hitting the LLM each time.

[0:19:13] Guy Podjarny: Yes. That's actually like a really interesting point on it. When the elastic infrastructure came along, and even serverless, I had one talk in which I talked about denial of wallet accounts, effectively you're allowing the service to come along, you’re just getting the company bankrupt in the process. But serverless was cheap, and then –

[0:19:31] Tomasz Tunguz: I mean, $3 – we were calculating, some of these requests are $3 a request. So, you imagine like 1,000 KPS, at three bucks a request.

[0:19:39] Guy Podjarny: Definitely. You can easily close down a job and it might be a little bit just sort of for fun. Sometimes there is kind of an actual financial gains from getting a company to suffer. But it is a super interesting vector. I guess on the build side, I'm curious to hear a little bit of your take on the personas involved. So, maybe I'll share a concern I hear. Some security leaders that are a bit further along, they recognise the fact that they have a challenge around collaboration with their data science teams.

If you think about the AppSec side, for a while now they've built a relationship with the development teams. Some are working better. Some are working not as well on it. Now, they suddenly have this like new constituent that existed in the organisation in the dark corners of the Oregon, most companies, and they teach with a super date expected data, and they come along, and they tell them, “You have this, so where did this come from?” “Well, I don't know. I never really looked. We have the data. It's okay. That's all we need. We can assess it.” Or a lot of these organisations don't even have these datasets. Organisations don't have a documented and structured development process and manipulations. They modify stuff in production.

So, it freaks them out. I think someone legitimately. What do you see around ownership? For that tooling layer, for that infrastructure layer? Are those data science teams that are coming in? Is it the hardcore, hardened platform teams that come in and say, “Well, I'll own the data tools and these data science folks. They will follow them, because I know how to use those responsibly with governance.” I don't know. It's a bit of a long-winded question. But who do you see owning the AI tooling within companies when they're purchasing? Probably, the budget owners, but like the actual product owners?

[0:21:19] Tomasz Tunguz: Yes. Okay. So, I think you're hitting on a really important trend, which we talk a lot in the data world, which is, historically, you've had the developers build an application. The application produces data. That data then moves into a cloud, like a data warehouse, then you have the data team that does like post hoc analysis. After everything has happened, let's analyse what happened. How much money did we make? What is our click-through rates?

What's been happening over the last decade, but suddenly has been increased by LLMs, is all of a sudden you need that data within the application. So, this data team is now thrust to instead of being post hoc, they're actually part of the product development cycle, and the most sophisticated data teams, they look like software engineering teams.

So, what do I mean by that? There's a VP of data. The VP of data is basically a VP of product. That person manages a series of other PMs who are writing PRDs, and they're receiving feature requests from primarily internal constituents on Jira. They are starting to commit code to GitHub. What the problem is, unlike regular software, you have two different assets that you need to manage when you're in data. There's the code itself, and then there's the data. So, the versioning of the data, and so there is some like software infrastructure, like Iceberg and Project Nessie, where you can like version and roll back and fork, just the way that you would think. But that's really –

[0:22:45] Guy Podjarny: The standards around data versioning.

[0:22:46] Tomasz Tunguz: It's like day one. I mean, top of the first. So, that's starting to happen, but I agree with you. I mean, I think one of the most important dynamics within the security world is the CISO and the head of data will become far closer than they ever have. Because you could argue it's at least equally important to the partnership between the CISO and the VPNs.

[0:23:08] Guy Podjarny: Yes. I think I agree with that. I guess there's an interesting timing question on it. First, I’ll ask a question. Are there kind of rising stars within the data tooling world that is helping convert from teams to being something that is a slightly more structured development process? And I guess, maybe, do you – what's your view on the willingness of the data teams to adopt it? Because the tools will come, but the practices take longer to –

[0:23:34] Tomasz Tunguz: So, there is a lot of demand. I mean, Guy, you must be very familiar with that. figure, the DevOps –

[0:23:41] Guy Podjarny: Yes, the Infinity Loop.

[0:23:41] Tomasz Tunguz: So, there is now a DataOps Figure ight. You can think about, okay, there's planning, and I put this together this table, which is, here's what's happened in DevOps, here's the parallel in the data world. So, you can think about like the PRD. Well, there's this idea or like SLAs.

Sorry, let me take a step back. One of the big trends and data in the last world is called data mesh. Well, what is data mesh? That's microservices for data. That's what it is. It’s breaking down –

[0:24:08] Guy Podjarny: Yes, like a service mesh.

[0:24:10] Tomasz Tunguz: Yes, exactly right. So, now each individual team produces data. Okay. So, once you start producing data, well, what do you need? You need to tell other teams how frequently you're producing that data. What is the format? What's the SLA? Great. Okay, so that's an SLA in the DevOps world. What do we call in the data world? Called Data Contract. So, there's a company behind that called Gable.

In DevOps, there's this notion of containers within DevOps and CI/CD and how you and I can both be working on a particular piece of software. If we read a conflicting thing, there's a merge process. There's a company called Tobiko Data that's doing this for data. So, you and I can both be working on a data pipeline. I write a breaking – we have a femoral developer environment that understand the semantics, understand what the breaking changes are. We merge them into a beautiful way. Just the way that there's a Datadog. For infrastructure monitoring, there's now Data Observability. Monte Carlo is the leader there. What's happening in a pipeline?

So, each one of those eight or nine steps in the DevOps Figure Eight now has an equivalent, sometimes a little different, but mostly equivalent with vendors. So, that stack is starting to happen. But could I tell you that there has been a single CISO that's involved in any of those procurement conversations? Having invested in that category for last six years? I don't think they've ever been a constituency. But in the next five years, no doubt they will be.

[0:25:24] Guy Podjarny: Yes. But it's an interesting model, though. I think that's a good articulation of the journey that the data development or data engineering community is undergoing. So, to an extent, maybe a starting point for someone looking to secure these things is to say, “Okay, I understand this phase in the data pipeline. Can I relate it to the software peer? And then can I ask, what did I do about securing that phase? By proxy, try to apply that to that database.”

[0:25:56] Tomasz Tunguz: Yes. The parallels are very strong. Very, very, very, very strong. So, it should be pretty – it's just, every time there's a new world that comes up with the same concept, they give it a different name. Just learning –

[0:26:07] Guy Podjarny: Yes. Eventually, they emerge together. That's super interesting on it. You kind of mentioned committing code to GitHub on it. I know a company called Doug’s Hub. They’re trying to do, I think, it was others that are trying to be a GitHub for the data. Do you think there is a budding world, a collaboration tool, maybe even open source tools that might challenge GitHub?

[0:26:25] Tomasz Tunguz: I don't think so. I think, I mean, like GitHub, it's hard to put big files on there. There's like large file service, which is great. But a lot of these files are just gargantuan. I mean, it's kind of like asking, would you ever put your Datadog, the logs that you put in Datadog into GitHub? It's just like, it's not really designed for that. The way that the data world is evolving, is that S3 and R2, that's the storage on Amazon and Cloudflare. That's where people are dumping data and starting to manage it. And so, there are open source abstractions we talked about Iceberg.

The idea behind – so within regular, you can imagine a database like an Excel spreadsheet. And let's say you had a million Excel spreadsheets in a folder, and each had six columns. All of a sudden, we need a seventh column that's like column A plus column B. What you have to do today is you need to write a script. I mean, all the classic scripting problems, right? You have to figure a plus versus a minus, and you need to rollback. So, what these open source systems do like Nessie and Iceberg is they say, “Okay, add a column version.”

So, I think those will exist outside of GitHub, and the data storage place will be S3 or R2 because they're so cheap relative to alternatives. And you'll probably have this separate system. Now, how GitHub and those systems talk? TBD.

[0:27:42] Guy Podjarny: Yes. So, the code can still be versioned. They need the data versioning capabilities next to them. It won't be stored, probably in the same system. Although, there might still need to be sort of collaboration capabilities around that, some easier UI in which you don't go and browse a file list on S3. And then there needs to be migration, which I guess, again, comes back to GitHub. So, there's some mix of the code.

[0:28:02] Tomasz Tunguz: There will be some mix.

[0:28:04] Guy Podjarny: What about – so in the world of software that has been reused in open source. So, from a management and from a security perspective, it's very important. It’s very valuable for software to have been packaged, as here's the version of this piece of software on it, even registries over time that evolved on it. I see nascent elements to it, but what are you seeing around open data? And is there some nascent packaging standard of it, so that whatever –I had a security hat on here, and I'm saying, that would be very useful to be able to tag a piece of data with whatever assurances I think I can apply to it, and [inaudible 0:28:44], I can start worrying and inspect it a bit more deeply.

[0:28:48] Tomasz Tunguz: Yes, it's a great question. I mean, you can imagine, for a regulator, you need to prove that a machine learning model, given a particular set of data, produces a particular outcome. So, you need to match the model version with the data version. Both of those can change independently. There's no link between them today. People are starting to talk about this idea. There's a company called Patch that’s doing data package management, so NPM equivalent, where I can say this is the version of the data, which is on this particular day. The format is frozen, then it continues to update. But it's not there yet. But it's absolutely, absolutely, absolutely essential.

[0:29:21] Guy Podjarny: Yes. That's interesting. So, from a security perspective, I guess maybe taking it back to the customer lens on it. How much are these security concerns actually holding back enterprises from adopting it? Or is it so the business driver is too strong and they're running ahead, and the CISO is being told, figure it out, because we're going.

[0:29:44] Tomasz Tunguz: So, I would say for large language models, it's definitely holding it back. There's a real concern. For CISO, I feel for them because if they make a mistake, often your job can be on the line. Because, well, the downstream effects can be huge. In the world of data, you could botch an analysis and you won't be fired. But now, all of a sudden, with his LLMs, it has changed, and the heads of data know this. Because they've seen, I mean, they've been probably hacked like five times through their hospital and all that kind of stuff. So, they realised that all of a sudden, all this data is now everywhere. Until they have a high degree of confidence that all the data they're using is public, and if it were to leak, it could be no big deal, or there's a governance/security layer that exists on top. I think they'll be very reticent unless they're incredibly sophisticated.

[0:30:40] Guy Podjarny: I guess, how much are they looking for assurances from the platform they're using, versus dedicated tools that would give them the right capabilities? Like how much of it is OpenAI, telling you, “I'm not going to use your data”, or giving you whatever our back controls over, versus some LLM guard that you run the –

[0:31:02] Tomasz Tunguz: Yes. So, four months ago, we were talking to a publicly traded company, it was building on top of OpenAI, and we were asking them how they were managing their data security. They had two different answers. The first was, they had a proc. Every query that went to OpenAI was intercepted at the network layer. They ran RegExr for Social Security and credit card numbers and stripped that out. State of the art. Then, the other thing they did is, somebody would log in once a week to OpenAI and hit the delete personal data button. So, this stuff is new, right? A lot of the –

[0:31:34] Guy Podjarny: But what are they looking for? Are you sort of hearing narratives that say, “Hey, I need more secure LLM platforms, and that would address my problems?” Or are you hearing them say, “I want to buy, like I will also seek a secure LLM platform. But I need a dedicated, LLM-related, AI-related security tools.”

[0:31:52] Tomasz Tunguz: I think it will be both. The core platforms, like Microsoft has pushed aggressively, like a VPC equivalent for OpenAI, Google has Vertex and Amazon has Bedrock. And those have assuaged some of the concerns around the first category we talked about, which is uploading my private data into something that I control, and then training a model from which only I benefit, which produces an intellectual property. I think at the core infrastructure layer, that part will be managed by hyperscalers. But all the other categories that we talked about, like data loss prevention, and those kinds of things, I think huge opportunities for startups to come in and build businesses there.

[0:32:30] Guy Podjarny: Yes. I agree, and I guess it's all about the timing element. It sounds like probably the biggest value unlock that you can provide for the companies at the moment is the unlocking, of being able to use the LLMs. We're seeing that. We’re a specific slice of it. We're still seeing companies being fearful of using Copilot because of a variety of security concerns that doesn't produce vulnerabilities and things like that. So, oftentimes, they look for solutions like Snyk to be a guardrail. They say, “Okay, fine. You can use Copilot. It's okay.”

I can explain why that's a wise move, but I'll also admit that many of them are just, they're not deep in their area of expertise. So, they know they need to have a security guard. They know it needs to not get in the way of the individual. In this case, developers who are doing their work. But it's not necessarily that they fully mapped out the threat landscape, and they said that this is the thing I'm worried about the most.

[0:33:24] Tomasz Tunguz: But I don't think anybody knows – you know what I mean? We're all creating this right now. We're all learning from each other. So, I think you're right. I mean, if I were a buyer in this category, I would look to a trusted partner who could tell me this is what's important and this is what isn't, and I want to deploy something. And over time, I can optimise and figure out, actually, this particular attribute, or this particular threat vector is more appropriate for me. Today, it's just like, “Ah.”

[0:33:49] Guy Podjarny: Do something. You have to do something. You have to get in the game. So, if you’re a security person, you need to build a theory, understand it's going to be imperfect, do something, get in and continue to scrutinise. Don't think you're done because you've embraced it. So, how much do you think about – you mentioned in a variety in this recent newsletter messages. You mentioned the notion of constellation of models. In general, there's like a stronger, if I understand that term correctly, and other – it’s like the multicloud conversation, of like, would you really just use OpenAI? Would you really just use one? Or should we really brace ourselves to doing it? That's very important for security in terms of priorities. What are you seeing enterprises lean towards? What's your prediction in terms of all those?

[0:34:33] Tomasz Tunguz: So, it's really early days. When a company PM just wants to get started, they'll use OpenAI, and they'll say, “Great, I'll use the most expensive, most comprehensive, let's just get it working.” Then, what happens over time and like the bleeding edge companies, they move from OpenAI, and then they'll take an open source model, and then they'll train it. What they'll notice is, okay, this model is really expensive to operate. There are three or four actions, like let’s summarize some texts or extract some data from a PDF, or auto-complete code, and they will take smaller models.

So, you have large language models, and then you have small language models. Most of them are open – I mean, many of these are open source. So, the most sophisticated companies, now what they have is just constellation models. One big model where an input comes in. I don't recognise it, or it needs a lot of compute, goes to the big model. A second prompt comes in and says, “Okay, PDF extraction. Okay, let's route it.” So now, there's a routing layer that identifies what kind of query is this classifies it, and then it sends it to the right model.

Then, there's even now, starting to be – this is very, very early, a caching layer, which is, I've seen this before. So, you can imagine, “How do you say a bagel in Russian?” You only need to hit the LLM once. You don't even need to hit the LLM. But you only need to hit it once and you can cache it. And that time to live can be decades because that's not going to change. So, now you have a query comes in, there's some kind of a routing system, there's a caching layer, and then there's two, or three, or four, or five different models that are being deployed and managed, and then that's –

[0:36:09] Guy Podjarny: And you foresee, like those are maybe today, it's only the very advanced companies that apply that. Would you anticipate that that pattern will emerge everywhere, because of the cost aspect of it?

[0:36:22] Tomasz Tunguz: I just think so – the cost. A dollar, $3 a query, even with massive ACVs, you just cannot – or you think about like Microsoft and others are working on security configuration products, right? So, you deploy a new WAF, and you can type into the chatbot, and how do I configure my rules around like port forwarding, and all that kind of stuff. Actually, it's probably not the right layer. But anyway, you get my point. That's really expensive thing until like, for $20,000 ACV, if you're operating, you're probably talking about 10% to 20% of gross margin impact, as a result of a lot of these LLMs being put inside of products. So, people will push against that and try to improve the margin structure, and this is the way they’ll do it.

[0:37:02] Guy Podjarny: That’s really interesting. And that's, I guess, the motivation, like probably sufficient motivation in its own right. And then on top of that, you have a just the competitive pressures of it. It could be that one model is better than the other. It could be that one division has opted into one thing and the other division in the enterprise. So, CISOs, one of the challenges in the world of security is that while developers like depth, the sort of individual teams might sort of say, this stack is awesome. I'm going to use that and any things that just work with that stack. I don't care what the other one is, because it's not my problem. Security people, they need to come up with threats. It's not practical for them to deal with the same threat in like seven different tools and seven different dev teams. So, they need things that cover the breadth.

So, it might be another reason to maybe advise you in favour of external security tools that give you an assurance that spans these different sides.

[0:37:47] Tomasz Tunguz: Exactly right. You nailed it.

[0:37:48] Guy Podjarny: Typically, startups don't go straight to the enterprise. They go, and they kind of feed themselves, if you will, or they build up their product, indeed on, smaller companies on startups that are advanced in some adoption of new technology. So, if I was an AI security company today, in a classic, like a security company might go to, early, or mid-market companies that are more technology adopters and trying to sell them so that cloud security solutions before they go to the giants, very hard to satisfy.

Those smaller companies today, do they care about AI security at all? When you look into startups, or the pioneers that already are startups that have embraced AI, how much do they – we talked about how much enterprises care about AI and that it's a holder – how much do this midmarket at an early stage companies care about, that have adopted AI in more full force, how much do they talk about or what is concerned about security?

[0:38:43] Tomasz Tunguz: They care enough about to dissatisfy SOC2, I guess, is the way that I put it. But there's nothing – I mean, I guess that's the question. Will the SOC2 regulations change? As the expectations change as a result of LLMs?

The other dynamic is a site called GDPR data processor. So, a lot of them are wondering, I guess, selling to the enterprise, how do I architect myself that I'm not a marginal data processor? That's another really important component to this. Those are the two areas that seem to pop up.

[0:39:14] Guy Podjarny: So, they basically – I guess it's all a sophistication type element. Their customers will ask them about things that they know to ask them. So, the very, very basic things they'll ask them about, but beyond that, we don't really care too. It's a bit of on, one hand, super compelling, on the other hand, pretty tough world for us. A security company today, might need to go to the top or to the more security-sensitive people, more so, than the more technology adopters.

[0:39:38] Tomasz Tunguz: Exactly.

[0:39:40] Guy Podjarny: I'm curious, so like slightly veering out of security, but relevant and applicable to security as well is the flip world. We talked about how the world of data needs to embrace software development processes, but AI and LLMs, they're also changing how software will be developed over time. You're sort of deep and immersed in this space. What's your prediction to how AI-powered software development looks like in five years’ time, in 10 years’ time. I can intentionally, in the long run, not the sort of simple code completion or prompt that generate the code.

[0:40:11] Tomasz Tunguz: Yes. I think we will see assembly to like an interpreted language. You think about going from effectively zeros and ones to high-level object. I think we're going to see an even bigger jump with AI. Because the reality is, 50% of code contributed to GitHub already is machine-generated. Most of the code that I write is so formulaic and it's always the same thing. I mean, this week, OpenAI just announced, I can upload a zip of my repository, and then ask the model to refactor across classes. Let me contrast this.

So, four months ago, I could ask the AI to complete a line for me. Then, maybe two months after that, I could say write a function for me. Now, I can say, write a class for me, a whole document. Now, we're at the point where – and I haven't tested it yet. But refactor across multiple classes. So, within the span of 9 months, let's call it 10 months, a year, you really got a higher and higher level of abstraction. I think the other really interesting dynamic around these code agencies, I learned. For me, one of the hardest things to understand has been a hash map, or like how to use this function in Ruby called Map. I just look at the syntax and I feel dumb, looking at it. It just does not click. When I code, it does a formula like, “Oh, that's the way that it works.”

So, I think that's really important, where you see more and more people within an enterprise, coding people who might be sophisticated in Excel or, slightly technical languages who are now all of a sudden doing like, real application. Yes, the amount of code that we will start to produce will just grow by 100x, maybe 1,000x.

[0:41:50] Guy Podjarny: Yes. If you're a security person, that is a very scary statement. Especially, when it’s not necessarily vetted, right? Because it might be someone who just knows Excel, but cannot produce 10 files of Python. But what is their ability to scrutinise those files to say, for starters, that they're probably able to scrutinise, and that they're doing the right functional thing to a reasonable degree. But a security, that's a much harder thing to spot.

[0:42:16] Tomasz Tunguz: Yes. I mean, like, are they using the right encryption algorithm? What are they doing for like key rotation?

[0:42:21] Guy Podjarny: Even just straight up vulnerabilities. Are they sanitizing input that sort of moves from one function –

[0:42:26] Tomasz Tunguz: SQL injection, all that stuff.

[0:42:26] Guy Podjarny: And when you have vulnerability, if it's not composable, if not components, then that vulnerability will now be basically duplicated across 50 different instances, versus being able to know that this shared library has a vulnerability on it, and everybody else just needs to update.

[0:42:42] Tomasz Tunguz: Yes. That’s right.

[0:42:41] Guy Podjarny: It's a scary proposition on it – I do agree, though, with your prediction, which is, abstractions will get higher and higher. Development, software development will become more democratised. They come back a little bit to the, almost like the GitHub disruption. I'm not saying, it might be GitHub themselves to do this. But I think when you imagine that world of very high abstraction layer, but still need some form of versioning datasets that you need to run, and they do need some form of versioning. But it's no longer code, the stronghold, the stranglehold that existing source code repositories or communities on it might be diminished now.

[0:43:15] Tomasz Tunguz: Yes, absolutely possible.

[0:43:18] Guy Podjarny: I guess, we're in startup world, so everything has disruptable on it.

[0:43:23] Tomasz Tunguz: What’s the next juicy target?

[0:43:26] Guy Podjarny: Yes, I don't know if GitHub is an easy target and a stretch. But it's interesting to think about that, that software. I share your view of kind of switching to being higher level abstraction, a bit more sort of specification driven. But maybe this is back to the equivalents in software development and data, which is something needs to structure this thing. Just producing vast amounts of wild code. At some point, you don't want to know the code, you just want them to generate functionality, and you want to manage the functionality somehow, not know if and what code exists on this.

[0:43:56] Tomasz Tunguz: Yes, that's right. I mean, maybe you see sandboxing of these applications again, right? And particularly because somebody builds an application with an LLM inside, it's very difficult to instrument to understand exactly what's going on. So, you need to look primarily at inputs and outputs.

[0:44:09] Guy Podjarny: Yes. The exciting aspect of it is that you get to potentially reinvent the whole software stack of it. The transition that kind of gives you a little bit of security shivers in the process.

This has been an awesome conversation. I really enjoyed it. I think we're kind of coming up on time here. Before I let you go, I'd love to ask you, my typical ending question over here. So, if you could delegate or outsource one aspect of your job to AI, what would that be?

[0:44:37] Tomasz Tunguz: It would be calendaring. So, I look at a calendar and I feel like I lose 100 IQ points. I mean, it's such a difficult problem. I would love for email inbox to schedule. Done.

[0:44:50] Guy Podjarny: With like not just the schedule and coordination, but the wise time management aspect of it, right?

[0:44:55] Tomasz Tunguz: Yes. Cluster everything, figure out the travel time, look at the weather, figure out like whether she and Biden are going to be in San Francisco.

[0:45:05] Guy Podjarny: Yes. That would be optimizing. I think it's interesting that your answer right now, I feel like a lot of answers rotated around the admin of life. Let me just do my job, and just get this thing. That is a compelling part, so that’s what you would to spend time on.

[0:45:19] Tomasz Tunguz: That’s right. Yes, exactly.

[0:45:21] Guy Podjarny: Thank you very much, Tom, for joining in and sharing those views. I think it's a very fresh lesson and very interesting to me, and I'm sure to everybody listening.

[0:45:28] Tomasz Tunguz: So fun to be with you today. Thanks again.

[0:45:29] Guy Podjarny: And thanks, everybody, for tuning in. And I hope you'll join us for the next one.

[OUTRO]

[0:45:39] ANNOUNCER: Thanks for listening to The Secure Developer. You will find other episodes and full transcriptions on devsecon.com. We hope you enjoyed the episode, and don’t forget to leave us a review on Apple iTunes or Spotify, and share the episode with others who may enjoy it and gain value from it.

If you would like to recommend a guest or topic, or share some feedback, you can find us on Twitter, @devseccon, and LinkedIn at The Secure Developer. See you in the next episode.

Up next

AI, Cybersecurity, And Data Governance With Henrik Smith

Episode 143

AI, Cybersecurity, And Data Governance With Henrik Smith

View episode
Generative AI, Security, And Predictions For 2024

Episode 144

Generative AI, Security, And Predictions For 2024

View episode
Threat Modeling In The Age Of Artificial Intelligence With Laura Bell Main

Episode 145

Threat Modeling In The Age Of Artificial Intelligence With Laura Bell Main

View episode
Inside The Matrix Of Container Security: A Deep Dive Into Container Breakout Vulnerabilities

Episode 146

Inside The Matrix Of Container Security: A Deep Dive Into Container Breakout Vulnerabilities

View episode