Skip to main content
Episode 28

Season 4, Episode 28

Developer Empathy With Jason Chan

Guests:
Jason Chan
Listen on Apple PodcastsListen on Spotify Podcasts

In episode 28 of The Secure Developer, Guy is joined by Jason Chan of Netflix to discuss simplifying the security process for software developers, as well as some of the open source projects Netflix has shared with the community.

The post Ep. #28, Developer Empathy with Jason Chan of Netflix appeared first on Heavybit.

共有

Jason Chan: “We're not trying to make software developers security experts. We realise they have a lot of other responsibilities. So, if we can solve those problems elegantly and efficiently, what we'd like them to do is opt into our systems versus trying to solve on their own. We'll talk a lot internally sometimes about this idea of cognitive load, and how much does a developer need to keep in their brain to be able to get ideas into production, and we want to minimise the impact that security has on that. If you think about how traditional security practices, we really wanted to turn that on its head.”

[INTRODUCTION]

[0:00:36] Guy Podjarny: Hi, I'm Guy Podjarny, CEO and Co-Founder of Snyk. You're listening to The Secure Developer, a podcast about security for developers covering security tools and practices you can and should adopt into your development workflow. It is a part of the Secure Developer community. Check out thesecuredeveloper.com for great talks and content about developer security, and to ask questions, and share your knowledge. The Secure Developer is brought to you by Heavybit, a program dedicated to helping startups take their developer products to market. For more information, visit heavybit.com.

[EPISODE]

[0:01:09] Guy Podjarny: Welcome back to The Secure Developer, everybody. Today, I've got Jason Chan from Netflix with us. Thanks for coming on the show, Jason.

[0:01:15] Jason Chan: Thanks for having me.

[0:01:17] Guy Podjarny: Jason, there's a whole million things I want to ask you and dig into security at Netflix. But before I go too deep, would you just give us a quick rundown of what's your background, what's your role today, and maybe what's the journey that got you there?

[0:01:29] Jason Chan: Sure. I've been in security for about 20 years or so. I started doing defence work for the Space and Naval Warfare Center. Moved from that really into boutique consulting for about the first half of my career. I would say for the last 10 years or so, I've been really managing security programs. I led the security program at VMware, and now I've been at Netflix for about eight years.

[0:01:50] Guy Podjarny: Okay. So, in VMware, you weren't on the security product side. You were on the, kind of keeping VMware secure side?

[0:01:55] Jason Chan: Exactly. We were in the IT organisation, so the information security team handled corporate enterprise stuff.

[0:02:00] Guy Podjarny: Got it. Eight years is a long time at Netflix, and the organisation has changed somewhat in those eight years. What were your scopes of responsibility roughly?

[0:02:10] Jason Chan: Sure. It has changed quite a bit. Business-wise, you think about it moving from shipping DVDs, to being a streaming provider, to now really being a full-fledged entertainment company creating your own content.

I started out relatively early on when Netflix was pushing into the public cloud and AWS, really as an individual contributor to explore the space. From there, as it turns out, you need more than one person to protect a cloud.

[0:02:35] Guy Podjarny: Sometimes.

[0:02:37] Jason Chan: I just eventually built the team. We started as a product security function, really working within the engineering team to try to effect change there.
Then over the years, probably about 60% organic growth and about 40% reorganising of other teams into my org, we've grown to an enterprise-wide security team.

[0:02:55] Guy Podjarny: Got it. How does this fit in the org? Is it still a part of the engineering organisation?

[0:02:59] Jason Chan: Our security team is within the product organisation, which I guess you could think of as the engineering organisation, and I report into the chief product officer.

[0:03:07] Guy Podjarny: Okay. So, it's like a peer organisation to the engineering team?

[0:03:11] Jason Chan: Yes. Our chief product officer basically runs engineering. We don't have a CTO or a CIO, so we're – I think of my peers, the other VPs, are other engineering VPs.

[0:03:20] Guy Podjarny: Got it. I think when you look at Netflix in general, Netflix has been pioneering a lot of modern practices around microservices, Chaos, and engineering, and a whole bunch of others. Specifically, there was a lot of statements. I've heard you speak, in general, statements from Netflix around modern philosophies, like very intentional approaches to software engineering, including security.

So, I'd love to dig in a little bit around this topic of, how do you run security in this fast-paced environment and dealing with developers? What would you say, in high-level, when you look to engage or work with the product team, with engineering teams, with your peers, and help instil security, what are the core philosophies, core principles that guide you in making decisions in your actions?

[0:04:07] Jason Chan: I would say to start with, when you think about some of the things that Netflix is well-known for, things like Chaos Engineering. A lot of that is – our intent is really the same as it always has been, whether you're talking about keeping systems available or secure.
The intent is always the same, but how we apply it is fundamentally different. It has to be different because of running in the cloud, and running large-scale distributed engineering teams. That's like the foundation, is that we're still trying to achieve the same objectives.

But we realise organisationally and culturally, we need to approach it a different way. Probably the leading principle or philosophy or whatever you might call it when we're working with our engineering teams is just this idea of, “We want to think about guardrails instead of gates.” If you think about how traditional security practices have been, you have this idea of a gate where maybe a project or an application, you get to a certain phase and then you need to go talk to security –
[0:05:05] Guy Podjarny: Stop for an audit.

[0:05:07] Jason Chan: Yes. You might have to get a pen test. We really wanted to turn that on its head and really think about this idea of guardrails, where we can keep things moving fast, but also keep things staying safe. We try to infuse that general principle into anything we do, whether it's a tool we're going to build to support developers, whether it's any kind of process or mechanism or workflow, something like that. We're really trying to minimise the situations where somebody has to come ask us to do something.

[0:05:33] Guy Podjarny: This sounds much more aligned with moving quickly. I don't know if I'm phrasing it correctly, but I think that fixed values are like freedom and accountability or something of that nature. Freedom and?

[0:05:44] Jason Chan: It's freedom and responsibility.

[0:05:45] Guy Podjarny: And responsibility?

[0:05:45] Jason Chan: Yes.

[0:05:47] Guy Podjarny: How does that align? You work with the dev teams, how much do you give them freedom? You gave them the guardrails, how much do you give them that freedom and how much do you hold them responsible?

[0:05:59] Jason Chan: It's definitely, there's two sides of that coin. I would say a general management philosophy for Netflix, and you can even – I know our CEO has talked about it publicly, is you really want to distribute decision-making as much as you can. So, me, as leading the security team, I want to be making a few decisions as possible, and the best way to facilitate that is to make sure that people have context about what's important to the company, what's important to the team.

The idea being, if you have all the information you need to make good decisions, that given maximum freedom, you're likely to come within a range of acceptability for decision-making. Then the responsibility aspect of it is, you are free to make your own choices, you're free to pursue your own paths. Sometimes they'll be the wrong choices. The responsibility element is you have to be accountable for those things.

So, security-wise, and I would say generally, from an engineering perspective, we want people to be able to make their own choices, but there's a little bit of, I don't know if I'd call it informal peer pressure that keeps things working. Where you generally want to align with other teams, if centralised teams are providing a particular service, you generally want to consume that, versus building your own. Just because those are the things. It kind of helps you be a good citizen of the overall engineering ecosystem.

[0:07:16] Guy Podjarny: Yes. To what level does that require security expertise? What's the threshold there?

[0:07:22] Jason Chan: Generally, what we're trying to do, and maybe we have a slightly different approach to security expertise for developers than some organisations. But my general philosophy is that we're not trying to make software developers security experts. We realise they have a lot of other responsibilities. They have to build features and products. They have to worry about performance reliability. We want to make participating in security as easy as possible.

A lot of security tasks are quite difficult, they require expertise, whether it's cryptography or things like that. We don't really want developers who are really trying to focus on some other element whether it's UI or personalisation to be having to worry about those decisions. So, if we can solve those problems elegantly and performantly and efficiently for them, then what we'd like them to do is opt into our systems versus trying to solve on their own.

[0:08:13] Guy Podjarny: Yes. I guess, it's a combination of not reinventing the wheel. So, you're working with those systems and things of use, and maybe sandboxing environment? Is that the way to think about it? It's like you play within the sandbox then you're fine?

[0:08:28] Jason Chan: Yes. I think, that's a good way to talk about compartmentalisation or segmentation, and one of the examples are, I guess, an analogy we'll use around self-service is like when you go to a grocery store and you want to do self-service checkout. It's normally fine, but if you have drugs, or not drugs, but if you have alcohol or cigarettes you're going to have to talk to a person.

[0:08:47] Guy Podjarny: Probably drugs as well. It's probably okay –

[0:08:49] Jason Chan: I mean, who knows? Some of drugs, maybe Sudafed? So, the idea is that a lot of things you can do and you can interact with, and it's not necessarily going to cause that much of a security downside. But we really want to identify that minimum set of things that we feel would be pretty impactful, if there was some kind of damage or some kind of error, and figure out the right way to sandbox those. Right way to allow the developer to interface with those systems, but in a safer way.

[0:09:12] Guy Podjarny: Got it. Let's maybe drill down a sec. We talk philosophy, probably the best way is to just talk about specific practices that exemplify it. So, let's chat about a bunch of practices that you have here. Can you give us some examples of some tools that you use or some security processes that you apply to them?

[0:09:32] Jason Chan: Sure. A good example that follows on that idea of making security safe to interact with is Lemur, which is a system that we built to allow developers to interface with PKI, with SSL. We all know historically things like OpenSSL is just – it's a difficult thing to use. Certificates are not straightforward to request or configure, but they're important. So really, after, I think it was in 2014, after Heartbleed happened which was the big SSL issue, we built Lemur as a way – we knew we wanted to make encrypted communications and SSL more pervasive throughout the environment. But we didn't want to rely on manual management, or those certificates.

So, Lemur is a way for all of our developers in a really simple and easy way to request certificates, make their certificates get automatically renewed, they're monitored. I'm sure we've all had experience in those cases where a certificate expires and all of a sudden something stops working. So, we wanted to just make all those problems generally go away, just fade away, again try to let the developers focus on what they're actually getting paid for at Netflix and let us worry about the security stuff.

[0:10:39] Guy Podjarny: This is Lemur, that's the name of the tool that's there? That's an example of something that's entirely out of bounds. That system, there should really be, there is no freedom for somebody to use their own implementation of OpenSSL. There needs to be some spectacular reason for SSL. There needs to be some spectacular reason for them to do so.

[0:10:56] Jason Chan: Yes. Lemur, yes, it's really the standard, and it's there. We open-sourced that a few years ago. But what we hope to do is if we build a tool that's sufficiently capable and simple, then you really have no motivation to go outside of that bounds. We call that concept, “A paved road.” You could certainly bushwhack and make your way through the woods. But if you have this nice smooth paved road that gets you to your destination, you're likely to opt in there.

Now, with freedom of responsibility, we do preserve the individual decision-making to go off that, but then they become responsible for that. So, you could think in Lemur or SSL certificate perspective, a developer, if they didn't want to use Lemur, they're going to have to figure out which certificate authority to use, how to provision those, how to make sure they don't expire. So, they're going to be on the hook for all that and there's really no reason to do that.

[0:11:42] Guy Podjarny: Yes. There needs to be something strong. Okay, cool. We're starting from the point of tools that are not really offering terribly much choice to the developer. What's an example of a tool that's maybe closer to the user side of the fence? You mentioned cloud permissions. You have some capability there and another tool that you've open-sourced?

[0:12:03] Jason Chan: We built a system called Repokid, and Repokid works on Amazon Web Services, as really a mechanism to evaluate the permissions that an application is using and if they have been provisioned with more permissions than they're using to automatically whittle those away. The philosophy there is we don't really want developers thinking about, what specific permissions do I need in AWS? We want them to be able to use what they need. So, if they need a queuing service or storage, they should just be able to use that without worrying how all the mechanics work from a policy side.

So, Repokid just works in the background. It will automatically manage and monitor those permissions, take away ones that aren't used. It will of course notify. But the goal is to have developers really have no notion of how permissions work in AWS and everything will just work. We'll talk a lot internally sometimes about this idea of cognitive load, and how much does a developer need to keep in their brain to be able to get ideas from their head into production? We want to minimise that. We want to minimise the impact that security has on that. We think of Repokid and similar tools as a way of just taking entire classes of problems that used to be or they have historically been problematic, and just really making them disappear into the background.

[0:13:22] Guy Podjarny: Interesting. So, that's it. I think the SSL statement wasn't controversial at all, in the sense that I know very few developers would have any aspirations to reimplement their SSL, or even certificate assignment. But specifically, when it comes to permissions or to getting permissions, that's indeed typically, or more commonly, you see this as the realm of the application definition. You're saying in this case, it is you're defining it. You're getting it down to that least privilege by way of observation?

[0:13:51] Jason Chan: Yes. You might think of it as automated least privilege, where what we do out of the gate is we'll give you what we call the base set of permissions and that set of permissions has been created over observing many hundreds and thousands of applications operate in AWS over a number of years. We generally have a sense of how most applications operate.

So, if you were at Netflix and you were an engineer and you created a new application, you would get those set of permissions, and the chances are what you need to do, you already have permissions to do. You may try to do something that you don't have permission to do, and we have a pretty simple workflows for you to gain access to those. But then what we do is we'll observe your application over time, and if you haven't given permissions that you don't actually need, we'll just kind of shrink those away, and we have a standard change notification process so we'll let you know, “Hey, your application is given these permissions that it's not using. We're going to take them away. If you have any questions, here's a documentation. Come talk to us either in person or in Slack channel.”

[0:14:52] Guy Podjarny: But the default is to take it away. They can appeal. They get some head notice for it and they can appeal or prevent it from happening. But if they've done nothing, the permission would go away.

[0:15:03] Jason Chan: Yes. It would go away. There's really no basis for appeal. That's one of the nice things is that if permission is not being used, then there's no real justification to have it. Now, you may run into sometimes, a developer will say, “Hey, I only use that maybe once a month or once a quarter.” Something like that. We certainly could provide that. But generally, it's been pretty simple.

I was checking our metrics a few months back and we have a really low rollback rate. Most of the decisions we make in an automated way have no impact and they just, again, it kind of fades into the background. That's one of the things that we're trying to do. It's not that we don't like to work with our developers, but when we think about scale, what we're trying to do is figure out the investments that we can make, that potentially takes out entire classes of interactions that you perhaps used to have to have, but now they can just go away through automation.

[0:15:55] Guy Podjarny: Cool. Let's take a step then into the code. We've talked about SSL that's like arguably a little bit more infrastructure elements of it, and Cloud Access is as well. What about the code itself? Within the application code itself, you've got maybe Docker containers. You've got vulnerable dependencies. You've got vulnerabilities in the code. What are the practices? How do you tackle those? Getting closer to explicit decision-making from the developer.

[0:16:23] Jason Chan: Yes. With specific code-level vulnerabilities, we have a number of mechanisms that we would plug into, whether they're scanners or other systems that could give us some signal that a vulnerability exists somewhere. What we try to do, because as you know, there’s never really a shortage of security tools that will tell you there's something wrong. What the problem is, is like, “Okay, how do you fix it?”

Then, what we try to do is match up the vulnerabilities that we see really commonly, maybe they surface through a bug bounty or through a penetration test or something like that. Those more difficult problems, and same thing, where we're going to try to solve those for the developers so they don't need to action on their own. A good example would be something like Secrets Management, where Secrets Management feels like, “Well, that should be a simple and solved problem.” But of course, at scale, when you have thousands of developers and thousands of systems, it's really quite difficult. So, we built that system. We build the system that handles PKI and things like that to give systems identity, that then become the foundation for use cases like Secrets Management, and being able to store encrypted secrets in your code repository versus leaving it in plain text.

[0:17:38] Guy Podjarny: But still, to drill down into this notion of the vulnerabilities in code. So, you build scanners? You provide those scanners to developers? I guess two questions, one is, do they have to use it? Maybe this comes back to conformity. Second is, who handles the results of it? Code scanners are notorious for their false positives. How do you engage there?

[0:18:00] Jason Chan: They are. I mean, we definitely plug into a variety of scanners and other tools that will help you surface vulnerabilities. One of the things that we're investing in is a system that, we call internally, we call it Security Brain, which is intended as a way of aggregating vulnerabilities for developers, so that you're not – the old case we used to talk about was you'd run a scan and that you'd give like a 300-page PDF report to the developer, and –

[0:18:27] Guy Podjarny: And everybody hides under the desk.

[0:18:29] Jason Chan: It hasn't been reviewed. What we try to do with Security Brain is surface the most important things that we really want the developer to fix, and make it really clear what those issues are and what the fixes are, and it may link to things like Cheers that we've added. We're trying to think about what is the developer’s interface to security issues? We want to minimise the number of places they have to go and how they would actually interact with that.

So, Security Brain for us would be, I don't know if it would be, like it’s not necessarily a normalisation layer, but it's one place that we could surface findings from an arbitrary number of tools into that system, so that the developer has – you could think of it like an application context. If you own an application, you could go there. When you log in and you'll see all the applications you own with the various issues that we found with each of those. You can be relatively certain that we're not going to come pester you about things that are not surfaced in that. Of course, there could be critical things that pop up that we would maybe engage out –

[0:19:28] Guy Podjarny: That's more of the response bit.

[0:19:30] Jason Chan: But you can think of that as the basis for your workflow, and you won't need to worry about logging into 15 other systems to find problems.

[0:19:36] Guy Podjarny: Okay. Interesting. So, that's all that communication. How do you handle containers? Again, Netflix is known for early adopters, microservices, a lot of those components. How do you handle security of containers and application dependencies? Those things are between infrastructure and code?

[0:19:54] Jason Chan: That's a good question. That's a good emerging area that most teams are trying to figure out how to tackle. Because one of the things as you move to the cloud, or as you move to say immutable infrastructure, or this idea of having golden images or base AMIs, or whatever you might call it in your organisation. Really, the line between infrastructure and the network, and security and applications just kind of go away. I would say we still have the same philosophy where we're trying to get leveraged by building security into the platform, versus chasing every single potential variation that folks might have.

We have a team, and you may be aware of, we have our container runtime called Titus. We work with them pretty closely from a relationship basis to build security, the security features we need into the container runtime system. We work with our team that handles the operating system on our base images pretty closely. But what we're trying to do, similarly, we're not necessarily trying to make them security experts. What we're trying to do is work with them closely, in close partnership, because we know there's high leverage by investing in those things, so that they can build security into what they're providing the rest of the ecosystem.

[0:21:03] Guy Podjarny: Yes. Very cool. Thanks for the tour a little bit about your different problems and clearly there are even more threats for it. But I love how, indeed, the philosophy holds within those components. How do you handle responses? All of those up until now have been, a lot of them have been, there are analogies to quality that you might draw. Maybe one of the slightly different is, indeed, incident response. A new vulnerability has been disclosed. It affects one of your dependencies. Do you get that notice? Does the dev team get the notice? Who gets pulled in?

[0:21:39] Jason Chan: If we had, for example, like a product security –

[0:21:42] Guy Podjarny: Another breach. A breach is a little bit stressful – exactly, like a new stressful vulnerability.

[0:21:48] Jason Chan: We have a centralised response team, and my goal with that team is to really to have them to be able to support different types of incidents. It could be data issues, it could be severe product security issues, it could be a corporate issue. It really doesn't matter, what we want them to bring to the table is things like crisis management skills, and communication skills, and technical remediation and leading that.

For a product security incident, for example, you had a popular application server that had a serious security vulnerability. Generally, our AppSec team would take lead on that in working in conjunction with our incident response team. Just because they're going to have a bit more context about who may be using that system, and they're more familiar with the system, to be able to get that data pretty quickly. They would be the technical lead on that incident and they would be doing a lot of technical execution of the investigation.

But what we attempt to do is have pretty good context about the environment. If you think of it from an inventory perspective, so that you could answer relatively quickly, “Hey, who is using this particular library or service, and who do we need to contact?” We also want to be able to add qualifiers to that, right? If you're using it and you are edge-facing, we're going to want to engage with you more quickly than if you're using it on an internal system that's not facing the Internet.

[0:23:04] Guy Podjarny: Okay. So, it sounds like in those contexts, the application security team or the product security is still a little bit fronting for the development team, trying to buffer, comes back to trying to keep the noes to a minimum for the development team and you'll push them. The AppSec team might have context about, whatever if it's a component that is vulnerable, should I worry or not?

[0:23:26] Jason Chan: Yes. We'll generally have the AppSec team make those decisions, and really what we found is one of the great things about the Netflix culture is that people care and they want to do the right thing. If they can help with a security issue, they're more than willing –

[0:23:40] Guy Podjarny: They're going to jump on it.

[0:23:40] Jason Chan: Totally. I've never had an issue in eight years of any team not being responsive or not taking responsibility for issues. I give them a lot of credit. They tend to go above and beyond when there's an issue.

[0:23:54] Guy Podjarny: But it is, in my understanding, that it’s still handled a little bit differently than an outage. It does imply that the person getting paged when there is an outage versus when there's a vulnerability in the same application, it's not the same person.

[0:24:06] Jason Chan: It would generally be the same person.

[0:24:09] Guy Podjarny: Oh, so it's still the same person.

[0:24:10] Jason Chan: Yes. So, say you if you had an application team and they had an on-call, we would just – if we felt it was urgent enough where we need to trigger the on-call, it would be the same person responding whether it was an outage or a security issue. They may then engage other folks, but they would be the first responder.

[0:24:22] Guy Podjarny: Got it. Then this would be the central incident response that would have paged them, because they said, “This is a sufficiently” – they might not have application context, but they know that the vulnerability that's been disclosed is sufficiently severe to care.

[0:24:35] Jason Chan: Yes. That's another thing that we tried to do, when we talk about cognitive load and how much do we want people to have to worry about, is we really designed our incident response processes after our SRE team. So, they had already had years of experience managing outages and working with engineers and bringing in the right people at the right time. To me, it would be ridiculous to try to build a capability that was markedly different from that, so we really tried to borrow from that, so that when people are responding to a security incident, it feels very similar to an outage, and now things will work the same way. We'll do the same type of post-incident reviews, the same type of reporting.

[0:25:12] Guy Podjarny: That makes perfect sense on it. Maybe on the flip side of that, we talked about problems happening. Do you have anything that you do that is more about celebrating successes? If a development team, like as the security team, if some developer went above and beyond, or did something awesome for security, do you have some elements of that? Maybe a similar question is, is there any notion of a security champions program, or things like that? How do you go the other way around identifying leaders and celebrating them within dev?

[0:25:42] Jason Chan: We don't have a formal champions program, but I would say we have a pretty strong and robust informal program where just the nature of software engineering, you have a lot of folks that have worked on security products and they know. They tend to be good informal champions for us. We do have a program – like anybody else, we probably don't celebrate enough. But we have a program that we call Security All-Stars. Maybe it's a little bit corny or a little bit cheesy, but we will recognise people if they go above and beyond. We give a little bit of swag. Nothing big. I think it's always nice to be recognised by your peers, and I think it's generally appreciated.

[0:26:18] Guy Podjarny: I know I promised I'm not going to talk about tools, but then I realised that we didn't talk about Security Monkey, and it's not quite the elephant in the room but it's – can you tell us a little bit, the Simian Army from Netflix, infamous or famous. I don't know.

[0:26:32] Jason Chan: Sure.

[0:26:33] Guy Podjarny: What is Security Monkey and what does it do?

[0:26:34] Jason Chan: Ye. So, Security Monkey, I believe it was the first tool that my team open-sourced. I think in 2014-ish or so. The Simian Army, at least the way I always thought about it, and that goes with things like Janitor Monkey and Chaos Monkey and Chaos Kong, is just that you've always had some sort of technology governance, or some sort of patterns and practices that you would follow, or that you adhere to lead to the outcomes you wanted. If you think about 15 years ago, how would you handle things like performance or efficiency or reliability? It would tend to be process-oriented.

When we design things like the Simian Army or Chaos Monkey or Security Monkey, they were similarly designed to get those same outcome. We knew we would have to design in a fundamentally different way. What Security Monkey does is, if you think of that guardrails versus gates principle, is it monitors the environment but it's not stopping you from doing anything. But it's identifying issues that we believe could be security problems and they're flagging them for the security team. It's a passive monitor versus, “Let me throw up a gate in front of you to stop you from doing something.” That's how a lot of our security tools work, with that idea that you can't prevent everything from happening. But if you build a lot of muscle and a lot of skill around detection and response that will allow you to move pretty fast, if you have higher confidence that you'll find issues quickly and be able to fix them in production quickly.

[0:28:01] Guy Podjarny: I think that makes a lot of sense. Chaos Monkey is more disruptive than what Security Monkey sounds like. Is there a Chaos Security Monkey? Is there a thing that just get some over permissions and breaks into a system and sees what it can get through? Or some aggressive version of an ongoing pen test?

[0:28:19] Jason Chan: Yes. We've experimented with kind of security chaos-type things. We do run attack simulation tools and things like that.

[0:28:26] Guy Podjarny: More Red Team style?

[0:28:27] Jason Chan: Yes, Red Team, and some security testing automation things. Nothing quite like Chaos Monkey, but I thought Chaos Monkey is such a fundamentally simple thing. It's this idea that what would happen if a monkey got into your data centre and started unplugging?

[0:28:39] Guy Podjarny: Yes, pulling plugs in.

[0:28:40] Jason Chan: Because that's a real-world way of testing reliability. I remember early in my career when I would do things like network engineering, you would always have these plans for high availability, and everything had to work just right for the feel of it would actually work. Of course, in reality, it never quite happened that way, so with things like Chaos Monkey, it really forced you to poke and prod at all different dimensions of your system and really see how you can respond.

[0:29:06] Guy Podjarny: What do you think is missing? If you had unlimited budget and resources, what revolution do you think is still missing there? I don't know if revolution, but like key opportunities that line security automation or the likes?

[0:29:20] Jason Chan: One of the things that we've invested some in that I would like to continue is this idea, and we've had different names for it, but it's just – we have an open-source routing gateway called Zuul that our edge team has built and open sourced many years ago.
But it's a modular system, so any traffic that comes into netflix.com is going through Zuul. But it's modular, so you can add arbitrary components to it, whether it's, for example we do some rate limiting on it.

But you can also add, we've experimented with adding for example, a web application firewall module to it. You can add an authentication module to it, or a logging module to it, and there's that idea of having a proxy that takes care of a lot of the security concerns. That's really where I want to keep pushing, is that I want developers to be able to focus on what they're actually hired to do, and worry less and less about security. It's not that I don't want them to care about security. I want them to focus their 9 to 5 on what they're actually at Netflix to do. The more and more that we can abstract and the safer we can make it for them, that's really the flag on top of the mountain that I'm going after.

[0:30:25] Guy Podjarny: Interesting. There's something about that statement that actually almost goes back to the perimeter, because it extracts it from where the application is into something outside. The catch would be, how do you do that in a way that keeps up with the applications complexity?

[0:30:40] Jason Chan: Yes, it's true. It's not an easy problem to solve. But similarly, in a different way, if you think about serverless or Lambda as a way of, if you're just running Lambdas, you're just running functions. There's a bunch of security problems that just don't exist. That doesn't mean no security problems exist, it's just much smaller. That's the kind of philosophy where we're trying to get to, is where there's just fewer things that can go wrong. We have a higher assurance that we understand in the environment and that we feel comfortable with the controls that exist.

[0:31:12] Guy Podjarny: Indeed. I talk a lot about serverless security and I feel a lot of problems indeed. They don't theoretically go away, but they're basically handled by somebody else. In this context, the platform, the cloud platform. So, you're saying, “Well, I want to take another chunk of those components and hold them for my development team as well.” Unfortunately, the attackers would shift their attention into the remaining gaps.

[0:31:33] Jason Chan: But I think if you compare, take say 15 years ago, where you are running on-premise, you are running a server, you're running all the network. You are maybe running a virtualisation layer, you're running the OS, you're running the middleware, whatever, the application tier. You are responsible for all of that, and then you contrast that to Lambda where again, you're just minimising the attacks. You're really aggressively managing the – because there's always more stuff to worry about, but you're really being able to control it there. So, to me, it's a really neat direction.

[0:32:05] Guy Podjarny: The smaller unit also allows for this privilege, and the likes on it. Maybe one last question before we wrap up, in this sort of very empowered environment that you have within Netflix, what are your thoughts and are you concerned at all about this notion of a malicious or a compromised developer?

[0:32:21] Jason Chan: Malicious, yes. Malicious insider I think is definitely something really any security team is concerned about. I think we acknowledge that malicious insider is a very difficult problem. If you think about Snowden, or any of the other – I mean, these are really difficult problems to solve when you have a knowledgeable malicious inside attacker that has a lot of rights. We've invested a fair amount into, if you think about identity, we try to make identity as pervasive as possible and we try to use behavioural analytics to better understand. Hey, is this developer acting normal? Or any other employee acting normally? Or is there potentially some kind of issue there?

I would say, generally, it's one of the largest problems in information security, so I wouldn't say, really anybody has solved it. It's certainly something that we pay attention to.

[0:33:11] Guy Podjarny: I do think that you've opened Bliss. You have the SSH access, so zero trust plays a component to it, that at the very least, if you're unfamiliar with it, listening to this and unfamiliar with this, Bliss is like an SSH proxy that is more single sign-on based. Or more conscious of who's allowed to access the system at any given time without managing a proliferation of SSH keys.

[0:33:36] Jason Chan: Yes. It's a nice mechanism for managing SSH-CA, so you can have short-lived sessions that are user-specific and end-point-specific, that really helps. Because we've always thought about you know the bash and host model is a nice way to choke point access, but there's problems with that architecture. So, Bliss is intended to address some of those limitations. They give you more granularity on authorisation, but also from an accountability and auditing perspective.

[0:34:01] Guy Podjarny: Yes. That's one aspect of it, because you know at least that if somebody did get compromised, they don't have the keys to the kingdom. They have to jump through the system and that that would be auditable, just managed in a more reasonable fashion.

[0:34:13] Jason Chan: Yes. If you if you bring it back, if you like the kill chain model, one of the things that you're always trying to do is you're trying to make the chain as long as possible, so that a number of things have to go wrong in succession successfully for the attacker to be able to achieve their objectives. As much as we can do to extend that and build more detections and things like that, it's generally what we want to invest in.

[0:34:37] Guy Podjarny: Sounds good. This has been great, Jason. Before I let you go here, I like asking every guest on the show. If you had one bit of advice or pet peeve or something you would like to tell a team looking to level up their security posture, what would that be?
[0:34:52] Jason Chan: I believe the real differentiator with any security team from organisation to organisation is how well you understand the company's culture and the company's risk appetite. Because the body of knowledge for security is out there, it's knowable. But where the real art comes in is how you apply the company's philosophy on risk and the company's culture, how you shape that body of knowledge to the problem at hand. I think many security folks they'll have their bag of tricks or they'll have their experience of having done certain things a certain way at another company or another customer. To me, it's really the customisation, and really investing, and really understanding what your company wants, what it's comfortable with, how it wants to operate, and being able to flex your security program to fit that.

[0:35:42] Guy Podjarny: That's a great advice. So, if somebody wants to tweet at you or such and find you on the Internet to ask more questions, where can they find you?

[0:35:49] Jason Chan: Sure. Please come find me on Twitter, it's @chanjbs.

[0:35:56] Guy Podjarny: Jason, thanks for coming on the show.

[0:35:57] Jason Chan: Thank you. Thanks for having me. I appreciate it.

[0:35:58] Guy Podjarny: Thanks everybody for tuning in. Hope you join us for the next one.

[OUTRO]

[0:36:04] Guy Podjarny: That's all we have time for today. If you'd like to come on as a guest on this show, or want us to cover a specific topic, find us on Twitter, @thesecuredev. To learn more about Heavybit, browse to heavybit.com. You can find this podcast and many other great ones, as well as over a hundred videos about building developer tooling companies, given by top experts in the field.