access2024

From Manual to Automated: Implementing Least Privilege in AWS with SCPs

Get the Free Ultimate Guide to SCPs

In this session

Speaker

Cole Horsman

Cole Horsman

AVP, Security Operations Global Atlantic Financial Group

sUMMARY

Learn first hand about the journey of achieving least privilege in the cloud using Service Control Policies (SCPs) in AWS. This case study will start by showcasing the manual approach, detailing the design and implementation with cloud-native tools. Then, the discussion will focus on leveraging automation tools to streamline the process, significantly reducing time and effort. Take away practical advice to apply to your own least privilege journey in the cloud.

Resources

Read Summary
Imagine accelerating your identity security maturity in the vast expanse of the cloud. At the Access 2023 Cloud Identity Access and Permission Summit, Chad Lorenc, a seasoned security manager from AWS, shared invaluable insights on mastering this vital aspect of cybersecurity. Key Principles: Chad emphasized that all cloud access is privileged access. He stressed the need for understanding roles and centralizing them through robust organizational structures. By mapping roles to identity policies, businesses can tailor security measures to their unique needs. Understanding Cloud Identities: In the complex cloud landscape, comprehending the nuances of identities is paramount. Chad highlighted various identity types, from cloud users to root users. This knowledge forms the bedrock of secure cloud environments. Phases of Identity Security Maturity: Chad outlined a roadmap for your journey. Begin by establishing a baseline of identities. Then, build protections around your business objectives and fine-tune them for maximum efficiency. The ultimate phase involves continuous monitoring, ensuring your security evolves with your needs. Aligning Security with Business: Don’t just implement security measures; integrate them with your business objectives. Chad urged against merely promoting "least privilege". Instead, showcase how it streamlines operations, ensures compliance, reduces errors, fortifies security, and facilitates scalability. Key Takeaways:
  • Understand Identities: Recognize the diverse identity types in the cloud—from users to cloud owners.
  • Align Security: Integrate security seamlessly with your business objectives for tangible benefits.
  • Continuous Monitoring: Implement robust monitoring and response strategies, ensuring your identity security stays ahead of the curve.
Embark on this transformative journey armed with knowledge, and watch as your cloud security reaches new heights.
View Transcript
Joseph Barringhaus (00:00): Awesome. Okay. So we just had a great session from James. Thank you so much James for that session. We saved the best for last. Everyone is here for our next session, From Manual to Automated: Implementing Lease Privilege in AWS With SCPs. I'm excited for the session. It's with Cole Horsman, the AVP of Security Operations at Global Atlantic Financial. (00:18): And just like the last sessions, the chat and the questions are active and watched at all times. Our presenters are here today. Cole, go ahead and join me up here on stage. We're super excited for the session and yes, again, reminder, even though it's the last day and you've been with us for a few hours, this session is being recorded and we will share the slides. If you have questions throughout the session, drop them in the channel, in the chat on the side, put them in the Q&A. We'll make sure we have lots of time here with Cole at the end. So with that, Cole, I'm going to turn it over to you. I'll be in the backstage if you need me. We'll come back on for questions at the end. And with that, take it away, Cole. Cole Horsman (00:53): -All right. Thanks for the introduction, Joseph. Much appreciated. Okay. My name is Cole Horsman. Again, I'm the AVP of Security Operations, primarily focused on cloud security at Global Atlantic. I've been here for about five years and I guess you could say that part of the reason that I'm here is because of some of the identity problems. I started out contracting as a consultant and then Global Atlantic brought me on full-time when we were building out their cloud environment to fix some security issues that they had with their legacy environment. (01:32): I'll get into that in here in a minute. Just give you a quick background on Global Atlantic. We are a leading US and retirement life insurance company. We were founded at Goldman Sachs in 2004. It's interesting is that we started out as a startup. It was a creative investment revenue stream for Goldman Sachs and that separated from Goldman Sachs independently in 2013. (02:01): KKR, a large private equity company that acquired us or majority acquired us in 2021, saw something in us. Essentially what Global Atlantic is mantra is we can do more with less. We run lean, we work hard, and we have the ability to do pretty incredible things with small teams. So that's a theme that you'll see and then I'll talk about in especially how it relates to identity, doing more with less. (02:33): So anyway, today, like I said, KKR fully owns Global Atlantic and we operates independently as a standalone insurance business. A lot of integration and alignment with them. Essentially we are the first relationship that KKR has had with one of their acquisitions where we have fully combined. So anyway, that's a little bit of our history. So table of contents here, I'll just run through it. (03:02): Global Atlantic Financial Group, that's our company. I'm based out of Des Moines, Iowa. We're headquartered out of New York. Got several satellite offices across the US but that's primarily where we are. Why we're here today is to talk about our journey to least privilege, cloud identity issues, lessons learned, strategies, pivots and things like that. And then Sonrai security and how we discovered them, how we utilizing them today. And they saved us from some projects that we... An initial plan that we were going to implement that allowed us to save a lot of time and resources to implement Sonrai for it. And at the end, I'll talk about a few things that you can do. Sonrai or not. (03:53): Like I said, love that what we've done with Sonrai, but there are things that you can do that are open source that we've implemented that have helped us to be able in our cloud identity space. Let's see here. So I'll start out with the Global Atlantic identity composition. So a little history on going back to what we had talked about before, our legacy environment. When I came in, we had a lot of AWS problems. We had a mixed environment for our production. So all of our application teams were sharing one production account and we had full administrator access for developers. (04:36): Arguably the reason I have a job is because of these issues that we have. So I can't really... This was job security for me, but there were a lot of things to address when we first got into this environment. We had access keys that were for pretty much every process. There were no roles that were running processes. This was all driven by access keys. We found access keys any place that you can imagine, S3 buckets running on servers in plain text, none of which were in secret server or anything like that. (05:07): So we had an uphill battle with identity in the beginning. For system access, again, we had a one to many relationship. So one access key would be controlling five processes and when you go to rotate that, you would have to coordinate with several teams just to rotate an access key. So that was job zero is just visibility, finding out connections to where... And dependencies to where all of these access keys were connected and who it impacted. (05:41): I hate to say it, but in that environment, I think the easiest way to put it is you got to break a few eggs to make an omelet and definitely we had to learn the hard way on a few of those things because of the coupled of that environment. A lot of public resources, no encryption standards, no governance or guardrails. This was my reality when I first got here. And again, my job was not to come in and be a security engineer when I got there, it was to build out the... as a cloud engineer to build out the 2.0 Environment, which a lot of that is building out guardrails, so setting up organizations. So naturally it just progressed into that role. (06:28): Just make sure I didn't miss anything. We had access keys that were as old as the processes, so four years old and things like that. I'm sure I'm speaking to some of you in the crowd that have encountered this and seen this in some early cloud days, but I'll get to the good part. So 2.0 objectives and just want to lay this out there. These are the objectives that we set out to do and in a lot of ways the reality is as well, but there are some things that we had to pivot on. (07:00): So developer access, we moved to role-based access. We moved to SSO-integrated, MFA-enabled all through Azure. We had a pretty prescriptive. At the time, it was just me doing the cloud security. So we had a pretty prescriptive role-based model. So if you're a developer in one application team, you would have a power user access in the development environment and a ReadOnlyAccess, anything beyond that. (07:33): We've moved to automated deployments. So our pipelines would deploy any infrastructure outside of the development environment and we basically just had to drive that point home and make that the way. So in QA in the beginning we fought with some manuals, manual deployment, some drift, I guess, but I guess the 80/20 was that we were going to deploy through pipelines. The pipelines deploy resources. We have release management, SSO check-out accounts that are integrated with ServiceNow for any production rollouts. And again, one of the previous sessions earlier talking about JIT, that's exactly what we found as well is that the A JIT perspective, these check-out accounts that we did were our home-baked JIT solution to where you would go in, log into our secret server, have a change ticket that was validated with ServiceNow, then you get access to release your production code. (08:44): I could talk about that integration specifically for a while. Feel free to ping me outside of that if you're interested. From a system access perspective, we don't want access keys. I've heard that earlier today as well. A lot of great sessions by the way. Appreciate all of the contributors to that and the people behind the scenes making that happen. But the access key is a problem. We still have access keys truthfully that we processes that run outside of AWS that we're iteratively trying to remove. That's also a journey. (09:20): So I'm not going to sit here and say that we have a perfect situation. We still deal with access keys today and I assume that that's fairly normal, but we do not. We discourage it. It requires an exception process that goes through... Lands on the desk of pretty much the CISO like this is a required exception. So also in 2.0 we wanted a logical separation of AWS accounts and therefore permissions, network, et cetera, through many accounts. So a separate dev, QA production environment. (09:56): We also do network segmentation based on the same thing through Transit Gateway, etc. From a security, I guess in compliance more or less. We wanted to define our controls, our detective, corrective and preventative controls. So what are we doing for detectives? We're using CSPM and we were just talking before the call. CSPM is great for the visibility. It's a starting point. You should have some visibility into it. However, I mean that is, it's just going to point out your problems. It is not going to out of the box, most likely not going to solve or bring solutions. So just keep that in mind that you methodically need to come up with a plan to fix those issues and address those as well, especially in automated deployment environments where when you make a correction to a production resource that ultimately could end up in drift and could cause issues down the road. (10:56): So just be cognizant of that, that CSPM is your starting point and a great way to identify some high severity issues, but it can also be overwhelming. That alert burn down strategy is really important. Corrective, so we've got a couple of different ways. We started out with some automated remediation and then some custom remediation where we built some Python runbooks that would look at compliance issues and then correct based on a schedule. We've tried a lot of different things. (11:30): Code security was a big one. Obviously, this was not a thing, a big thing right when we deployed in 2020 or put our AWS organization. But bridge crew came out somewhere in 2021, I believe, or we adopted it in 2021 so that we could start scanning our repositories for misconfigurations using Terraform. And then the other preventative policy, organization policy, service control policies. Hopefully if you've been here the rest of the day or caught any of the other sessions you're familiar with at least to some degree of service control policies. (12:07): But moving on here, those are our objectives for 2.0. That was the deployment strategy. And I'll kind of get into some of the baseline service control policy when you're setting up an AWS organization. We created a 10 commandments and this was all new. We used some tools like asecure.cloud. It used to be free. Sorry, it's not anymore, I don't think. Or maybe there is a free version, but you can go check it out. That's where we got a good library of where do we start. Okay, how do service control policies work? Okay, so we started piecing some things from there, using some online resources. (12:46): Organizations was fairly new concept, especially the Terraform provider for it in the beginning. But we did create a 10 commandments, block the root user, deny the regions and we've talked about that today in some previous sessions as well. If you're not using a region, there's no reason to allow that region or not block it. I mean, I wish it was just turned off by default, et cetera. I'll move on there. But unencrypted services. So preventing people from uploading objects unless they're encrypted in S3, preventing people from spinning up new RDS instances that aren't encrypted by default, enforcing AMIs and denying unused service. Kind of like a service whitelist for AWS. (13:35): A lot of these things are themes that have been discussed earlier with some of the Sonrai presentations as well. But just calling that out, just getting a set of these 10 commandments. They're probably widely available at this point too, to use as a baseline for organizations. The other thing that we wanted to do, and again this goes back to the access key thing, is deny the creation of IAM users. (13:58): So that was a strategy and definitely still something I recommend just to set that stuff up initially so that you can again move to preventative. Once a resource is out in production, good luck trying to get it back out of production and arguing with the application teams on why you can't do that. It's just becomes very difficult to pull something back out of production in my experience. (14:21): Some of the challenges that we ran into. There's character limits. There are attachment limits to OUs. Again, there were some limitations with AWS organizations and Terraform provider that we couldn't do. We couldn't use Terraform with organizations in the beginning. (14:41): Testing challenges. I personally have caused an outage in production that took down a financial forecasting tool that we had for hours and it was something to do with the batch process that happened overnight. So I didn't find out until the next morning. That thing takes a while to run. That was a hard lesson learned in knowing exactly how these service control policies work and also understanding our environment that our QA and our development environments are very different than our production environment. (15:15): So even if you've tested it in lower environments, there's possibility that you are going to run into issues in production. I'm not even... Like a possibility. It depends on your environment of course, but it's a probability. I would say that work with the application teams. It would be hard to anticipate everything that you're going to run into, but sanity check where you can and just do the due diligence. (15:43): Again, for us moving to a platform like Sonrai that can help with our automation of service control policies and have done a lot of the R&D would've saved us from something like that happening because this was just basically... It was a service that ran on the back end that we weren't expecting. And when that happened, that took out the batch process and I think that would've been avoided with somebody dedicated to service control policies like we're going to talk about here in a little bit. (16:14): That also, when you cause an outage in a batch process, that causes people to lose trust in the system and moves to internal approvals, more processes around service control policies, that became a bit of a problem and a friction point. So FYI that you could run into. Let's see here. So 2024 objects is moving that flash forward into today. What we wanted to do is set out this year. We had three verticals. We had network, we had identity, and data security, and those were the three themes that we really wanted to look at this year. When you're looking at your backlog, it's really hard to prioritize, but those were the three verticals that we had set up. (17:02): I'm going to focus just on the identity part just based on the theme of this conference. I think it's good to focus on those three areas from a network perspective, from an identity and data perspective ultimately, and I think Chad in a previous slide today talked about identity being at the edge and really is your firewall. So that resonates because the things that you are protecting that data, it's good to identify that data and see track down what can access it. (17:34): And it all starts with identity. I've preached that from the beginning with people who are coming into the cloud team is you really got to figure out and get a good handle on identity. That's really the building blocks for understanding cloud. So what we wanted to do this year, version control, ensure our identity resources are routed through CICD, ensure we can do preventative scans on that. So prevent star policies from being deployed or in an action star. We need to be able to audit our cloud and identity resources and then centralize management of it. (18:10): CIEM. We want to develop our CIEM solution, get a visibility into it and just see where we're at. And again, this was the start of 2024. Just wanted to get a better idea of what our identity posture was and identify unused principles. (18:27): Least privilege. I'm not going to sit here and try to say that we were trying to completely achieve identity nirvana or least privilege completely, but again, just work toward least privilege where it makes sense. I think there's a lot of ideas about lease privilege and a little bit pie in the sky for us at least working toward it and getting a handle on the unused and the overly permissive. That's the goal. We didn't have any unrealistic expectations this year that we were going to only provision necessary policies and things like that. Just get better at it, iterate on it. (19:06): Let's see here. So we discovered a tool called IAMbic. Definitely something I recommend checking out. IAMbic is IAM but in code. And what this does, it's a tool that you can deploy in your organization and it will go out and look for all of your identity resources in AWS. What that means is it will pull your roles, your policies, your permission sets, your service control policies all into one repository and it converts them to YAML files. So makes them pretty human readable and you can kind of see those resources make changes to those resources and again, audit those resources all in one repository. (19:50): Before we had some manually provisioned IAM roles, we want to prevent drift with this. We want to be able to audit and we want to enforce a source of truth. And what comes into play here and something else that you can do with IAMbic is you can we set up an automation server. And really with that automation server is runs a cron schedule. (20:12): And if someone were to make a change, so what is this? This is an organization service control policy. I will say that we have pivoted to using Sonrai as our source of truth for IAMbic, but for demonstration purposes, if you're looking at this, what we would do is add an enforced flag to this template and then when the automation server runs on that cron schedule and sees that someone made a change outside of the pipeline, it would go correct. It runs on a schedule every five, 10 minutes. And then we can know for sure that if someone... Like I said, someone made a change, this would go back to the source of truth which is IAMbic. (20:50): So we'll enforce all of our policies, permission sets that are distributed at first and inch our way toward what makes sense for our organization. Obviously, with just a very lean team doing identity and to some previous points today, we want to allow some developer autonomy with guardrails. We can't be the central identity team for all of AWS. That's not scalable and not feasible at our company. (21:20): So CIEM, one thing I wanted to call out here, we wanted to get a snapshot of how many resources that we had. And not going to go into who the provider is or anything like that, but we had some categorically incorrect information that was pulled from our platform for CIEM and there were some misconfigurations that should indicate that there are overly provisioned access on this diagram that weren't, were showing zero. Everything looks pretty good here. (21:56): There's a false sense of security that you really need to do some fact checking. Use IAM Access Analyzer in conjunction with your CIEM to make sure that the information is accurate. And I wanted to call that out. As you start stacking up more tools and you start getting more insight, make sure it's correct because it's not something that you should just rely on one solution for. (22:18): And I think at the end, if there's time, and I doubt there will be, but I could talk about testing your service control policies and doing some QA testing as well, but just wanted to call that out really quick that the information represented to you, you need to make sure that it's correct. (22:34): Code scanning. So I talked about Bridgecrew early on and I'm going to move through this quickly. We're running out of time, but code scanning is critical for us. We've got two ways to be preventative and those, if you recall, code scanning and service control policies. So we want to make sure that if a developer does or anyone if cloud ops or anyone provisioning an identity resource, if they do, we want to prevent the low hanging fruit. (23:02): We don't want to see stars and resource policy where we can avoid it. We definitely don't want to see star like a full star in the actions part of the policy. So just preventing those kinds of things from going into our environment because again, once it's out there, getting it to go back and modifying that is a little bit more challenging. Can take a week, can take a couple of weeks, et cetera. We got to wait for another sprint. Just try to get preventative as soon as possible if you can. The old adage, the shift left mentality. So that's basically what that's getting at. (23:39): So our initial plan, and this is where it comes into the Sonrai and the value we got out of it. The least privilege plan was to use the CIEM. That was somewhat unreliable to create policies and run Python books against those policies to correct permissions. So I'll give you an example. So somebody goes out and provisions an RDS, the full admin attaches it to an RDS instance. (24:09): Well, this would look at the permissions and remove the permissions based on a finding. It would send an SQS. We would've to set up an SQS queue. We'd have to set up a Lambda function. We'd have to do custom Python coding just to be able to correct that permission set. We were planning on it, taking six months from basically the second half of the year. We would have infrastructure management. So Lambda functions, perhaps some compute, other compute as well. (24:40): Custom development for again a two-three person team that's not doing full-time development. Removing permissions has a lot of risk associated when you can't easily put it back. And then a lot of potential for alienating app teams because of causing outages. (25:01): So enter Sonrai Security. I think I saw it on LinkedIn at the beginning part of the year and I think they had just released, somebody shared something about Sonrai and I looked at it and I was like, "Oh that's interesting. I know that no one is doing that right now. I think this could solve a problem." (25:18): So we set up a demo, we looked through it and we worked very closely with their team to see if this would fit for us, if it worked, how we thought it would work and if it would solve our problem. And it checked a lot of the boxes that we were trying to set out to solve for this year and it checked some other boxes with audit that we weren't expecting it to solve for. But the implementation I guess when we were looking at it to go through and remediate this again, keep in mind that it was going to take us six months to go and try to build this out and that was anticipated. (25:48): I'm not really sure exactly what that would've taken, but by doing this, we were able to start with the services and block. If you look at the slide here, you can see that we had a lot of unused identities, a lot of permissions that weren't being used, a lot of zombies. We went through established a process and development for correcting what are our service block list and we can now manage that through Sonrai. (26:18): We were originally managing it. Like I said, that's how I broke production. We originally manage it manually. One thing I'm going to call out too, when you're managing it manually, I think I missed this earlier, you have to go put it back. So for example, our root user. We blocked the root user. If I need to go rotate a credential for root, I got to go manually remove it, then I got to go rotate the credential, then put it back. (26:42): I will say that that's caused issues for us and has been a problem for us. So having an automation plan platform to handle that is also an unintended benefit of Sonrai that we weren't expecting. But you can see here what it saved it by starting with the services and then moving to the zombies and then moving to the sensitive permissions. We were able to do that within a few days of starting here, a few days and then we started quarantining zombies, and then we started doing the unused permissions. What would've taken, like I said, months. (27:15): And not only that, but the chat ops component was hugely beneficial because we got immediate feedback when someone got blocked. We can go reach out to that team and go get the permissions put back if they needed it to not cause friction with the application team. Sure, when we introduced it, there's a little bit that you have to work through, but when they saw the demo, when they saw it in action, they saw how quickly we could respond. It was a non-issue and in fact they could see what we were trying to achieve and the value of the product. (27:47): So takeaways, what you can do today, there's a lot of open source tooling that you can use to day. IAMbic is one that I definitely suggest checking out. Get visibility to your environment. Somebody talked about a few open source tools today. I think Prowler was mentioned and then I latched onto the role-based account factory that Chad was talking about as well. So those are really good tools that you could just kind of look around and find something. And if there's not time for the demo, then I would say that another thing that you can do is use some logic, some creativity and go use ChatGPT outside of your environment or in a safe controlled environment to say how can I test my service control policies to make sure that they're working? (28:33): Things like that. Shift left toward preventative controls. That's a given. Just to prevent it from, if you can prevent it from going into your production environment, you should. Cloud native tools. If you're using a cloud, like AWS or OneCloud, then these are the things that you should be using. Some other things that you can use. And then with limited resource automation is the only way. So find a way to do it more efficiently. Use things like ChatGPT to help you I guess get code or something like Python runbooks and things like that. All of this kind of starts with that. Obviously, you're going to use your own judgment on certain things that put in your environment. But those are the things that I would say that you could do today. If you want to reach out on LinkedIn or anything like that, feel free. That was a lot and I didn't really expect to go that long, so sorry about that guys. Joseph Barringhaus (29:27): Oh, cool. You've actually got some more time. Your session, you're doing great on time. If you want to go into some of your... I know you were working with using SCPs yourself manually and validating, and QA'ing them with ChatGPT, if you want to dive into that a little bit, we've got some time today. Cole Horsman (29:41): All right. Joseph Barringhaus (29:42): Let's do it. Cole Horsman (29:44): Let me share a different tab here. Joseph Barringhaus (29:46): And I'll go backstage for a minute, but I'll come back on for Q&A at the end. Keep the questions coming in the chat, y'all. Cole is going through this with us. Cole Horsman (29:53): Yeah. So this again was just more or less how do you do more with less? So I just asked ChatGPT, "Write a process to ensure that the service control policy is working intended." And what I really want to do is, "Hey, is my service control policy that blocks regions working? I don't have any way of knowing that, I set up this thing and somebody said it worked, but I don't know if it's actually working, if certain services work. How do I test that?" ChatGPT has got some advice that you can run through, but it wasn't exactly what I was looking for. I'm more or less like, "Give me the run book." (30:28): So I got some of the takeaways from what it initially offered and then I run through and said, write a Python script to test the service control policy that's working in US East one region. And sure enough, it'll give you a policy that will, or a Python run book that will give you the structure of doing it. You got to enter some of your own information. You got to build it out, but you get where this is going. You can build this out in your environment. You can start to test things. You can save a lot of time. Like I'm not a Python developer. Enough to be dangerous, but this is something that you can get your baseline started, save you some time on getting it deployed, and then you can do some follow-up questions and interact with ChatGPT a little bit further to say, "Okay. Here's the next iteration." (31:17): Validate that it's working. Send the results to an SNS topic. So now I want to get a message, "Hey, this failed. I got an access denied here, or something like that." Okay, great. So I need feedback that tells me that's working. I want to run that on schedule that says. All right, every month or every week, or whatever the schedule is, tell me every week that my service block is working. Or every day I just want an output of what service control policy say. Tell me you tried to deploy an S3 bucket that was unencrypted. Give me the results. Tell me you got an access denied. That's what I'm looking for. And say, "Hey, success if that happens." (31:55): So just a real world scenario, as you build this out, you could see where this is going. You could see the storyline of me interacting with ChatGPT and how you could do that. So other things that you could do, I don't know about testing this, but like blocking root and things like that. You could develop your service control policies with ChatGPT. You can go ask it to create a service control policy to block route. It will tell you the basics and again, do your independent testing, but it will give you a service control policy to block those services that you need to. (32:25): So I encourage people and even my team to go out and do some testing on their own, use ChatGPT in a safe environment, use logic to pick out the pieces that you're going to have to build on your own, but bring that to get you that starting point and get the creativity going. (32:50): So I think that's more or less from a demo perspective. I mean, it's really not a demo, but I just wanted to show the... I guess how to get through it and use AI to your advantage with ChatGPT and how you can use it in your organization safely. Joseph Barringhaus (33:07): Awesome. Thanks, Cole. I'm going to have you stop sharing if you don't mind and we're going to go through some questions that have come...

Up Next

Protect Your Cloud In One Click Without Disrupting DevOps

Start a free trial or get a live demo with our cloud experts to see the Sonrai Cloud Permission Firewall in action.