Knowit AB

03/24/2023 | News release | Distributed by Public on 03/25/2023 15:40

What is Amazon Security Lake?

On 29th of November 2022 at the last re:Invent, AWS announced a new exciting security service - Amazon Security Lake (currently in preview), supported by major players in the industry like Cisco and Splunk. May you be a SysOps engineer, security administrator or data engineering manager looking to have improved security management of your AWS organization - this may be the service for you.

Security Lake empowers your organization with an all-around capability to manage and normalize all the security data in one centralized data lake, supporting data records in Parquet and Open Cybersecurity Schema Framework(OCSF) standard. Relying on that standard, the key feature of the Security Lake service is that now both AWS service specific and external data sources are transformed into this common data format for downstream consumption, for example for threat detection and incident response automation. AWS example subscribers include OpenSearch, Athena and SageMaker, but developers are free to write their own OCSF-compatible subscribers for the Security Lake.

Laying the context

We live in the age of big data production where every year we produce more data than the earlier years combined. With the abundance of data comes the abundance of different data driven tools developed by various vendors and the cybersecurity sector is no exception. For a security engineer whose aim is to automate the remediation of security incidences, the abundance of different tools and formats distracts from doing the real detection and incident response automation work.

Currently the sector of cybersecurity data management faces issues such as vendor log format incompatibility, abundance of mutually incompatible security formats, security data ownership and lifecycle concerns, solution provider lock-in due to data schema incompatibilities, open-source standard underutilization and solution cost ineffectiveness. This diverse set of issues motivates a development of a solution in the cybersecurity sector to address them and help to improve the general cybersecurity posture in the organization.

The solution to the beforementioned should facilitatetheoperationalization, management, storageand consumptioncybersecurity data while keeping storage, querying and ingestion costs at minimum and facilitate the threat detection automation and integration with both third party SIEM as well as open-source tools.

Why does it matter?

According to Forbes, cyber-crime is growing exponentially, increasing by 2.5 trillion USD in cost in the next 2 years. There isn't a single day without some news announced on cyberattacks and major vendors in the industry like Google, Amazon and Microsoft have launched their own security tools. In terms of AWS offering, AWS Security Hub has been the most comprehensive security solution from AWS so far. But has it proved itself to be the king of the cybersecurity services demanded in the industry?

If you have used AWS Security Hub before, you may have experienced that for example its dashboard update interval could be shortened in order to be more relevant in case of real-time incident situations and it can have a considerable amount of false positive findings that could be perhaps improved by some AI system. On the other hand, its Security Score is a very useful metric deriving inputs from multiple different services, both third-party and AWS services, including for example GuardDuty that can leverage artificial intelligence-based algorithms trained on third-party datasets.

But what if the customer has some specific datasets to be leveraged that are either generated by their own applications or rely on historical trends from 3rd party providers? And what if they need possibly near real-time detection of these events that won't be visible to the human eye and quick access to the enriched historical data? Today there is not one service that does all that in AWS, but the Security Lake service can get you very close to meeting that requirement.

Problem faced by security engineers currently

Every week you can read news about how AI is automating human jobs. Nevertheless, in order to be able to leverage AI and for humans to do more meaningful work, the data needs to be homogenized. Most of the cybersecurity teams and analysts in the world still clean and transform their data manually and that can lead to fatigure, burnout and boredom with work. So the need for standardization and automation of cybersecurity data is evident. With regards to implementing standardization on AWS, the first framework that may come to mind is the AWS Well-Architected framework.

Amongst others, the Well-Architected framework refers to the ready-made playbooks for incident responses and more specifically, SEC04-BP04 Implement actionable security events mentions runbooks (enabling consistent and prompt responses to well understood events by documenting procedures in runbooks) and playbooks (predefined steps to perform to identify an issue).

Learn more and book a Well-Architected Review

However, there's a difference between automating the remediation of a Security hub finding (see the image below) and a real security attack scenario.

Reference: AWS, 2023. How to automate security remediation based on AWS Security Hub findings. Automated Security Response on AWS.

What would be the specific steps implemented in the runbook be in the case of leveraging 10 different cybersecurity data sources of different formats? Although AWS Solutions Library includes sample solutions for automatic security incident remediation (click here for an example), since the concrete way to resolve an attack may not be predicted beforehand, there wouldn't be a ready-made playbook or pre-deployed AWS Lambda service for it. So if as a result the security engineer has to manually parse third-party vendor data, the speed of the activity could cost the organization under attack millions of euros. Oops..

Enter Security Lake

And this is where Security Lake enters the scene again - fast (hourly) data homogenization that could be used for real-time incident remediation is what it excels in. In a real life-scenario the Cloudwatch alarms or SNS messages will trigger appropriate responses for more simple automations, but for more complex responses the security team has to step in, turning to the Security Lake hourly generated data for further insight or leverage a pre-built automation on top of that data. Therefore, the Security Pillar currently doesn't seem to cover more complex attack scenarios for which more complex, possibly security data driven applications need to be developed and deployed.

Let's now take a look on how the cloud security posture, previously controlled by AWS Security Hub is enhanced with the introduction of Security Lake.

Security synergies - Combining Security Hub and Security Lake

The Security Hub and Security Lake are complementary services meant to improve the security posture in your entire AWS organization. Let's look a bit closer at the similarities and differences between them.

Security Hub itself is not collecting any data but rather focusing on compliance and snapshotting the overall security posture of the organization. On the other hand, the main focus of Security Lake is can be summarized as CONA which describes the different phases of cybersecurity data management:
  • Centralize (data governance)
  • Optimize (storage and querying)
  • Normalize (data formats)
  • Analyze (clean data in OCSF format)

Security Hub has a central dashboard, but for Security Lake you need to add an OpenSearch dashboard as subscriber, so no inbuilt system dashboard included.

From similarities, both Security Hub and Security Lake use one particular data format and they can integrate with third-party providers. The difference is that Security Lake is based on OCSF data format which is open-source, whereas Security Hub leverages ASFF (AWS Security Finding Format) which is proprietary to AWS.

Security Hubsupports custom data sources as well so an organization can get security findings from all supported Security Hub partners.

Both services unload your development team from building custom integrations with individual AWS services such as Macie, GuardDuty or Inspector. For some automation to be developed, the developer just needs to integrate with Security Hub or Security Lake directly.

Making these two services work in tandem creates a strong base for enhancing your security defenses against any malicious actors. Security Hub could be encapsulated as "Security today", focusing on general principles, guidelines and security hygiene whereas Security Lake could be summarized as "Security for tomorrow" in the sense that the data acts like a tool on which predictive automations, not just descriptive automations can be built on top of.

Now that we have a clear understanding where Amazon Security Lake is positioned, let's summarize why it would be a good idea to start using it for your AWS organization

Why adopt Security Lake to your AWS organization?

  1. Solves the challenge of security data format heterogeneity

  2. Easy for developers to work because it has a central standard that it conforms to

    The consumer application developed just needs to integrate with Security Lake (by becoming a subscriber). That way, building a custom solution leveraging Security Lake is a straightforward process.

  3. Easy to test and launch

    It takes around 5 minutes to launch the service and in 1 hour you will already have cleaned data, possibly from many accounts and regions ready to be checked out.

  4. Scaling security throughout the organization - protection also for your application data

    For custom sources, it just requires an application that writes some data to S3 in certain partitioned format, for example using this library (former AWS Data Wrangler), but any service that writes the data in the correct way is supported.

  • Future-oriented, bulletproofing against future attack vectors possibly driven by AI

    The topic of cloud and cybersecurity is becoming increasingly more important and conventional rule-based systems won't be enough to handle the complex attacks of the future. Imagine some synthetic algorithms with a pseudo-intelligence of that of GPT-4 or ChatGPT that could possibly be used to generate (new trends are the generative models) new synthetic attack vectors on the fly. These patterns can obviously change over time but might be very complex for humans to detect.

  • No upfront pricing

    Pricing is dependent on the use of underlying services (mostly AWS Glue). Keeping it running in 7 regions with a few data sources in a minimal configuration capped the monthly cost under 5 dollars, originating from the other services (no cost for Security Lake service). Read more about the pricing.


How to deploy Security Lake

Getting started with Amazon Security Lake

Use case: How to synergize Security Lake data and build a security threat detection application

Assume an AWS organization has deployed Security Lake across many accounts over many environments. The data collected from sources 1 and 2 would be in this format:

We can see that irrespective of the data source, the datasets share a common schema. Using this we can build a machine learning application. Since all the data comes with a severity_id label already, there would be no need to predict this, but instead we are interested in classifying a potential cyberattack type. To achieve that, a solution is to build a deep learning pipeline. It may require some usage of pre-trained models on vulnerability databases and high computational resources in the training phase, dependent on the required performance of the model.

The solution entails the conversion of the Security Lake data (especially including Security Hub findings and VPC Flow logs, packet level data) into images, and (re-) training a DCGAN - a deep convolutional generative adversarial network.

A general adversarial network is built on the idea of a generator that does the data generation (generating real or fake data, trying to fool the discriminator) and a discriminator that tries to correctly separate the real and the fake data, exemplified by the simplified diagram below.

Reference: Agrawal, R. 2021. An End-to-End Introduction to Generative Adversarial Networks (GANs). Analytics Vidhya.

The idea beyond this kind of generative modelling is to learn the "simplified" representation of the security data in order to enhance the understanding of it for the security team and generate new security data, similar to a possible future cyber attack data, for enhancement of threat detection.


Reference: Karimi-Bidhendi, S., Arafat, A., Cheng, A. Wu, Y., Kheradvar, A. & Jafarkhani, H. 2020. Deep convolutional generative adversarial network (DCGAN) architecture. Journal of Cardiovascular Magnetic Resonance.

We can then train this neural network in an adversarial fashion and use the resulting models for threat detection.

The advantage of DCGAN compared to other supervised learning models is that it tackles the issue of needing labeling of classes, i.e. the identification of "attack" VS "no attack" and sub-classifying the attack type. The model will be able to generate new data similar to the known attack patterns and this knowledge can then be leveraged in this application. To make the model more robust, transfer learning from known cyberattack databases would be necessary.

We can then deploy the DCGAN generator network to generate new security training data that looks like the data generator by AWS Security Lake and use it for example for visualization purposes to understand better the structure of the data.

For direct integration with Security Lake, we can deploy the DCGAN discriminator network for doing inference on new featured data obtained from the Security Lake. It can be deployed as a subscriber for the Security Lake and with the power of intelligent predictions, it can help to save the organization significantly in the event of a cyberattack.

Knowit is your trusted Well-Architected partner

Sign up for a Well-Architected Review here.