With the increased adoption of cloud services, organizations are moving their critical workloads to AWS. Some of these workloads store, process, and analyze sensitive data that must be audited to satisfy security and compliance requirements. The most common questions from the auditors are around who logs in to the system when, who queried which sensitive data when, when did the user last modify/update his/her credentials?
By default, Amazon Redshift logs all information related to user connections, user modifications, and user activity on the database. However, to efficiently manage disk space, log tables are only retained for 2–5 days, depending on log usage and available disk space. To retain the log data for longer period of time, enable database audit logging. After it’s enabled, Amazon Redshift automatically pushes the data to a configured S3 bucket periodically.
Amazon Redshift Spectrum is a recently released feature that enables querying and joining data stored in Amazon S3 with Amazon Redshift tables. With Redshift Spectrum, you can retrieve the audit data stored in S3 to answer all security and compliance–related questions. Redshift Spectrum can also combine the datasets from the tables in the database with the datasets stored in S3. It supports files in Parquet, textfile (csv, pipe delimited, tsv), sequence file, and RC file format. It also supports different compression types like gzip, snappy, and bz2.
In this post, I demonstrate querying the Amazon Redshift audit data logged in S3 to provide answers to common use cases described above.
You set up the following resources:
- Amazon Redshift cluster and parameter group
- IAM role and policies to give Redshift Spectrum access to Amazon Redshift
- Redshift Spectrum external tables
- Create an AWS account
- Configure the AWS CLI to access your AWS account
- Get access to a query tool compatible with Amazon Redshift
- Create an S3 bucket
The Amazon Redshift cluster must:
- Be in the same region as the S3 bucket storing the audit log files.
- Be version 1.0.1294 or later.
- Have read bucket and put object permissions on the S3 bucket to be configured for logging.
- Have an IAM role attached that has at least the following two built-in policies attached, policies AmazonS3ReadOnlyAccess and AmazonAthenaFullAccess.
Set up Amazon Redshift
Create a new parameter group to enable user activity logging:
Create the Amazon Redshift cluster using the new parameter group created:
Wait for the cluster to build. When it is complete, enable audit logging:
Set up Redshift Spectrum
To set up Redshift Spectrum, create an IAM role and policies, an external database, and external tables.
IAM role and policies
Create an IAM role for Redshift Spectrum to access S3 bucket from Redshift database:
Attach built-in policies AmazonS3ReadOnlyAccess and AmazonAthenaFullAccess for Amazon Redshift to access S3 using Redshift Spectrum:
Add the role that you just created to the Amazon Redshift cluster. Replace the account number with your account number.
At this point, you now have Redshift Spectrum completely configured to access S3 from the Amazon Redshift cluster.
External database and schema
While you are logged in to Amazon Redshift database, set up an external database and schema that supports creating external tables so that you can query data stored in S3.
An external database created in Amazon Redshift is stored in the Amazon Athena data catalog and can be accessed directly from Athena. It defines a mapping of a database whose metadata locations exist either in the catalog or in a user-specified Hive metastore server to a local schema inside the Amazon Redshift cluster. This mapping allows table definitions to be shared between Athena and Amazon Redshift, or even different infrastructure setups between Amazon Redshift, Apache Spark, and Apache Hive.
To query audit log data, create an external database and schema in Amazon Redshift. Update the account number, role name, and region name below before running the DDL.
The region parameter specifies the region of the Athena data catalog. By default, this is the same region as the Amazon Redshift cluster.
Create external tables
Redshift Spectrum supports querying S3 data by creating external tables. External tables can be created in Amazon Redshift, Athena, or the Hive metastore. External tables are read-only. Currently, data in S3 files cannot be modified using this feature.
An external table is always referenced by prefixing the table name with the schema name.
Create three tables to query three different audit log files:
- User DDL: Captures the changes made to user definitions or any users created or deleted.
- User connection: Logs all successful and unsuccessful logon attempts.
- User activity: Captures all queries executed by users.
The data file format of user DDL and user connection logs is pipe-delimited text files. Both files are compressed using gzip utility. The user activity log is free flow text. Each query is delimited by a new line. This log is also compressed using gzip utility.
Update the S3 bucket locations in the following queries with your own bucket.
Demonstrate the use case
In order to demonstrate the use cases for this post, create a user “guest,” log in as that user, and create a table, “person.” Then, query the table to generate some user activity.
Log in as admin user and create a new user “guest”:
Log in as user “guest” and run the DDLs below:
While you are still logged in as guest, run some ad hoc queries on the person table to generate the activity log:
In the next sections, I demonstrate one use case for each type of log generated in the S3 bucket.
User Definition Log
The following query identifies whether the user ‘guest’ definition was altered during the day.
User Connection Log
The following query identifies the remote host from where the user ‘guest’ logged on, logon time, and so on.
User Activity Log
The following query pulls information on when the person table was accessed and by which users.