Safely Manage AWS Security in Apache Spark

In this post, we will learn how to safely manage AWS security in Apache Spark. Apache Spark provides various filesystem clients (s3, s3n, s3a) for reading and writing to and from Amazon S3. Because s3 block filesystem is deprecated and s3n native is not maintained, I am going to use recommended (s3a) filesystem client in our code examples. You can read more about differences of s3, s3n, and s3a here: https://wiki.apache.org/hadoop/AmazonS3.

Prerequisites:

  • spark-core_2.11 2.2.0
  • hadoop-client 2.8.1
  • aws-java-sdk 1.11.145
  • Running Spark cluster on AWS

Access S3 filesystem using AWS user key/secret

The simplest way to configure AWS credentials is to create from IAM Management Console and append key/secret into spark configuration. SparkS3EmbeddedCredentials uses AWS key/secret to access and write to S3 bucket. Let create AWS access key/secret:

S3 filesystem access with embedded credentials

We need to embed newly created access key/secret to fs.s3a.access.key and fs.s3a.secret.key spark configuration properties respectively. Following code snippet shows how to configure AWS key/secret to fetch data from S3 bucket. We also need to configure S3 input file by replacing <s3-bucket-name> and <input-file-name> from actual bucket and input file name for processing.

Configuration configuration = sc.hadoopConfiguration();
configuration.set("fs.s3a.access.key", "<AWS-ACCESS-KEY-ID>"); configuration.set("fs.s3a.secret.key", "<AWS-SECRET-ACCESS-KEY>");
JavaRDD s3InputRDD = sc.textFile("s3a://<s3-bucket-name>/<input-file-name>");

Download complete source code: SparkS3EmbeddedCredentials.java

Sample command line:

$ spark-submit --class com.spark.aws.samples.s3.SparkS3EmbeddedCredentials aws-spark-samples-0.0.1-SNAPSHOT.jar s3a://spark.data.com/shakespeare-1-100.txt s3a://spark.data.com/output

S3 filesystem access by Java system properties

To configure key/secret using Java system properties, we need to provide -Daws.accessKey=<AWS-ACCESS-KEY-ID> and -Daws.secretKey=<AWS-SECRET-ACCESS-KEY> at java command line. In case of Spark, we need to include these properties using option –driver-java-options. Spark Command line example:

$ spark-submit --driver-java-options="-Daws.accessKey=<AWS-ACCESS-KEY-ID> -Daws.secretKey=<AWS-SECRET-ACCESS-KEY>" --class com.spark.aws.samples.s3.SparkS3CommandLineCredentials aws-spark-samples-0.0.1-SNAPSHOT.jar s3a://spark.data.com/shakespeare-1-100.txt s3a://spark.data.com/output

SystemPropertiesCredentialsProvider loads credentials from command line with parameter names aws.accessKeyId and aws.secretKey. Following code snippet shows how to fetch credentials using SystemPropertiesCredentialsProvider.

Configuration configuration = sc.hadoopConfiguration();
SystemPropertiesCredentialsProvider provider = new SystemPropertiesCredentialsProvider(); 
configuration.set("fs.s3a.access.key", provider.getCredentials().getAWSAccessKeyId()); 
configuration.set("fs.s3a.secret.key", provider.getCredentials().getAWSSecretKey());
JavaRDD s3InputRDD = sc.textFile("s3a://spark.data.com/shakespeare-1-100.txt");

Download complete source code: SparkS3CommandLineCredentials.java

S3 filesystem access by credentials configuration file

AWS provides command aws configure command which can be used to store credentials in file ~/.aws/credentials. To use AWS command line interface first, you need to install it (AWS doc: Installing the AWS command line interface). Configure key/secret using aws configure command (AWS doc: Configuring the AWS CLI):

$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: json

ProfileCredentialsProvider class automatically loads credentials from aws configuration file. Following code snippet shows how to fetch credentials using ProfileCredentialsProvider.

Configuration configuration = sc.hadoopConfiguration();
ProfileCredentialsProvider provider = new ProfileCredentialsProvider(); 
configuration.set("fs.s3a.access.key", provider.getCredentials().getAWSAccessKeyId()); 
configuration.set("fs.s3a.secret.key", provider.getCredentials().getAWSSecretKey());
JavaRDD s3InputRDD = sc.textFile("s3a://<s3-bucket-name>/<input-file-name>");

Download complete source code: SparkS3ConfigFileCredentials.java

S3 filesystem access by environment variables

AWS loads credentials from system environment variables with variable names for access key: (AWS_ACCESS_KEY_ID or AWS_ACCESS_KEY) and secret: (AWS_SECRET_KEY or AWS_SECRET_ACCESS_KEY). System environment variable example (Ubuntu):

satish@satish-CMP:~$ export AWS_ACCESS_KEY="my-access-key"
satish@satish-CMP:~$ export AWS_SECRET_KEY="my-secret-key"
satish@satish-CMP:~$ echo $AWS_ACCESS_KEY $AWS_SECRET_KEY
my-access-key my-secret-key

EnvironmentVariableCredentialsProvider class automatically loads credentials by searching AWS_ACCESS_KEY and AWS_SECRET_KEY system environment variables. In case of temporary credentials, the AWS_SESSION_TOKEN variable also needs to be available in system environment variables. Following code snippet shows how to fetch credentials using EnvironmentVariableCredentialsProvider.

Configuration configuration = sc.hadoopConfiguration();
EnvironmentVariableCredentialsProvider provider = new EnvironmentVariableCredentialsProvider(); 
configuration.set("fs.s3a.access.key", provider.getCredentials().getAWSAccessKeyId()); 
configuration.set("fs.s3a.secret.key", provider.getCredentials().getAWSSecretKey());
JavaRDD s3InputRDD = sc.textFile("s3a://<s3-bucket-name>/<input-file-name>");

Download complete source code: SparkS3EnvironmentCredentials.java

Access S3 filesystem using Temporary Credentials

AWS provides temporary credentials feature using EC2 Instance Profile, and it is more secure than distributing AWS key/secret (generated using IAM console) with the application. Credentials based on instance profile has an expiry, and we get a fresh set of key/secret after expiry as well as it can only be used with EC2 machines.

EC Instance Profile credentials workflow

Following image shows workflow of EC2 instance profile temporary credentials.

Create Instance Profile IAM Role

AWS documentation: EC2 Instance Profiles

Open IAM console https://console.aws.amazon.com/iam/ and go to Roles section. Click on ‘Create Role’.

Select role type: AWS Service. Choose service: EC2. Select your use case: EC2. Click ‘Next: Permissions’.

Search and check AmazonS3FullAccess policy. Click ‘Next: Review’.

Under review section enter Role name : S3FullAccessInstanceProfile. Click ‘Create role’. Your role should appear in Roles list now.

Attach IAM role to EC2 Instance

Now we need to attach S3FullAccessInstanceProfile role to EC2 instance. Go to the EC2 management console. Click Attach/Replace IAM role under Instance Settings menu (Right click or open in Actions).

Select the S3FullAccessInstanceProfile role and Apply.

Spark Temporary Credentials Configuration

Now we are ready to read/write files to S3 filesystem using instance profile temporary credentials. Follow code shows usage of instance profile credentials. In case of temporary credentials, we also need to pass session token along with key/secret. fs.s3a.session.token configuration property stores session token and because of this property we also need to provide TemporaryAWSCredentialsProvider for fs.s3a.aws.credentials.provider property.

InstanceProfileCredentialsProvider credentialsProvider = new InstanceProfileCredentialsProvider(true);
BasicSessionCredentials credentials = (BasicSessionCredentials) credentialsProvider.getCredentials();
configuration.set("fs.s3a.access.key", credentials.getAWSAccessKeyId()); configuration.set("fs.s3a.secret.key", credentials.getAWSSecretKey()); configuration.set("fs.s3a.session.token", credentials.getSessionToken()); configuration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider");

Download complete source code: SparkS3InstanceProfile.java

Sample command line:

$ spark-submit --class com.spark.aws.samples.s3.S3WordCountInstanceProfileCredentials ~/spark-aws-samples-0.0.1-SNAPSHOT.jar s3a://spark.data.com/shakespeare-1-100.txt s3a://spark.data.com/output1

Access S3 filesystem using STS Assume Role

STS assume role is most safe and secure way to access the S3 filesystem. Using STS service assume role we can request for short-lived/temporary credentials with limited permissions. We need to provide desired role ARN as an input while requesting for temporary credentials. The following image shows workflow of STS assumes the role with the combination of EC2 Instance profile credentials.

Activate AWS STS in an AWS Region

Follow this post and check if STS enabled in aws region: http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html

Create Roles

AWS documentation: AWS STS DocumentationAWS STS AssumeRole API

This time we need to create 3 roles:

  1. EC2TemporaryCredentials: This role will be used to initialize STS service. Don’t attach any permissions to this role.
  2. S3ReadOnlyCredentials: This role will be used by STS assume role API to create temporary credentials with read-only permission. We also need to add ECTemporaryCredentials into trust Relationships.
  3. S3FullAccessCredentials: This role will be used by STS assume role API to create temporary credentials with S3 full access permissions. A user can perform any S3 operations using this role. We also need to add ECTemporaryCredentials into trust Relationships.

EC2TemporaryCredentials/S3ReadOnlyCredentials/S3FullAccessCredentials: Create all these 3 roles by following steps (don’t attach any permissions): Create Instance Profile IAM Role. Warning: Don’t attach any permissions like AmazonS3FullAccess or any other to this role.

I assume you created all these 3 roles with no permissions. Check it by going to permissions tab, permissions tab should show no permission entries. If there are any permissions associated with any of these roles, please remove them before going further.

EC2TemporaryCredentials: Need not to change anything. Attach this IAM role to EC2 instance Spark master (the instance that is going to run STS assume role API).

Configure Assume Role Relationship

S3ReadOnlyCredentials: Under permission tab attach policy AmazonS3ReadOnlyAccess. Switch to Trust Relationships tab and click on ‘Edit Trust Relationship’. Replace existing policy document from: { “Version”:”2012-10-17″ , “Statement” : [{ “Effect” : “Allow” , “Principal” : { “AWS” : “arn:aws:iam::111122223333:role/EC2TemporaryCredentials” }, “Action” : “sts:AssumeRole” }]}

After updating trust policy, S3ReadOnlyCredentials authorized EC2TemporaryCredentials to perform assume role action.

S3FullAccessCredentials: Under permission tab attach policy AmazonS3FullAccess. Switch to Trust Relationships tab and click on ‘Edit Trust Relationship’. Replace existing policy document from: { “Version”:”2012-10-17″ , “Statement” : [{ “Effect” : “Allow” , “Principal” : { “AWS” : “arn:aws:iam::111122223333:role/EC2TemporaryCredentials” }, “Action” : “sts:AssumeRole” }]}.

The policy document is same for both S3ReadOnlyCredentials and S3FullAccessCredentials roles.

Attach EC2TemporaryCredentials to EC2 instance that is running Spark master node. Now we are ready to execute our code for accessing the S3 filesystem. Following code snippet shows loading instance profile credentials and STS assume a role for S3ReadOnlyCredentials.

String roleSessionName = "S3ReadOnlyCredentialsSession";
String roleArn = "arn:aws:iam::111122223333:role/S3ReadOnlyCredentials";
InstanceProfileCredentialsProvider credentialsProvider = new InstanceProfileCredentialsProvider(true);
AWSSecurityTokenService sts = AWSSecurityTokenServiceClientBuilder.standard() .withCredentials(credentialsProvider).build();
STSAssumeRoleSessionCredentialsProvider provider = new STSAssumeRoleSessionCredentialsProvider.Builder(roleArn, roleSessionName).withStsClient(sts).build();
configuration.set("fs.s3a.access.key", provider.getCredentials().getAWSAccessKeyId()); configuration.set("fs.s3a.secret.key", provider.getCredentials().getAWSSecretKey()); configuration.set("fs.s3a.session.token", provider.getCredentials().getSessionToken());
configuration .set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider");

Download complete source code: SparkS3STSAssumeRole.java

Sample command line:

$ spark-submit --class com.spark.aws.samples.s3.S3WordCountSTSAssumeRoleCredentials ~/spark-aws-samples-0.0.1-SNAPSHOT.jar s3a://spark.data.com/shakespeare-1-100.txt s3a://spark.data.com/output2

Conclusion

In above post, we explored 3 different configurations in Spark to access the S3 filesystem. We tried with directly passing AWS key/secret, instance profile IAM role and STS assume role credentials. Passing AWS key/secret is not a secure way and should not be used in production environments. Instance profile and STS assume role are much safer to use because there is no need to pass any predefined key/secret, but AWS itself provides keys on demand. STS assume role is still better than instance profile because instance profile allows the EC2 machine to access credentials and in case of STS assume role we need to provide role name as input to get credentials.

Leave a Reply

Your email address will not be published. Required fields are marked *