Pre-Summer Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: mxmas70

Home > Amazon Web Services > AWS Certified Data Engineer > Data-Engineer-Associate

Data-Engineer-Associate AWS Certified Data Engineer - Associate (DEA-C01) Question and Answers

Question # 4

A data engineer needs to securely transfer 5 TB of data from an on-premises data center to an Amazon S3 bucket. Approximately 5% of the data changes every day. Updates to the data need to be regularly proliferated to the S3 bucket. The data includes files that are in multiple formats. The data engineer needs to automate the transfer process and must schedule the process to run periodically.

Which AWS service should the data engineer use to transfer the data in the MOST operationally efficient way?

A.

AWS DataSync

B.

AWS Glue

C.

AWS Direct Connect

D.

Amazon S3 Transfer Acceleration

Full Access
Question # 5

A company has a production AWS account that runs company workloads. The company ' s security team created a security AWS account to store and analyze security logs from the production AWS account. The security logs in the production AWS account are stored in Amazon CloudWatch Logs.

The company needs to use Amazon Kinesis Data Streams to deliver the security logs to the security AWS account.

Which solution will meet these requirements?

A.

Create a destination data stream in the production AWS account. In the security AWS account, create an IAM role that has cross-account permissions to Kinesis Data Streams in the production AWS account.

B.

Create a destination data stream in the security AWS account. Create an IAM role and a trust policy to grant CloudWatch Logs the permission to put data into the stream. Create a subscription filter in the security AWS account.

C.

Create a destination data stream in the production AWS account. In the production AWS account, create an IAM role that has cross-account permissions to Kinesis Data Streams in the security AWS account.

D.

Create a destination data stream in the security AWS account. Create an IAM role and a trust policy to grant CloudWatch Logs the permission to put data into the stream. Create a subscription filter in the production AWS account.

Full Access
Question # 6

A company needs a solution to manage costs for an existing Amazon DynamoDB table. The company also needs to control the size of the table. The solution must not disrupt any ongoing read or write operations. The company wants to use a solution that automatically deletes data from the table after 1 month.

Which solution will meet these requirements with the LEAST ongoing maintenance?

A.

Use the DynamoDB TTL feature to automatically expire data based on timestamps.

B.

Configure a scheduled Amazon EventBridge rule to invoke an AWS Lambda function to check for data that is older than 1 month. Configure the Lambda function to delete old data.

C.

Configure a stream on the DynamoDB table to invoke an AWS Lambda function. Configure the Lambda function to delete data in the table that is older than 1 month.

D.

Use an AWS Lambda function to periodically scan the DynamoDB table for data that is older than 1 month. Configure the Lambda function to delete old data.

Full Access
Question # 7

A company hosts its applications on Amazon EC2 instances. The company must use SSL/TLS connections that encrypt data in transit to communicate securely with AWS infrastructure that is managed by a customer.

A data engineer needs to implement a solution to simplify the generation, distribution, and rotation of digital certificates. The solution must automatically renew and deploy SSL/TLS certificates.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Store self-managed certificates on the EC2 instances.

B.

Use AWS Certificate Manager (ACM).

C.

Implement custom automation scripts in AWS Secrets Manager.

D.

Use Amazon Elastic Container Service (Amazon ECS) Service Connect.

Full Access
Question # 8

A company uses Amazon S3 as a data lake. The company sets up a data warehouse by using a multi-node Amazon Redshift cluster. The company organizes the data files in the data lake based on the data source of each data file.

The company loads all the data files into one table in the Redshift cluster by using a separate COPY command for each data file location. This approach takes a long time to load all the data files into the table. The company must increase the speed of the data ingestion. The company does not want to increase the cost of the process.

Which solution will meet these requirements?

A.

Use a provisioned Amazon EMR cluster to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift.

B.

Load all the data files in parallel into Amazon Aurora. Run an AWS Glue job to load the data into Amazon Redshift.

C.

Use an AWS Glue job to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift.

D.

Create a manifest file that contains the data file locations. Use a COPY command to load the data into Amazon Redshift.

Full Access
Question # 9

A retail company has a customer data hub in an Amazon S3 bucket. Employees from many countries use the data hub to support company-wide analytics. A governance team must ensure that the company ' s data analysts can access data only for customers who are within the same country as the analysts.

Which solution will meet these requirements with the LEAST operational effort?

A.

Create a separate table for each country ' s customer data. Provide access to each analyst based on the country that the analyst serves.

B.

Register the S3 bucket as a data lake location in AWS Lake Formation. Use the Lake Formation row-level security features to enforce the company ' s access policies.

C.

Move the data to AWS Regions that are close to the countries where the customers are. Provide access to each analyst based on the country that the analyst serves.

D.

Load the data into Amazon Redshift. Create a view for each country. Create separate 1AM roles for each country to provide access to data from each country. Assign the appropriate roles to the analysts.

Full Access
Question # 10

A data engineer uploads unpredictable volumes of unstructured data to an Amazon S3 bucket throughout the day. The data engineer needs to transform the data by using complex processing logic that takes from 5 to 30 minutes to complete. The solution must automatically scale with incoming data volume and process each uploaded file only one time.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Create AWS Lambda functions that are invoked by S3 Event Notifications to process the data as the data arrives in the S3 bucket.

B.

Use AWS Glue jobs with job bookmarks enabled to process the data with automatic scaling based on workload.

C.

Set up an Amazon EMR cluster that runs a Spark job to transform data when new files are detected in the S3 bucket.

D.

Create an Amazon EC2 Auto Scaling group with instances that poll the S3 bucket for new data.

Full Access
Question # 11

A data engineer uses Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to run data pipelines in an AWS account. A workflow recently failed to run. The data engineer needs to use Apache Airflow logs to diagnose the failure of the workflow. Which log type should the data engineer use to diagnose the cause of the failure?

A.

YourEnvironmentName-WebServer

B.

YourEnvironmentName-Scheduler

C.

YourEnvironmentName-DAGProcessing

D.

YourEnvironmentName-Task

Full Access
Question # 12

A company manages an Amazon Redshift data warehouse. The data warehouse is in a public subnet inside a custom VPC A security group allows only traffic from within itself- An ACL is open to all traffic.

The company wants to generate several visualizations in Amazon QuickSight for an upcoming sales event. The company will run QuickSight Enterprise edition in a second AW5 account inside a public subnet within a second custom VPC. The new public subnet has a security group that allows outbound traffic to the existing Redshift cluster.

A data engineer needs to establish connections between Amazon Redshift and QuickSight. QuickSight must refresh dashboards by querying the Redshift cluster.

Which solution will meet these requirements?

A.

Configure the Redshift security group to allow inbound traffic on the Redshift port from the QuickSight security group.

B.

Assign Elastic IP addresses to the QuickSight visualizations. Configure the QuickSight security group to allow inbound traffic on the Redshift port from the Elastic IP addresses.

C.

Confirm that the CIDR ranges of the Redshift VPC and the QuickSight VPC are the same. If CIDR ranges are different, reconfigure one CIDR range to match the other. Establish network peering between the VPCs.

D.

Create a QuickSight gateway endpoint in the Redshift VPC. Attach an endpoint policy to the gateway endpoint to ensure only specific QuickSight accounts can use the endpoint.

Full Access
Question # 13

A data engineer uses the AWS Glue Data Catalog to manage data lake metadata. The data engineer ' s extract, transform, and load (ETL) process creates new partitions in an Amazon S3 data lake throughout the day. The new partitions are not queryable through Amazon Athena until an AWS Glue crawler run finishes each night. The data engineer needs to make new partitions immediately available for querying.

Which solution will meet these requirements?

A.

Modify the ETL process to use the AWS Glue CreatePartition API call after creating each new partition in Amazon S3.

B.

Configure S3 Event Notifications to invoke an AWS Lambda function that copies new partition data to a separate cataloged S3 bucket.

C.

Use Amazon DynamoDB Streams to track partition changes and update the AWS Glue Data Catalog.

D.

Use the AWS Glue StartImportLabelsTaskRun API call to synchronize partitions on demand.

Full Access
Question # 14

A sales company uses AWS Glue ETL to collect, process, and ingest data into an Amazon S3 bucket. The AWS Glue pipeline creates a new file in the S3 bucket every hour. File sizes vary from 200 KB to 300 KB. The company wants to build a sales prediction model by using data from the previous 5 years. The historic data includes 44,000 files.

The company builds a second AWS Glue ETL pipeline by using the smallest worker type. The second pipeline retrieves the historic files from the S3 bucket and processes the files for downstream analysis. The company notices significant performance issues with the second ETL pipeline.

The company needs to improve the performance of the second pipeline.

Which solution will meet this requirement MOST cost-effectively?

A.

Use a larger worker type.

B.

Increase the number of workers in the AWS Glue ETL jobs.

C.

Use the AWS Glue DynamicFrame grouping option.

D.

Enable AWS Glue auto scaling.

Full Access
Question # 15

A company stores customer data in an Amazon S3 bucket. Multiple teams in the company want to use the customer data for downstream analysis. The company needs to ensure that the teams do not have access to personally identifiable information (PII) about the customers.

Which solution will meet this requirement with LEAST operational overhead?

A.

Use Amazon Macie to create and run a sensitive data discovery job to detect and remove PII.

B.

Use S3 Object Lambda to access the data, and use Amazon Comprehend to detect and remove PII.

C.

Use Amazon Kinesis Data Firehose and Amazon Comprehend to detect and remove PII.

D.

Use an AWS Glue DataBrew job to store the PII data in a second S3 bucket. Perform analysis on the data that remains in the original S3 bucket.

Full Access
Question # 16

A company uses an Amazon Redshift Single-AZ cluster for enterprise analytics. The company wants to set up a highly resilient disaster recovery (DR) solution for the cluster. The solution must meet a recovery time objective (RTO) of less than 1 hour.

Which solution will meet this requirement MOST cost-effectively?

A.

Use a Redshift dense storage (DS2) node. Enable Multi-AZ deployment.

B.

Use a Redshift RA3 node. Enable Multi-AZ deployment.

C.

Configure a Redshift cluster from a cross-Region snapshot copy in a second AWS Region when necessary.

D.

Use a Redshift RA3 node. Enable cluster relocation.

Full Access
Question # 17

A global ecommerce company processes customer transactions, inventory updates, and user activity logs across multiple AWS services. The company needs a scalable, fully managed, and event-driven orchestration solution to coordinate complex extract, transform, and load (ETL) workflows. The solution must use AWS Glue and Amazon EMR to process data. The data will be stored in Amazon Redshift and Amazon S3. The solution must support dependency management, automated retries, and data pipeline monitoring.

Which solution will meet these requirements?

A.

Use AWS Step Functions to define an express workflow that invokes the data transformation and loading tasks across Amazon EMR and AWS Glue.

B.

Create AWS Lambda functions for each step of the workflow. Configure Amazon EventBridge to invoke AWS Glue jobs. Configure the Lambda functions to process and move data through the pipeline.

C.

Use Apache Airflow on Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to create Directed Acyclic Graphs (DAGs) to manage ETL workflows.

D.

Create an AWS Lambda function that runs each step of the workflow. Create an Amazon EventBridge scheduled rule to invoke the function every day.

Full Access
Question # 18

A company that operates globally must follow regulations that require data from an AWS Region to be accessible only within that Region.

A data engineer is creating a data pipeline that will create resources in the Region where the data engineer works. The data pipeline should have access to data only from the Region where the data engineer works. The pipeline uses Active Directory as an identity and authentication system. The pipeline uses a custom identity broker application to verify that employees are signed in to Active Directory and to obtain temporary credentials by using the AssumeRole API operation.

Which solution will meet the locality requirements with the LEAST administrative effort?

A.

Create an IAM role that has permissions to create resources. Create a policy for each Region that ensures users can create resources only in that Region. Pass the policy as the session policy when employees obtain the temporary credentials.

B.

Create an IAM role for data engineers in each Region separately. Instruct each data engineer to obtain temporary credentials by assuming the appropriate Region-specific IAM role.

C.

Create an IAM group for each Region. Include the required IAM policies for each IAM group. Add users to each IAM group so that when users log in by obtaining the temporary credentials, the users will receive the appropriate access based on the IAM group.

D.

Create individual IAM policies that allow users to create resources in a specific Region. Assign the policies to each data engineer. Allow users to assume the individually assigned role when the users log in to AWS.

Full Access
Question # 19

A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded.

A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB.

How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?

A.

Use a second Lambda function to invoke the first Lambda function based on Amazon CloudWatch events.

B.

Use the Amazon Redshift Data API to publish an event to Amazon EventBridqe. Configure an EventBridge rule to invoke the Lambda function.

C.

Use the Amazon Redshift Data API to publish a message to an Amazon Simple Queue Service (Amazon SQS) queue. Configure the SQS queue to invoke the Lambda function.

D.

Use a second Lambda function to invoke the first Lambda function based on AWS CloudTrail events.

Full Access
Question # 20

A company has a data warehouse in Amazon Redshift. To comply with security regulations, the company needs to log and store all user activities and connection activities for the data warehouse.

Which solution will meet these requirements?

A.

Create an Amazon S3 bucket. Enable logging for the Amazon Redshift cluster. Specify the S3 bucket in the logging configuration to store the logs.

B.

Create an Amazon Elastic File System (Amazon EFS) file system. Enable logging for the Amazon Redshift cluster. Write logs to the EFS file system.

C.

Create an Amazon Aurora MySQL database. Enable logging for the Amazon Redshift cluster. Write the logs to a table in the Aurora MySQL database.

D.

Create an Amazon Elastic Block Store (Amazon EBS) volume. Enable logging for the Amazon Redshift cluster. Write the logs to the EBS volume.

Full Access
Question # 21

A company uses AWS Key Management Service (AWS KMS) to encrypt an Amazon Redshift cluster. The company wants to configure a cross-Region snapshot of the Redshift cluster as part of disaster recovery (DR) strategy.

A data engineer needs to use the AWS CLI to create the cross-Region snapshot.

Which combination of steps will meet these requirements? (Select TWO.)

A.

Create a KMS key and configure a snapshot copy grant in the source AWS Region.

B.

In the source AWS Region, enable snapshot copying. Specify the name of the snapshot copy grant that is created in the destination AWS Region.

C.

In the source AWS Region, enable snapshot copying. Specify the name of the snapshot copy grant that is created in the source AWS Region.

D.

Create a KMS key and configure a snapshot copy grant in the destination AWS Region.

E.

Convert the cluster to a Multi-AZ deployment.

Full Access
Question # 22

A data engineer needs to use Amazon Neptune to develop graph applications.

Which programming languages should the engineer use to develop the graph applications? (Select TWO.)

A.

Gremlin

B.

SQL

C.

ANSI SQL

D.

SPARQL

E.

Spark SQL

Full Access
Question # 23

A data engineer needs to create an Amazon Athena table based on a subset of data from an existing Athena table named cities_world. The cities_world table contains cities that are located around the world. The data engineer must create a new table named cities_us to contain only the cities from cities_world that are located in the US.

Which SQL statement should the data engineer use to meet this requirement?

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Full Access
Question # 24

A company stores Apache Parquet files in an Amazon S3 data lake. The data lake receives thousands of files from multiple sources every hour. The files range in size from 50 KB to 100 KB.

The company is evaluating the implementation of Apache Iceberg tables for the data lake. The company is using AWS Glue Data Catalog as part of the evaluation. The company needs a solution to optimize query performance in Iceberg. The solution must ensure that Iceberg table performance does not degrade when more files are added over time.

Which solution will meet these requirements?

A.

Use an AWS Glue job to compact the files into a standard size of 512 MB at the end of each day. Run an AWS Glue crawler to update the Data Catalog.

B.

Configure the Data Catalog to automatically compact the files every minute.

C.

Configure Iceberg table properties to enable automatic compaction based on thresholds for file size and the number of files.

D.

Implement a partition strategy in Amazon S3. Run an AWS Glue crawler to update the Data Catalog every 5 minutes.

Full Access
Question # 25

A technology company currently uses Amazon Kinesis Data Streams to collect log data in real time. The company wants to use Amazon Redshift for downstream real-time queries and to enrich the log data.

Which solution will ingest data into Amazon Redshift with the LEAST operational overhead?

A.

Set up an Amazon Data Firehose delivery stream to send data to a Redshift provisioned cluster table.

B.

Set up an Amazon Data Firehose delivery stream to send data to Amazon S3. Configure a Redshift provisioned cluster to load data every minute.

C.

Configure Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to send data directly to a Redshift provisioned cluster table.

D.

Use Amazon Redshift streaming ingestion from Kinesis Data Streams and to present data as a materialized view.

Full Access
Question # 26

A company wants to ingest streaming data into an Amazon Redshift data warehouse from an Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster. A data engineer needs to develop a solution that provides low data access time and that optimizes storage costs.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Create an external schema that maps to the MSK cluster. Create a materialized view that references the external schema to consume the streaming data from the MSK topic.

B.

Develop an AWS Glue streaming extract, transform, and load (ETL) job to process the incoming data from Amazon MSK. Load the data into Amazon S3. Use Amazon Redshift Spectrum to read the data from Amazon S3.

C.

Create an external schema that maps to the streaming data source. Create a new Amazon Redshift table that references the external schema.

D.

Create an Amazon S3 bucket. Ingest the data from Amazon MSK. Create an event-driven AWS Lambda function to load the data from the S3 bucket to a new Amazon Redshift table.

Full Access
Question # 27

A company has a gaming application that stores data in Amazon DynamoDB tables. A data engineer needs to ingest the game data into an Amazon OpenSearch Service cluster. Data updates must occur in near real time.

Which solution will meet these requirements?

A.

Use AWS Step Functions to periodically export data from the Amazon DynamoDB tables to an Amazon S3 bucket. Use an AWS Lambda function to load the data into Amazon OpenSearch Service.

B.

Configure an AW5 Glue job to have a source of Amazon DynamoDB and a destination of Amazon OpenSearch Service to transfer data in near real time.

C.

Use Amazon DynamoDB Streams to capture table changes. Use an AWS Lambda function to process and update the data in Amazon OpenSearch Service.

D.

Use a custom OpenSearch plugin to sync data from the Amazon DynamoDB tables.

Full Access
Question # 28

A retail company stores data from a product lifecycle management (PLM) application in an on-premises MySQL database. The PLM application frequently updates the database when transactions occur.

The company wants to gather insights from the PLM application in near real time. The company wants to integrate the insights with other business datasets and to analyze the combined dataset by using an Amazon Redshift data warehouse.

The company has already established an AWS Direct Connect connection between the on-premises infrastructure and AWS.

Which solution will meet these requirements with the LEAST development effort?

A.

Run a scheduled AWS Glue extract, transform, and load (ETL) job to get the MySQL database updates by using a Java Database Connectivity (JDBC) connection. Set Amazon Redshift as the destination for the ETL job.

B.

Run a full load plus CDC task in AWS Database Migration Service (AWS DMS) to continuously replicate the MySQL database changes. Set Amazon Redshift as the destination for the task.

C.

Use the Amazon AppFlow SDK to build a custom connector for the MySQL database to continuously replicate the database changes. Set Amazon Redshift as the destination for the connector.

D.

Run scheduled AWS DataSync tasks to synchronize data from the MySQL database. Set Amazon Redshift as the destination for the tasks.

Full Access
Question # 29

A data engineer must orchestrate a data pipeline that consists of one AWS Lambda function and one AWS Glue job. The solution must integrate with AWS services.

Which solution will meet these requirements with the LEAST management overhead?

A.

Use an AWS Step Functions workflow that includes a state machine. Configure the state machine to run the Lambda function and then the AWS Glue job.

B.

Use an Apache Airflow workflow that is deployed on an Amazon EC2 instance. Define a directed acyclic graph (DAG) in which the first task is to call the Lambda function and the second task is to call the AWS Glue job.

C.

Use an AWS Glue workflow to run the Lambda function and then the AWS Glue job.

D.

Use an Apache Airflow workflow that is deployed on Amazon Elastic Kubernetes Service (Amazon EKS). Define a directed acyclic graph (DAG) in which the first task is to call the Lambda function and the second task is to call the AWS Glue job.

Full Access
Question # 30

A data engineer uses AWS Lake Formation to manage access to data that is stored in an Amazon S3 bucket. The data engineer configures an AWS Glue crawler to discover data at a specific file location in the bucket, s3://examplepath. The crawler execution fails with the following error:

" The S3 location: s3://examplepath is not registered. "

The data engineer needs to resolve the error.

A.

Attach an appropriate IAM policy to the IAM role of the AWS Glue crawler to grant the crawler permission to read the S3 location.

B.

Register the S3 location in Lake Formation to allow the crawler to access the data.

C.

Create a new AWS Glue database. Assign the correct permissions to the database for the crawler.

D.

Configure the S3 bucket policy to allow cross-account access.

Full Access
Question # 31

A company uses Amazon Redshift as its data warehouse service. A data engineer needs to design a physical data model.

The data engineer encounters a de-normalized table that is growing in size. The table does not have a suitable column to use as the distribution key.

Which distribution style should the data engineer use to meet these requirements with the LEAST maintenance overhead?

A.

ALL distribution

B.

EVEN distribution

C.

AUTO distribution

D.

KEY distribution

Full Access
Question # 32

A data engineer must manage the ingestion of real-time streaming data into AWS. The data engineer wants to perform real-time analytics on the incoming streaming data by using time-based aggregations over a window of up to 30 minutes. The data engineer needs a solution that is highly fault tolerant.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Use an AWS Lambda function that includes both the business and the analytics logic to perform time-based aggregations over a window of up to 30 minutes for the data in Amazon Kinesis Data Streams.

B.

Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to analyze the data that might occasionally contain duplicates by using multiple types of aggregations.

C.

Use an AWS Lambda function that includes both the business and the analytics logic to perform aggregations for a tumbling window of up to 30 minutes, based on the event timestamp.

D.

Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to analyze the data by using multiple types of aggregations to perform time-based analytics over a window of up to 30 minutes.

Full Access
Question # 33

A company stores datasets in JSON format and .csv format in an Amazon S3 bucket. The company has Amazon RDS for Microsoft SQL Server databases, Amazon DynamoDB tables that are in provisioned capacity mode, and an Amazon Redshift cluster. A data engineering team must develop a solution that will give data scientists the ability to query all data sources by using syntax similar to SQL.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Use AWS Glue to crawl the data sources. Store metadata in the AWS Glue Data Catalog. Use Amazon Athena to query the data. Use SQL for structured data sources. Use PartiQL for data that is stored in JSON format.

B.

Use AWS Glue to crawl the data sources. Store metadata in the AWS Glue Data Catalog. Use Redshift Spectrum to query the data. Use SQL for structured data sources. Use PartiQL for data that is stored in JSON format.

C.

Use AWS Glue to crawl the data sources. Store metadata in the AWS Glue Data Catalog. Use AWS Glue jobs to transform data that is in JSON format to Apache Parquet or .csv format. Store the transformed data in an S3 bucket. Use Amazon Athena to query the original and transformed data from the S3 bucket.

D.

Use AWS Lake Formation to create a data lake. Use Lake Formation jobs to transform the data from all data sources to Apache Parquet format. Store the transformed data in an S3 bucket. Use Amazon Athena or Redshift Spectrum to query the data.

Full Access
Question # 34

A company uses Amazon Redshift to store order transactions from the current day. The company has an orders table that contains the previous order data. The company also has a staging table that contains new or updated order records. The company needs to remove stale records from the orders table and insert the most recent data in the orders table from the staging table. Several downstream applications need the orders table to display up-to-date information.

Which solution will meet these requirements?

A.

Use Amazon Redshift Spectrum to delete stale records from the orders table and insert records from the staging table into the orders table.

B.

Unload the orders table and the staging table to Amazon S3. Delete stale orders table data and insert new staging table data in Amazon S3 by using Amazon Athena. Copy the orders S3 table to the orders Amazon Redshift table.

C.

Use Amazon Athena federated queries to read stale records from the orders table. Delete the stale records and insert the records from the staging table into the orders table.

D.

Write an Amazon Redshift stored procedure that deletes the stale records from the orders table and inserts new records from the staging table.

Full Access
Question # 35

A data engineer uses Amazon Redshift to run resource-intensive analytics processes once every month. Every month, the data engineer creates a new Redshift provisioned cluster. The data engineer deletes the Redshift provisioned cluster after the analytics processes are complete every month. Before the data engineer deletes the cluster each month, the data engineer unloads backup data from the cluster to an Amazon S3 bucket.

The data engineer needs a solution to run the monthly analytics processes that does not require the data engineer to manage the infrastructure manually.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Use Amazon Step Functions to pause the Redshift cluster when the analytics processes are complete and to resume the cluster to run new processes every month.

B.

Use Amazon Redshift Serverless to automatically process the analytics workload.

C.

Use the AWS CLI to automatically process the analytics workload.

D.

Use AWS CloudFormation templates to automatically process the analytics workload.

Full Access
Question # 36

A company stores data in a data lake that is in Amazon S3. Some data that the company stores in the data lake contains personally identifiable information (PII). Multiple user groups need to access the raw data. The company must ensure that user groups can access only the PII that they require.

Which solution will meet these requirements with the LEAST effort?

A.

Use Amazon Athena to query the data. Set up AWS Lake Formation and create data filters to establish levels of access for the company ' s IAM roles. Assign each user to the IAM role that matches the user ' s PII access requirements.

B.

Use Amazon QuickSight to access the data. Use column-level security features in QuickSight to limit the PII that users can retrieve from Amazon S3 by using Amazon Athena. Define QuickSight access levels based on the PII access requirements of the users.

C.

Build a custom query builder UI that will run Athena queries in the background to access the data. Create user groups in Amazon Cognito. Assign access levels to the user groups based on the PII access requirements of the users.

D.

Create IAM roles that have different levels of granular access. Assign the IAM roles to IAM user groups. Use an identity-based policy to assign access levels to user groups at the column level.

Full Access
Question # 37

A data engineer has a one-time task to read data from objects that are in Apache Parquet format in an Amazon S3 bucket. The data engineer needs to query only one column of the data.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Confiqure an AWS Lambda function to load data from the S3 bucket into a pandas dataframe- Write a SQL SELECT statement on the dataframe to query the required column.

B.

Use S3 Select to write a SQL SELECT statement to retrieve the required column from the S3 objects.

C.

Prepare an AWS Glue DataBrew project to consume the S3 objects and to query the required column.

D.

Run an AWS Glue crawler on the S3 objects. Use a SQL SELECT statement in Amazon Athena to query the required column.

Full Access
Question # 38

A data engineer is optimizing query performance in Amazon Athena notebooks that use Apache Spark to analyze large datasets that are stored in Amazon S3. The data is partitioned. An AWS Glue crawler updates the partitions.

The data engineer wants to minimize the amount of data that is scanned to improve efficiency of Athena queries.

Which solution will meet these requirements?

A.

Apply partition filters in the queries.

B.

Increase the frequency of AWS Glue crawler invocations to update the data catalog more often.

C.

Organize the data that is in Amazon S3 by using a nested directory structure.

D.

Configure Spark to use in-memory caching for frequently accessed data.

Full Access
Question # 39

A data engineer maintains a materialized view that is based on an Amazon Redshift database. The view has a column named load_date that stores the date when each row was loaded.

The data engineer needs to reclaim database storage space by deleting all the rows from the materialized view.

Which command will reclaim the MOST database storage space?

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Full Access
Question # 40

A data engineer is processing a large amount of log data from web servers. The data is stored in an Amazon S3 bucket. The data engineer uses AWS services to process the data every day. The data engineer needs to extract specific fields from the raw log data and load the data into a data warehouse for analysis.

A.

Use Amazon EMR to run Apache Hive queries on the raw log files in the S3 bucket to extract the specified fields. Store the output as ORC files in the original S3 bucket.

B.

Use AWS Step Functions to orchestrate a series of AWS Batch jobs to parse the raw log files. Load the specified fields into an Amazon RDS for PostgreSQL database.

C.

Use an AWS Glue crawler to parse the raw log data in the S3 bucket and to generate a schema. Use AWS Glue ETL jobs to extract and transform the data and to load it into Amazon Redshift.

D.

Use AWS Glue DataBrew to run AWS Glue ETL jobs on a schedule to extract the specified fields from the raw log files in the S3 bucket. Load the data into partitioned tables in Amazon Redshift.

Full Access
Question # 41

A company is developing a product recommendation system that uses Amazon OpenSearch Service. The system needs to perform k-nearest neighbors (k-NN) vector searches on 10 million product embeddings with 768-dimensional vectors. The system must maintain high recall accuracy and support incremental updates without reindexing as new products are added each day. The system must also accommodate complex filtering based on product categories and inventory status.

Which vector index type will meet these requirements?

A.

FAISS Inverted File Index (IVF) with an nlist value of 1024 and an nprobes value of 10.

B.

Lucene Hierarchical Navigable Small Worlds (HNSW) index with an M value of 16 and an efConstruction value of 200.

C.

Exact k-NN search that uses a Painless script scoring.

D.

Faiss index with binary quantization and an nlist value of 4096.

Full Access
Question # 42

A company uses an Amazon Redshift provisioned cluster as its database. The Redshift cluster has five reserved ra3.4xlarge nodes and uses key distribution.

A data engineer notices that one of the nodes frequently has a CPU load over 90%. SQL Queries that run on the node are queued. The other four nodes usually have a CPU load under 15% during daily operations.

The data engineer wants to maintain the current number of compute nodes. The data engineer also wants to balance the load more evenly across all five compute nodes.

Which solution will meet these requirements?

A.

Change the sort key to be the data column that is most often used in a WHERE clause of the SQL SELECT statement.

B.

Change the distribution key to the table column that has the largest dimension.

C.

Upgrade the reserved node from ra3.4xlarqe to ra3.16xlarqe.

D.

Change the primary key to be the data column that is most often used in a WHERE clause of the SQL SELECT statement.

Full Access
Question # 43

A company is building an analytics solution. The solution uses Amazon S3 for data lake storage and Amazon Redshift for a data warehouse. The company wants to use Amazon Redshift Spectrum to query the data that is in Amazon S3.

Which actions will provide the FASTEST queries? (Choose two.)

A.

Use gzip compression to compress individual files to sizes that are between 1 GB and 5 GB.

B.

Use a columnar storage file format.

C.

Partition the data based on the most common query predicates.

D.

Split the data into files that are less than 10 KB.

E.

Use file formats that are not

Full Access
Question # 44

A company runs an extract, transform, and load (ETL) job in AWS Glue. The job processes personally identifiable information (PII) data and writes logs to an Amazon CloudWatch Logs log group. A data engineer needs to mask PII data in the CloudWatch Logs log group.

Which solution will meet these requirements?

A.

Attach an AWS Glue security configuration to the ETL job.

B.

Configure a data protection policy. Attach the policy to the CloudWatch log group.

C.

Run an Amazon Macie sensitive data discovery job.

D.

Call AWS Glue sensitive data detection APIs in the ETL job.

Full Access
Question # 45

A company builds a new data pipeline to process data for business intelligence reports. Users have noticed that data is missing from the reports.

A data engineer needs to add a data quality check for columns that contain null values and for referential integrity at a stage before the data is added to storage.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Use Amazon SageMaker Data Wrangler to create a Data Quality and Insights report.

B.

Use AWS Glue ETL jobs to perform a data quality evaluation transform on the data. Use an IsComplete rule on the requested columns. Use a ReferentialIntegrity rule for each join.

C.

Use AWS Glue ETL jobs to perform a SQL transform on the data to determine whether requested columns contain null values. Use a second SQL transform to check referential integrity.

D.

Use Amazon SageMaker Data Wrangler and a custom Python transform to create custom rules to check for null values and referential integrity.

Full Access
Question # 46

A company currently stores all of its data in Amazon S3 by using the S3 Standard storage class.

A data engineer examined data access patterns to identify trends. During the first 6 months, most data files are accessed several times each day. Between 6 months and 2 years, most data files are accessed once or twice each month. After 2 years, data files are accessed only once or twice each year.

The data engineer needs to use an S3 Lifecycle policy to develop new data storage rules. The new storage solution must continue to provide high availability.

Which solution will meet these requirements in the MOST cost-effective way?

A.

Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.

B.

Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.

C.

Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.

D.

Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.

Full Access
Question # 47

A company plans to use Amazon Kinesis Data Firehose to store data in Amazon S3. The source data consists of 2 MB csv files. The company must convert the .csv files to JSON format. The company must store the files in Apache Parquet format.

Which solution will meet these requirements with the LEAST development effort?

A.

Use Kinesis Data Firehose to convert the csv files to JSON. Use an AWS Lambda function to store the files in Parquet format.

B.

Use Kinesis Data Firehose to convert the csv files to JSON and to store the files in Parquet format.

C.

Use Kinesis Data Firehose to invoke an AWS Lambda function that transforms the .csv files to JSON and stores the files in Parquet format.

D.

Use Kinesis Data Firehose to invoke an AWS Lambda function that transforms the .csv files to JSON. Use Kinesis Data Firehose to store the files in Parquet format.

Full Access
Question # 48

A company needs to build a data lake in AWS. The company must provide row-level data access and column-level data access to specific teams. The teams will access the data by using Amazon Athena, Amazon Redshift Spectrum, and Apache Hive from Amazon EMR.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Use Amazon S3 for data lake storage. Use S3 access policies to restrict data access by rows and columns. Provide data access through Amazon S3.

B.

Use Amazon S3 for data lake storage. Use Apache Ranger through Amazon EMR to restrict data access by rows and columns. Provide data access by using Apache Pig.

C.

Use Amazon Redshift for data lake storage. Use Redshift security policies to restrict data access by rows and columns. Provide data access by using Apache Spark and Amazon Athena federated queries.

D.

Use Amazon S3 for data lake storage. Use AWS Lake Formation to restrict data access by rows and columns. Provide data access through AWS Lake Formation.

Full Access
Question # 49

A company stores a 100 MB dataset in an Amazon S3 bucket as an Apache Parquet file. A data engineer needs to profile the data before performing data preparation steps on the data.

Which solution will meet this requirement in the MOST operationally efficient way?

A.

Create a profile job on the dataset in AWS Glue DataBrew. Review the profile job results.

B.

Stream the data into Amazon Managed Service for Apache Flink for SQL queries. Use the Apache Flink dashboard to profile the data.

C.

Ingest the data into Amazon Redshift Spectrum. Use SQL queries to profile the data.

D.

Load the data into an Amazon QuickSight dataset. Build a topic to profile the data with questions.

Full Access
Question # 50

A data engineer at a large company needs to create centralized datasets that are optimized for Amazon Redshift performance. The company has multiple downstream teams that use their own AWS accounts and dedicated Amazon Redshift clusters with RA3 nodes. All downstream teams need access to the centralized datasets.

Which solution will provide immediate access to the datasets and maintain the current Amazon Redshift performance?

A.

Copy the datasets to an Amazon S3 bucket by using the UNLOAD command. Register the table definitions in a dedicated AWS Glue Data Catalog schema. Share the schema with the other AWS accounts by using AWS Lake Formation. Use Amazon Redshift Spectrum to access the data.

B.

Create a daily extract, transform, and load (ETL) job to unload the data to an Amazon S3 staging area. Instruct the teams to copy the data into their Amazon Redshift clusters.

C.

Set up Amazon Redshift data sharing between the Amazon Redshift producer clusters and the consumer clusters to provide access to the centralized datasets.

D.

Set up an AWS DataSync job that automatically syncs the data between the Amazon Redshift producer clusters and the consumer clusters.

Full Access
Question # 51

A company needs to generate a one-time performance report by joining data that is stored in Amazon DynamoDB, Amazon RDS, Amazon Redshift, and Amazon S3. The company wants to avoid unnecessary data movement and to minimize query execution time.

Which solution will meet these requirements?

A.

Capture data from DynamoDB by using DynamoDB Streams. Migrate data from Amazon RDS by using AWS DMS. Export Amazon Redshift data. Store all data in Amazon S3. Use Redshift Spectrum to run queries.

B.

Set up an AWS Glue ETL pipeline to extract, transform, and centralize data in Amazon S3. Use Amazon Athena to run analytical queries.

C.

Deploy an Amazon EMR cluster powered by Apache Spark to ingest, process, and merge datasets from multiple sources. Run analytical workloads on the merged data.

D.

Use Amazon Athena Federated Query to perform one-time joins and analysis across DynamoDB, Amazon RDS, Amazon Redshift, and Amazon S3.

Full Access
Question # 52

A data engineer develops an AWS Glue Apache Spark ETL job to perform transformations on a dataset. When the data engineer runs the job, the job returns an error that reads, “No space left on device.”

The data engineer needs to identify the source of the error and provide a solution.

Which combinations of steps will meet this requirement MOST cost-effectively? (Select TWO.)

A.

Scale out the workers vertically to address data skewness.

B.

Use the Spark UI and AWS Glue metrics to monitor data skew in the Spark executors.

C.

Scale out the number of workers horizontally to address data skewness.

D.

Enable the --write-shuffle-files-to-s3 job parameter. Use the salting technique.

E.

Use error logs in Amazon CloudWatch to monitor data skew.

Full Access
Question # 53

A company uses Amazon S3 to store data and Amazon QuickSight to create visualizations.

The company has an S3 bucket in an AWS account named Hub-Account. The S3 bucket is encrypted with an AWS Key Management Service (AWS KMS) key. The company’s Amazon QuickSight instance is in a separate AWS account named BI-Account.

The company updates the S3 bucket policy to grant access to the QuickSight service role. The company wants to enable cross-account access to allow QuickSight to interact with the S3 bucket.

Which combination of steps will meet this requirement? (Select TWO)

A.

Use the existing AWS KMS key to encrypt connections from QuickSight to the S3 bucket.

B.

Add the S3 bucket as a resource that the QuickSight service role can access.

C.

Use AWS Resource Access Manager (AWS RAM) to share the S3 bucket with the BI-Account.

D.

Add an IAM policy to the QuickSight service role to give QuickSight access to the KMS key that encrypts the S3 bucket.

E.

Add the KMS key as a resource that the QuickSight service role can access.

Full Access
Question # 54

A company needs to collect logs for an Amazon RDS for MySQL database and make the logs available for audits. The logs must track each user that modifies data in the database or makes changes to the database instance.

Which solution will meet these requirements?

A.

Enable Amazon CloudWatch Logs. Create metric filters to monitor database changes and instance-level changes. Configure automated notification systems to send near real-time alerts for suspicious database operations.

B.

Configure an Amazon EventBridge rule to monitor database activity. Create an AWS Lambda function to process EventBridge events and store them in Amazon OpenSearch Service.

C.

Configure AWS CloudTrail to log API calls. Use Amazon CloudWatch Logs for basic monitoring. Use IAM policies to control access to the logs. Set up scheduled reporting for log audits.

D.

Enable and configure native Amazon RDS database audit logging. Enable Amazon CloudWatch Logs. Configure metric filters and alarms. Configure AWS CloudTrail audit logging.

Full Access
Question # 55

A data engineer configured an AWS Glue Data Catalog for data that is stored in Amazon S3 buckets. The data engineer needs to configure the Data Catalog to receive incremental updates.

The data engineer sets up event notifications for the S3 bucket and creates an Amazon Simple Queue Service (Amazon SQS) queue to receive the S3 events.

Which combination of steps should the data engineer take to meet these requirements with LEAST operational overhead? (Select TWO.)

A.

Create an S3 event-based AWS Glue crawler to consume events from the SQS queue.

B.

Define a time-based schedule to run the AWS Glue crawler, and perform incremental updates to the Data Catalog.

C.

Use an AWS Lambda function to directly update the Data Catalog based on S3 events that the SQS queue receives.

D.

Manually initiate the AWS Glue crawler to perform updates to the Data Catalog when there is a change in the S3 bucket.

E.

Use AWS Step Functions to orchestrate the process of updating the Data Catalog based on 53 events that the SQS queue receives.

Full Access
Question # 56

A company needs to optimize storage for an Amazon S3 bucket. Objects older than 1 year must be accessible within 5 hours. All versions of the objects must be retained and immutable for 7 years. All versions of the objects must use the write-once-read-many (WORM) model.

Which solution will meet these requirements?

A.

Configure S3 Versioning on the bucket and use the S3 Intelligent-Tiering storage class. Configure a lifecycle policy for the bucket to transition objects that are older than 1 year to S3 Glacier Flexible Retrieval. Configure the policy to delete objects that are older than 7 years.

B.

Configure S3 Object Lock on the bucket and use the S3 Intelligent-Tiering storage class. Configure a lifecycle policy for the bucket to transition objects that are older than 1 year to S3 Glacier Deep Archive. Configure the policy to delete objects that are older than 7 years.

C.

Configure S3 Object Lock on the bucket and use the S3 Intelligent-Tiering storage class. Configure a lifecycle policy for the bucket to transition objects that are older than 1 year to S3 Glacier Flexible Retrieval. Configure the policy to delete objects that are older than 7 years.

D.

Configure S3 Versioning on the bucket and use the S3 Intelligent-Tiering storage class. Configure a lifecycle policy for the bucket to transition objects that are older than 1 year to S3 Glacier Deep Archive. Configure the policy to delete objects that are older than 7 years.

Full Access
Question # 57

A company aggregates high-frequency sensor telemetry into an Amazon S3 data lake. Each sensor stream emits structured records every hour. The records include metadata such as sensor category, unit ID, operational state, event timestamp, and site location. The data scales up to millions of records each day. The company runs complex queries each day to uncover performance insights specific to sensor categories.

Which solution will meet these requirements with the FASTEST query execution time?

A.

Persist the data in Apache ORC format. Partition the data by date. Sort the data by sensor category.

B.

Persist the data in CSV format. Partition the data by date. Sort the data by operational status.

C.

Persist the data in Parquet format. Partition the data by sensor category. Sort the data by date.

D.

Persist the data in CSV format. Partition the data by date. Sort the data by sensor category.

Full Access
Question # 58

A data engineer configures a large number of AWS Glue jobs that all start up around the same time. All the jobs run for less than 1 hour in the same subnet of the same VPC. All the AWS Glue jobs run on a G.1X worker type.

Some of the jobs occasionally fail with the following error: “The specified subnet does not have enough free addresses to satisfy the request.”

What is the likely root cause of the error?

A.

There are not enough IP addresses in the subnet.

B.

The G.1X worker type cannot access the subnet.

C.

AWS Glue does not have the correct IAM permissions to add additional IP addresses to the subnet.

D.

There are not enough IP addresses in the VPC.

Full Access
Question # 59

A company uses Amazon Athena for one-time queries against data that is in Amazon S3. The company has several use cases. The company must implement permission controls to separate query processes and access to query history among users, teams, and applications that are in the same AWS account.

Which solution will meet these requirements?

A.

Create an S3 bucket for each use case. Create an S3 bucket policy that grants permissions to appropriate individual IAM users. Apply the S3 bucket policy to the S3 bucket.

B.

Create an Athena workgroup for each use case. Apply tags to the workgroup. Create an 1AM policy that uses the tags to apply appropriate permissions to the workgroup.

C.

Create an JAM role for each use case. Assign appropriate permissions to the role for each use case. Associate the role with Athena.

D.

Create an AWS Glue Data Catalog resource policy that grants permissions to appropriate individual IAM users for each use case. Apply the resource policy to the specific tables that Athena uses.

Full Access
Question # 60

A company stores customer records in Amazon S3. The company must not delete or modify the customer record data for 7 years after each record is created. The root user also must not have the ability to delete or modify the data.

A data engineer wants to use S3 Object Lock to secure the data.

Which solution will meet these requirements?

A.

Enable governance mode on the S3 bucket. Use a default retention period of 7 years.

B.

Enable compliance mode on the S3 bucket. Use a default retention period of 7 years.

C.

Place a legal hold on individual objects in the S3 bucket. Set the retention period to 7 years.

D.

Set the retention period for individual objects in the S3 bucket to 7 years.

Full Access
Question # 61

A company wants to combine data from multiple software as a service (SaaS) applications for analysis.

A data engineering team needs to use Amazon QuickSight to perform the analysis and build dashboards. A data engineer needs to extract the data from the SaaS applications and make the data available for QuickSight queries.

Which solution will meet these requirements in the MOST operationally efficient way?

A.

Create AWS Lambda functions that call the required APIs to extract the data from the applications. Store the data in an Amazon S3 bucket. Use AWS Glue to catalog the data in the S3 bucket. Create a data source and a dataset in QuickSight

B.

Use AWS Lambda functions as Amazon Athena data source connectors to run federated queries against the SaaS applications. Create an Athena data source and a dataset in QuickSight.

C.

Use Amazon AppFlow to create a Row for each SaaS application. Set an Amazon S3 bucket as the destination. Schedule the flows to extract the data to the bucket. Use AWS Glue to catalog the data in the S3 bucket. Create a data source and a dataset in QuickSight.

D.

Export data the from the SaaS applications as Microsoft Excel files. Create a data source and a dataset in QuickSight by uploading the Excel files.

Full Access
Question # 62

A company processes 500 GB of audience and advertising data daily, storing CSV files in Amazon S3 with schemas registered in AWS Glue Data Catalog. They need to convert these files to Apache Parquet format and store them in an S3 bucket.

The solution requires a long-running workflow with 15 GiB memory capacity to process the data concurrently, followed by a correlation process that begins only after the first two processes complete.

A.

Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate the workflow by using AWS Glue. Configure AWS Glue to begin the third process after the first two processes have finished.

B.

Use Amazon EMR to run each process in the workflow. Create an Amazon Simple Queue Service (Amazon SQS) queue to handle messages that indicate the completion of the first two processes. Configure an AWS Lambda function to process the SQS queue by running the third process.

C.

Use AWS Glue workflows to run the first two processes in parallel. Ensure that the third process starts after the first two processes have finished.

D.

Use AWS Step Functions to orchestrate a workflow that uses multiple AWS Lambda functions. Ensure that the third process starts after the first two processes have finished.

Full Access
Question # 63

A company’s data processing pipeline uses AWS Glue jobs and AWS Glue Data Catalog. All AWS Glue jobs must run in a custom VPC inside a private subnet. The company uses a NAT gateway to support outbound connections.

A data engineer needs to use AWS Glue to migrate data from an on-premises PostgreSQL database to Amazon S3. There is no current network connection between AWS and the on-premises environment. However, the data engineer has updated the on-premises database to allow traffic from the custom VPC.

Which solution will meet these requirements?

A.

Create a JDBC connection in AWS Glue with the database JDBC URL, username, and password.

B.

Create a Simple Authentication and Security Layer (SASL) connection in AWS Glue to the on-premises database.

C.

Create a JDBC connection in AWS Glue with a security group that allows TCP traffic to and from itself.

D.

Create a JDBC connection in AWS Glue that uses a JDBC driver stored in Amazon S3. Retrieve the database URL, username, and password from AWS Secrets Manager.

Full Access
Question # 64

A data engineer is building a data pipeline. A large data file is uploaded to an Amazon S3 bucket once each day at unpredictable times. An AWS Glue workflow uses hundreds of workers to process the file and load the data into Amazon Redshift. The company wants to process the file as quickly as possible.

Which solution will meet these requirements?

A.

Create an on-demand AWS Glue trigger to start the workflow. Create an AWS Lambda function that runs every 15 minutes to check the S3 bucket for the daily file. Configure the function to start the AWS Glue workflow if the file is present.

B.

Create an event-based AWS Glue trigger to start the workflow. Configure Amazon S3 to log events to AWS CloudTrail. Create a rule in Amazon EventBridge to forward PutObject events to the AWS Glue trigger.

C.

Create a scheduled AWS Glue trigger to start the workflow. Create a cron job that runs the AWS Glue job every 15 minutes. Set up the AWS Glue job to check the S3 bucket for the daily file. Configure the job to stop if the file is not present.

D.

Create an on-demand AWS Glue trigger to start the workflow. Create an AWS Database Migration Service (AWS DMS) migration task. Set the DMS source as the S3 bucket. Set the target endpoint as the AWS Glue workflow.

Full Access
Question # 65

A data engineer needs to build a data pipeline to process medical records from 50 hospitals. The pipeline must ingest 5 GB of data from each hospital and remove personally identifiable information (PII). The pipeline must then transform the data and save the data in a central store. The pipeline must automatically retry after transient failures without manual intervention.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Store the data in Amazon S3. Use AWS Glue extract, transform, and load (ETL) jobs to process the data. Use AWS Glue DataBrew to remove the PII. Orchestrate the pipeline by using AWS Step Functions.

B.

Deploy an Amazon EC2 instance to run a custom Python script to orchestrate the pipeline and remove the PII. Store the data in Amazon RDS. Use AWS Batch to process the data.

C.

Store the data in Amazon S3. Create an AWS Lambda function to process the data and mask the PII. Configure Amazon EventBridge to orchestrate the pipeline.

D.

Orchestrate the pipeline by using AWS Batch to remove the PII and transform the data. Store the data in Amazon S3.

Full Access
Question # 66

A data engineer uses Amazon Kinesis Data Streams to ingest and process records that contain user behavior data from an application every day.

The data engineer notices that the data stream is experiencing throttling because hot shards receive much more data than other shards in the data stream.

How should the data engineer resolve the throttling issue?

A.

Use a random partition key to distribute the ingested records.

B.

Increase the number of shards in the data stream. Distribute the records across the shards.

C.

Limit the number of records that are sent each second by the producer to match the capacity of the stream.

D.

Decrease the size of the records that the producer sends to match the capacity of the stream.

Full Access
Question # 67

A company uses Amazon Redshift for its data warehouse. The company must automate refresh schedules for Amazon Redshift materialized views.

Which solution will meet this requirement with the LEAST effort?

A.

Use Apache Airflow to refresh the materialized views.

B.

Use an AWS Lambda user-defined function (UDF) within Amazon Redshift to refresh the materialized views.

C.

Use the query editor v2 in Amazon Redshift to refresh the materialized views.

D.

Use an AWS Glue workflow to refresh the materialized views.

Full Access
Question # 68

A data engineer needs to debug an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The data engineer enabled the bookmark feature for the AWS Glue job. The data engineer has set the maximum concurrency for the AWS Glue job to 1.

The AWS Glue job is successfully writing the output to Amazon Redshift. However, the Amazon S3 files that were loaded during previous runs of the AWS Glue job are being reprocessed by subsequent runs.

What is the likely reason the AWS Glue job is reprocessing the files?

A.

The AWS Glue job does not have the s3:GetObjectAcl permission that is required for bookmarks to work correctly.

B.

The maximum concurrency for the AWS Glue job is set to 1.

C.

The data engineer incorrectly specified an older version of AWS Glue for the Glue job.

D.

The AWS Glue job does not have a required commit statement.

Full Access
Question # 69

A company runs concurrent analytical queries on Amazon Redshift tables multiple times each day. The queries require consistent data views three times each day. The company runs extract, transform, and load (ETL) operations that update dimension tables while the queries run. The company has noticed that the queries cause table-level locks during the ETL operations. The company ' s current solution experiences query timeouts and deadlocks during peak processing hours, which affects analytical reporting and on-demand analysis.

Which solution will fix this issue?

A.

Use Amazon Redshift materialized views for analytical queries. Schedule ETL operations during off-peak hours to minimize lock contention.

B.

Configure Amazon Redshift federated queries to access source data directly. Use read replicas to isolate analytical workloads from ETL operations.

C.

Use Amazon Redshift Spectrum to query data in Amazon S3 for analytical workloads. Maintain ETL operations on Amazon Redshift tables with transaction isolation.

D.

Deploy separate Amazon Redshift clusters for ETL and analytics workloads. Use cross-database queries and data sharing to maintain data consistency.

Full Access
Question # 70

A company has multiple applications that use datasets that are stored in an Amazon S3 bucket. The company has an ecommerce application that generates a dataset that contains personally identifiable information (PII). The company has an internal analytics application that does not require access to the PII.

To comply with regulations, the company must not share PII unnecessarily. A data engineer needs to implement a solution that with redact PII dynamically, based on the needs of each application that accesses the dataset.

Which solution will meet the requirements with the LEAST operational overhead?

A.

Create an S3 bucket policy to limit the access each application has. Create multiple copies of the dataset. Give each dataset copy the appropriate level of redaction for the needs of the application that accesses the copy.

B.

Create an S3 Object Lambda endpoint. Use the S3 Object Lambda endpoint to read data from the S3 bucket. Implement redaction logic within an S3 Object Lambda function to dynamically redact PII based on the needs of each application that accesses the data.

C.

Use AWS Glue to transform the data for each application. Create multiple copies of the dataset. Give each dataset copy the appropriate level of redaction for the needs of the application that accesses the copy.

D.

Create an API Gateway endpoint that has custom authorizers. Use the API Gateway endpoint to read data from the S3 bucket. Initiate a REST API call to dynamically redact PII based on the needs of each application that accesses the data.

Full Access
Question # 71

A company ' s data engineer needs to optimize the performance of table SQL queries. The company stores data in an Amazon Redshift cluster. The data engineer cannot increase the size of the cluster because of budget constraints.

The company stores the data in multiple tables and loads the data by using the EVEN distribution style. Some tables are hundreds of gigabytes in size. Other tables are less than 10 MB in size.

Which solution will meet these requirements?

A.

Keep using the EVEN distribution style for all tables. Specify primary and foreign keys for all tables.

B.

Use the ALL distribution style for large tables. Specify primary and foreign keys for all tables.

C.

Use the ALL distribution style for rarely updated small tables. Specify primary and foreign keys for all tables.

D.

Specify a combination of distribution, sort, and partition keys for all tables.

Full Access
Question # 72

A financial company wants to use Amazon Athena to run on-demand SQL queries on a petabyte-scale dataset to support a business intelligence (BI) application. An AWS Glue job that runs during non-business hours updates the dataset once every day. The BI application has a standard data refresh frequency of 1 hour to comply with company policies.

A data engineer wants to cost optimize the company ' s use of Amazon Athena without adding any additional infrastructure costs.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Configure an Amazon S3 Lifecycle policy to move data to the S3 Glacier Deep Archive storage class after 1 day

B.

Use the query result reuse feature of Amazon Athena for the SQL queries.

C.

Add an Amazon ElastiCache cluster between the Bl application and Athena.

D.

Change the format of the files that are in the dataset to Apache Parquet.

Full Access
Question # 73

A retail company stores customer data in an Amazon S3 bucket. Some of the customer data contains personally identifiable information (PII) about customers. The company must not share PII data with business partners.

A data engineer must determine whether a dataset contains PII before making objects in the dataset available to business partners.

Which solution will meet this requirement with the LEAST manual intervention?

A.

Configure the S3 bucket and S3 objects to allow access to Amazon Macie. Use automated sensitive data discovery in Macie.

B.

Configure AWS CloudTrail to monitor S3 PUT operations. Inspect the CloudTrail trails to identify operations that save PII.

C.

Create an AWS Lambda function to identify PII in S3 objects. Schedule the function to run periodically.

D.

Create a table in AWS Glue Data Catalog. Write custom SQL queries to identify PII in the table. Use Amazon Athena to run the queries.

Full Access
Question # 74

A data engineer is creating a product recommendation system that requires vector search across 5 million product embeddings with 768 dimensions. The system must prioritize search accuracy and maintain search latency under 100 milliseconds. The data engineer wants to implement a k-nearest neighbors (k-NN) vector index in Amazon OpenSearch Service.

Which vector index type should the data engineer use?

A.

Inverted File Index (IVF) with the Faiss engine.

B.

Hierarchical Navigable Small World (HNSW) with the Faiss engine.

C.

Exact k-NN search with a scoring script.

D.

Flat index with binary quantization.

Full Access
Question # 75

A gaming company uses AWS Glue to perform read and write operations on Apache Iceberg tables for real-time streaming data. The data in the Iceberg tables is stored in Apache Parquet format. The company is experiencing slow query performance.

Which solutions will improve query performance? (Select TWO)

A.

Use AWS Glue Data Catalog to generate column-level statistics for the Iceberg tables on a schedule.

B.

Use AWS Glue Data Catalog to automatically compact the Iceberg tables.

C.

Use AWS Glue Data Catalog to automatically optimize indexes for the Iceberg tables.

D.

Use AWS Glue Data Catalog to enable copy-on-write for the Iceberg tables.

E.

Use AWS Glue Data Catalog to generate views for the Iceberg tables.

Full Access
Question # 76

A company stores historical customer data in an Amazon Redshift table. A column named Email contains null entries and values that are not email addresses. The quality of the Email column is critical for multiple downstream processes. A data engineer must create an AWS Glue Data Quality rule that fails when the percentage of valid email addresses in the Email column is less than 90%.

Which component of an AWS Glue Data Quality rule will meet these requirements?

A.

Uniqueness " Email " matches with a threshold set to > 0.9

B.

ColumnValues " Email " matches with a threshold set to > 0.1

C.

ColumnValues " Email " matches with a threshold set to > 0.9

D.

UniqueValueRatio " Email " matches with a threshold set to > 0.1

Full Access
Question # 77

A company has an Amazon S3–based data lake. The data lake contains datasets that belong to multiple departments. The data lake ingests millions of customer records each day.

A data engineer needs to design an access and storage solution that allows departments to access only the subset of the company’s dataset that each department requires. The solution must follow the principle of least privilege.

Which solution will meet these requirements with the LEAST operational effort?

A.

Define IAM policies and IAM roles for each department. Specify the S3 access paths from the data lake that each team can access.

B.

Set up Amazon Redshift and Amazon Redshift Spectrum as the primary entry points for the data lake. Define an IAM role that Amazon Redshift can assume. Configure the IAM role to grant access to the data that is in Amazon S3.

C.

Set up AWS Lake Formation. Assign LF-Tags to AWS Glue Data Catalog resources. Enable Lake Formation tag-based access control (LF-TBAC).

D.

Deploy an Amazon RDS for PostgreSQL database that has the aws_s3 extension installed. Configure AWS Step Functions events to invoke an AWS Lambda function to sync the data lake with the database.

Full Access
Question # 78

A company currently uses a provisioned Amazon EMR cluster that includes general purpose Amazon EC2 instances. The EMR cluster uses EMR managed scaling between one to five task nodes for the company ' s long-running Apache Spark extract, transform, and load (ETL) job. The company runs the ETL job every day.

When the company runs the ETL job, the EMR cluster quickly scales up to five nodes. The EMR cluster often reaches maximum CPU usage, but the memory usage remains under 30%.

The company wants to modify the EMR cluster configuration to reduce the EMR costs to run the daily ETL job.

Which solution will meet these requirements MOST cost-effectively?

A.

Increase the maximum number of task nodes for EMR managed scaling to 10.

B.

Change the task node type from general purpose EC2 instances to memory optimized EC2 instances.

C.

Switch the task node type from general purpose EC2 instances to compute optimized EC2 instances.

D.

Reduce the scaling cooldown period for the provisioned EMR cluster.

Full Access
Question # 79

A company uses an Amazon Redshift cluster that runs on RA3 nodes. The company wants to scale read and write capacity to meet demand. A data engineer needs to identify a solution that will turn on concurrency scaling.

Which solution will meet this requirement?

A.

Turn on concurrency scaling in workload management (WLM) for Redshift Serverless workgroups.

B.

Turn on concurrency scaling at the workload management (WLM) queue level in the Redshift cluster.

C.

Turn on concurrency scaling in the settings during the creation of and new Redshift cluster.

D.

Turn on concurrency scaling for the daily usage quota for the Redshift cluster.

Full Access
Question # 80

A data engineer is configuring Amazon SageMaker Studio to use AWS Glue interactive sessions to prepare data for machine learning (ML) models.

The data engineer receives an access denied error when the data engineer tries to prepare the data by using SageMaker Studio.

Which change should the engineer make to gain access to SageMaker Studio?

A.

Add the AWSGlueServiceRole managed policy to the data engineer ' s IAM user.

B.

Add a policy to the data engineer ' s IAM user that includes the sts:AssumeRole action for the AWS Glue and SageMaker service principals in the trust policy.

C.

Add the AmazonSageMakerFullAccess managed policy to the data engineer ' s IAM user.

D.

Add a policy to the data engineer ' s IAM user that allows the sts:AddAssociation action for the AWS Glue and SageMaker service principals in the trust policy.

Full Access
Question # 81

A company is building an inventory management system and an inventory reordering system to automatically reorder products. Both systems use Amazon Kinesis Data Streams. The inventory management system uses the Amazon Kinesis Producer Library (KPL) to publish data to a stream. The inventory reordering system uses the Amazon Kinesis Client Library (KCL) to consume data from the stream. The company configures the stream to scale up and down as needed.

Before the company deploys the systems to production, the company discovers that the inventory reordering system received duplicated data.

Which factors could have caused the reordering system to receive duplicated data? (Select TWO.)

A.

The producer experienced network-related timeouts.

B.

The stream ' s value for the IteratorAgeMilliseconds metric was too high.

C.

There was a change in the number of shards, record processors, or both.

D.

The AggregationEnabled configuration property was set to true.

E.

The max_records configuration property was set to a number that was too high.

Full Access
Question # 82

A company receives call logs as Amazon S3 objects that contain sensitive customer information. The company must protect the S3 objects by using encryption. The company must also use encryption keys that only specific employees can access.

Which solution will meet these requirements with the LEAST effort?

A.

Use an AWS CloudHSM cluster to store the encryption keys. Configure the process that writes to Amazon S3 to make calls to CloudHSM to encrypt and decrypt the objects. Deploy an IAM policy that restricts access to the CloudHSM cluster.

B.

Use server-side encryption with customer-provided keys (SSE-C) to encrypt the objects that contain customer information. Restrict access to the keys that encrypt the objects.

C.

Use server-side encryption with AWS KMS keys (SSE-KMS) to encrypt the objects that contain customer information. Configure an IAM policy that restricts access to the KMS keys that encrypt the objects.

D.

Use server-side encryption with Amazon S3 managed keys (SSE-S3) to encrypt the objects that contain customer information. Configure an IAM policy that restricts access to the Amazon S3 managed keys that encrypt the objects.

Full Access
Question # 83

A data engineer is designing a log table for an application that requires continuous ingestion. The application must provide dependable API-based access to specific records from other applications. The application must handle more than 4,000 concurrent write operations and 6,500 read operations every second.

A.

Create an Amazon Redshift table with the KEY distribution style. Use the Amazon Redshift Data API to perform all read and write operations.

B.

Store the log files in an Amazon S3 Standard bucket. Register the schema in AWS Glue Data Catalog. Create an external Redshift table that points to the AWS Glue schema. Use the table to perform Amazon Redshift Spectrum read operations.

C.

Create an Amazon Redshift table with the EVEN distribution style. Use the Amazon Redshift JDBC connector to establish a database connection. Use the database connection to perform all read and write operations.

D.

Create an Amazon DynamoDB table that has provisioned capacity to meet the application ' s capacity needs. Use the DynamoDB table to perform all read and write operations by using DynamoDB APIs.

Full Access
Question # 84

A company needs to store semi-structured transactional data in a serverless database.

The application writes data infrequently but reads it frequently, with millisecond retrieval required.

A.

Store the data in an Amazon S3 Standard bucket. Enable S3 Transfer Acceleration.

B.

Store the data in an Amazon S3 Apache Iceberg table. Enable S3 Transfer Acceleration.

C.

Store the data in an Amazon RDS for MySQL cluster. Configure RDS Optimized Reads.

D.

Store the data in an Amazon DynamoDB table. Configure a DynamoDB Accelerator (DAX) cache.

Full Access
Question # 85

A research company stores data in an Amazon Redshift cluster. The company needs to share data between departments and maintain regulatory compliance. The company needs a solution that gives researchers access to only the records from their own departments and does not create multiple dataset copies. The solution must also ensure that personally identifiable information (PII) is protected from unauthorized access.

Which solution will meet these requirements?

A.

Create a datashare in Amazon Redshift for each department. Use cross-Region data sharing to distribute copies of the entire dataset to each department ' s Amazon Redshift cluster.

B.

Implement row-level security policies with basic SQL filters based on department. Attach the security policies to the data tables. Grant EXPLAIN RLS permission to authorized researchers.

C.

Create separate schemas for each department with appropriate views that filter data. Grant each department access to only their respective schema.

D.

Use row-level security policies with multi-condition SQL predicates. Attach the security policies to the data tables. Grant each department ' s role access to the appropriate policies.

Full Access
Question # 86

A company is building a data lake for a new analytics team. The company is using Amazon S3 for storage and Amazon Athena for query analysis. All data that is in Amazon S3 is in Apache Parquet format.

The company is running a new Oracle database as a source system in the company ' s data center. The company has 70 tables in the Oracle database. All the tables have primary keys. Data can occasionally change in the source system. The company wants to ingest the tables every day into the data lake.

Which solution will meet this requirement with the LEAST effort?

A.

Create an Apache Sqoop job in Amazon EMR to read the data from the Oracle database. Configure the Sqoop job to write the data to Amazon S3 in Parquet format.

B.

Create an AWS Glue connection to the Oracle database. Create an AWS Glue bookmark job to ingest the data incrementally and to write the data to Amazon S3 in Parquet format.

C.

Create an AWS Database Migration Service (AWS DMS) task for ongoing replication. Set the Oracle database as the source. Set Amazon S3 as the target. Configure the task to write the data in Parquet format.

D.

Create an Oracle database in Amazon RDS. Use AWS Database Migration Service (AWS DMS) to migrate the on-premises Oracle database to Amazon RDS. Configure triggers on the tables to invoke AWS Lambda functions to write changed records to Amazon S3 in Parquet format.

Full Access
Question # 87

A data engineer wants to optimize the runtime performance of an AWS Glue extract, transform, and load (ETL) job. The job processes large JSON files from Amazon S3. The job currently reads all fields from the source files but transforms only a subset of the fields.

Which solution will meet these requirements?

A.

Enable job bookmarks. Implement a custom bookmark key that uses a timestamp field.

B.

Implement pushdown predicates. Specify only required fields in the source schema definition.

C.

Create multiple smaller AWS Glue jobs. Configure each job to process a different field subset in parallel.

D.

Convert input files to Parquet format by using an AWS Glue crawler before processing the files.

Full Access
Question # 88

A company wants to implement real-time analytics capabilities. The company wants to use Amazon Kinesis Data Streams and Amazon Redshift to ingest and process streaming data at the rate of several gigabytes per second. The company wants to derive near real-time insights by using existing business intelligence (BI) and analytics tools.

Which solution will meet these requirements with the LEAST operational overhead?

A.

Use Kinesis Data Streams to stage data in Amazon S3. Use the COPY command to load data from Amazon S3 directly into Amazon Redshift to make the data immediately available for real-time analysis.

B.

Access the data from Kinesis Data Streams by using SQL queries. Create materialized views directly on top of the stream. Refresh the materialized views regularly to query the most recent stream data.

C.

Create an external schema in Amazon Redshift to map the data from Kinesis Data Streams to an Amazon Redshift object. Create a materialized view to read data from the stream. Set the materialized view to auto refresh.

D.

Connect Kinesis Data Streams to Amazon Kinesis Data Firehose. Use Kinesis Data Firehose to stage the data in Amazon S3. Use the COPY command to load the data from Amazon S3 to a table in Amazon Redshift.

Full Access
Question # 89

A company is migrating its database servers from Amazon EC2 instances that run Microsoft SQL Server to Amazon RDS for Microsoft SQL Server DB instances. The company ' s analytics team must export large data elements every day until the migration is complete. The data elements are the result of SQL joins across multiple tables. The data must be in Apache Parquet format. The analytics team must store the data in Amazon S3.

Which solution will meet these requirements in the MOST operationally efficient way?

A.

Create a view in the EC2 instance-based SQL Server databases that contains the required data elements. Create an AWS Glue job that selects the data directly from the view and transfers the data in Parquet format to an S3 bucket. Schedule the AWS Glue job to run every day.

B.

Schedule SQL Server Agent to run a daily SQL query that selects the desired data elements from the EC2 instance-based SQL Server databases. Configure the query to direct the output .csv objects to an S3 bucket. Create an S3 event that invokes an AWS Lambda function to transform the output format from .csv to Parquet.

C.

Use a SQL query to create a view in the EC2 instance-based SQL Server databases that contains the required data elements. Create and run an AWS Glue crawler to read the view. Create an AWS Glue job that retrieves the data and transfers the data in Parquet format to an S3 bucket. Schedule the AWS Glue job to run every day.

D.

Create an AWS Lambda function that queries the EC2 instance-based databases by using Java Database Connectivity (JDBC). Configure the Lambda function to retrieve the required data, transform the data into Parquet format, and transfer the data into an S3 bucket. Use Amazon EventBridge to schedule the Lambda function to run every day.

Full Access