Black Friday Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: mxmas70

Home > Databricks > Databricks Certification > Databricks-Certified-Data-Engineer-Associate

Databricks-Certified-Data-Engineer-Associate Databricks Certified Data Engineer Associate Exam Question and Answers

Question # 4

A data engineer has a Job that has a complex run schedule, and they want to transfer that schedule to other Jobs.

Rather than manually selecting each value in the scheduling form in Databricks, which of the following tools can the data engineer use to represent and submit the schedule programmatically?

A.

pyspark.sql.types.DateType

B.

datetime

C.

pyspark.sql.types.TimestampType

D.

Cron syntax

E.

There is no way to represent and submit this information programmatically

Full Access
Question # 5

A data engineer is maintaining an ETL pipeline code with a GitHub repository linked to their Databricks account. The data engineer wants to deploy the ETL pipeline to production as a databricks workflow.

Which approach should the data engineer use?

A.

Databricks Asset Bundles (DAB) + GitHub Integration

B.

Maintain workflow_config.j son and deploy it using Databricks CLI

C.

Manually create and manage the workflow in Ul

D.

Maintain workflow_conf ig. json and deploy it using Terraform

Full Access
Question # 6

A data engineer has a Python notebook in Databricks, but they need to use SQL to accomplish a specific task within a cell. They still want all of the other cells to use Python without making any changes to those cells.

Which of the following describes how the data engineer can use SQL within a cell of their Python notebook?

A.

It is not possible to use SQL in a Python notebook

B.

They can attach the cell to a SQL endpoint rather than a Databricks cluster

C.

They can simply write SQL syntax in the cell

D.

They can add %sql to the first line of the cell

E.

They can change the default language of the notebook to SQL

Full Access
Question # 7

A data engineer needs access to a table new_table, but they do not have the correct permissions. They can ask the table owner for permission, but they do not know who the table owner is.

Which of the following approaches can be used to identify the owner of new_table?

A.

Review the Permissions tab in the table's page in Data Explorer

B.

All of these options can be used to identify the owner of the table

C.

Review the Owner field in the table's page in Data Explorer

D.

Review the Owner field in the table's page in the cloud storage solution

E.

There is no way to identify the owner of the table

Full Access
Question # 8

A data engineer is working on a personal laptop and needs to perform complex transformations on data stored in a Delta Lake on cloud storage. The engineer decides to use Databricks Connect to interact with Databricks clusters and work in their local IDE.

How does Databricks Connect enable the engineer to develop, test, and debug code seamlessly on their local machine while interacting with Databricks clusters?

A.

By allowing direct execution of Spark jobs from the local machine without needing a network connection

B.

By providing a local environment that mimics the Databricks runtime, enabling the engineer to develop, test, and debug code using a specific IDE that is required by Databricks

C.

By providing a local environment that mimics the Databricks runtime, enabling the engineer to develop, test, and debug code using their preferred ide

D.

By providing a local environment that mimics the Databricks runtime, enabling the engineer to develop, test, and debug code only through Databricks' own web interface

Full Access
Question # 9

A data engineer needs access to a table new_uable, but they do not have the correct permissions. They can ask the table owner for permission, but they do not know who the table owner is.

Which approach can be used to identify the owner of new_table?

A.

There is no way to identify the owner of the table

B.

Review the Owner field in the table's page in the cloud storage solution

C.

Review the Permissions tab in the table's page in Data Explorer

D.

Review the Owner field in the table’s page in Data Explorer

Full Access
Question # 10

A data engineer has been using a Databricks SQL dashboard to monitor the cleanliness of the input data to an ELT job. The ELT job has its Databricks SQL query that returns the number of input records containing unexpected NULL values. The data engineer wants their entire team to be notified via a messaging webhook whenever this value reaches 100.

Which of the following approaches can the data engineer use to notify their entire team via a messaging webhook whenever the number of NULL values reaches 100?

A.

They can set up an Alert with a custom template.

B.

They can set up an Alert with a new email alert destination.

C.

They can set up an Alert with a new webhook alert destination.

D.

They can set up an Alert with one-time notifications.

E.

They can set up an Alert without notifications.

Full Access
Question # 11

A global retail company sells products across multiple categories (e.g.. Electronics, Clothing) and regions (e.g.. North. South, East. West). The sales team has provided the data engineer with a PySpark dataframe named sales_df as below and the team wants the data engineer to analyze the sales data to help them make strategic decisions.

A.

Category_sales = sales df.groupBy("category").agg(sum("sales amount") .alias ("total sales amount"))

B.

Category_sales = sales_df.sum("3ales_amount"). g-1- upBy("categcryn).alias("toLal_sales_amount))

C.

Category_sale: .es df -agg (sum ("sales amount") .-;r*i:rRy ("category") .alias ("total sa.en amount"))

D.

Category_sales = sales_df.groupBy("reqion"). agq(sum("sales_amountn).alias(ntotal_sales_amount''))

Full Access
Question # 12

A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables.

Which change will need to be made to the pipeline when migrating to Delta Live Tables?

A.

The pipeline can have different notebook sources in SQL & Python.

B.

The pipeline will need to be written entirely in SQL.

C.

The pipeline will need to be written entirely in Python.

D.

The pipeline will need to use a batch source in place of a streaming source.

Full Access
Question # 13

Identify the impact of ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE for a constraint violation.

A data engineer has created an ETL pipeline using Delta Live table to manage their company travel reimbursement detail, they want to ensure that the if the location details has not been provided by the employee, the pipeline needs to be terminated.

How can the scenario be implemented?

A.

CONSTRAINT valid_location EXPECT (location = NULL)

B.

CONSTRAINT valid_location EXPECT (location != NULL) ON VIOLATION FAIL UPDATE

C.

CONSTRAINT valid_location EXPECT (location != NULL) ON DROP ROW

D.

CONSTRAINT valid_location EXPECT (location != NULL) ON VIOLATION FAIL

Full Access
Question # 14

Which of the following describes the relationship between Bronze tables and raw data?

A.

Bronze tables contain less data than raw data files.

B.

Bronze tables contain more truthful data than raw data.

C.

Bronze tables contain aggregates while raw data is unaggregated.

D.

Bronze tables contain a less refined view of data than raw data.

E.

Bronze tables contain raw data with a schema applied.

Full Access
Question # 15

A data engineer needs to process SQL queries on a large dataset with fluctuating workloads. The workload requires automatic scaling based on the volume of queries, without the need to manage or provision infrastructure. The solution should be cost-efficient and charge only for the compute resources used during query execution.

Which compute option should the data engineer use?

A.

Databricks SQL Analytics

B.

Databricks Jobs

C.

Databricks Runtime for ML

D.

Serverless SQL Warehouse

Full Access
Question # 16

Which TWO items are characteristics of the Gold Layer?

Choose 2 answers

A.

Read-optimized

B.

Normalised

C.

Raw Data

D.

Historical lineage

E.

De-normalised

Full Access
Question # 17

A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:

DROP TABLE IF EXISTS my_table;

After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.

Which of the following describes why all of these files were deleted?

A.

The table was managed

B.

The table's data was smaller than 10 GB

C.

The table's data was larger than 10 GB

D.

The table was external

E.

The table did not have a location

Full Access
Question # 18

The Delta transaction log for the ‘students’ tables is shown using the ‘DESCRIBE HISTORY students’ command. A Data Engineer needs to query the table as it existed before the UPDATE operation listed in the log.

Which command should the Data Engineer use to achieve this? (Choose two.)

A.

SELECT * FROM students@v4

B.

SELECT * FROM students TIMESTAMP AS OF ‘2024-04-22T 14:32:47.000+00:00’

C.

SELECT * FROM students FROM HISTORY VERSION AS OF 3

D.

SELECT * FROM students VERSION AS OF 5

E.

SELECT * FROM students TIMESTAMP AS OF ‘2024-04-22T 14:32:58.000+00:00’

Full Access
Question # 19

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The code block used by the data engineer is below:

If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?

A.

processingTime(1)

B.

trigger(availableNow=True)

C.

trigger(parallelBatch=True)

D.

trigger(processingTime="once")

E.

trigger(continuous="once")

Full Access
Question # 20

Which of the following data lakehouse features results in improved data quality over a traditional data lake?

A.

A data lakehouse provides storage solutions for structured and unstructured data.

B.

A data lakehouse supports ACID-compliant transactions.

C.

A data lakehouse allows the use of SQL queries to examine data.

D.

A data lakehouse stores data in open formats.

E.

A data lakehouse enables machine learning and artificial Intelligence workloads.

Full Access
Question # 21

A data engineering team has noticed that their Databricks SQL queries are running too slowly when they are submitted to a non-running SQL endpoint. The data engineering team wants this issue to be resolved.

Which of the following approaches can the team use to reduce the time it takes to return results in this scenario?

A.

They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to "Reliability Optimized."

B.

They can turn on the Auto Stop feature for the SQL endpoint.

C.

They can increase the cluster size of the SQL endpoint.

D.

They can turn on the Serverless feature for the SQL endpoint.

E.

They can increase the maximum bound of the SQL endpoint's scaling range

Full Access
Question # 22

Which method should a Data Engineer apply to ensure Workflows are being triggered on schedule?

A.

Scheduled Workflows require an always-running cluster, which is more expensive but reduces processing latency.

B.

Scheduled Workflows process data as it arrives at configured sources.

C.

Scheduled Workflows can reduce resource consumption and expense since the cluster runs only long enough to execute the pipeline.

D.

Scheduled Workflows run continuously until manually stopped.

Full Access
Question # 23

Which of the following is hosted completely in the control plane of the classic Databricks architecture?

A.

Worker node

B.

JDBC data source

C.

Databricks web application

D.

Databricks Filesystem

E.

Driver node

Full Access
Question # 24

A data engineer that is new to using Python needs to create a Python function to add two integers together and return the sum?

Which of the following code blocks can the data engineer use to complete this task?

A)

B)

C)

D)

E)

A.

Option A

B.

Option B

C.

Option C

D.

Option D

E.

Option E

Full Access
Question # 25

A data engineer wants to schedule their Databricks SQL dashboard to refresh once per day, but they only want the associated SQL endpoint to be running when it is necessary.

Which of the following approaches can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?

A.

They can ensure the dashboard’s SQL endpoint matches each of the queries’ SQL endpoints.

B.

They can set up the dashboard’s SQL endpoint to be serverless.

C.

They can turn on the Auto Stop feature for the SQL endpoint.

D.

They can reduce the cluster size of the SQL endpoint.

E.

They can ensure the dashboard’s SQL endpoint is not one of the included query’s SQL endpoint.

Full Access
Question # 26

A data engineer is running code in a Databricks Repo that is cloned from a central Git repository. A colleague of the data engineer informs them that changes have been made and synced to the central Git repository. The data engineer now needs to sync their Databricks Repo to get the changes from the central Git repository.

Which of the following Git operations does the data engineer need to run to accomplish this task?

A.

Merge

B.

Push

C.

Pull

D.

Commit

E.

Clone

Full Access
Question # 27

In which of the following scenarios should a data engineer select a Task in the Depends On field of a new Databricks Job Task?

A.

When another task needs to be replaced by the new task

B.

When another task needs to fail before the new task begins

C.

When another task has the same dependency libraries as the new task

D.

When another task needs to use as little compute resources as possible

E.

When another task needs to successfully complete before the new task begins

Full Access
Question # 28

A Databricks workflow fails at the last stage due to an error in a notebook. This workflow runs daily. The data engineer fixes the mistake and wants to rerun the pipeline. This workflow is very costly and time-intensive to run.

Which action should the data engineer do in order to minimise downtime and cost?

A.

Switch to another cluster

B.

Repair run

C.

Re-run the entire workflow

D.

Restart the cluster

Full Access
Question # 29

Which of the following describes the type of workloads that are always compatible with Auto Loader?

A.

Dashboard workloads

B.

Streaming workloads

C.

Machine learning workloads

D.

Serverless workloads

E.

Batch workloads

Full Access
Question # 30

Identify how the count_if function and the count where x is null can be used

Consider a table random_values with below data.

What would be the output of below query?

select count_if(col > 1) as count_a. count(*) as count_b.count(col1) as count_c from random_values col1

0

1

2

NULL -

2

3

A.

3 6 5

B.

4 6 5

C.

3 6 6

D.

4 6 6

Full Access
Question # 31

A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week is equal to 1 and the Python variable review_period is True.

Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?

A.

if day_of_week = 1 and review_period:

B.

if day_of_week = 1 and review_period = "True":

C.

if day_of_week == 1 and review_period == "True":

D.

if day_of_week == 1 and review_period:

E.

if day_of_week = 1 & review_period: = "True":

Full Access
Question # 32

Which of the following is stored in the Databricks customer's cloud account?

A.

Databricks web application

B.

Cluster management metadata

C.

Repos

D.

Data

E.

Notebooks

Full Access
Question # 33

A data engineer has three tables in a Delta Live Tables (DLT) pipeline. They have configured the pipeline to drop invalid records at each table. They notice that some data is being dropped due to quality concerns at some point in the DLT pipeline. They would like to determine at which table in their pipeline the data is being dropped.

Which of the following approaches can the data engineer take to identify the table that is dropping the records?

A.

They can set up separate expectations for each table when developing their DLT pipeline.

B.

They cannot determine which table is dropping the records.

C.

They can set up DLT to notify them via email when records are dropped.

D.

They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.

E.

They can navigate to the DLT pipeline page, click on the “Error” button, and review the present errors.

Full Access
Question # 34

In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?

A.

Checkpointing and Write-ahead Logs

B.

Structured Streaming cannot record the offset range of the data being processed in each trigger.

C.

Replayable Sources and Idempotent Sinks

D.

Write-ahead Logs and Idempotent Sinks

E.

Checkpointing and Idempotent Sinks

Full Access
Question # 35

A data engineer has realized that they made a mistake when making a daily update to a table. They need to use Delta time travel to restore the table to a version that is 3 days old. However, when the data engineer attempts to time travel to the older version, they are unable to restore the data because the data files have been deleted.

Which of the following explains why the data files are no longer present?

A.

The VACUUM command was run on the table

B.

The TIME TRAVEL command was run on the table

C.

The DELETE HISTORY command was run on the table

D.

The OPTIMIZE command was nun on the table

E.

The HISTORY command was run on the table

Full Access
Question # 36

A data engineer works for an organization that must meet a stringent Service Level Agreement (SLA) that demands minimal runtime errors and high availability for its data processing pipelines. The data engineer wants to avoid the operational overhead of managing and tuning clusters.

Which architectural solution will meet the requirements?

A.

Implement a hybrid approach with scheduled batch jobs on custom cloud VMs.

B.

Use an auto-scaling cluster configured and monitored by the user.

C.

Utilize Databricks serverless compute that automatically optimizes resources and abstracts cluster management.

D.

Deploy a dedicated, manually managed cluster optimized by in-house IT staff.

Full Access
Question # 37

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.

The table is configured to run in Production mode using the Continuous Pipeline Mode.

Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?

A.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

B.

All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.

C.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.

D.

All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.

E.

All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

Full Access
Question # 38

A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:

Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?

A.

Replace predict with a stream-friendly prediction function

B.

Replace schema(schema) with option ("maxFilesPerTrigger", 1)

C.

Replace "transactions" with the path to the location of the Delta table

D.

Replace format("delta") with format("stream")

E.

Replace spark.read with spark.readStream

Full Access
Question # 39

A data organization leader is upset about the data analysis team’s reports being different from the data engineering team’s reports. The leader believes the siloed nature of their organization’s data engineering and data analysis architectures is to blame.

Which of the following describes how a data lakehouse could alleviate this issue?

A.

Both teams would autoscale their work as data size evolves

B.

Both teams would use the same source of truth for their work

C.

Both teams would reorganize to report to the same department

D.

Both teams would be able to collaborate on projects in real-time

E.

Both teams would respond more quickly to ad-hoc requests

Full Access
Question # 40

A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions.

In which location can the data engineer review their permissions on the table?

A.

Jobs

B.

Dashboards

C.

Catalog Explorer

D.

Repos

Full Access
Question # 41

A Data Engineer is building a simple data pipeline using Delta Live Tables (DLT) in Databricksto ingest customer data. The raw customer data is stored in a cloud storage location in JSON format. The task is to create a DLT pipeline that reads the rawJSON data and writes it into a Delta table for further processing.

Which code snippet will correctly ingest the raw JSON data and create a Delta table using DLT?

A)

B)

C)

D)

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Full Access
Question # 42

Which SQL code snippet will correctly demonstrate a Data Definition Language (DDL) operation used to create a table?

A.

DROP TABLE employees;

B.

INSERT INTO employees (id, name) VALUES (1, 'Alice');

C.

CRFATF tabif employees ( id INT, name suing

D.

ALTFR TABIF employees add column salary DECTMA(10,2);

Full Access
Question # 43

What is stored in a Databricks customer's cloud account?

A.

Data

B.

Cluster management metadata

C.

Databricks web application

D.

Notebooks

Full Access
Question # 44

An engineering manager uses a Databricks SQL query to monitor ingestion latency for each data source. The manager checks the results of the query every day, but they are manually rerunning the query each day and waiting for the results.

Which of the following approaches can the manager use to ensure the results of the query are updated each day?

A.

They can schedule the query to refresh every 1 day from the SQL endpoint's page in Databricks SQL.

B.

They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL.

C.

They can schedule the query to refresh every 1 day from the query's page in Databricks SQL.

D.

They can schedule the query to run every 1 day from the Jobs UI.

E.

They can schedule the query to run every 12 hours from the Jobs UI.

Full Access
Question # 45

A dataset has been defined using Delta Live Tables and includes an expectations clause:

CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION FAIL UPDATE

What is the expected behavior when a batch of data containing data that violates these constraints is processed?

A.

Records that violate the expectation cause the job to fail.

B.

Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.

C.

Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.

D.

Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.

Full Access