Summer Sale Limited Time 60% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 28522818

Home > Databricks > Databricks Certification > Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam Question and Answers

Question # 4

Which of the following code blocks returns a 2-column DataFrame that shows the distinct values in column productId and the number of rows with that productId in DataFrame transactionsDf?

A.

transactionsDf.count("productId").distinct()

B.

transactionsDf.groupBy("productId").agg(col("value").count())

C.

transactionsDf.count("productId")

D.

transactionsDf.groupBy("productId").count()

E.

transactionsDf.groupBy("productId").select(count("value"))

Full Access
Question # 5

Which of the following code blocks returns a copy of DataFrame itemsDf where the column supplier has been renamed to manufacturer?

A.

itemsDf.withColumn(["supplier", "manufacturer"])

B.

itemsDf.withColumn("supplier").alias("manufacturer")

C.

itemsDf.withColumnRenamed("supplier", "manufacturer")

D.

itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))

E.

itemsDf.withColumnsRenamed("supplier", "manufacturer")

Full Access
Question # 6

Which of the following describes a shuffle?

A.

A shuffle is a process that is executed during a broadcast hash join.

B.

A shuffle is a process that compares data across executors.

C.

A shuffle is a process that compares data across partitions.

D.

A shuffle is a Spark operation that results from DataFrame.coalesce().

E.

A shuffle is a process that allocates partitions to executors.

Full Access
Question # 7

Which of the following code blocks creates a new 6-column DataFrame by appending the rows of the 6-column DataFrame yesterdayTransactionsDf to the rows of the 6-column DataFrame

todayTransactionsDf, ignoring that both DataFrames have different column names?

A.

union(todayTransactionsDf, yesterdayTransactionsDf)

B.

todayTransactionsDf.unionByName(yesterdayTransactionsDf, allowMissingColumns=True)

C.

todayTransactionsDf.unionByName(yesterdayTransactionsDf)

D.

todayTransactionsDf.concat(yesterdayTransactionsDf)

E.

todayTransactionsDf.union(yesterdayTransactionsDf)

Full Access
Question # 8

The code block shown below should convert up to 5 rows in DataFrame transactionsDf that have the value 25 in column storeId into a Python list. Choose the answer that correctly fills the blanks in

the code block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__(__4__)

A.

1. filter

2. "storeId"==25

3. collect

4. 5

B.

1. filter

2. col("storeId")==25

3. toLocalIterator

4. 5

C.

1. select

2. storeId==25

3. head

4. 5

D.

1. filter

2. col("storeId")==25

3. take

4. 5

E.

1. filter

2. col("storeId")==25

3. collect

4. 5

Full Access
Question # 9

In which order should the code blocks shown below be run in order to create a table of all values in column attributes next to the respective values in column supplier in DataFrame itemsDf?

1. itemsDf.createOrReplaceView("itemsDf")

2. spark.sql("FROM itemsDf SELECT 'supplier', explode('Attributes')")

3. spark.sql("FROM itemsDf SELECT supplier, explode(attributes)")

4. itemsDf.createOrReplaceTempView("itemsDf")

A.

4, 3

B.

1, 3

C.

2

D.

4, 2

E.

1, 2

Full Access
Question # 10

The code block displayed below contains an error. The code block is intended to return all columns of DataFrame transactionsDf except for columns predError, productId, and value. Find the error.

Excerpt of DataFrame transactionsDf:

transactionsDf.select(~col("predError"), ~col("productId"), ~col("value"))

A.

The select operator should be replaced by the drop operator and the arguments to the drop operator should be column names predError, productId and value wrapped in the col operator so they

should be expressed like drop(col(predError), col(productId), col(value)).

B.

The select operator should be replaced with the deselect operator.

C.

The column names in the select operator should not be strings and wrapped in the col operator, so they should be expressed like select(~col(predError), ~col(productId), ~col(value)).

D.

The select operator should be replaced by the drop operator.

E.

The select operator should be replaced by the drop operator and the arguments to the drop operator should be column names predError, productId and value as strings.

(Correct)

Full Access
Question # 11

Which of the following code blocks sorts DataFrame transactionsDf both by column storeId in ascending and by column productId in descending order, in this priority?

A.

transactionsDf.sort("storeId", asc("productId"))

B.

transactionsDf.sort(col(storeId)).desc(col(productId))

C.

transactionsDf.order_by(col(storeId), desc(col(productId)))

D.

transactionsDf.sort("storeId", desc("productId"))

E.

transactionsDf.sort("storeId").sort(desc("productId"))

Full Access
Question # 12

Which of the following code blocks reads in the parquet file stored at location filePath, given that all columns in the parquet file contain only whole numbers and are stored in the most appropriate

format for this kind of data?

A.

1.spark.read.schema(

2. StructType(

3. StructField("transactionId", IntegerType(), True),

4. StructField("predError", IntegerType(), True)

5. )).load(filePath)

B.

1.spark.read.schema([

2. StructField("transactionId", NumberType(), True),

3. StructField("predError", IntegerType(), True)

4. ]).load(filePath)

C.

1.spark.read.schema(

2. StructType([

3. StructField("transactionId", StringType(), True),

4. StructField("predError", IntegerType(), True)]

5. )).parquet(filePath)

D.

1.spark.read.schema(

2. StructType([

3. StructField("transactionId", IntegerType(), True),

4. StructField("predError", IntegerType(), True)]

5. )).format("parquet").load(filePath)

E.

1.spark.read.schema([

2. StructField("transactionId", IntegerType(), True),

3. StructField("predError", IntegerType(), True)

4. ]).load(filePath, format="parquet")

Full Access
Question # 13

The code block displayed below contains an error. The code block should return a DataFrame where all entries in column supplier contain the letter combination et in this order. Find the error.

Code block:

itemsDf.filter(Column('supplier').isin('et'))

A.

The Column operator should be replaced by the col operator and instead of isin, contains should be used.

B.

The expression inside the filter parenthesis is malformed and should be replaced by isin('et', 'supplier').

C.

Instead of isin, it should be checked whether column supplier contains the letters et, so isin should be replaced with contains. In addition, the column should be accessed using col['supplier'].

D.

The expression only returns a single column and filter should be replaced by select.

Full Access
Question # 14

Which of the following statements about Spark's DataFrames is incorrect?

A.

Spark's DataFrames are immutable.

B.

Spark's DataFrames are equal to Python's DataFrames.

C.

Data in DataFrames is organized into named columns.

D.

RDDs are at the core of DataFrames.

E.

The data in DataFrames may be split into multiple chunks.

Full Access
Question # 15

Which of the following is not a feature of Adaptive Query Execution?

A.

Replace a sort merge join with a broadcast join, where appropriate.

B.

Coalesce partitions to accelerate data processing.

C.

Split skewed partitions into smaller partitions to avoid differences in partition processing time.

D.

Reroute a query in case of an executor failure.

E.

Collect runtime statistics during query execution.

Full Access
Question # 16

The code block shown below should return a two-column DataFrame with columns transactionId and supplier, with combined information from DataFrames itemsDf and transactionsDf. The code

block should merge rows in which column productId of DataFrame transactionsDf matches the value of column itemId in DataFrame itemsDf, but only where column storeId of DataFrame

transactionsDf does not match column itemId of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Code block:

transactionsDf.__1__(itemsDf, __2__).__3__(__4__)

A.

1. join

2. transactionsDf.productId==itemsDf.itemId, how="inner"

3. select

4. "transactionId", "supplier"

B.

1. select

2. "transactionId", "supplier"

3. join

4. [transactionsDf.storeId!=itemsDf.itemId, transactionsDf.productId==itemsDf.itemId]

C.

1. join

2. [transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId]

3. select

4. "transactionId", "supplier"

D.

1. filter

2. "transactionId", "supplier"

3. join

4. "transactionsDf.storeId!=itemsDf.itemId, transactionsDf.productId==itemsDf.itemId"

E.

1. join

2. transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId

3. filter

4. "transactionId", "supplier"

Full Access
Question # 17

The code block shown below should return a one-column DataFrame where the column storeId is converted to string type. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

transactionsDf.__1__(__2__.__3__(__4__))

A.

1. select

2. col("storeId")

3. cast

4. StringType

B.

1. select

2. col("storeId")

3. as

4. StringType

C.

1. cast

2. "storeId"

3. as

4. StringType()

D.

1. select

2. col("storeId")

3. cast

4. StringType()

E.

1. select

2. storeId

3. cast

4. StringType()

Full Access
Question # 18

Which of the following describes the characteristics of accumulators?

A.

Accumulators are used to pass around lookup tables across the cluster.

B.

All accumulators used in a Spark application are listed in the Spark UI.

C.

Accumulators can be instantiated directly via the accumulator(n) method of the pyspark.RDD module.

D.

Accumulators are immutable.

E.

If an action including an accumulator fails during execution and Spark manages to restart the action and complete it successfully, only the successful attempt will be counted in the accumulator.

Full Access
Question # 19

Which of the following statements about Spark's configuration properties is incorrect?

A.

The maximum number of tasks that an executor can process at the same time is controlled by the spark.task.cpus property.

B.

The maximum number of tasks that an executor can process at the same time is controlled by the spark.executor.cores property.

C.

The default value for spark.sql.autoBroadcastJoinThreshold is 10MB.

D.

The default number of partitions to use when shuffling data for joins or aggregations is 300.

E.

The default number of partitions returned from certain transformations can be controlled by the spark.default.parallelism property.

Full Access
Question # 20

The code block displayed below contains an error. The code block should arrange the rows of DataFrame transactionsDf using information from two columns in an ordered fashion, arranging first by

column value, showing smaller numbers at the top and greater numbers at the bottom, and then by column predError, for which all values should be arranged in the inverse way of the order of items

in column value. Find the error.

Code block:

transactionsDf.orderBy('value', asc_nulls_first(col('predError')))

A.

Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.

B.

Column value should be wrapped by the col() operator.

C.

Column predError should be sorted in a descending way, putting nulls last.

D.

Column predError should be sorted by desc_nulls_first() instead.

E.

Instead of orderBy, sort should be used.

Full Access
Question # 21

The code block displayed below contains an error. The code block should read the csv file located at path data/transactions.csv into DataFrame transactionsDf, using the first row as column header

and casting the columns in the most appropriate type. Find the error.

First 3 rows of transactions.csv:

1.transactionId;storeId;productId;name

2.1;23;12;green grass

3.2;35;31;yellow sun

4.3;23;12;green grass

Code block:

transactionsDf = spark.read.load("data/transactions.csv", sep=";", format="csv", header=True)

A.

The DataFrameReader is not accessed correctly.

B.

The transaction is evaluated lazily, so no file will be read.

C.

Spark is unable to understand the file type.

D.

The code block is unable to capture all columns.

E.

The resulting DataFrame will not have the appropriate schema.

Full Access
Question # 22

Which of the following code blocks performs an inner join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively, excluding columns value and storeId from

DataFrame transactionsDf and column attributes from DataFrame itemsDf?

A.

transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId)

B.

1.transactionsDf.createOrReplaceTempView('transactionsDf')

2.itemsDf.createOrReplaceTempView('itemsDf')

3.

4.spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

C.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

D.

1.transactionsDf \

2. .drop(col('value'), col('storeId')) \

3. .join(itemsDf.drop(col('attributes')), col('productId')==col('itemId'))

E.

1.transactionsDf.createOrReplaceTempView('transactionsDf')

2.itemsDf.createOrReplaceTempView('itemsDf')

3.

4.statement = """

5.SELECT * FROM transactionsDf

6.INNER JOIN itemsDf

7.ON transactionsDf.productId==itemsDf.itemId

8."""

9.spark.sql(statement).drop("value", "storeId", "attributes")

Full Access
Question # 23

Which of the following statements about DAGs is correct?

A.

DAGs help direct how Spark executors process tasks, but are a limitation to the proper execution of a query when an executor fails.

B.

DAG stands for "Directing Acyclic Graph".

C.

Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts.

D.

In contrast to transformations, DAGs are never lazily executed.

E.

DAGs can be decomposed into tasks that are executed in parallel.

Full Access
Question # 24

Which of the following statements about lazy evaluation is incorrect?

A.

Predicate pushdown is a feature resulting from lazy evaluation.

B.

Execution is triggered by transformations.

C.

Spark will fail a job only during execution, but not during definition.

D.

Accumulators do not change the lazy evaluation model of Spark.

E.

Lineages allow Spark to coalesce transformations into stages

Full Access
Question # 25

The code block shown below should return a copy of DataFrame transactionsDf without columns value and productId and with an additional column associateId that has the value 5. Choose the

answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__, __3__).__4__(__5__, 'value')

A.

1. withColumn

2. 'associateId'

3. 5

4. remove

5. 'productId'

B.

1. withNewColumn

2. associateId

3. lit(5)

4. drop

5. productId

C.

1. withColumn

2. 'associateId'

3. lit(5)

4. drop

5. 'productId'

D.

1. withColumnRenamed

2. 'associateId'

3. 5

4. drop

5. 'productId'

E.

1. withColumn

2. col(associateId)

3. lit(5)

4. drop

5. col(productId)

Full Access
Question # 26

Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?

A.

array_remove(transactionsDf, "*")

B.

transactionsDf.unpersist()

(Correct)

C.

del transactionsDf

D.

transactionsDf.clearCache()

E.

transactionsDf.persist()

Full Access
Question # 27

Which of the following describes a difference between Spark's cluster and client execution modes?

A.

In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client mode.

B.

In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.

C.

In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.

D.

In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.

E.

In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.

Full Access