Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam sample Question + Exam 2025 Practice Exam Dumps

Question # 4

Which of the following code blocks returns a 2-column DataFrame that shows the distinct values in column productId and the number of rows with that productId in DataFrame transactionsDf?

transactionsDf.count("productId").distinct()

transactionsDf.groupBy("productId").agg(col("value").count())

transactionsDf.count("productId")

transactionsDf.groupBy("productId").count()

transactionsDf.groupBy("productId").select(count("value"))

Full Access

Question # 5

Which of the following code blocks returns a copy of DataFrame itemsDf where the column supplier has been renamed to manufacturer?

itemsDf.withColumn(["supplier", "manufacturer"])

itemsDf.withColumn("supplier").alias("manufacturer")

itemsDf.withColumnRenamed("supplier", "manufacturer")

itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))

itemsDf.withColumnsRenamed("supplier", "manufacturer")

Full Access

Answer:

Explanation:

itemsDf.withColumnRenamed("supplier", "manufacturer")

Correct! This uses the relatively trivial DataFrame method withColumnRenamed for renaming column supplier to column manufacturer.

Note that the QUESTION NO: asks for "a copy of DataFrame itemsDf". This may be confusing if you are not familiar with Spark yet. RDDs (Resilient Distributed Datasets) are the foundation of

Spark DataFrames and are immutable. As such, DataFrames are immutable, too. Any command that changes anything in the DataFrame therefore necessarily returns a copy, or a new version, of it

that has the changes applied.

itemsDf.withColumnsRenamed("supplier", "manufacturer")

Incorrect. Spark's DataFrame API does not have a withColumnsRenamed() method.

itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))

No. Watch out â€“ although the col() method works for many methods of the DataFrame API, withColumnRenamed is not one of them. As outlined in the documentation linked below,

withColumnRenamed expects strings.

itemsDf.withColumn(["supplier", "manufacturer"])

Wrong. While DataFrame.withColumn() exists in Spark, it has a different purpose than renaming columns. withColumn is typically used to add columns to DataFrames, taking the name of the new

column as a first, and a Column as a second argument. Learn more via the documentation that is linked below.

itemsDf.withColumn("supplier").alias("manufacturer")

No. While DataFrame.withColumn() exists, it requires 2 arguments. Furthermore, the alias() method on DataFrames would not help the cause of renaming a column much. DataFrame.alias() can be

useful in addressing the input of join statements. However, this is far outside of the scope of this question. If you are curious nevertheless, check out the link below.

More info: pyspark.sql.DataFrame.withColumnRenamed â€” PySpark 3.1.1 documentation, pyspark.sql.DataFrame.withColumn â€” PySpark 3.1.1 documentation, and pyspark.sql.DataFrame.alias â€”

PySpark 3.1.2 documentation (https://bit.ly/3aSB5tm , https://bit.ly/2Tv4rbE , https://bit.ly/2RbhBd2)

Static notebook | Dynamic notebook: See test 1, QUESTION NO: 31 (Databricks import instructions) (https://flrs.github.io/spark_practice_tests_code/#1/31.html ,

https://bit.ly/sparkpracticeexams_import_instructions)

Question # 6

Which of the following describes a shuffle?

A shuffle is a process that is executed during a broadcast hash join.

A shuffle is a process that compares data across executors.

A shuffle is a process that compares data across partitions.

A shuffle is a Spark operation that results from DataFrame.coalesce().

A shuffle is a process that allocates partitions to executors.

Full Access

Question # 7

Which of the following code blocks creates a new 6-column DataFrame by appending the rows of the 6-column DataFrame yesterdayTransactionsDf to the rows of the 6-column DataFrame

todayTransactionsDf, ignoring that both DataFrames have different column names?

union(todayTransactionsDf, yesterdayTransactionsDf)

todayTransactionsDf.unionByName(yesterdayTransactionsDf, allowMissingColumns=True)

todayTransactionsDf.unionByName(yesterdayTransactionsDf)

todayTransactionsDf.concat(yesterdayTransactionsDf)

todayTransactionsDf.union(yesterdayTransactionsDf)

Full Access

Question # 8

The code block shown below should convert up to 5 rows in DataFrame transactionsDf that have the value 25 in column storeId into a Python list. Choose the answer that correctly fills the blanks in

the code block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__(__4__)

1. filter

2. "storeId"==25

3. collect

4. 5

1. filter

2. col("storeId")==25

3. toLocalIterator

4. 5

1. select

2. storeId==25

3. head

4. 5

1. filter

2. col("storeId")==25

3. take

4. 5

1. filter

2. col("storeId")==25

3. collect

4. 5

Full Access

Question # 9

In which order should the code blocks shown below be run in order to create a table of all values in column attributes next to the respective values in column supplier in DataFrame itemsDf?

1. itemsDf.createOrReplaceView("itemsDf")

2. spark.sql("FROM itemsDf SELECT 'supplier', explode('Attributes')")

3. spark.sql("FROM itemsDf SELECT supplier, explode(attributes)")

4. itemsDf.createOrReplaceTempView("itemsDf")

4, 3

1, 3

4, 2

1, 2

Full Access

Question # 10

The code block displayed below contains an error. The code block is intended to return all columns of DataFrame transactionsDf except for columns predError, productId, and value. Find the error.

Excerpt of DataFrame transactionsDf:

transactionsDf.select(~col("predError"), ~col("productId"), ~col("value"))

The select operator should be replaced by the drop operator and the arguments to the drop operator should be column names predError, productId and value wrapped in the col operator so they

should be expressed like drop(col(predError), col(productId), col(value)).

The select operator should be replaced with the deselect operator.

The column names in the select operator should not be strings and wrapped in the col operator, so they should be expressed like select(~col(predError), ~col(productId), ~col(value)).

The select operator should be replaced by the drop operator.

The select operator should be replaced by the drop operator and the arguments to the drop operator should be column names predError, productId and value as strings.

(Correct)

Full Access

Question # 11

Which of the following code blocks sorts DataFrame transactionsDf both by column storeId in ascending and by column productId in descending order, in this priority?

transactionsDf.sort("storeId", asc("productId"))

transactionsDf.sort(col(storeId)).desc(col(productId))

transactionsDf.order_by(col(storeId), desc(col(productId)))

transactionsDf.sort("storeId", desc("productId"))

transactionsDf.sort("storeId").sort(desc("productId"))

Full Access

Question # 12

Which of the following code blocks reads in the parquet file stored at location filePath, given that all columns in the parquet file contain only whole numbers and are stored in the most appropriate

format for this kind of data?

1.spark.read.schema(

2. StructType(

3. StructField("transactionId", IntegerType(), True),

4. StructField("predError", IntegerType(), True)

5. )).load(filePath)

1.spark.read.schema([

2. StructField("transactionId", NumberType(), True),

3. StructField("predError", IntegerType(), True)

4. ]).load(filePath)

1.spark.read.schema(

2. StructType([

3. StructField("transactionId", StringType(), True),

4. StructField("predError", IntegerType(), True)]

5. )).parquet(filePath)

1.spark.read.schema(

2. StructType([

3. StructField("transactionId", IntegerType(), True),

4. StructField("predError", IntegerType(), True)]

5. )).format("parquet").load(filePath)

1.spark.read.schema([

2. StructField("transactionId", IntegerType(), True),

3. StructField("predError", IntegerType(), True)

4. ]).load(filePath, format="parquet")

Full Access

Question # 13

The code block displayed below contains an error. The code block should return a DataFrame where all entries in column supplier contain the letter combination et in this order. Find the error.

Code block:

itemsDf.filter(Column('supplier').isin('et'))

The Column operator should be replaced by the col operator and instead of isin, contains should be used.

The expression inside the filter parenthesis is malformed and should be replaced by isin('et', 'supplier').

Instead of isin, it should be checked whether column supplier contains the letters et, so isin should be replaced with contains. In addition, the column should be accessed using col['supplier'].

The expression only returns a single column and filter should be replaced by select.

Full Access

Question # 14

Which of the following statements about Spark's DataFrames is incorrect?

Spark's DataFrames are immutable.

Spark's DataFrames are equal to Python's DataFrames.

Data in DataFrames is organized into named columns.

RDDs are at the core of DataFrames.

The data in DataFrames may be split into multiple chunks.

Full Access

Question # 15

Which of the following is not a feature of Adaptive Query Execution?

Replace a sort merge join with a broadcast join, where appropriate.

Coalesce partitions to accelerate data processing.

Split skewed partitions into smaller partitions to avoid differences in partition processing time.

Reroute a query in case of an executor failure.

Collect runtime statistics during query execution.

Full Access

Question # 16

The code block shown below should return a two-column DataFrame with columns transactionId and supplier, with combined information from DataFrames itemsDf and transactionsDf. The code

block should merge rows in which column productId of DataFrame transactionsDf matches the value of column itemId in DataFrame itemsDf, but only where column storeId of DataFrame

transactionsDf does not match column itemId of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Code block:

transactionsDf.__1__(itemsDf, __2__).__3__(__4__)

1. join

2. transactionsDf.productId==itemsDf.itemId, how="inner"

3. select

4. "transactionId", "supplier"

1. select

2. "transactionId", "supplier"

3. join

4. [transactionsDf.storeId!=itemsDf.itemId, transactionsDf.productId==itemsDf.itemId]

1. join

2. [transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId]

3. select

4. "transactionId", "supplier"

1. filter

2. "transactionId", "supplier"

3. join

4. "transactionsDf.storeId!=itemsDf.itemId, transactionsDf.productId==itemsDf.itemId"

1. join

2. transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId

3. filter

4. "transactionId", "supplier"

Full Access

Answer:

Explanation:

Explanation

This QUESTION NO: is pretty complex and, in its complexity, is probably above what you would encounter in the exam. However, reading the QUESTION NO: carefully, you can use your logic skills

to weed out the

wrong answers here.

First, you should examine the join statement which is common to all answers. The first argument of the join() operator (documentation linked below) is the DataFrame to be joined with. Where join is

in gap 3, the first argument of gap 4 should therefore be another DataFrame. For none of the questions where join is in the third gap, this is the case. So you can immediately discard two answers.

For all other answers, join is in gap 1, followed by .(itemsDf, according to the code block. Given how the join() operator is called, there are now three remaining candidates.

Looking further at the join() statement, the second argument (on=) expects "a string for the join column name, a list of column names, a join expression (Column), or a list of Columns", according to

the documentation. As one answer option includes a list of join expressions (transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId) which is unsupported according to the

documentation, we can discard that answer, leaving us with two remaining candidates.

Both candidates have valid syntax, but only one of them fulfills the condition in the QUESTION NO: "only where column storeId of DataFrame transactionsDf does not match column itemId of

DataFrame

itemsDf". So, this one remaining answer option has to be the correct one!

As you can see, although sometimes overwhelming at first, even more complex questions can be figured out by rigorously applying the knowledge you can gain from the documentation during the

exam.

More info: pyspark.sql.DataFrame.join â€” PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 47 (Databricks import instructions)

Question # 17

The code block shown below should return a one-column DataFrame where the column storeId is converted to string type. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

transactionsDf.__1__(__2__.__3__(__4__))

1. select

2. col("storeId")

3. cast

4. StringType

1. select

2. col("storeId")

3. as

4. StringType

1. cast

2. "storeId"

3. as

4. StringType()

1. select

2. col("storeId")

3. cast

4. StringType()

1. select

2. storeId

3. cast

4. StringType()

Full Access

Question # 18

Which of the following describes the characteristics of accumulators?

Accumulators are used to pass around lookup tables across the cluster.

All accumulators used in a Spark application are listed in the Spark UI.

Accumulators can be instantiated directly via the accumulator(n) method of the pyspark.RDD module.

Accumulators are immutable.

If an action including an accumulator fails during execution and Spark manages to restart the action and complete it successfully, only the successful attempt will be counted in the accumulator.

Full Access

Question # 19

Which of the following statements about Spark's configuration properties is incorrect?

The maximum number of tasks that an executor can process at the same time is controlled by the spark.task.cpus property.

The maximum number of tasks that an executor can process at the same time is controlled by the spark.executor.cores property.

The default value for spark.sql.autoBroadcastJoinThreshold is 10MB.

The default number of partitions to use when shuffling data for joins or aggregations is 300.

The default number of partitions returned from certain transformations can be controlled by the spark.default.parallelism property.

Full Access

Question # 20

The code block displayed below contains an error. The code block should arrange the rows of DataFrame transactionsDf using information from two columns in an ordered fashion, arranging first by

column value, showing smaller numbers at the top and greater numbers at the bottom, and then by column predError, for which all values should be arranged in the inverse way of the order of items

in column value. Find the error.

Code block:

transactionsDf.orderBy('value', asc_nulls_first(col('predError')))

Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.

Column value should be wrapped by the col() operator.

Column predError should be sorted in a descending way, putting nulls last.

Column predError should be sorted by desc_nulls_first() instead.

Instead of orderBy, sort should be used.

Full Access

Question # 21

The code block displayed below contains an error. The code block should read the csv file located at path data/transactions.csv into DataFrame transactionsDf, using the first row as column header

and casting the columns in the most appropriate type. Find the error.

First 3 rows of transactions.csv:

1.transactionId;storeId;productId;name

2.1;23;12;green grass

3.2;35;31;yellow sun

4.3;23;12;green grass

Code block:

transactionsDf = spark.read.load("data/transactions.csv", sep=";", format="csv", header=True)

The DataFrameReader is not accessed correctly.

The transaction is evaluated lazily, so no file will be read.

Spark is unable to understand the file type.

The code block is unable to capture all columns.

The resulting DataFrame will not have the appropriate schema.

Full Access

Question # 22

Which of the following code blocks performs an inner join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively, excluding columns value and storeId from

DataFrame transactionsDf and column attributes from DataFrame itemsDf?

transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId)

1.transactionsDf.createOrReplaceTempView('transactionsDf')

2.itemsDf.createOrReplaceTempView('itemsDf')

4.spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

1.transactionsDf \

2. .drop(col('value'), col('storeId')) \

3. .join(itemsDf.drop(col('attributes')), col('productId')==col('itemId'))

1.transactionsDf.createOrReplaceTempView('transactionsDf')

2.itemsDf.createOrReplaceTempView('itemsDf')

4.statement = """

5.SELECT * FROM transactionsDf

6.INNER JOIN itemsDf

7.ON transactionsDf.productId==itemsDf.itemId

8."""

9.spark.sql(statement).drop("value", "storeId", "attributes")

Full Access

Answer:

Explanation:

Explanation

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question. However, this variety reflects the variety of ways that one can express a join in PySpark. You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView('transactionsDf')

itemsDf.createOrReplaceTempView('itemsDf')

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes")

Correct - this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf \

drop(col('value'), col('storeId')) \

join(itemsDf.drop(col('attributes')), col('productId')==col('itemId'))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop('value', 'storeId') instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect - Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView('transactionsDf')

itemsDf.createOrReplaceTempView('itemsDf')

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join â€” PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 25 (Databricks import instructions)

Question # 23

Which of the following statements about DAGs is correct?

DAGs help direct how Spark executors process tasks, but are a limitation to the proper execution of a query when an executor fails.

DAG stands for "Directing Acyclic Graph".

Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts.

In contrast to transformations, DAGs are never lazily executed.

DAGs can be decomposed into tasks that are executed in parallel.

Full Access

Question # 24

Which of the following statements about lazy evaluation is incorrect?

Predicate pushdown is a feature resulting from lazy evaluation.

Execution is triggered by transformations.

Spark will fail a job only during execution, but not during definition.

Accumulators do not change the lazy evaluation model of Spark.

Lineages allow Spark to coalesce transformations into stages

Full Access

Question # 25

The code block shown below should return a copy of DataFrame transactionsDf without columns value and productId and with an additional column associateId that has the value 5. Choose the

answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__, __3__).__4__(__5__, 'value')

1. withColumn

2. 'associateId'

3. 5

4. remove

5. 'productId'

1. withNewColumn

2. associateId

3. lit(5)

4. drop

5. productId

1. withColumn

2. 'associateId'

3. lit(5)

4. drop

5. 'productId'

1. withColumnRenamed

2. 'associateId'

3. 5

4. drop

5. 'productId'

1. withColumn

2. col(associateId)

3. lit(5)

4. drop

5. col(productId)

Full Access

Question # 26

Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?

array_remove(transactionsDf, "*")

transactionsDf.unpersist()

(Correct)

del transactionsDf

transactionsDf.clearCache()

transactionsDf.persist()

Full Access

Question # 27

Which of the following describes a difference between Spark's cluster and client execution modes?

In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client mode.

In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.

In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.

In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.

In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.

Full Access

Weekend Sale - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: mxmas70

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam Question and Answers

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Quick Links

Why Us

Unlimited Packages

Site Secure

We Accept