New Year Special Sale - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: mxmas70

Home > Databricks > ML Data Scientist > Databricks-Machine-Learning-Associate

Databricks-Machine-Learning-Associate Databricks Certified Machine Learning Associate Exam Question and Answers

Question # 4

Which of the following approaches can be used to view the notebook that was run to create an MLflow run?

A.

Open the MLmodel artifact in the MLflow run paqe

B.

Click the "Models" link in the row corresponding to the run in the MLflow experiment paqe

C.

Click the "Source" link in the row corresponding to the run in the MLflow experiment page

D.

Click the "Start Time" link in the row corresponding to the run in the MLflow experiment page

Full Access
Question # 5

A data scientist is using the following code block to tune hyperparameters for a machine learning model:

Which change can they make the above code block to improve the likelihood of a more accurate model?

A.

Increase num_evals to 100

B.

Change fmin() to fmax()

C.

Change sparkTrials() to Trials()

D.

Change tpe.suggest to random.suggest

Full Access
Question # 6

A data scientist is using Spark SQL to import their data into a machine learning pipeline. Once the data is imported, the data scientist performs machine learning tasks using Spark ML.

Which of the following compute tools is best suited for this use case?

A.

Single Node cluster

B.

Standard cluster

C.

SQL Warehouse

D.

None of these compute tools support this task

Full Access
Question # 7

What is the name of the method that transforms categorical features into a series of binary indicator feature variables?

A.

Leave-one-out encoding

B.

Target encoding

C.

One-hot encoding

D.

Categorical

E.

String indexing

Full Access
Question # 8

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

A.

import pyspark.pandas as ps

df = ps.DataFrame(spark_df)

B.

import pyspark.pandas as ps

df = ps.to_pandas(spark_df)

C.

spark_df.to_sql()

D.

import pandas as pd

df = pd.DataFrame(spark_df)

E.

spark_df.to_pandas()

Full Access
Question # 9

A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical.

Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?

A.

Spark ML decision trees test every feature variable in the splitting algorithm

B.

Spark ML decision trees automatically prune overfit trees

C.

Spark ML decision trees test more split candidates in the splitting algorithm

D.

Spark ML decision trees test a random sample of feature variables in the splitting algorithm

E.

Spark ML decision trees test binned features values as representative split candidates

Full Access
Question # 10

A data scientist is wanting to explore the Spark DataFrame spark_df. The data scientist wants visual histograms displaying the distribution of numeric features to be included in the exploration.

Which of the following lines of code can the data scientist run to accomplish the task?

A.

spark_df.describe()

B.

dbutils.data(spark_df).summarize()

C.

This task cannot be accomplished in a single line of code.

D.

spark_df.summary()

E.

dbutils.data.summarize (spark_df)

Full Access
Question # 11

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

A.

One-hot encoding is not supported by most machine learning libraries.

B.

One-hot encoding is dependent on the target variable's values which differ for each application.

C.

One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

D.

One-hot encoding is not a common strategy for representing categorical feature variables numerically.

E.

One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

Full Access
Question # 12

A data scientist has produced two models for a single machine learning problem. One of the models performs well when one of the features has a value of less than 5, and the other model performs well when the value of that feature is greater than or equal to 5. The data scientist decides to combine the two models into a single machine learning solution.

Which of the following terms is used to describe this combination of models?

A.

Bootstrap aggregation

B.

Support vector machines

C.

Bucketing

D.

Ensemble learning

E.

Stacking

Full Access
Question # 13

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.

Which of the following code blocks will accomplish this task?

A.

spark_df[spark_df["price"] > 0]

B.

spark_df.filter(col("price") > 0)

C.

SELECT * FROM spark_df WHERE price > 0

D.

spark_df.loc[spark_df["price"] > 0,:]

E.

spark_df.loc[:,spark_df["price"] > 0]

Full Access
Question # 14

A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically.

Which of the following lines of code will return the metadata description?

A.

There is no way to return the metadata description programmatically.

B.

fs.create_training_set("new_table")

C.

fs.get_table("new_table").description

D.

fs.get_table("new_table").load_df()

E.

fs.get_table("new_table")

Full Access
Question # 15

A machine learning engineer wants to parallelize the inference of group-specific models using the Pandas Function API. They have developed theapply_modelfunction that will look up and load the correct model for each group, and they want to apply it to each group of DataFramedf.

They have written the following incomplete code block:

Which piece of code can be used to fill in the above blank to complete the task?

A.

applyInPandas

B.

groupedApplyInPandas

C.

mapInPandas

D.

predict

Full Access
Question # 16

A data scientist is utilizing MLflow Autologging to automatically track their machine learning experiments. After completing a series of runs for the experiment experiment_id, the data scientist wants to identify the run_id of the run with the best root-mean-square error (RMSE).

Which of the following lines of code can be used to identify the run_id of the run with the best RMSE in experiment_id?

A)

B)

C)

D)

A.

OptionA

B.

Option B

C.

Option C

D.

Option D

Full Access
Question # 17

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrametrain_dfto train the model.

The Spark DataFrametrain_dfhas the following schema:

The machine learning engineer shares the following code block:

Which of the following changes does the machine learning engineer need to make to complete the task?

A.

They need to call the transform method on train df

B.

They need to convert the features column to be a vector

C.

They do not need to make any changes

D.

They need to utilize a Pipeline to fit the model

E.

They need to split thefeaturescolumn out into one column for each feature

Full Access
Question # 18

A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation folds:

• 10.0

• 12.0

• 17.0

Which of the following values represents the overall cross-validation root-mean-squared error?

A.

13.0

B.

17.0

C.

12.0

D.

39.0

E.

10.0

Full Access
Question # 19

A machine learning engineer is trying to scale a machine learning pipelinepipelinethat contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process, they are using the following code block:

A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to theestimatorparameter and then placing the updated cv object as the final stage of thepipelinein place of the original model.

Which of the following is a negative consequence of the approach suggested by the colleague?

A.

The model will take longerto train for each unique combination of hvperparameter values

B.

The feature engineering stages will be computed using validation data

C.

The cross-validation process will no longer be

D.

The cross-validation process will no longer be reproducible

E.

The model will be refit one more per cross-validation fold

Full Access
Question # 20

A data scientist has developed a machine learning pipeline with a static input data set using Spark ML, but the pipeline is taking too long to process. They increase the number of workers in the cluster to get the pipeline to run more efficiently. They notice that the number of rows in the training set after reconfiguring the cluster is different from the number of rows in the training set prior to reconfiguring the cluster.

Which of the following approaches will guarantee a reproducible training and test set for each model?

A.

Manually configure the cluster

B.

Write out the split data sets to persistent storage

C.

Set a speed in the data splitting operation

D.

Manually partition the input data

Full Access
Question # 21

Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?

A.

Keras

B.

Scikit-learn

C.

PyTorch

D.

Spark ML

Full Access
Question # 22

A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:

prediction DOUBLE

actual DOUBLE

Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?

A)

B)

C)

D)

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Full Access