Databricks-Machine-Learning-Associate Databricks Certified Machine Learning Associate Exam exact Exam Questions

Question # 4

A data scientist is using Spark SQL to import their data into a machine learning pipeline. Once the data is imported, the data scientist performs machine learning tasks using Spark ML.

Which of the following compute tools is best suited for this use case?

Single Node cluster

Standard cluster

SQL Warehouse

None of these compute tools support this task

Full Access

Question # 5

A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:

prediction DOUBLE

actual DOUBLE

Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?

Option A

Option B

Option C

Option D

Option E

Full Access

Question # 6

A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically.

Which of the following lines of code will return the metadata description?

There is no way to return the metadata description programmatically.

fs.create_training_set("new_table")

fs.get_table("new_table").description

fs.get_table("new_table").load_df()

fs.get_table("new_table")

Full Access

Question # 7

A machine learning engineering team has a Job with three successive tasks. Each task runs a single notebook. The team has been alerted that the Job has failed in its latest run.

Which of the following approaches can the team use to identify which task is the cause of the failure?

Run each notebook interactively

Review the matrix view in the Job's runs

Migrate the Job to a Delta Live Tables pipeline

Change each Task’s setting to use a dedicated cluster

Full Access

Question # 8

A data scientist is performing hyperparameter tuning using an iterative optimization algorithm. Each evaluation of unique hyperparameter values is being trained on a single compute node. They are performing eight total evaluations across eight total compute nodes. While the accuracy of the model does vary over the eight evaluations, they notice there is no trend of improvement in the accuracy. The data scientist believes this is due to the parallelization of the tuning process.

Which change could the data scientist make to improve their model accuracy over the course of their tuning process?

Change the number of compute nodes to be half or less than half of the number of evaluations.

Change the number of compute nodes and the number of evaluations to be much larger but equal.

Change the iterative optimization algorithm used to facilitate the tuning process.

Change the number of compute nodes to be double or more than double the number of evaluations.

Full Access

Question # 9

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

import pyspark.pandas as ps

df = ps.DataFrame(spark_df)

import pyspark.pandas as ps

df = ps.to_pandas(spark_df)

spark_df.to_sql()

import pandas as pd

df = pd.DataFrame(spark_df)

spark_df.to_pandas()

Full Access

Question # 10

A machine learning engineer is using the following code block to scale the inference of a single-node model on a Spark DataFrame with one million records:

Assuming the default Spark configuration is in place, which of the following is a benefit of using anIterator?

The data will be limited to a single executor preventing the model from being loaded multiple times

The model will be limited to a single executor preventing the data from being distributed

The model only needs to be loaded once per executor rather than once per batch during the inference process

The data will be distributed across multiple executors during the inference process

Full Access

Question # 11

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata

pandas API on Spark DataFrames are more performant than Spark DataFrames

pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata

pandas API on Spark DataFrames are less mutable versions of Spark DataFrames

pandas API on Spark DataFrames are unrelated to Spark DataFrames

Full Access

Question # 12

A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model bycomparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.

Which of the following possible explanations for this difference is invalid?

The second model is much more accurate than the first model

The data scientist failed to exponentiate the predictions in the second model prior tocomputingthe RMSE

The datascientist failed to take the logof the predictions in the first model prior to computingthe RMSE

The first model is much more accurate than the second model

The RMSE is an invalid evaluation metric for regression problems

Full Access

Question # 13

In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?

When the features are of the categorical type

When the features are of the boolean type

When the features contain a lot of extreme outliers

When the features contain no outliers

When the features contain no missingno values

Full Access

Question # 14

A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical.

Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?

Spark ML decision trees test every feature variable in the splitting algorithm

Spark ML decision trees automatically prune overfit trees

Spark ML decision trees test more split candidates in the splitting algorithm

Spark ML decision trees test a random sample of feature variables in the splitting algorithm

Spark ML decision trees test binned features values as representative split candidates

Full Access

Question # 15

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrametrain_dfto train the model.

The Spark DataFrametrain_dfhas the following schema:

The machine learning engineer shares the following code block:

Which of the following changes does the machine learning engineer need to make to complete the task?

They need to call the transform method on train df

They need to convert the features column to be a vector

They do not need to make any changes

They need to utilize a Pipeline to fit the model

They need to split thefeaturescolumn out into one column for each feature

Full Access

Question # 16

A machine learning engineer has been notified that a new Staging version of a model registered to the MLflow Model Registry has passed all tests. As a result, the machine learning engineer wants to put this model into production by transitioning it to the Production stage in the Model Registry.

From which of the following pages in Databricks Machine Learning can the machine learning engineer accomplish this task?

The home page of the MLflow Model Registry

The experiment page in the Experiments observatory

The model version page in the MLflow ModelRegistry

The model page in the MLflow Model Registry

Full Access

Question # 17

A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:

They have written the following incomplete code block to use predict to score each record of Spark DataFramespark_df:

Which of the following lines of code can be used to complete the code block to successfully complete the task?

predict(*spark_df.columns)

mapInPandas(predict)

predict(Iterator(spark_df))

mapInPandas(predict(spark_df.columns))

predict(spark_df.columns)

Full Access

Question # 18

A new data scientist has started working on an existing machine learning project. The project is a scheduled Job that retrains every day. The project currently exists in a Repo in Databricks. The data scientist has been tasked with improving the feature engineering of the pipeline’s preprocessing stage. The data scientist wants to make necessary updates to the code that can be easily adopted into the project without changing what is being run each day.

Which approach should the data scientist take to complete this task?

They can create a new branch in Databricks, commit their changes, and push those changes to the Git provider.

They can clone the notebooks in the repository into a Databricks Workspace folder and make the necessary changes.

They can create a new Git repository, import it into Databricks, and copy and paste the existing code from the original repository before making changes.

They can clone the notebooks in the repository into a new Databricks Repo and make the necessary changes.

Full Access

Question # 19

Which of the following evaluation metrics is not suitable to evaluate runs in AutoML experiments for regression problems?

R-squared

MAE

MSE

Full Access

Question # 20

Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?

Keras

pandas

PvTorch

Spark ML

Scikit-learn

Full Access

Question # 21

A team is developing guidelines on when to use various evaluation metrics for classification problems. The team needs to provide input on when to use the F1 score over accuracy.

Which of the following suggestions should the team include in their guidelines?

The F1 score should be utilized over accuracy when the number of actual positive cases is identical to the number of actual negative cases.

The F1 score should be utilized over accuracy when there are greater than two classes in the target variable.

The F1 score should be utilized over accuracy when there is significant imbalance between positive and negative classes and avoiding false negatives is a priority.

The F1 score should be utilized over accuracy when identifying true positives and true negatives are equally important to the business problem.

Full Access

Question # 22

Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?

MLflow Experiment Tracking

Spark ML

Autoscaling clusters

Delta Lake

Full Access

Answer:

Explanation:

Spark ML (part of Apache Spark's MLlib) is designed to handle machine learning tasks across multiple nodes in a cluster, effectively parallelizing tasks like hyperparameter tuning. It supports various machine learning algorithms that can be optimized over a Spark cluster, making it suitable for parallelizing hyperparameter tuning for single-node machine learning models when they are adapted to run on Spark.

References

Apache Spark MLlib Guide:https://spark.apache.org/docs/latest/ml-guide.html

Spark ML is a library within Apache Spark designed for scalable machine learning. It provides tools to handle large-scale machine learning tasks, including parallelizing the hyperparameter tuning process for single-node machine learning models using a Spark cluster. Here’s a detailed explanation of how Spark ML can be used:

Hyperparameter Tuning with CrossValidator: Spark ML includes theCrossValidatorandTrainValidationSplitclasses, which are used for hyperparameter tuning. These classes can evaluate multiple sets of hyperparameters in parallel using a Spark cluster.

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Define the model

model = ...

# Create a parameter grid

paramGrid = ParamGridBuilder() \

addGrid(model.hyperparam1, [value1, value2]) \

addGrid(model.hyperparam2, [value3, value4]) \

build()

# Define the evaluator

evaluator = BinaryClassificationEvaluator()

# Define the CrossValidator

crossval = CrossValidator(estimator=model,

estimatorParamMaps=paramGrid,

evaluator=evaluator,

numFolds=3)

Parallel Execution: Spark distributes the tasks of training models with different hyperparameters across the cluster’s nodes. Each node processes a subset of the parameter grid, which allows multiple models to be trained simultaneously.
Scalability: Spark ML leverages the distributed computing capabilities of Spark. This allows for efficient processing of large datasets and training of models across many nodes, which speeds up the hyperparameter tuning process significantly compared to single-node computations.

References

Apache Spark MLlib Documentation
Hyperparameter Tuning in Spark ML

Weekend Special Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: buysanta

Exact2Pass Menu

Exact2Pass

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

SubFooter