Databricks-Machine-Learning-Professional Databricks Certified Machine Learning Professional exact Exam Questions

Question # 4

A machine learning engineering team has written predictions computed in a batch job to a Delta table for querying. However, the team has noticed that the querying is running slowly. The team has alreadytuned the size of the data files. Upon investigating, the team has concluded that the rows meeting the query condition are sparsely located throughout each of the data files.

Based on the scenario, which of the following optimization techniques could speed up the query by colocating similar records while considering values in multiple columns?

Z-Ordering

Bin-packing

Write as a Parquet file

Data skipping

Tuning the file size

Full Access

Answer:

Explanation:

Z-Ordering is an optimization technique that can speed up the query by colocating similar records while considering values in multiple columns. Z-Ordering is a way of organizing data in storage based on the values of one or more columns. Z-Ordering maps multidimensional data to one dimension while preserving locality of the data points. This means that rows with similar values for the specified columns are stored close together in the same set of files. This improves the performance of queries that filter on those columns, as they can skip over irrelevant files or data blocks. Z-Ordering also enhances data skipping and caching, as it reduces the number of distinct values per file for the chosen columns1. The other options are incorrect because:

Option B: Bin-packing is an optimization technique that compacts small files into larger ones, but does not colocate similar records based on multiple columns. Bin-packing can improve the performance of queries by reducing the number of files that need to be read, but it does not affect the data layout within the files2.
Option C: Writing as a Parquet file is not an optimization technique, but a file format choice. Parquet is a columnar storage format that supports efficient compression and encoding schemes. Parquet can improve the performance of queries by reducing the storage footprint and the amount of data transferred, but it does not colocate similar records based on multiple columns3.
Option D: Data skipping is an optimization technique that skips over files or data blocks that do not match the query predicates, but does not colocate similar records based on multiple columns. Data skipping can improve the performance of queries by avoiding unnecessary data scans, but it depends on the data layout and the metadata collected for each file4.
Option E: Tuning the file size is an optimization technique that adjusts the size of the data files to a target value, but does not colocate similar records based on multiple columns. Tuning the file size can improve the performance of queries by balancing the trade-off between parallelism and overhead, but it does not affectthe data layout within the files5. References: Z-Ordering (multi-dimensional clustering), Compaction (bin-packing), Parquet, Data skipping, Tuning file sizes

Question # 5

A data scientist has written a function to track the runs of their random forest model. The data scientist is changing the number of trees in the forest across each run.

Which of the following MLflow operations is designed to log single values like the number of trees in a random forest?

mlflow.log_artifact

mlflow.log_model

mlflow.log_metric

mlflow.log_param

There is no way to store values like this.

Full Access

Question # 6

A machine learning engineer is using the following code block as part of a batch deployment pipeline:

Which of the following changes needs to be made so this code block will work when theinferencetable is a stream source?

Replace "inference" with the path to the location of the Delta table

Replace schema(schema) with option("maxFilesPerTriqqer", 1}

Replace spark.read with spark.readStream

Replace formatfdelta") with format("stream")

Replace predict with a stream-friendly prediction function

Full Access

Answer:

Explanation:

To read data from a stream source, such as Kafka, socket, or rate, the spark.readStream method should be used instead of spark.read. The spark.readStream method returns a streaming DataFrame that represents the unbounded input data stream. The spark.readStream method supports the same options and formats as the spark.read method, such as schema, delta, csv, json, etc. The spark.readStream method can also read from a Delta table as a stream source, by specifying the format("delta") and the path or table name of the Delta table123

The other options are incorrect because:

A. Replacing “inference” with the path to the location of the Delta table does not change the fact that spark.read is used to read from a stream source, which is not supported. The spark.readStream method should be used instead, and the path or table name of the Delta table can be specified as an option or argument.
B. Replacing schema(schema) with option("maxFilesPerTrigger", 1) does not change the fact that spark.read is used to read from a stream source, which is not supported. The spark.readStream method should be used instead, and the schema can be specified as an option or argument. The option("maxFilesPerTrigger", 1) is an optional configuration that limits the number of files processed in each trigger for file-based stream sources, such as delta, csv, json, etc. It does not affect the reading of data from a stream source4
D. Replacing format("delta") with format("stream") does not change the fact that spark.read is used to read from a stream source, which is not supported. The spark.readStream method should be used instead, and the format can be specified as an option or argument. The format("stream") is not a valid format for reading data from a stream source. The supported formats are delta, kafka, socket, rate, etc1
E. Replacing predict with a stream-friendly prediction function does not change the fact that spark.read is used to read from a stream source, which is not supported. The spark.readStream method should be used instead, and the prediction function can be applied to the streaming DataFrame as usual. The predict function does not need to be changed, as long as it can accept a streaming DataFrame as input and return a column of predictions as output5

References:

Input Sources - Structured Streaming Programming Guide - Spark 3.2.0 Documentation
Structured Streaming + Delta Lake - Databricks
Structured Streaming Programming Guide - Spark 3.2.0 Documentation
Configuration - Structured Streaming Programming Guide - Spark 3.2.0 Documentation
Machine Learning with Structured Streaming - Databricks

Question # 7

A machine learning engineer is migrating a machine learning pipeline to use Databricks Machine Learning. They have programmatically identified the best run from an MLflow Experiment and stored its URI in themodel_urivariable and its Run ID in therun_idvariable. They have also determined that the model was logged with the name"model". Now, the machine learning engineer wants to register that model in the MLflow Model Registry with the name"best_model".

Which of the following lines of code can they use to register the model to the MLflow Model Registry?

mlflow.register_model(model_uri, "best_model")

mlflow.register_model(run_id, "best_model")

mlflow.register_model(f"runs:/{run_id}/best_model", "model")

mlflow.register_model(model_uri, "model")

mlflow.register_model(f"runs:/{run_id}/model")

Full Access

Question # 8

Which of the following operations in Feature Store Client fs can be used to return a Spark DataFrame of a data set associated with a Feature Store table?

fs.create_table

fs.write_table

fs.get_table

There is no way to accomplish this task with fs

fs.read_table

Full Access

Question # 9

Which of the following tools can assist in real-time deployments by packaging software with its own application, tools, and libraries?

Cloud-based compute

None of these tools

REST APIs

Containers

Autoscaling clusters

Full Access

Question # 10

A machine learning engineer has developed a model and registered it using the FeatureStoreClient fs. The model has model URI model_uri. The engineer now needs to perform batch inference on customer-level Spark DataFrame spark_df, but it is missing a few of the static features that were used when training the model. The customer_id column is the primary key of spark_df and the training set used when training and logging the model.

Which of the following code blocks can be used to compute predictions for spark_df when the missing feature values can be found in the Feature Store by searching for features by customer_id?

df = fs.get_missing_features(spark_df, model_uri)

fs.score_model(model_uri, df)

fs.score_model(model_uri, spark_df)

df = fs.get_missing_features(spark_df, model_uri)

fs.score_batch(model_uri, df)

df = fs.get_missing_features(spark_df)

fs.score_batch(model_uri, df)

fs.score_batch(model_uri, spark_df)

Full Access

Answer:

Explanation:

To compute predictions for spark_df when the missing feature values can be found in the Feature Store by searching for features by customer_id, you can use the following code block:

Python

# Get the missing features from the Feature Store using the model URI and the customer_id column

df = fs.get_missing_features(spark_df, model_uri, lookup_key="customer_id")

# Score the DataFrame using the model URI and the Feature Store Client

fs.score_batch(model_uri, df)

AI-generated code. Review and use carefully. More info on FAQ.

The fs.get_missing_features method takes a Spark DataFrame, a model URI, and a lookup key as arguments. It returns a new Spark DataFrame that contains the originalcolumns plus the missing features that are required by the model. The missing features are retrieved from the Feature Store by joining the DataFrame with the feature tables using the lookup key. The lookup key must match the primary key of the feature tables. The model URI must point to a registered model that was trained using features from the Feature Store1.

The fs.score_batch method takes a model URI and a Spark DataFrame as arguments. It applies the model to the DataFrame and returns a new Spark DataFrame that contains the original columns plus a prediction column. The model URI must point to a registered model that was trained using features from the Feature Store2.

The other options are incorrect because:

Option A: fs.score_model is not a valid method name, as it is missing an underscore. The correct method name is fs.score_batch2.
Option B: fs.score_model without getting the missing features will not work, as the model expects the DataFrame to have all the features that were used for training. The correct way is to use fs.get_missing_features before fs.score_batch12.
Option D: fs.score_batch without getting the missing features will not work, as the model expects the DataFrame to have all the features that were used for training. The correct way is to use fs.get_missing_features before fs.score_batch12.
Option E: fs.score_batch without specifying the lookup key will not work, as the fs.get_missing_features method requires a lookup key to join the DataFrame with the feature tables. The correct way is to use fs.get_missing_features with the lookup key “customer_id” before fs.score_batch12. References: Get missing features, Score batch

Question # 11

A data scientist has developed a scikit-learn modelsklearn_modeland they want to log the model using MLflow.

They write the following incomplete code block:

Which of the following lines of code can be used to fill in the blank so the code block can successfully complete the task?

mlflow.spark.track_model(sklearn_model, "model")

mlflow.sklearn.log_model(sklearn_model, "model")

mlflow.spark.log_model(sklearn_model, "model")

mlflow.sklearn.load_model("model")

mlflow.sklearn.track_model(sklearn_model, "model")

Full Access

Question # 12

A machine learning engineer is converting a Hyperopt-based hyperparameter tuning process from manual MLflow logging to MLflow Autologging. They are trying to determine how to manage nested Hyperopt runs with MLflow Autologging.

Which of the following approaches will create a single parent run for the process and a child run for each unique combination of hyperparameter values when using Hyperopt and MLflow Autologging?

Startinq amanual parent run before callingfmin

Ensuring that a built-in model flavor is used for the model logging

Starting a manual child run within the objective function

There is no way to accomplish nested runs with MLflow Autoloqqinq and Hyperopt

MLflow Autoloqqinq will automatically accomplish this task with Hyperopt

Full Access

Question # 13

A machine learning engineer wants to log feature importance data from a CSV file at path importance_path with an MLflow run for model model.

Which of the following code blocks will accomplish this task inside of an existing MLflow run block?

mlflow.log_data(importance_path, "feature-importance.csv")

mlflow.log_artifact(importance_path, "feature-importance.csv")

None of these code blocks tan accomplish the task.

Full Access

Question # 14

Which of the following is a simple, low-cost method of monitoring numeric feature drift?

Jensen-Shannon test

Summary statistics trends

Chi-squared test

None of these can be used to monitor feature drift

Kolmogorov-Smirnov (KS) test

Full Access

Question # 15

A data scientist has developed a modelmodeland computed the RMSE of the model on the test set. They have assigned this value to the variablermse. They now want to manually store the RMSE value with the MLflow run.

They write the following incomplete code block:

Which of the following lines of code can be used to fill in the blank so the code block can successfully complete the task?

log_artifact

log_model

log_metric

log_param

There is no way to store values like this.

Full Access

Question # 16

A data scientist set up a machine learning pipeline to automatically log a data visualization with each run. They now want to view the visualizations in Databricks.

Which of the following locations in Databricks will show these data visualizations?

The MLflow Model RegistryModel paqe

The Artifacts section of the MLflow Experiment page

Logged data visualizations cannot be viewed in Databricks

The Artifacts section of the MLflow Run page

The Figures section of the MLflow Run page

Full Access

Question # 17

Which of the following MLflow operations can be used to delete a model from the MLflow Model Registry?

client.transition_model_version_stage

client.delete_model_version

client.update_registered_model

client.delete_model

client.delete_registered_model

Full Access

Question # 18

After a data scientist noticed that a column was missing from a production feature set stored as a Delta table, the machine learning engineering team has been tasked with determining when the column was dropped from the feature set.

Which of the following SQL commands can be used to accomplish this task?

VERSION

DESCRIBE

HISTORY

DESCRIBE HISTORY

TIMESTAMP

Full Access

Winter Sale Special 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: ex2p65

Exact2Pass Menu

Exact2Pass

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

SubFooter