Black Friday Special Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: buysanta

Exact2Pass Menu

Question # 4

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

A.

Can Manage

B.

Can Edit

C.

No permissions

D.

Can Read

E.

Can Run

Full Access
Question # 5

The business intelligence team has a dashboard configured to track various summary metrics for retail stories. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:

For Demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table named products_per_order, includes the following fields:

Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.

Which solution meets the expectations of the end users while controlling and limiting possible costs?

A.

Use the Delta Cache to persists the products_per_order table in memory to quickly the dashboard with each query.

B.

Populate the dashboard by configuring a nightly batch job to save the required to quickly update the dashboard with each query.

C.

Use Structure Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.

D.

Define a view against the products_per_order table and define the dashboard against this view.

Full Access
Question # 6

Which statement regarding spark configuration on the Databricks platform is true?

A.

Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.

B.

When the same spar configuration property is set for an interactive to the same interactive cluster.

C.

Spark configuration set within an notebook will affect all SparkSession attached to the same interactive cluster

D.

The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs.

Full Access
Question # 7

The business reporting tem requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms and load the data for their pipeline runs in 10 minutes.

Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

A.

Schedule a jo to execute the pipeline once and hour on a dedicated interactive cluster.

B.

Schedule a Structured Streaming job with a trigger interval of 60 minutes.

C.

Schedule a job to execute the pipeline once hour on a new job cluster.

D.

Configure a job that executes every time new data lands in a given directory.

Full Access
Question # 8

Which of the following is true of Delta Lake and the Lakehouse?

A.

Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.

B.

Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.

C.

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

D.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

E.

Z-order can only be applied to numeric values stored in Delta Lake tables

Full Access
Question # 9

What is the first of a Databricks Python notebook when viewed in a text editor?

A.

%python

B.

% Databricks notebook source

C.

-- Databricks notebook source

D.

//Databricks notebook source

Full Access
Question # 10

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

A.

Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.

B.

Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.

C.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.

D.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will tail.

E.

Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Full Access
Question # 11

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?

A.

The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job.

B.

A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key.

C.

The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job.

D.

An incremental job will leverage running information in the state store to update aggregate values in the gold_customer_lifetime_sales_summary table.

E.

An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.

Full Access
Question # 12

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream for highly selective joins on a number of fields, and will also be leveraged by the machine learning team to filter on a handful of relevant fields, in total, 15 fields have been identified that will often be used for filter and join logic.

The data engineer is trying to determine the best approach for dealing with these nested fields before declaring the table schema.

Which of the following accurately presents information about Delta Lake and Databricks that may Impact their decision-making process?

A.

Because Delta Lake uses Parquet for data storage, Dremel encoding information for nesting can be directly referenced by the Delta transaction log.

B.

Tungsten encoding used by Databricks is optimized for storing string data: newly-added native support for querying JSON strings means that string types are always most efficient.

C.

Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

D.

By default Delta Lake collects statistics on the first 32 columns in a table; these statistics are leveraged for data skipping when executing selective queries.

Full Access
Question # 13

What statement is true regarding the retention of job run history?

A.

It is retained until you export or delete job run logs

B.

It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3

C.

t is retained for 60 days, during which you can export notebook run results to HTML

D.

It is retained for 60 days, after which logs are archived

E.

It is retained for 90 days or until the run-id is re-used through custom run configuration

Full Access
Question # 14

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv

Which statement describes the results returned by this query?

A.

Three columns will be returned, but one column will be named "redacted" and contain only null values.

B.

Only the email and itv columns will be returned; the email column will contain all null values.

C.

The email and ltv columns will be returned with the values in user itv.

D.

The email, age. and ltv columns will be returned with the values in user ltv.

E.

Only the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.

Full Access
Question # 15

A Delta Lake table in the Lakehouse named customer_parsams is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

Immediately after each update succeeds, the data engineer team would like to determine the difference between the new version and the previous of the table.

Given the current implementation, which method can be used?

A.

Parse the Delta Lake transaction log to identify all newly written data files.

B.

Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.

C.

Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and time travel functionality.

D.

Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.

Full Access
Question # 16

A Delta Lake table was created with the below query:

Realizing that the original query had a typographical error, the below code was executed:

ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store

Which result will occur after running the second command?

A.

The table reference in the metastore is updated and no data is changed.

B.

The table name change is recorded in the Delta transaction log.

C.

All related files and metadata are dropped and recreated in a single ACID transaction.

D.

The table reference in the metastore is updated and all data files are moved.

E.

A new Delta transaction log Is created for the renamed table.

Full Access
Question # 17

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.

The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

A.

The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.

B.

Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.

C.

Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.

D.

Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.

E.

Schema inference and evolution on .Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

Full Access
Question # 18

The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.

Which statement describes what the number alongside this field represents?

A.

The job_id is returned in this field.

B.

The job_id and number of times the job has been are concatenated and returned.

C.

The number of times the job definition has been run in the workspace.

D.

The globally unique ID of the newly triggered run.

Full Access
Question # 19

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

A.

Can manage

B.

Can edit

C.

Can run

D.

Can Read

Full Access
Question # 20

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?

A.

Regex

B.

Julia

C.

pyspsark.ml.feature

D.

Scala Datasets

E.

C++

Full Access
Question # 21

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?

A.

A batch job will update the enriched_itemized_orders_by_account table, replacing only those rows that have different values than the current version of the table, using accountID as the primary key.

B.

The enriched_itemized_orders_by_account table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.

C.

An incremental job will leverage information in the state store to identify unjoined rows in the source tables and write these rows to the enriched_iteinized_orders_by_account table.

D.

An incremental job will detect if new rows have been written to any of the source tables; if new rows are detected, all results will be recalculated and used to overwrite the enriched_itemized_orders_by_account table.

E.

No computation will occur until enriched_itemized_orders_by_account is queried; upon query materialization, results will be calculated using the current valid version of data in each of the three tables referenced in the join logic.

Full Access
Question # 22

A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

A.

userLookup.join(streamingDF, ["userid"], how="inner")

B.

streamingDF.join(userLookup, ["user_id"], how="outer")

C.

streamingDF.join(userLookup, ["user_id”], how="left")

D.

streamingDF.join(userLookup, ["userid"], how="inner")

E.

userLookup.join(streamingDF, ["user_id"], how="right")

Full Access
Question # 23

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database.

After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged).

Which statement describes what will happen when the above code is executed?

A.

The connection to the external table will fail; the string "redacted" will be printed.

B.

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded password will be saved to DBFS.

C.

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the password will be printed in plain text.

D.

The connection to the external table will succeed; the string value of password will be printed in plain text.

E.

The connection to the external table will succeed; the string "redacted" will be printed.

Full Access
Question # 24

An external object storage container has been mounted to the location /mnt/finance_eda_bucket.

The following logic was executed to create a database for the finance team:

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?

A.

A logical table will persist the query plan to the Hive Metastore in the Databricks control plane.

B.

An external table will be created in the storage container mounted to /mnt/finance eda bucket.

C.

A logical table will persist the physical plan to the Hive Metastore in the Databricks control plane.

D.

An managed table will be created in the storage container mounted to /mnt/finance eda bucket.

E.

A managed table will be created in the DBFS root storage container.

Full Access
Question # 25

The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

A.

%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.

B.

Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.

C.

%sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.

D.

Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.

E.

%sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

Full Access
Question # 26

The data engineer team has been tasked with configured connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user group already created in Databricks that represent various teams within the company.

A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.

Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?

A.

‘’Read’’ permissions should be set on a secret key mapped to those credentials that will be used by a given team.

B.

No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.

C.

“Read” permissions should be set on a secret scope containing only those credentials that will be used by a given team.

D.

“Manage” permission should be set on a secret scope containing only those credentials that will be used by a given team.

Full Access
Question # 27

The data architect has decided that once data has been ingested from external sources into the

Databricks Lakehouse, table access controls will be leveraged to manage permissions for all production tables and views.

The following logic was executed to grant privileges for interactive queries on a production database to the core engineering group.

GRANT USAGE ON DATABASE prod TO eng;

GRANT SELECT ON DATABASE prod TO eng;

Assuming these are the only privileges that have been granted to the eng group and that these users are not workspace administrators, which statement describes their privileges?

A.

Group members have full permissions on the prod database and can also assign permissions to other users or groups.

B.

Group members are able to list all tables in the prod database but are not able to see the results of any queries on those tables.

C.

Group members are able to query and modify all tables and views in the prod database, but cannot create new tables or views.

D.

Group members are able to query all tables and views in the prod database, but cannot create or edit anything in the database.

E.

Group members are able to create, query, and modify all tables and views in the prod database, but cannot define custom functions.

Full Access
Question # 28

A data team's Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows: Note that proposed changes are in bold.

Which step must also be completed to put the proposed query into production?

A.

Increase the shuffle partitions to account for additional aggregates

B.

Specify a new checkpointlocation

C.

Run REFRESH TABLE delta, /item_agg'

D.

Remove .option (mergeSchema', true') from the streaming write

Full Access
Question # 29

Which distribution does Databricks support for installing custom Python code packages?

A.

sbt

B.

CRAN

C.

CRAM

D.

nom

E.

Wheels

F.

jars

Full Access
Question # 30

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.

Which of the following likely explains these smaller file sizes?

A.

Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations

B.

Z-order indices calculated on the table are preventing file compaction

C Bloom filler indices calculated on the table are preventing file compaction

C.

Databricks has autotuned to a smaller target file size based on the overall size of data in the table

D.

Databricks has autotuned to a smaller target file size based on the amount of data in each partition

Full Access
Question # 31

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

A.

"Can Manage" privileges on the required cluster

B.

Workspace Admin privileges, cluster creation allowed. "Can Attach To" privileges on the required cluster

C.

Cluster creation allowed. "Can Attach To" privileges on the required cluster

D.

"Can Restart" privileges on the required cluster

E.

Cluster creation allowed. "Can Restart" privileges on the required cluster

Full Access
Question # 32

What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?

A.

Use &Pip install in a notebook cell

B.

Run source env/bin/activate in a notebook setup script

C.

Install libraries from PyPi using the cluster UI

D.

Use &sh install in a notebook cell

Full Access
Question # 33

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

A.

configure

B.

fs

C.

jobs

D.

libraries

E.

workspace

Full Access
Question # 34

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows:

The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is:

store_id INT, sales_date DATE, total_sales FLOAT

If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?

A.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.

B.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.

C.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

D.

Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

E.

Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

Full Access
Question # 35

A junior data engineer on your team has implemented the following code block.

The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.

When this query is executed, what will happen with new records that have the same event_id as an existing record?

A.

They are merged.

B.

They are ignored.

C.

They are updated.

D.

They are inserted.

E.

They are deleted.

Full Access