Databricks Real Dumps Practice Exam Questions by Dumpswarp

Databricks Certified Data Engineer Professional Exam Questions and Answers

Question 1

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries.

In which location can one review the timeline for cluster resizing events?

Options:

Workspace audit logs

Driver's log file

Ganglia

Cluster Event Log

Executor's log file

Question 2

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Options:

Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.

Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Question 3

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows:

The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is:

store_id INT, sales_date DATE, total_sales FLOAT

If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?

Options:

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

Answer:

Explanation:

The daily_store_sales table contains all the information needed to update store_sales_summary. The schema of the table is:

store_id INT, sales_date DATE, total_sales FLOAT

The daily_store_sales table is implemented as a Type 1 table, which means that old values are overwritten by new values and no history is maintained. The total_sales column might be adjusted after manual data auditing, which means that the data in the table may change over time.

The safest approach to generate accurate reports in the store_sales_summary table is to use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update. Structured Streaming is a scalable and fault-tolerant stream processing engine built on Spark SQL. Structured Streaming allows processing data streams as if they were tables or DataFrames, using familiar operations such as select, filter, groupBy, or join. Structured Streaming also supports output modes that specify how to write the results of a streaming query to a sink, such as append, update, or complete. Structured Streaming can handle both streaming and batch data sources in a unified manner.

The change data feed is a feature of Delta Lake that provides structured streaming sources that can subscribe to changes made to a Delta Lake table. The change data feed captures both data changes and schema changes as ordered events that can be processed by downstream applications or services. The change data feed can be configured with different options, such as starting from a specific version or timestamp, filtering by operation type or partition values, or excluding no-op changes.

By using Structured Streaming to subscribe to the change data feed for daily_store_sales, one can capture and process any changes made to the total_sales column due to manual data auditing. By applying these changes to the aggregates in the store_sales_summary table with each update, one can ensure that the reports are always consistent and accurate with the latest data. Verified References: [Databricks Certified Data Engineer Professional], under “Spark Core” section; Databricks Documentation, under “Structured Streaming” section; Databricks Documentation, under “Delta Change Data Feed” section.

Question 4

The business reporting tem requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms and load the data for their pipeline runs in 10 minutes.

Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

Options:

Schedule a jo to execute the pipeline once and hour on a dedicated interactive cluster.

Schedule a Structured Streaming job with a trigger interval of 60 minutes.

Schedule a job to execute the pipeline once hour on a new job cluster.

Configure a job that executes every time new data lands in a given directory.

Question 5

A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.

The proposed directory structure is displayed below:

Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?

Options:

No; Delta Lake manages streaming checkpoints in the transaction log.

Yes; both of the streams can share a single checkpoint directory.

No; only one stream can write to a Delta Lake table.

Yes; Delta Lake supports infinite concurrent writers.

No; each of the streams needs to have its own checkpoint directory.

Question 6

An external object storage container has been mounted to the location /mnt/finance_eda_bucket.

The following logic was executed to create a database for the finance team:

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?

Options:

A logical table will persist the query plan to the Hive Metastore in the Databricks control plane.

An external table will be created in the storage container mounted to /mnt/finance eda bucket.

A logical table will persist the physical plan to the Hive Metastore in the Databricks control plane.

An managed table will be created in the storage container mounted to /mnt/finance eda bucket.

A managed table will be created in the DBFS root storage container.

Question 7

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.

Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?

Options:

• Total VMs; 1

• 400 GB per Executor

• 160 Cores / Executor

• Total VMs: 8

• 50 GB per Executor

• 20 Cores / Executor

• Total VMs: 4

• 100 GB per Executor

• 40 Cores/Executor

• Total VMs:2

• 200 GB per Executor

• 80 Cores / Executor

Question 8

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company's data is stored in regional cloud storage in the United States.

The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.

Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?

Options:

Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.

Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.

Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.

Databricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near.

Databricks notebooks send all executable code from the user's browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.

Question 9

The data engineer team is configuring environment for development testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team want to develop and test against similar production data as possible.

A junior data engineer suggests that production data can be mounted to the development testing environments, allowing pre production code to execute against production data. Because all users have

Admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.

Which statement captures best practices for this situation?

Options:

Because access to production data will always be verified using passthrough credentials it is safe to mount data to any Databricks development environment.

All developer, testing and production code and data should exist in a single unified workspace; creating separate environments for testing and development further reduces risks.

In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.

Because delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data, as such it is generally safe to mount production data anywhere.

Question 10

Which distribution does Databricks support for installing custom Python code packages?

Options:

sbt

CRAN

CRAM

nom

Wheels

jars

Question 11

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

Options:

Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.

Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will tail.

Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Question 12

The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing specific fields have not been approval for the sales org.

Which of the following solutions addresses the situation while emphasizing simplicity?

Options:

Create a view on the marketing table selecting only these fields approved for the sales team alias the names of any fields that should be standardized to the sales naming conventions.

Use a CTAS statement to create a derivative table from the marketing table configure a production jon to propagation changes.

Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from marketing table.

Create a new table with the required schema and use Delta Lake's DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.

Question 13

A table is registered with the following code:

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?

Options:

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Question 14

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

Options:

configure

jobs

libraries

workspace

Question 15

Which statement describes Delta Lake Auto Compaction?

Options:

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.

Question 16

The data science team has created and logged a production using MLFlow. The model accepts a list of column names and returns a new column of type DOUBLE.

The following code correctly imports the production model, load the customer table containing the customer_id key column into a Dataframe, and defines the feature columns needed for the model.

Which code block will output DataFrame with the schema'' customer_id LONG, predictions DOUBLE''?

Options:

Model, predict (df, columns)

Df, map (lambda k:midel (x [columns]) ,select (''customer_id predictions'')

Df. Select (''customer_id''.

Model (''columns) alias (''predictions'')

Df.apply(model, columns). Select (''customer_id, prediction''

Question 17

What statement is true regarding the retention of job run history?

Options:

It is retained until you export or delete job run logs

It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3

t is retained for 60 days, during which you can export notebook run results to HTML

It is retained for 60 days, after which logs are archived

It is retained for 90 days or until the run-id is re-used through custom run configuration

Question 18

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.

If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?

Options:

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.

All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.

Question 19

The data architect has decided that once data has been ingested from external sources into the

Databricks Lakehouse, table access controls will be leveraged to manage permissions for all production tables and views.

The following logic was executed to grant privileges for interactive queries on a production database to the core engineering group.

GRANT USAGE ON DATABASE prod TO eng;

GRANT SELECT ON DATABASE prod TO eng;

Assuming these are the only privileges that have been granted to the eng group and that these users are not workspace administrators, which statement describes their privileges?

Options:

Group members have full permissions on the prod database and can also assign permissions to other users or groups.

Group members are able to list all tables in the prod database but are not able to see the results of any queries on those tables.

Group members are able to query and modify all tables and views in the prod database, but cannot create new tables or views.

Group members are able to query all tables and views in the prod database, but cannot create or edit anything in the database.

Group members are able to create, query, and modify all tables and views in the prod database, but cannot define custom functions.

Question 20

The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

Options:

%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.

Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.

%sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.

Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.

%sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

Question 21

A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:

Which statement describes the execution and results of running the above query multiple times?

Options:

Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.

Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.

Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.

Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.

Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table giving the desired result.

Question 22

The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.

The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.

Which statement exemplifies best practices for implementing this system?

Options:

Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.

Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.

Storinq all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.

Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.

Because all tables must live in the same storage containers used for the database they're created in, organizations should be prepared to create between dozens and thousands of databases depending on their data isolation requirements.

Question 23

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

to_interval("event_time", "5 minutes").alias("time")

window("event_time", "5 minutes").alias("time")

"event_time"

window("event_time", "10 minutes").alias("time")

lag("event_time", "10 minutes").alias("time")

Question 24

If task A fails during a scheduled run, which statement describes the results of this run?

Options:

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.

Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.

Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.

Tasks B and C will be skipped; task A will not commit any changes because of stage failure.

Question 25

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.

The below query is used to create the alert:

The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute.

If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

Options:

The total average temperature across all sensors exceeded 120 on three consecutive executions of the query

The recent_sensor_recordingstable was unresponsive for three consecutive runs of the query

The source query failed to update properly for three consecutive minutes and then restarted

The maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the query

The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query

Question 26

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.

The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.

The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.

Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

Options:

Because the vacuum command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.

Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the vacuum job is run the following day.

Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.

Because Delta Lake's delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.

Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the vacuum job is run 8 days later.

Question 27

A CHECK constraint has been successfully added to the Delta table named activity_details using the following logic:

A batch job is attempting to insert new records to the table, including a record where latitude = 45.50 and longitude = 212.67.

Which statement describes the outcome of this batch insert?

Options:

The write will fail when the violating record is reached; any records previously processed will be recorded to the target table.

The write will fail completely because of the constraint violation and no records will be inserted into the target table.

The write will insert all records except those that violate the table constraints; the violating records will be recorded to a quarantine table.

The write will include all records in the target table; any violations will be indicated in the boolean column named valid_coordinates.

The write will insert all records except those that violate the table constraints; the violating records will be reported in a warning log.

Question 28

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

Which approach would simplify the identification of these changed records?

Options:

Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.

Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.

Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.

Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.

Answer:

Explanation:

The approach that would simplify the identification of the changed records is to replace the current overwrite logic with a merge statement to modify only those records that have changed, and write logic to make predictions on the changed records identified by the change data feed. This approach leverages the Delta Lake features of merge and change data feed, which are designed to handle upserts and track row-level changes in a Delta table12. By using merge, the data engineering team can avoid overwriting the entire table every night, and only update or insert the records that have changed in the source data. By using change data feed, the ML team can easily access the change events that have occurred in the customer_churn_params table, and filter them by operation type (update or insert) and timestamp. This way, they can only make predictions on the records that have changed in the past 24 hours, and avoid re-processing the unchanged records.

The other options are not as simple or efficient as the proposed approach, because:

Option A would require applying the churn model to all rows in the customer_churn_params table, which would be wasteful and redundant. It would also require implementing logic to perform an upsert into the predictions table, which would be more complex than using the merge statement.
Option B would require converting the batch job to a Structured Streaming job, which would involve changing the data ingestion and processing logic. It would also require using the complete output mode, which would output the entire result table every time there is a change in the source data, which would be inefficient and costly.
Option C would require calculating the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers, which would be computationally expensive and prone to errors. It would also require storing and accessing the previous predictions, which would add extra storage and I/O costs.
Option D would require modifying the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written, which would add extra complexity and overhead to the data engineering job. It would also require using this field to identify records written on a particular date, which would be less accurate and reliable than using the change data feed.

References: Merge, Change data feed

Question 29

A data engineer wants to join a stream of advertisement impressions (when an ad was shown) with another stream of user clicks on advertisements to correlate when impression led to monitizable clicks.

Which solution would improve the performance?

Options:

Option A

Option B

Option C

Option D

Question 30

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream for highly selective joins on a number of fields, and will also be leveraged by the machine learning team to filter on a handful of relevant fields, in total, 15 fields have been identified that will often be used for filter and join logic.

The data engineer is trying to determine the best approach for dealing with these nested fields before declaring the table schema.

Which of the following accurately presents information about Delta Lake and Databricks that may Impact their decision-making process?

Options:

Because Delta Lake uses Parquet for data storage, Dremel encoding information for nesting can be directly referenced by the Delta transaction log.

Tungsten encoding used by Databricks is optimized for storing string data: newly-added native support for querying JSON strings means that string types are always most efficient.

Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

By default Delta Lake collects statistics on the first 32 columns in a table; these statistics are leveraged for data skipping when executing selective queries.

Question 31

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.

When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

Options:

The five Minute Load Average remains consistent/flat

Bytes Received never exceeds 80 million bytes per second

Total Disk Space remains constant

Network I/O never spikes

Overall cluster CPU utilization is around 25%

Question 32

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables.

Which approach will ensure that this requirement is met?

Options:

When a database is being created, make sure that the LOCATION keyword is used.

When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.

When data is saved to a table, make sure that a full file path is specified alongside the Delta format.

When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.

When the workspace is being configured, make sure that external cloud object storage has been mounted.

Question 33

Review the following error traceback:

Which statement describes the error being raised?

Options:

The code executed was PvSoark but was executed in a Scala notebook.

There is no column in the table named heartrateheartrateheartrate

There is a type error because a column object cannot be multiplied.

There is a type error because a DataFrame object cannot be multiplied.

There is a syntax error because the heartrate column is not correctly identified as a column.

Question 34

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?

Options:

Regex

Julia

pyspsark.ml.feature

Scala Datasets

C++

Question 35

Which statement describes the default execution mode for Databricks Auto Loader?

Options:

New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.

Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.

Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.

New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.

Question 36

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

Options:

Can manage

Can edit

Can run

Can Read

Load More Databricks-Certified-Professional-Data-Engineer Questions

Summer Limited Time 60% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: wrap60

Dumpswrap Top Menu

breadcrumb

Databricks Databricks-Certified-Professional-Data-Engineer Dumps

Databricks-Certified-Professional-Data-Engineer Free PDF Questions

Databricks Certified Data Engineer Professional Exam Questions and Answers

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation: