Databricks Real Dumps Practice Exam Questions by Dumpswarp

Databricks Certified Data Engineer Associate Exam Questions and Answers

Question 1

A data engineer is running code in a Databricks Repo that is cloned from a central Git repository. A colleague of the data engineer informs them that changes have beenmade and synced to the central Git repository. The data engineer now needs to sync their Databricks Repo to get the changes from the central Git repository.

Which of the following Git operations does the data engineer need to run to accomplish this task?

Options:

Merge

Push

Pull

Commit

Clone

Question 2

A data engineer has three tables in a Delta Live Tables (DLT) pipeline. They have configured the pipeline to drop invalid records at each table. They notice that some data is being dropped due to quality concerns at some point in the DLT pipeline. They would like to determine at which table in their pipeline the data is being dropped.

Which of the following approaches can the data engineer take to identify the table that is dropping the records?

Options:

They can set up separate expectations for each table when developing their DLT pipeline.

They cannot determine which table is dropping the records.

They can set up DLT to notify them via email when records are dropped.

They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.

They can navigate to the DLT pipeline page, click on the “Error” button, and review the present errors.

Question 3

Which of the following describes the type of workloads that are always compatible with Auto Loader?

Options:

Dashboard workloads

Streaming workloads

Machine learning workloads

Serverless workloads

Batch workloads

Question 4

What is stored in a Databricks customer's cloud account?

Options:

Data

Cluster management metadata

Databricks web application

Notebooks

Question 5

A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables.

Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables?

Options:

None of these changes will need to be made

The pipeline will need to stop using the medallion-based multi-hop architecture

The pipeline will need to be written entirely in SQL

The pipeline will need to use a batch source in place of a streaming source

The pipeline will need to be written entirely in Python

Question 6

The Delta transaction log for the ‘students’ tables is shown using the ‘DESCRIBE HISTORY students’ command. A Data Engineer needs to query the table as it existed before the UPDATE operation listed in the log.

Which command should the Data Engineer use to achieve this? (Choose two.)

Options:

SELECT * FROM students@v4

SELECT * FROM students TIMESTAMP AS OF ‘2024-04-22T 14:32:47.000+00:00’

SELECT * FROM students FROM HISTORY VERSION AS OF 3

SELECT * FROM students VERSION AS OF 5

SELECT * FROM students TIMESTAMP AS OF ‘2024-04-22T 14:32:58.000+00:00’

Question 7

Which of the following data lakehouse features results in improved data quality over a traditional data lake?

Options:

A data lakehouse provides storage solutions for structured and unstructured data.

A data lakehouse supports ACID-compliant transactions.

A data lakehouse allows the use of SQL queries to examine data.

A data lakehouse stores data in open formats.

A data lakehouse enables machine learning and artificial Intelligence workloads.

Question 8

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The code block used by the data engineer is below:

Which line of code should the data engineer use to fill in the blank if the data engineer only wants the query to execute a micro-batch to process data every 5 seconds?

Options:

trigger("5 seconds")

trigger(continuous="5 seconds")

trigger(once="5 seconds")

trigger(processingTime="5 seconds")

Question 9

A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level.

Which of the following tools can the data engineer use to solve this problem?

Options:

Unity Catalog

Data Explorer

Delta Lake

Delta Live Tables

Auto Loader

Question 10

A data architect has determined that a table of the following format is necessary:

Which of the following code blocks uses SQL DDL commands to create an empty Delta table in the above format regardless of whether a table already exists with this name?

Options:

Option A

Option B

Option C

Option D

Option E

Question 11

Which of the following Structured Streaming queries is performing a hop from a Silver table to a Gold table?

Options:

Question 12

A data analysis team has noticed that their Databricks SQL queries are running too slowly when connected to their always-on SQL endpoint. They claim that this issue is present when many members of the team are running small queries simultaneously.They ask the data engineering team for help. The data engineering team notices that each of the team’s queries uses the same SQL endpoint.

Which of the following approaches can the data engineering team use to improve the latency of the team’s queries?

Options:

They can increase the cluster size of the SQL endpoint.

They can increase the maximum bound of the SQL endpoint’s scaling range.

They can turn on the Auto Stop feature for the SQL endpoint.

They can turn on the Serverless feature for the SQL endpoint.

They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to “Reliability Optimized.”

Question 13

Which of the following statements regarding the relationship between Silver tables and Bronze tables is always true?

Options:

Silver tables contain a less refined, less clean view of data than Bronze data.

Silver tables contain aggregates while Bronze data is unaggregated.

Silver tables contain more data than Bronze tables.

Silver tables contain a more refined and cleaner view of data than Bronze tables.

Silver tables contain less data than Bronze tables.

Question 14

Which of the following is hosted completely in the control plane of the classic Databricks architecture?

Options:

Worker node

JDBC data source

Databricks web application

Databricks Filesystem

Driver node

Answer:

Explanation:

The Databricks web application is the user interface that allows you to create and manage workspaces, clusters, notebooks, jobs, and other resources. It is hosted completely in the control plane of the classic Databricks architecture, which includes the backend services that Databricks manages in your Databricks account. The other options are part of the compute plane, which is where your data is processed by compute resources such as clusters. Thecompute plane is in your own cloud account and network. References: Databricks architecture overview, Security and Trust CenterQUESTION NO: 4

Which of the following benefits of using the Databricks Lakehouse Platform is provided by Delta Lake?

A. The ability to manipulate the same data using a variety of languages

B. The ability to collaborate in real time on a single notebook

C. The ability to set up alerts for query failures

D. The ability to support batch and streaming workloads

E. The ability to distribute complex data operations

Answer: D

Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks lakehouse. Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations and providing incremental processing at scale1. Delta Lake supports upserts using the merge operation, which enables you to efficiently update existing data or insert new data into your Delta tables2. Delta Lake also provides time travel capabilities, which allow you to query previous versions of your data or roll back to a specific point in time3. References: 1: What is Delta Lake? | Databricks on AWS 2: Upsert into a table using merge | Databricks on AWS 3: [Query an older snapshot of a table (time travel) | Databricks on AWS]

Learn more

Question 15

Which of the following describes the relationship between Bronze tables and raw data?

Options:

Bronze tables contain less data than raw data files.

Bronze tables contain more truthful data than raw data.

Bronze tables contain aggregates while raw data is unaggregated.

Bronze tables contain a less refined view of data than raw data.

Bronze tables contain raw data with a schema applied.

Question 16

A data engineer runs a statement every day to copy the previous day’s sales into the table transactions. Each day’s sales are in their own file in the location "/transactions/raw".

Today, the data engineer runs the following command to complete this task:

After running the command today, the data engineer notices that the number of records in table transactions has not changed.

Which of the following describes why the statement might not have copied any new records into the table?

Options:

The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.

The names of the files to be copied were not included with the FILES keyword.

The previous day’s file has already been copied into the table.

The PARQUET file format does not support COPY INTO.

The COPY INTO statement requires the table to be refreshed to view the copied rows.

Question 17

A data engineer needs to create a table in Databricks using data from their organization's existing SQLite database. They run the following command:

CREATE TABLE jdbc_customer360

USING

OPTIONS (

url "jdbc:sqlite:/customers.db", dbtable "customer360"

)

Which line of code fills in the above blank to successfully complete the task?

Options:

autoloader

org.apache.spark.sql.jdbc

sqlite

org.apache.spark.sql.sqlite

Question 18

Which tool is used by Auto Loader to process data incrementally?

Options:

Spark Structured Streaming

Unity Catalog

Checkpointing

Databricks SQL

Question 19

A data organization leader is upset about the data analysis team’s reports being different from the data engineering team’s reports. The leader believes the siloed nature of their organization’s data engineering and data analysis architectures is to blame.

Which of the following describes how a data lakehouse could alleviate this issue?

Options:

Both teams would autoscale their work as data size evolves

Both teams would use the same source of truth for their work

Both teams would reorganize to report to the same department

Both teams would be able to collaborate on projects in real-time

Both teams would respond more quickly to ad-hoc requests

Question 20

A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task.

Which of the following approaches can the data engineer use to set up the new task?

Options:

They can clone the existing task in the existing Job and update it to run the new notebook.

They can create a new task in the existing Job and then add it as a dependency of the original task.

They can create a new task in the existing Job and then add the original task as a dependency of the new task.

They can create a new job from scratch and add both tasks to run concurrently.

They can clone the existing task to a new Job and then edit it to run the new notebook.

Question 21

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.

The table is configured to run in Production mode using the Continuous Pipeline Mode.

Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?

Options:

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.

All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.

All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

Question 22

Which of the following describes a scenario in which a data engineer will want to use a single-node cluster?

Options:

When they are working interactively with a small amount of data

When they are running automated reports to be refreshed as quickly as possible

When they are working with SQL within Databricks SQL

When they are concerned about the ability to automatically scale with larger data

When they are manually running reports with a large amount of data

Answer:

Explanation:

The scenario in which a data engineer will want to use a single-node cluster is when they are working interactively with a small amount of data. A single-node cluster is a cluster consisting of an Apache Spark driver and no Spark workers1. A single-node cluster supports Spark jobs and all Spark data sources, including Delta Lake1. A single-node cluster is helpful for single-node machine learning workloads that use Spark to load and save data, and for lightweight exploratory data analysis1. A single-node cluster can run Spark locally, spawn one executor thread per logical core in the cluster, and save all log output in the driver log1. A single-node cluster can be created by selecting the Single Node button when configuring a cluster1.

The other options are not suitable for using a single-node cluster. When running automated reports to be refreshed as quickly as possible, a data engineer will want to use a multi-node cluster that can scale up and down automatically based on the workload demand2. When working with SQL within Databricks SQL, a data engineer will want to use a SQL Endpoint that can execute SQL queries on a serverless pool or an existing cluster3. When concerned about the ability to automatically scale with larger data, a data engineer will want to use a multi-node cluster that can leverage the Databricks Lakehouse Platform and the Delta Engine to handle large-scale data processing efficiently and reliably4. When manually running reports with a large amount of data, a data engineer will want to use a multi-node cluster that can distribute the computation across multiple workers and leverage the Spark UI to monitor the performance and troubleshoot the issues.

Question 23

Which of the following data workloads will utilize a Gold table as its source?

Options:

A job that enriches data by parsing its timestamps into a human-readable format

A job that aggregates uncleaned data to create standard summary statistics

A job that cleans data by removing malformatted records

A job that queries aggregated data designed to feed into a dashboard

A job that ingests raw data from a streaming source into the Lakehouse

Question 24

A data engineer has been given a new record of data:

id STRING = 'a1'

rank INTEGER = 6

rating FLOAT = 9.4

Which of the following SQL commands can be used to append the new record to an existing Delta table my_table?

Options:

INSERT INTO my_table VALUES ('a1', 6, 9.4)

my_table UNION VALUES ('a1', 6, 9.4)

INSERT VALUES ( 'a1' , 6, 9.4) INTO my_table

UPDATE my_table VALUES ('a1', 6, 9.4)

UPDATE VALUES ('a1', 6, 9.4) my_table

Question 25

An engineering manager wants to monitor the performance of a recent project using a Databricks SQL query. For the first week following the project’s release, the manager wants the query results to be updated every minute. However, the manager is concerned that the compute resources used for the query will be left running and cost the organization a lot of money beyond the first week of the project’s release.

Which of the following approaches can the engineering team use to ensure the query does not cost the organization any money beyond the first week of the project’s release?

Options:

They can set a limit to the number of DBUs that are consumed by the SQL Endpoint.

They can set the query’s refresh schedule to end after a certain number of refreshes.

They cannot ensure the query does not cost the organization money beyond the first week of the project’s release.

They can set a limit to the number of individuals that are able to manage the query’s refresh schedule.

They can set the query’s refresh schedule to end on a certain date in the query scheduler.

Question 26

Which of the following approaches should be used to send the Databricks Job owner an email in the case that the Job fails?

Options:

Manually programming in an alert system in each cell of the Notebook

Setting up an Alert in the Job page

Setting up an Alert in the Notebook

There is no way to notify the Job owner in the case of Job failure

MLflow Model Registry Webhooks

Question 27

A data engineer wants to create a relational object by pulling data from two tables. The relational object does not need to be used by other data engineers in other sessions. In order to save on storage costs, the data engineer wants to avoid copying and storing physical data.

Which of the following relational objects should the data engineer create?

Options:

Spark SQL Table

View

Database

Temporary view

Delta Table

Question 28

A data engineer needs to determine whether to use the built-in Databricks Notebooks versioning or version their project using Databricks Repos.

Which of the following is an advantage of using Databricks Repos over the Databricks Notebooks versioning?

Options:

Databricks Repos automatically saves development progress

Databricks Repos supports the use of multiple branches

Databricks Repos allows users to revert to previous versions of a notebook

Databricks Repos provides the ability to comment on specific changes

Databricks Repos is wholly housed within the Databricks Lakehouse Platform

Question 29

A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:

Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?

Options:

Replace predict with a stream-friendly prediction function

Replace schema(schema) with option ("maxFilesPerTrigger", 1)

Replace "transactions" with the path to the location of the Delta table

Replace format("delta") with format("stream")

Replace spark.read with spark.readStream

Question 30

A data engineer needs to create a table in Databricks using data from their organization’s existing SQLite database.

They run the following command:

Which of the following lines of code fills in the above blank to successfully complete the task?

Options:

org.apache.spark.sql.jdbc

autoloader

DELTA

sqlite

org.apache.spark.sql.sqlite

Question 31

A data engineer wants to create a new table containing the names of customers that live in France.

They have written the following command:

A senior data engineer mentions that it is organization policy to include a table property indicating that the new table includes personally identifiable information (PII).

Which of the following lines of code fills in the above blank to successfully complete the task?

Options:

There is no way to indicate whether a table contains PII.

"COMMENT PII"

TBLPROPERTIES PII

COMMENT "Contains PII"

PII

Question 32

A single Job runs two notebooks as two separate tasks. A data engineer has noticed that one of the notebooks is running slowly in the Job’s current run. The data engineer asks a tech lead for help in identifying why this might be the case.

Which of the following approaches can the tech lead use to identify why the notebook is running slowly as part of the Job?

Options:

They can navigate to the Runs tab in the Jobs UI to immediately review the processing notebook.

They can navigate to the Tasks tab in the Jobs UI and click on the active run to review the processing notebook.

They can navigate to the Runs tab in the Jobs UI and click on the active run to review the processing notebook.

There is no way to determine why a Job task is running slowly.

They can navigate to the Tasks tab in the Jobs UI to immediately review the processing notebook.

Load More Databricks-Certified-Data-Engineer-Associate Questions

Weekend Sale Discount Flat 70% Offer - Ends in 0d 00h 00m 00s - Coupon code: 70diswrap

Dumpswrap Top Menu

breadcrumb

Databricks Databricks-Certified-Data-Engineer-Associate Dumps

Databricks-Certified-Data-Engineer-Associate Free PDF Questions

Databricks Certified Data Engineer Associate Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options: