Winter Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: dumps65

Databricks Databricks-Machine-Learning-Associate Dumps

Databricks Certified Machine Learning Associate Exam Questions and Answers

Question 1

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

Options:

A.

import pyspark.pandas as ps

df = ps.DataFrame(spark_df)

B.

import pyspark.pandas as ps

df = ps.to_pandas(spark_df)

C.

spark_df.to_sql()

D.

import pandas as pd

df = pd.DataFrame(spark_df)

E.

spark_df.to_pandas()

Question 2

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

Options:

A.

pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata

B.

pandas API on Spark DataFrames are more performant than Spark DataFrames

C.

pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata

D.

pandas API on Spark DataFrames are less mutable versions of Spark DataFrames

E.

pandas API on Spark DataFrames are unrelated to Spark DataFrames

Question 3

A data scientist wants to use Spark ML to one-hot encode the categorical features in their PySpark DataFramefeatures_df. A list of the names of the string columns is assigned to theinput_columnsvariable.

They have developed this code block to accomplish this task:

as

The code block is returning an error.

Which of the following adjustments does the data scientist need to make to accomplish this task?

Options:

A.

They need to specify the method parameter to the OneHotEncoder.

B.

They need to remove the line with the fit operation.

C.

They need to use Stringlndexer prior to one-hot encodinq the features.

D.

They need to useVectorAssemblerprior to one-hot encoding the features.

Question 4

A data scientist has created a linear regression model that useslog(price)as a label variable. Using this model, they have performed inference and the predictions and actual label values are in Spark DataFramepreds_df.

They are using the following code block to evaluate the model:

regression_evaluator.setMetricName("rmse").evaluate(preds_df)

Which of the following changes should the data scientist make to evaluate the RMSE in a way that is comparable withprice?

Options:

A.

They should exponentiate the computed RMSE value

B.

They should take the log of the predictions before computing the RMSE

C.

They should evaluate the MSE of the log predictions to compute the RMSE

D.

They should exponentiate the predictions before computing the RMSE

Question 5

A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation when k > 2.

Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?

Options:

A.

A holdout set is not necessary when using a train-validation split

B.

Reproducibility is achievable when using a train-validation split

C.

Fewer hyperparameter values need to be tested when usinga train-validation split

D.

Bias is avoidable when using a train-validation split

E.

Fewer models need to be trained when using a train-validation split

Question 6

A team is developing guidelines on when to use various evaluation metrics for classification problems. The team needs to provide input on when to use the F1 score over accuracy.

as

Which of the following suggestions should the team include in their guidelines?

Options:

A.

The F1 score should be utilized over accuracy when the number of actual positive cases is identical to the number of actual negative cases.

B.

The F1 score should be utilized over accuracy when there are greater than two classes in the target variable.

C.

The F1 score should be utilized over accuracy when there is significant imbalance between positive and negative classes and avoiding false negatives is a priority.

D.

The F1 score should be utilized over accuracy when identifying true positives and true negatives are equally important to the business problem.

Question 7

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model. They elect to use the Hyperopt library'sfminoperation to facilitate this process. Unfortunately, the final model is not very accurate. The data scientist suspects that there is an issue with theobjective_functionbeing passed as an argument tofmin.

They use the following code block to create theobjective_function:

as

Which of the following changes does the data scientist need to make to theirobjective_functionin order to produce a more accurate model?

Options:

A.

Add test set validation process

B.

Add a random_state argument to the RandomForestRegressor operation

C.

Remove the mean operation that is wrapping the cross_val_score operation

D.

Replace the r2 return value with -r2

E.

Replace the fmin operation with the fmax operation

Question 8

A data scientist has written a feature engineering notebook that utilizes the pandas library. As the size of the data processed by the notebook increases, the notebook's runtime is drastically increasing, but it is processing slowly as the size of the data included in the process increases.

Which of the following tools can the data scientist use to spend the least amount of time refactoring their notebook to scale with big data?

Options:

A.

PySpark DataFrame API

B.

pandas API on Spark

C.

Spark SQL

D.

Feature Store

Question 9

A data scientist is developing a single-node machine learning model. They have a large number of model configurations to test as a part of their experiment. As a result, the model tuning process takes too long to complete. Which of the following approaches can be used to speed up the model tuning process?

Options:

A.

Implement MLflow Experiment Tracking

B.

Scale up with Spark ML

C.

Enable autoscaling clusters

D.

Parallelize with Hyperopt

Question 10

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model in parallel. They elect to use the Hyperopt library to facilitate this process.

Which of the following Hyperopt tools provides the ability to optimize hyperparameters in parallel?

Options:

A.

fmin

B.

SparkTrials

C.

quniform

D.

search_space

E.

objective_function

Question 11

A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical.

Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?

Options:

A.

Spark ML decision trees test every feature variable in the splitting algorithm

B.

Spark ML decision trees automatically prune overfit trees

C.

Spark ML decision trees test more split candidates in the splitting algorithm

D.

Spark ML decision trees test a random sample of feature variables in the splitting algorithm

E.

Spark ML decision trees test binned features values as representative split candidates

Question 12

Which of the following machine learning algorithms typically uses bagging?

Options:

A.

IGradient boosted trees

B.

K-means

C.

Random forest

D.

Decision tree

Question 13

Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?

Options:

A.

TrainValidationSplit

B.

DataFrame.where

C.

CrossValidator

D.

TrainValidationSplitModel

E.

DataFrame.randomSplit

Question 14

A data scientist wants to use Spark ML to impute missing values in their PySpark DataFrame features_df. They want to replace missing values in all numeric columns in features_df with each respective numeric column’s median value.

They have developed the following code block to accomplish this task:

as

The code block is not accomplishing the task.

Which reasons describes why the code block is not accomplishing the imputation task?

Options:

A.

It does not impute both the training and test data sets.

B.

The inputCols and outputCols need to be exactly the same.

C.

The fit method needs to be called instead of transform.

D.

It does not fit the imputer on the data to create an ImputerModel.

Question 15

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

Options:

A.

import pyspark.pandas as ps

df = ps.DataFrame(spark_df)

B.

import pyspark.pandas as ps

df = ps.to_pandas(spark_df)

C.

spark_df.to_pandas()

D.

import pandas as pd

df = pd.DataFrame(spark_df)

Question 16

A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:

prediction DOUBLE

actual DOUBLE

Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?

A)

as

B)

as

C)

as

D)

as

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Question 17

A data scientist is using the following code block to tune hyperparameters for a machine learning model:

as

Which change can they make the above code block to improve the likelihood of a more accurate model?

Options:

A.

Increase num_evals to 100

B.

Change fmin() to fmax()

C.

Change sparkTrials() to Trials()

D.

Change tpe.suggest to random.suggest

Question 18

A data scientist has developed a random forest regressor rfr and included it as the final stage in a Spark MLPipeline pipeline. They then set up a cross-validation process with pipeline as the estimator in the following code block:

as

Which of the following is a negative consequence of includingpipelineas the estimator in the cross-validation process rather thanrfras the estimator?

Options:

A.

The process will have a longer runtime because all stages of pipeline need to be refit or retransformed with each mode

B.

The process will leak data from the training set to the test set during the evaluation phase

C.

The process will be unable to parallelize tuning due to the distributed nature of pipeline

D.

The process will leak data prep information from the validation sets to the training sets for each model

Question 19

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column discount is less than or equal 0.

Which of the following code blocks will accomplish this task?

Options:

A.

spark_df.loc[:,spark_df["discount"] <= 0]

B.

spark_df[spark_df["discount"] <= 0]

C.

spark_df.filter (col("discount") <= 0)

D.

spark_df.loc(spark_df["discount"] <= 0, :]

Question 20

A data scientist is attempting to tune a logistic regression model logistic using scikit-learn. They want to specify a search space for two hyperparameters and let the tuning process randomly select values for each evaluation.

They attempt to run the following code block, but it does not accomplish the desired task:

as

Which of the following changes can the data scientist make to accomplish the task?

Options:

A.

Replace the GridSearchCV operation with RandomizedSearchCV

B.

Replace the GridSearchCV operation with cross_validate

C.

Replace the GridSearchCV operation with ParameterGrid

D.

Replace the random_state=0 argument with random_state=1

E.

Replace the penalty= ['12', '11'] argument with penalty=uniform ('12', '11')

Question 21

A machine learning engineer is using the following code block to scale the inference of a single-node model on a Spark DataFrame with one million records:

as

Assuming the default Spark configuration is in place, which of the following is a benefit of using anIterator?

Options:

A.

The data will be limited to a single executor preventing the model from being loaded multiple times

B.

The model will be limited to a single executor preventing the data from being distributed

C.

The model only needs to be loaded once per executor rather than once per batch during the inference process

D.

The data will be distributed across multiple executors during the inference process

Question 22

Which of the following evaluation metrics is not suitable to evaluate runs in AutoML experiments for regression problems?

Options:

A.

F1

B.

R-squared

C.

MAE

D.

MSE

Page: 1 / 7
Total 74 questions