Dumps Moneyack Guarantee - Professional-Data-Engineer Dumps UpTo 50% Off
Updated Jan-2022 Pass Professional-Data-Engineer Exam - Real Practice Test Questions
NEW QUESTION 44
You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? (Choose two.)
- A. Develop a data pipeline where status updates are appended to BigQuery instead of updated.
- B. Denormalize the data as must as possible.
- C. Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery's support for external data sources to query.
- D. Use BigQuery UPDATE to further reduce the size of the dataset.
- E. Preserve the structure of the data as much as possible.
Answer: A,B
Explanation:
Denormalization will help in performance by reducing query time, update are not good with bigquery.
NEW QUESTION 45
What Dataflow concept determines when a Window's contents should be output based on certain criteria being met?
- A. Sessions
- B. OutputCriteria
- C. Triggers
- D. Windows
Answer: C
Explanation:
Explanation
Triggers control when the elements for a specific key and window are output. As elements arrive, they are put into one or more windows by a Window transform and its associated WindowFn, and then passed to the associated Trigger to determine if the Windows contents should be output.
Reference:
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/windowing/Trig
NEW QUESTION 46
You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DTstores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRINGtype. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?
- A. Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DTinto TIMESTAMPvalues. Run the query into a destination table NEW_CLICK_STREAM, in which the column TSis the TIMESTAMPtype. Reference the table NEW_CLICK_STREAMinstead of the table CLICK_STREAMfrom now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.
- B. Delete the table CLICK_STREAM, and then re-create it such that the column DTis of the TIMESTAMPtype.
Reload the data. - C. Add a column TSof the TIMESTAMPtype to the table CLICK_STREAM, and populate the numeric values from the column TSfor each row. Reference the column TSinstead of the column DTfrom now on.
- D. Create a view CLICK_STREAM_V, where strings from the column DTare cast into TIMESTAMPvalues.
Reference the view CLICK_STREAM_Vinstead of the table CLICK_STREAMfrom now on. - E. Add two columns to the table CLICK STREAM: TSof the TIMESTAMPtype and IS_NEWof the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEWto true. For future queries, reference the column TSinstead of the column DT, with the WHEREclause ensuring that the value of IS_NEWmust be true.
Answer: E
NEW QUESTION 47
Which row keys are likely to cause a disproportionate number of reads and/or writes on a particular node in a Bigtable cluster (select 2 answers)?
- A. A stock symbol followed by a timestamp
- B. A non-sequential numeric ID
- C. A timestamp followed by a stock symbol
- D. A sequential numeric ID
Answer: C,D
Explanation:
using a timestamp as the first element of a row key can cause a variety of problems.
In brief, when a row key for a time series includes a timestamp, all of your writes will target a single node; fill that node; and then move onto the next node in the cluster, resulting in hotspotting.
Suppose your system assigns a numeric ID to each of your application's users. You might be tempted to use the user's numeric ID as the row key for your table. However, since new users are more likely to be active users, this approach is likely to push most of your traffic to a small number of nodes. [https://cloud.google.com/bigtable/docs/schema-design] Reference: https://cloud.google.com/bigtable/docs/schema-design-time- series#ensure_that_your_row_key_avoids_hotspotting
NEW QUESTION 48
You store historic data in Cloud Storage. You need to perform analytics on the historic data. You want to use a solution to detect invalid data entries and perform data transformations that will not require programming or knowledge of SQL.
What should you do?
- A. Use Cloud Dataflow with Beam to detect errors and perform transformations.
- B. Use Cloud Dataproc with a Hadoop job to detect errors and perform transformations.
- C. Use federated tables in BigQuery with queries to detect errors and perform transformations.
- D. Use Cloud Dataprep with recipes to detect errors and perform transformations.
Answer: D
NEW QUESTION 49
Your company's customer and order databases are often under heavy load. This makes performing
analytics against them difficult without harming operations. The databases are in a MySQL cluster, with
nightly backups taken using mysqldump. You want to perform analytics with minimal impact on operations.
What should you do?
- A. Mount the backups to Google Cloud SQL, and then process the data using Google Cloud Dataproc.
- B. Connect an on-premises Apache Hadoop cluster to MySQL and perform ETL.
- C. Add a node to the MySQL cluster and build an OLAP cube there.
- D. Use an ETL tool to load the data from MySQL into Google BigQuery.
Answer: B
NEW QUESTION 50
Which of the following statements about Legacy SQL and Standard SQL is not true?
- A. Standard SQL is the preferred query language for BigQuery.
- B. One difference between the two query languages is how you specify fully-qualified table names (i.e.
table names that include their associated project name). - C. If you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
- D. You need to set a query language for each dataset and the default is Standard SQL.
Answer: D
Explanation:
You do not set a query language for each dataset. It is set each time you run a query and the default query language is Legacy SQL.
Standard SQL has been the preferred query language since BigQuery 2.0 was released. In legacy SQL, to query a table with a project-qualified name, you use a colon, :, as a separator. In standard SQL, you use a period, ., instead.
Due to the differences in syntax between the two query languages (such as with project-qualified table names), if you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
Reference:
https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql
NEW QUESTION 51
You are deploying MariaDB SQL databases on GCE VM Instances and need to configure monitoring and alerting. You want to collect metrics including network connections, disk IO and replication status from MariaDB with minimal development effort and use StackDriver for dashboards and alerts.
What should you do?
- A. Install the StackDriver Logging Agent and configure fluentd in_tail plugin to read MariaDB logs.
- B. Install the OpenCensus Agent and create a custom metric collection application with a StackDriver exporter.
- C. Install the StackDriver Agent and configure the MySQL plugin.
- D. Place the MariaDB instances in an Instance Group with a Health Check.
Answer: A
NEW QUESTION 52
Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery. Which three approaches can you take? (Choose three.)
- A. Segregate data across multiple tables or databases.
- B. Ensure that the data is encrypted at all times.
- C. Disable writes to certain tables.
- D. Restrict BigQuery API access to approved users.
- E. Restrict access to tables by role.
- F. Use Google Stackdriver Audit Logging to determine policy violations.
Answer: D,E,F
Explanation:
Explanation/Reference:
NEW QUESTION 53
Your software uses a simple JSON format for all messages. These messages are published to Google Cloud Pub/Sub, then processed with Google Cloud Dataflow to create a real-time dashboard for the CFO.
During testing, you notice that some messages are missing in the dashboard. You check the logs, and all messages are being published to Cloud Pub/Sub successfully. What should you do next?
- A. Check the dashboard application to see if it is not displaying correctly.
- B. Switch Cloud Dataflow to pull messages from Cloud Pub/Sub instead of Cloud Pub/Sub pushing messages to Cloud Dataflow.
- C. Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output.
- D. Use Google Stackdriver Monitoring on Cloud Pub/Sub to find the missing messages.
Answer: D
Explanation:
Stackdriver can be used to check the error like number of unack messages, publisher pushing messages faster.
NEW QUESTION 54
You are designing a basket abandonment system for an ecommerce company. The system will send a
message to a user based on these rules:
No interaction by the user on the site for 1 hour
Has added more than $30 worth of products to the basket
Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should
you design the pipeline?
- A. Use a fixed-time window with a duration of 60 minutes.
- B. Use a session window with a gap time duration of 60 minutes.
- C. Use a global window with a time based trigger with a delay of 60 minutes.
- D. Use a sliding time window with a duration of 60 minutes.
Answer: C
NEW QUESTION 55
Which of these rules apply when you add preemptible workers to a Dataproc cluster (select 2 answers)?
- A. A Dataproc cluster cannot have only preemptible workers.
- B. If a preemptible worker is reclaimed, then a replacement worker must be added manually.
- C. Preemptible workers cannot use persistent disk.
- D. Preemptible workers cannot store data.
Answer: A,D
Explanation:
The following rules will apply when you use preemptible workers with a Cloud Dataproc cluster:
. Processing only--Since preemptibles can be reclaimed at any time, preemptible workers do not store data. Preemptibles added to a Cloud Dataproc cluster only function as processing nodes. . No preemptible-only clusters--To ensure clusters do not lose all workers, Cloud Dataproc cannot create preemptible-only clusters.
. Persistent disk size--As a default, all preemptible workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS.
The managed group automatically re-adds workers lost due to reclamation as capacity permits.
Reference: https://cloud.google.com/dataproc/docs/concepts/preemptible-vms
NEW QUESTION 56
You are a head of BI at a large enterprise company with multiple business units that each have different priorities and budgets. You use on-demand pricing for BigQuery with a quota of 2K concurrent on-demand slots per project. Users at your organization sometimes don't get slots to execute their query and you need to correct this. You'd like to avoid introducing new projects to your account.
What should you do?
- A. Switch to flat-rate pricing and establish a hierarchical priority model for your projects.
- B. Convert your batch BQ queries into interactive BQ queries.
- C. Create an additional project to overcome the 2K on-demand per-project quota.
- D. Increase the amount of concurrent slots per project at the Quotas page at the Cloud Console.
Answer: A
NEW QUESTION 57
Your financial services company is moving to cloud technology and wants to store 50 TB of financial timeseries data in the cloud. This data is updated frequently and new data will be streaming in all the time.
Your company also wants to move their existing Apache Hadoop jobs to the cloud to get insights into this data.
Which product should they use to store the data?
- A. Cloud Bigtable
- B. Google BigQuery
- C. Google Cloud Storage
- D. Google Cloud Datastore
Answer: A
Explanation:
Reference: https://cloud.google.com/bigtable/docs/schema-design-time-series
NEW QUESTION 58
You have an Apache Kafka cluster on-prem with topics containing web application logs. You need to replicate the data to Google Cloud for analysis in BigQuery and Cloud Storage. The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins.
What should you do?
- A. Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
- B. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Source connector. Use a Dataflow job to read from PubSub and write to GCS.
- C. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Sink connector. Use a Dataflow job to read from PubSub and write to GCS.
- D. Deploy a Kafka cluster on GCE VM Instances with the PubSub Kafka connector configured as a Sink connector. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
Answer: A
NEW QUESTION 59
......
Google Cloud Big Data & Machine Learning Fundamentals course
This course is a gateway to introduce you to Google Cloud's big data and different machine learning functions. However, to successfully pass this training, you have to attain one year of experience in SQL, extract transform, data modeling, machine learning, programming in Python, and load activities. So, the objectives of the course are the following:
- Recognize the purpose of the key Big data and Machine Learning products in Google Cloud
- Hire BigQuery and Cloud SQL for interactive data analysis
- Create ML models using BigQuery ML, APIs, and AutoML.
- Utilize Cloud SQL & Dataproc to migrate existing MySQL, Pig, Spark, or Hive workloads to Google Cloud
Download Free Google Professional-Data-Engineer Real Exam Questions: https://www.practicematerial.com/Professional-Data-Engineer-exam-materials.html
Pass Your Exam With 100% Verified Professional-Data-Engineer Exam Questions: https://drive.google.com/open?id=10jk2MnByyhA_qxfYB04bF17E8SHJ8vHe

