AWS Certified Data Analytics - Specialty (DAS-C01)

The AWS Certified Data Analytics - Specialty (DAS-C01) were last updated on today.
  • Viewing page 2 out of 32 pages.
  • Viewing questions 6-10 out of 160 questions
Disclaimers:
  • - ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.and Azure
  • - Trademarks, certification & product names are used for reference only and belong to Amazon.and Azure

Topic 1 - Exam A

Question #6 Topic 1

An airline has been collecting metrics on flight activities for analytics. A recently completed proof of concept demonstrates how the company provides insights to data analysts to improve on-time departures. The proof of concept used objects in Amazon S3, which contained the metrics in .csv format, and used Amazon Athena for querying the data. As the amount of data increases, the data analyst wants to optimize the storage solution to improve query performance. Which options should the data analyst use to improve performance as the data lake grows? (Choose three.)

  • A Add a randomized string to the beginning of the keys in S3 to get more throughput across partitions.
  • B Use an S3 bucket in the same account as Athena.
  • C Compress the objects to reduce the data transfer I/O.
  • D Use an S3 bucket in the same Region as Athena.
  • E Preprocess the .csv data to JSON to reduce I/O by fetching only the document keys needed by the query.
  • F Preprocess the .csv data to Apache Parquet to reduce I/O by fetching only the data blocks needed for predicates.
Suggested Answer: BCD
NOTE: The data analyst should choose options B, C, and D to improve performance as the data lake grows. Option B suggests using an S3 bucket in the same account as Athena, which can reduce data transfer latency. Option C suggests compressing the objects to reduce the amount of data transferred, resulting in improved query performance. Option D suggests using an S3 bucket in the same Region as Athena, which minimizes the latency and improves query performance.
Question #7 Topic 1

A large financial company is running its ETL process. Part of this process is to move data from Amazon S3 into an Amazon Redshift cluster. The company wants to use the most cost-efficient method to load the dataset into Amazon Redshift. Which combination of steps would meet these requirements? (Choose two.)

  • A Use the COPY command with the manifest file to load data into Amazon Redshift.
  • B Use S3DistCp to load files into Amazon Redshift.
  • C Use temporary staging tables during the loading process.
  • D Use the UNLOAD command to upload data into Amazon Redshift.
  • E Use Amazon Redshift Spectrum to query files from Amazon S3.
Suggested Answer: AE
NOTE: Using the COPY command with the manifest file allows for efficient data loading into Amazon Redshift. Additionally, using Amazon Redshift Spectrum allows for querying files directly from Amazon S3, reducing the need for data movement.
Question #8 Topic 1

A smart home automation company must efficiently ingest and process messages from various connected devices and sensors. The majority of these messages are comprised of a large number of small files. These messages are ingested using Amazon Kinesis Data Streams and sent to Amazon S3 using a Kinesis data stream consumer application. The Amazon S3 message data is then passed through a processing pipeline built on Amazon EMR running scheduled PySpark jobs. The data platform team manages data processing and is concerned about the efficiency and cost of downstream data processing. They want to continue to use PySpark. Which solution improves the efficiency of the data processing jobs and is well architected?

  • A Send the sensor and devices data directly to a Kinesis Data Firehose delivery stream to send the data to Amazon S3 with Apache Parquet record format conversion enabled. Use Amazon EMR running PySpark to process the data in Amazon S3.
  • B Set up an AWS Lambda function with a Python runtime environment. Process individual Kinesis data stream messages from the connected devices and sensors using Lambda.
  • C Launch an Amazon Redshift cluster. Copy the collected data from Amazon S3 to Amazon Redshift and move the data processing jobs from Amazon EMR to Amazon Redshift.
  • D Set up AWS Glue Python jobs to merge the small data files in Amazon S3 into larger files and transform them to Apache Parquet format. Migrate the downstream PySpark jobs from Amazon EMR to AWS Glue.
Suggested Answer: A
NOTE: The solution A improves the efficiency of the data processing jobs by sending the sensor and devices data directly to a Kinesis Data Firehose delivery stream. It also enables Apache Parquet record format conversion, which is optimized for big data processing. Using Amazon EMR running PySpark to process the data in Amazon S3 ensures efficient and well-architected data processing.
Question #9 Topic 1

A media analytics company consumes a stream of social media posts. The posts are sent to an Amazon Kinesis data stream partitioned on user_id. An AWS Lambda function retrieves the records and validates the content before loading the posts into an Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster. The validation process needs to receive the posts for a given user in the order they were received by the Kinesis data stream. During peak hours, the social media posts take more than an hour to appear in the Amazon OpenSearch Service (Amazon ES) cluster. A data analytics specialist must implement a solution that reduces this latency with the least possible operational overhead. Which solution meets these requirements?

  • A Migrate the validation process from Lambda to AWS Glue.
  • B Migrate the Lambda consumers from standard data stream iterators to an HTTP/2 stream consumer.
  • C Increase the number of shards in the Kinesis data stream.
  • D Send the posts stream to Amazon Managed Streaming for Apache Kafka instead of the Kinesis data stream.
Suggested Answer: C
NOTE: Increasing the number of shards in the Kinesis data stream allows for parallel processing of the data, which can help reduce latency and ensure that the posts for a given user are received in the order they were sent.
Question #10 Topic 1

A company has 1 million scanned documents stored as image files in Amazon S3. The documents contain typewritten application forms with information including the applicant first name, applicant last name, application date, application type, and application text. The company has developed a machine learning algorithm to extract the metadata values from the scanned documents. The company wants to allow internal data analysts to analyze and find applications using the applicant name, application date, or application text. The original images should also be downloadable. Cost control is secondary to query performance. Which solution organizes the images and metadata to drive insights while meeting the requirements?

  • A For each image, use object tags to add the metadata. Use Amazon S3 Select to retrieve the files based on the applicant name and application date.
  • B Index the metadata and the Amazon S3 location of the image file in Amazon OpenSearch Service (Amazon Elasticsearch Service). Allow the data analysts to use OpenSearch Dashboards (Kibana) to submit queries to the Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster.
  • C Store the metadata and the Amazon S3 location of the image file in an Amazon Redshift table. Allow the data analysts to run ad-hoc queries on the table.
  • D Store the metadata and the Amazon S3 location of the image files in an Apache Parquet file in Amazon S3, and define a table in the AWS Glue Data Catalog. Allow data analysts to use Amazon Athena to submit custom queries.
Suggested Answer: D
NOTE: chose option D because it is the most suitable solution for organizing the images and metadata to drive insights while meeting the requirements. Storing the metadata and the Amazon S3 location of the image files in an Apache Parquet file in Amazon S3 allows for efficient querying and analysis using Amazon Athena. Additionally, defining a table in the AWS Glue Data Catalog enables easy discovery and access to the data for data analysts.