一位数据科学家需要将现有的本地ETL过程迁移到云中。当前过程定期运行,并使用PySpark将多个大量数据源合并和格式化为单个整合输出,供下游处理。数据科学家已被分配到云解决方案以下要求:✑ 合并多个数据源。✑ 重用现有的PySpark逻辑。✑ 在现有的调度程序上运行解决方案。✑ 最小化需要管理的服务器数量。那么数据科学家应该使用哪种架构来构建这个解决方案呢?
-
A
Write the raw data to Amazon S3. Schedule an AWS Lambda function to submit a Spark step to a persistent Amazon EMR cluster based on the existing schedule. Use the existing PySpark logic to run the ETL job on the EMR cluster. Output the results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
-
B
Write the raw data to Amazon S3. Create an AWS Glue ETL job to perform the ETL processing against the input data. Write the ETL job in PySpark to leverage the existing logic. Create a new AWS Glue trigger to trigger the ETL job based on the existing schedule. Configure the output target of the ETL job to write to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
-
C
Write the raw data to Amazon S3. Schedule an AWS Lambda function to run on the existing schedule and process the input data from Amazon S3. Write the Lambda logic in Python and implement the existing PySpark logic to perform the ETL process. Have the Lambda function output the results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
-
D
Use Amazon Kinesis Data Analytics to stream the input data and perform real-time SQL queries against the stream to carry out the required transformations within the stream. Deliver the output results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
正确答案:
D
解析: