Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Google Data Fusion make the same data cleaning than DataPrep?

I want to run a machine learning model with some data. Before train the model with this data I need to process it, so I have been reading some ways to do it.

  1. First of all create a Dataflow pipeline to upload it to Bigquery or Google Cloud Storage, then create a data pipeline with Google Dataprep to clean it.

  2. The other way I reat to do it is with Data Fusion, that can create data pipelines more easier, but I don't know and here is my doubt, data Fusion it is only to create a pipeline like Dataflow and then I have to use DataPrep to clean the data or if Data Fusion can clean the data and prepare it to put into my machine learning model.

If Data Fusion can clean the data as DataPrep, when I should use DataPrep?

like image 743
J.C Guzman Avatar asked Sep 30 '19 21:09

J.C Guzman


People also ask

What is the difference between cloud dataflow and cloud dataprep services?

Dataflow is a managed service for deploying ETL pipelines written using the apache beam programming model, useful for both batch and streaming data, and can potentially be used with whatever data sources you want (e.g. Kafka, pubsub, datastore, JDBC...). Dataprep is more limited to GCS and BigQuery.

What other Google cloud service does dataprep use to complete the process of transforming data?

Dataprep connects to BigQuery, Cloud Storage, Google Sheets, and hundreds of other cloud applications and traditional databases so you can transform and clean any data you want. Dataprep is built on top of Dataflow and BigQuery.

Is Trifacta owned by Google?

Trifacta is a privately owned software company headquartered in San Francisco with offices in Bengaluru, Boston, Berlin and London.

Is dataprep a ETL?

Extract, Load, Transform (ELT) is an alternative to ETL used to store data in data lakes in raw formats before the transformation phase. Designed by Trifacta, Dataprep is a fully managed Google cloud data service for exploring, cleaning, structuring and enriching structured and unstructured data.


1 Answers

Datafusion and Dataprep can perform the same things. However their execution are different.

  • Datafusion create a Spark pipeline and run it on Dataproc cluster
  • Dataprep create a Beam pipeline and run it on Dataflow

IMO, Datafusion is more designed for data ingestion from one source to another one, with few transformation. Dataprep is more designed for data preparation (as its name means), data cleaning, new column creation, splitting column. Dataprep also provide insight of the data for helping you in your recipes.

In addition, Beam is a part of Tensorflow extended and your Data engineer pipeline will be more consistent if you use a tool compliant with Beam

That's why I will recommend Dataprep instead Datafusion.

like image 54
guillaume blaquiere Avatar answered Oct 28 '22 04:10

guillaume blaquiere