I want to run a machine learning model with some data. Before train the model with this data I need to process it, so I have been reading some ways to do it.
First of all create a Dataflow pipeline to upload it to Bigquery or Google Cloud Storage, then create a data pipeline with Google Dataprep to clean it.
The other way I reat to do it is with Data Fusion, that can create data pipelines more easier, but I don't know and here is my doubt, data Fusion it is only to create a pipeline like Dataflow and then I have to use DataPrep to clean the data or if Data Fusion can clean the data and prepare it to put into my machine learning model.
If Data Fusion can clean the data as DataPrep, when I should use DataPrep?
Dataflow is a managed service for deploying ETL pipelines written using the apache beam programming model, useful for both batch and streaming data, and can potentially be used with whatever data sources you want (e.g. Kafka, pubsub, datastore, JDBC...). Dataprep is more limited to GCS and BigQuery.
Dataprep connects to BigQuery, Cloud Storage, Google Sheets, and hundreds of other cloud applications and traditional databases so you can transform and clean any data you want. Dataprep is built on top of Dataflow and BigQuery.
Trifacta is a privately owned software company headquartered in San Francisco with offices in Bengaluru, Boston, Berlin and London.
Extract, Load, Transform (ELT) is an alternative to ETL used to store data in data lakes in raw formats before the transformation phase. Designed by Trifacta, Dataprep is a fully managed Google cloud data service for exploring, cleaning, structuring and enriching structured and unstructured data.
Datafusion and Dataprep can perform the same things. However their execution are different.
IMO, Datafusion is more designed for data ingestion from one source to another one, with few transformation. Dataprep is more designed for data preparation (as its name means), data cleaning, new column creation, splitting column. Dataprep also provide insight of the data for helping you in your recipes.
In addition, Beam is a part of Tensorflow extended and your Data engineer pipeline will be more consistent if you use a tool compliant with Beam
That's why I will recommend Dataprep instead Datafusion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With