Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R vs Pentaho Spoon as an ETL tool [closed]

Tags:

r

pentaho

Background (sorry it's so long):

I've been tasked with maintaining an ETL that collects a variety of online advertising data, around 20-30 MBs a day, and appends it to tables in MySQL. Outside contractors built the ETL with Pentaho Spoon (kitchen, kettle?). The ETL consists of about 250 jobs and transformations (.ktr,.kjb), each with about 5 to 25 steps. It is very common that something is going wrong in this large process. I've found that writing R scripts to do the transform and load is much more efficient. In fact, I think the ETL could be reduced to well under 1000 lines of code besides calls with RMySQL (i.e. plyr!). Perhaps Python would be used to extract the data from the web.

My use of R has led to some resistance. The computer programmers that designed the ETL don't know R so couldn't be called if I leave, and moreover a lot of time was invested in the Spoon ETL. Also, a layman can more easily follow the steps visually in Spoon, than in the R scripts. For my part, I think we are getting bogged down by the ETL. However, I don't have a large say in the matter as I don't have a background in computer science.

Please comment if you have any insights on the following. Please know I have been researching this for months and have read many opinions, but nothing as concise or reliable as SO usually provides:

  1. R has been called not as scalable by some at the company. I think the opposite mostly because of the logging capabilities. Spoon has limited pure logging output, whereas all R scripts can be sinked into a daily log. Fixing and avoiding mistakes in the .ktrs is very tedious, but easy with setting flags and/or searching through the R log. Any thoughts on this?

  2. This leads to a big picture question. What is the point of ETLs like Pentaho? This post Do I need a ETL?, leads me to believe that if you use R or other so-called OOL, there is no reason to have a tool like Pentaho. Can someone please confirm this if so? I really need a second opinion here. If this is so who uses tools like Pentaho? Is it simply people without the programming background, or someone else? I do see a fair amount of Pentaho questions on SO.

  3. It is true that a lot more people use R and than Pentaho, right? This http://www.kdnuggets.com/2012/05/top-analytics-data-mining-big-data-software.html makes it look so. To be honest I was surprised that Pentaho was 5th, which makes me doubly wonder who uses Pentaho and if my doubts about it's use in my work setting are misplaced.

Thanks for any responses. I don't mean any condescension towards Spoon or Spoon users; I am just really confused and in need of outside opinions.

like image 875
StatSandwich Avatar asked Feb 21 '13 07:02

StatSandwich


1 Answers

R as an ETL tool? Thats a new one, but whatever floats your boat.

I would say this though, if you can get 250 jobs and transformations down to under 1000 lines of R I would say your ETL is poorly written.

Along with this you have to think about supportability and scalability. Both of which I would imagine would be far easier with a graphical tool like Spoon rather than R code.

Personally I think you are misguided and the question you ask is poorly written but thats a different argument.

Regarding your points, PDI's logging is very good and you can log pretty much however you like, all into one large database table if you like a consolidated log.

ETL's wont be going away, even with the advent of the love of unstructured data storage pools like HDFS, also think about data analysis done outside R, if you want reporting or OLAP over the top of your data, it will still need transforming regardless.

Is it true, more people use R vs Pentaho? What sort of question is that? By Pentaho I assume you mean PDI? How can that ever be compared? A data analysis tool vs ETL tool and you want to count users? eh? If on the other hand you mean R vs Pentaho as a whole, then I would guess no.You are looking at a report on R vs Weka and making it fit your ETL argument. That doesn't wash in a month of sundays.

==EDIT== Okay so you have around 1000 lines of R & Python code currently. As your bosses requirements expand this slowly grows over time, and because you are trying to hit deadlines the new code is written as cleanly or as well documented as the code you currently have in place. So over time this grows to 5000 lines say plus a few python scripts. Then one day you get hit by a bus, and some new person has to come in and manage your code... where do they start, how to they make changes?

Virtually anyone with a modicum of data experience could make a change to a PDI ETL should they be required to. Where as it would take some with enough in depth R knowledge to make changes to what you have done.

ETL tools are designed to be quick and easy to use, they also offer far more than R can provide in terms of data connectivity to different systems (non db or file based, for example), although I guess this is why people resort to python etc. That said there is room for both, there is an R plugin for PDI kicking around in the community I've seen demonstrated.

On top of that I've seen enough TSQL to ETL migrations over the years to know from experience, that even though maintaining your ETL in code may seem practical in the short term, in the long term it just brings more pain.

On the other hand if you can code 250 PDI transformations down to 1000 lines of R, your ETL is likely bloated through bad design by your predecessor.

If you'd like me to give an opinion on your existing PDI ETL structure, that can also be arranged.

Tom

like image 198
bugg_tb Avatar answered Oct 11 '22 15:10

bugg_tb