Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data recognition, parsing, filtering, and transformation -- GUI?

Looking for a non-cloud based open source app for doing data transformation; though for a killer (and I mean killer) app just built for data transformations, I might be willing to spend up to $1000.

I've looked at Perl, Kapow Katalyst, Pentaho Kettle, and more.

Perl, Python, Ruby which are clearly languages, but unable to find any frameworks/DSLs just for processing data; meaning they're really not a great development environments, meaning there's no built GUI's for building RegEx, Input/Output (CSV, XML, JDBC, REST, etc.), no debugger for testing rows and rows of data -- they're not bad either, just not what I'm looking for, which is a GUI built for complex data transformations; that said, I'd love if the GUI/app file was in a scripting language, and NOT just stored in some not human readable XML/ASCII file.

Kapow Katalyst is made for accessing data via HTTP (HTML, CSS, RSS, JavaScript, etc.) it's got a nice GUI for transforming unstructured text, but that's not its core value offering, and is way, way too expensive. It does an okay job of traversing document namespace paths; guessing it's just XPath on the back-end, since the syntax appears to be the same.

Pentaho Kettle has a nice GUI for INPUT/OUTPUT of most common data stores, and its own take on handling data processing; which is okay, and just has a small learning curve. Kettle's debugger is ok, in that the data is easy to see, but the errors and exceptions are not threaded with the output, and there no way to really debug an issue; meaning you can't reload the output/error/exception, but are able to view the system feedback. All that said, Kettle data transformation is _______ well, let's just say it left me feeling like I must be missing something, because I was completely puzzled by "if it's not possible, just write the transformation in JavaScript"; umm, what?

So, any suggestions? Do realize that I haven't really spec'd out any transformations, but figure if you really use a product for data munging, I'd like to know about it; even excel, I guess.

In general though, currently I'm looking for a product that's able to handle 1000-100,000 rows with 10-100 columns. It'd be super cool if it could profile data sets, which is a feature Kettle sort of does, but not super well. I'd also like built in unit testing, meaning I'm able to build out control sets of data, and run changes made against the control set. Then I'd like to be able to selectively filter out rows and columns as I build out the transformation without altering the build; for example, I run a data set through transformation, filter the results, and the next run those sets are automatically blocked at the first "logical" occurrence; which in turn would mean less data to "look at" and a reduced runtime per each enhanced iteration; what would be crazy nice is if as I'd filtering out the rows/columns the app is tracking those, (and the output was filtered out). and unit tested/highlighted any changes. If I made a change that would effect the application logs and it's ability to track the unit tests based on me "breaking a branch" - it'd give me a warning, let me dump the data stored branch... and/or track the primary keys for difference in next generation of output, or even attempt to match them using fuzzy logic. And yes, I know this is a pipe dream, but hey, figured I'd ask, just in case there's something out there I've just never seen.

Feel free to comment, I'd be happy to answer any questions, or offer additional info.

like image 919
blunders Avatar asked Dec 03 '10 01:12

blunders


2 Answers

Google Refine?

like image 50
glenn mcdonald Avatar answered Sep 30 '22 13:09

glenn mcdonald


Talend will need more than 5 minutes of your time, perhaps closer to about 1 hour to begin to wire up a basic transformations and being able to fulfill your requirement to keep versioned control transformations as well. You described a Pipeline process that can be done easily in Talend when you know how, where you have multiple inputs and outputs in a project as the same raw data goes through various transformations and filtering, until it arrives as final output as you desired. Then you can schedule your jobs to repeat the process over similar data. Go back and spend more time with Talend, and you'll succeed in what you need, I'm sure.

I also happen to be one of the committers of Google Refine and also use Talend in my daily work. I actually sometimes model my transformations for Talend first in Google Refine. (Sometimes even using Refine to perform cleanup on borked ETL transforms themselves! LOL ) I can tell you that my experience with Talend played a small part in a few of the features of Google Refine. For instance, both Talend and Google Refine have the concept of an expression editor for your transformations (Talend goes down to Java language for this if need be).

Google Refine will never be an ETL tool, in the sense that we have not designed it to compete in that space were ETL is typically used for large data warehouse backend processing & transformations. However, we designed Google Refine to compliment existing ETL tools like Talend by allowing easy live previewing to make informed decisions about your transformations and cleanup, and if your data isn't incredibly huge, then you might opt to perform what you need within Refine itself.

like image 36
Thad Guidry Avatar answered Sep 30 '22 13:09

Thad Guidry