Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java ETL: hard to find a suitable one [closed]

I'm looking for an embeddable Java ETL, i.e., an Extract Transform Load engine that can be called from Java code.

I'm finding it surprisingly hard to find a suitable one.

I'm mainly looking at loading delimited text files into database tables, with some minor transforms along the way.

I'd like the following features:

  • the ability to specify the simple mappings externally, e.g, text column 5 to database column foo, specified some xml mapping file
  • the ability to give the the database node a javax.sql.Datasource

CloverETL allows mapping to be specified in XML, but database connections must be either JNDI names or a properties file specifying driverClass, url, dbusername, password, etc. Since I already have javax.sql.Datasources set up by my dependency injection framework, properties files seem painful and non-robust, especially if I want this to work in several environments (dev, test, prod).

KETL tells me that "We are currently in the process of completely overhauling our documentation for KETL™. Because of this, only the installation guide has been updated." Honest, but not helpful.

Octopus is now "http://www.together.at/prod/database/tdt", which is "under construction".

Pentaho seems to use the same "specify driverClass" style that CloverETL does, rather that using a datasource, but Pentaho's documentation for calling their engine from java code is just difficult to find.

Basically I'd really like to be able to do this pseudo-code:

extractTransformLoad(         
        getInputFile( "input.csv" ) , 
        getXMLMapping( "myMappingFile.xml") ,
        new DatabaseWriter( getDatasource() );

Any suggestions?

like image 457
tpdi Avatar asked Nov 22 '10 23:11

tpdi


People also ask

What are the challenges faced in ETL?

Some of the important ETL testing challenges are: Unavailability of inclusive test bed at times. Lack of proper flow of business information. Loss of data might happen during the ETL process. Existence of several ambiguous software requirements.

Is Java good for ETL?

Java is one of the most popular and powerful scripting languages. And there's an abundance of open source and paid ETLs to choose from that work with Java code. You won't have any trouble finding one that meets your specific data project needs.

Is ETL complicated?

A no-code ETL platform requires little to no coding. The process is easy to understand by the clients and easy to maintain. It is scalable and saves a lot of time and money for the companies handling real-time datasets. The logic is reusable for any data source and there are custom data manipulation features.

Is ETL obsolete?

ETL is outdated. It works with traditional data center infrastructures, which cloud technologies are already replacing. The loading time takes hours, even for businesses with data sets that are just a few terabytes in size. ELT is the future of data warehousing and efficiently utilizes current cloud technologies.


3 Answers

Disclosure: I'm the author of Scriptella ETL, but I believe this tool might be useful for your case.

It's a lightweight open source ETL with a one-liner integration with Java. It also supports Spring Framework and comes with built-in drivers for CSV, text, XML, Excel and other data-sources.

Example of importing a CSV file into a table:

<!DOCTYPE etl SYSTEM "http://scriptella.org/dtd/etl.dtd">
<etl>
  <connection id="in" driver="csv" url="data.csv" />
  <connection id="out" driver="oracle" url="jdbc:oracle:thin:@localhost:1521:ORCL" 
      classpath="ojdbc14.jar" user="scott" password="tiger" />
  <!-- Copy all CSV rows to a database table -->
  <query connection-id="in">
      <!-- Empty query means select all columns -->
      <script connection-id="out">
          INSERT INTO Table_Name VALUES (?id,?priority, ?summary, ?status)
      </script>
  </query>
</etl>

Running from Java:

// Execute etl.xml file
EtlExecutor.newExecutor(new File("etl.xml")).execute();

Running from command-line:

scriptella [file_name]

Integration with Spring:

  1. Use "spring" driver and the name of the bean to references data-sources. Example:

    <connection id="spring" driver="spring" url="datasourceBeanName" />
    
  2. Add EtlExecutorBean to the application context in order to execute the job:

    <bean id="createDb" class="scriptella.driver.spring.EtlExecutorBean">
        <property name="configLocation" value="create-db.etl.xml" />
        <property name="progressIndicator"><ref local="progress" /></property>
        <property name="autostart" value="true" /> <!-- Etl will be run during app context initialization -->
    </bean>
    

For additional details see the Spring example.

like image 122
ejboy Avatar answered Oct 04 '22 21:10

ejboy


Do you know Talend?

It's a tool based on Eclipse (Talend Open Studio), but you can use it directly in Java by writing your own code or by exporting jobs to Java classes.

like image 28
Loïc Guillois Avatar answered Oct 04 '22 21:10

Loïc Guillois


Here is a list of all the java based open source ETL libraries. I see you have evaluated few of them already but there are more. Also this seems to be a duplicate of https://stackoverflow.com/questions/272517/please-recommend-a-powerful-java-based-etl-framework

like image 33
Aravind Yarram Avatar answered Oct 04 '22 20:10

Aravind Yarram