I'm looking for an embeddable Java ETL, i.e., an Extract Transform Load engine that can be called from Java code.
I'm finding it surprisingly hard to find a suitable one.
I'm mainly looking at loading delimited text files into database tables, with some minor transforms along the way.
I'd like the following features:
CloverETL allows mapping to be specified in XML, but database connections must be either JNDI names or a properties file specifying driverClass, url, dbusername, password, etc. Since I already have javax.sql.Datasource
s set up by my dependency injection framework, properties files seem painful and non-robust, especially if I want this to work in several environments (dev, test, prod).
KETL tells me that "We are currently in the process of completely overhauling our documentation for KETL™. Because of this, only the installation guide has been updated." Honest, but not helpful.
Octopus is now "http://www.together.at/prod/database/tdt", which is "under construction".
Pentaho seems to use the same "specify driverClass" style that CloverETL does, rather that using a datasource, but Pentaho's documentation for calling their engine from java code is just difficult to find.
Basically I'd really like to be able to do this pseudo-code:
extractTransformLoad(
getInputFile( "input.csv" ) ,
getXMLMapping( "myMappingFile.xml") ,
new DatabaseWriter( getDatasource() );
Any suggestions?
Some of the important ETL testing challenges are: Unavailability of inclusive test bed at times. Lack of proper flow of business information. Loss of data might happen during the ETL process. Existence of several ambiguous software requirements.
Java is one of the most popular and powerful scripting languages. And there's an abundance of open source and paid ETLs to choose from that work with Java code. You won't have any trouble finding one that meets your specific data project needs.
A no-code ETL platform requires little to no coding. The process is easy to understand by the clients and easy to maintain. It is scalable and saves a lot of time and money for the companies handling real-time datasets. The logic is reusable for any data source and there are custom data manipulation features.
ETL is outdated. It works with traditional data center infrastructures, which cloud technologies are already replacing. The loading time takes hours, even for businesses with data sets that are just a few terabytes in size. ELT is the future of data warehousing and efficiently utilizes current cloud technologies.
Disclosure: I'm the author of Scriptella ETL, but I believe this tool might be useful for your case.
It's a lightweight open source ETL with a one-liner integration with Java. It also supports Spring Framework and comes with built-in drivers for CSV, text, XML, Excel and other data-sources.
<!DOCTYPE etl SYSTEM "http://scriptella.org/dtd/etl.dtd">
<etl>
<connection id="in" driver="csv" url="data.csv" />
<connection id="out" driver="oracle" url="jdbc:oracle:thin:@localhost:1521:ORCL"
classpath="ojdbc14.jar" user="scott" password="tiger" />
<!-- Copy all CSV rows to a database table -->
<query connection-id="in">
<!-- Empty query means select all columns -->
<script connection-id="out">
INSERT INTO Table_Name VALUES (?id,?priority, ?summary, ?status)
</script>
</query>
</etl>
// Execute etl.xml file
EtlExecutor.newExecutor(new File("etl.xml")).execute();
scriptella [file_name]
Use "spring"
driver and the name of the bean to references data-sources. Example:
<connection id="spring" driver="spring" url="datasourceBeanName" />
Add EtlExecutorBean
to the application context in order to execute the job:
<bean id="createDb" class="scriptella.driver.spring.EtlExecutorBean">
<property name="configLocation" value="create-db.etl.xml" />
<property name="progressIndicator"><ref local="progress" /></property>
<property name="autostart" value="true" /> <!-- Etl will be run during app context initialization -->
</bean>
For additional details see the Spring example.
Do you know Talend?
It's a tool based on Eclipse (Talend Open Studio), but you can use it directly in Java by writing your own code or by exporting jobs to Java classes.
Here is a list of all the java based open source ETL libraries. I see you have evaluated few of them already but there are more. Also this seems to be a duplicate of https://stackoverflow.com/questions/272517/please-recommend-a-powerful-java-based-etl-framework
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With