Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Process files in Java EE [closed]

I have a system that is supposed to take large files containing documents and process these to split up the individual documents and create document objects to be persisted with JPA (or at least it is assumed in this question).

The files are in the range of 1 document to 100 000 in each file. The files come in various types

  • Compressed
    • Zip
    • Tar + gzip
    • Gzip
  • Plain-text
  • XML
  • PDF

Now the biggest concern is that the specification forbids accessing local files. At least in the way that i'm used to.

I could save the files to a database table, but is that really a good way to do it? The files can be up to 2GB and accessing the files from the database would require that you download the whole file, either into memory or onto disk.

My first thought was to separate this process from the application server and use a more traditional approach, but i've been thinking about how to keep it on the application server for future purposes such as clustering etc.

My questions are basically

  1. Is there a standard way or a recommended way of dealing with this in Java EE?
  2. Is there an application server specific way around this?
  3. Can you justify breaking this process out of the application server? And how would you design the communications channel between these two separate systems?
like image 770
Peter Lindqvist Avatar asked Feb 28 '23 22:02

Peter Lindqvist


2 Answers

I sketch here a few more propositions and consider the following concerns:

  • scalability (file size, clustering, etc.)
  • batch architecture (job recovery, error handling, monitoring, etc.)
  • compliance with J2EE

With JCA

JCA connectors belong to the Java EE stack and permit inboud/outboud connectivity from/to the EJB world. JDBC and JMS are usually implemented as JCA connector. An inbound JCA connector can use thread (through the worker abstraction) and transactions. It can then forward any processing to a message-driven bean (MDB).

  • write a JCA connector that polls for new file, then process them and delegate further processing to message-driven bean in a synchronous way.
  • the MDB can then persit the information in database with JPA
  • the JCA connector has control over the transaction, and several MDB invocations can be in the same transaction
  • file system is not transactional so you will somehow need to figure out how to deal with error such as faulty input files
  • you can probably use streaming (InputStream) all along the pipleline

With plain threads

We can achieve more or less the same as the JCA way, using threads that are launched from a web servlet context listener (or evt. an EJB Timer).

  • The thread polls for new file, if file is found it processes it and delegates further processing to regular SLSB in a synchronous way.
  • Thread in web container have access to UserTransaction and can control the transaction
  • EJB can be local so that InputStream is passed by reference
  • Deployment of the web module + ejb can be done with an ear

With JMS

To avoid the need of having several concurrent polling threads and the problem of job acquision/locking, the actual processing can be realized asynchronously using JMS. JMS can also be interesting to split the processing in smaller tasks.

  • A periodic task polls for new file. If file is found, a JMS message is queued.
  • When the JMS message is delivered, the file is read and processed and the information is persisted in database with JPA
  • if JMS processing fails, the app. server may retries automatically or put the message in the dead message queue
  • monitoring/error handling is more complicated
  • you can probably use streaming

With ESB

Many projects have emerged in the past year to deal with integration: JBI, ServiceMix, OpenESB, Mule, Spring integration, Java CAPS, BPEL. Some are technologies, some are platform, and there is some overlap between them. They all have a wagon of connectors to route, transform and orchestrate message flow. IMHO, the message are suppose to be small piece of information, and it may be hard to use these technologies to process your large data file. The website patterns of enterprise application integration is an excellent website for more information.

IMO, the approach that fits best the Java EE philosophy is JCA. But the effort to invest is relatively high. In your case, the usage of plain thread that delegate further processing to SLSB is maybe the easiest solution. The JMS approach (close to the proposition of P. Thivent) can be interesting if the processing pipelie gets more complicated. Using an ESB seems overkill to me.

like image 196
ewernli Avatar answered Mar 05 '23 14:03

ewernli


Is there a standard way or a recommended way of dealing with this in Java EE?

I'd use a real integration layer (as in EAI) for this purpose, running as an external process. Integration tools (ETL, EAI, ESB) are specifically designed to deal with... integration and many of them provide everything required out of the box (simplified version: transport, connectors, transformation, routing, security).

Basically, when dealing with files, a file connector is used to monitor a directory for incoming files which are then parsed/split them into messages (applying optionally some transformations) and sent to an endpoint for business processing.

Have a look at Mule ESB for example (has a File Connector, supports many transports, can be run as a standalone process). Or maybe Spring Integration (coupled with Spring Batch?) which has File and JMS Adapters too. But I don't have much experience with it so I can't really say anything about it. Or, if you are rich, you could look at Tibco EMS, WebMethods, etc. Or build your own solution using some parsing library (e.g. jFFP or Flatworm).

Is there an application server specific way around this?

I'm not aware of anything like this.

Can you justify breaking this process out of the application server? And how would you design the communications channel between these two separate systems?

As I said, I'd use an external process for the file processing stuff (better suited) and send the content of the file as messages over JMS to the app server for the business processing (and thus benefit from Java EE features such as load balancing and transaction management).

like image 28
Pascal Thivent Avatar answered Mar 05 '23 15:03

Pascal Thivent