I have a system that is supposed to take large files containing documents and process these to split up the individual documents and create document objects to be persisted with JPA (or at least it is assumed in this question).
The files are in the range of 1 document to 100 000 in each file. The files come in various types
Now the biggest concern is that the specification forbids accessing local files. At least in the way that i'm used to.
I could save the files to a database table, but is that really a good way to do it? The files can be up to 2GB and accessing the files from the database would require that you download the whole file, either into memory or onto disk.
My first thought was to separate this process from the application server and use a more traditional approach, but i've been thinking about how to keep it on the application server for future purposes such as clustering etc.
My questions are basically
I sketch here a few more propositions and consider the following concerns:
With JCA
JCA connectors belong to the Java EE stack and permit inboud/outboud connectivity from/to the EJB world. JDBC and JMS are usually implemented as JCA connector. An inbound JCA connector can use thread (through the worker abstraction) and transactions. It can then forward any processing to a message-driven bean (MDB).
With plain threads
We can achieve more or less the same as the JCA way, using threads that are launched from a web servlet context listener (or evt. an EJB Timer).
With JMS
To avoid the need of having several concurrent polling threads and the problem of job acquision/locking, the actual processing can be realized asynchronously using JMS. JMS can also be interesting to split the processing in smaller tasks.
With ESB
Many projects have emerged in the past year to deal with integration: JBI, ServiceMix, OpenESB, Mule, Spring integration, Java CAPS, BPEL. Some are technologies, some are platform, and there is some overlap between them. They all have a wagon of connectors to route, transform and orchestrate message flow. IMHO, the message are suppose to be small piece of information, and it may be hard to use these technologies to process your large data file. The website patterns of enterprise application integration is an excellent website for more information.
IMO, the approach that fits best the Java EE philosophy is JCA. But the effort to invest is relatively high. In your case, the usage of plain thread that delegate further processing to SLSB is maybe the easiest solution. The JMS approach (close to the proposition of P. Thivent) can be interesting if the processing pipelie gets more complicated. Using an ESB seems overkill to me.
Is there a standard way or a recommended way of dealing with this in Java EE?
I'd use a real integration layer (as in EAI) for this purpose, running as an external process. Integration tools (ETL, EAI, ESB) are specifically designed to deal with... integration and many of them provide everything required out of the box (simplified version: transport, connectors, transformation, routing, security).
Basically, when dealing with files, a file connector is used to monitor a directory for incoming files which are then parsed/split them into messages (applying optionally some transformations) and sent to an endpoint for business processing.
Have a look at Mule ESB for example (has a File Connector, supports many transports, can be run as a standalone process). Or maybe Spring Integration (coupled with Spring Batch?) which has File and JMS Adapters too. But I don't have much experience with it so I can't really say anything about it. Or, if you are rich, you could look at Tibco EMS, WebMethods, etc. Or build your own solution using some parsing library (e.g. jFFP or Flatworm).
Is there an application server specific way around this?
I'm not aware of anything like this.
Can you justify breaking this process out of the application server? And how would you design the communications channel between these two separate systems?
As I said, I'd use an external process for the file processing stuff (better suited) and send the content of the file as messages over JMS to the app server for the business processing (and thus benefit from Java EE features such as load balancing and transaction management).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With