I am about to get involved in a NLP-related project and I need to use various libraries. Some are in java, others in C/C++ (for tasks that require more speed) and finally some are in Python. I was thinking of using Python as the "glue" and create wrapper-classes for every task that I want to do that relies on a different language. In order to do that, the wrapper class, for example, would execute the java program and communicate with it using pipes. My questions are:
Do you think that would work for cpu-demanding and highly repetitive tasks? Or would the overhead added by the pipe-communication be too heavy?
Is there any other (preferably simple) architecture that you would suggest?
I would simply advise not doing this.
Don't implement stuff in C/C++ "for speed". The performance benefit is not likely to be as great as you expect; e.g. compared with implementing in Java using "best practice" design and performance techniques.
Don't try and glue lots of languages together. You are setting yourself up for lots of portability issues, difficulties in debugging, and reliability issues; e.g. due to C / C++ bugs crashing the JVM. In addition, there are performance overheads in bridging between languages, and there can be unexpected bottlenecks. (For instance, you may find that your C/C++ has to be run single-threaded due to threading issues, and that you therefore can't get the benefit of Java multi-threading on a typically multi-core system.)
Instead, I advise you to look for libraries that allow you to implement the entire application in one language. If that is not possible, design it so that the different language components are different executables / processes, communicating via some kind of RPC, messaging, or whatever.
Whether or not you'd have problems communicating over pipes / sockets has nothing to do with how CPU intensive the tasks are, but how frequently you'd need to send information between the processes and how much data they need to send. Setting up threads to do your communication will have little processing overhead.
You can probably automatically wrap the C/C++ code with Python (SWIG, ctypesgen, Boost.Python), so the only glue you'll have to write yourself would then be talking to Java.
You could also do it the other way -- run the Python code in the JVM with Jython so the Python and Java code are together, then talk to the C/C++ from there.
You should take a look at Apache UIMA. It is designed exactly for this. From the project website:
The Frameworks run the components, and are available for both Java and C++. The Java Framework supports running both Java and non-Java components (using the C++ framework). The C++ framework, besides supporting annotators written in C/C++, also supports Perl, Python, and TCL annotators.
UIMA can manage pipes and annotators and is built to scale.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With