Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse data in Talend with Java (coming from a previously produced .txt file)?

I have a process in Talend which gets the search result of a page, saves the html and writes it into files, as seen here:

enter image description here

Initially I had a two step process with parsing out the date from the HTML files in Java. Here is the code: It works and writes it to a mysql database. Here is the code which basically does exactly that. (I'm a beginner, sorry for the lack of elegance)

package org.jsoup.examples;  import java.io.*;     import org.jsoup.*; import org.jsoup.nodes.*; import org.jsoup.select.Elements;  import java.io.IOException;   public class parse2 {            static parse2 parseIt2 = new parse2();     String companyName = "Platzhalter";     String jobTitle = "Platzhalter";     String location = "Platzhalter";     String timeAdded = "Platzhalter";      public static void main(String[] args) throws IOException {         parseIt2.getData();     }      //      public void getData() throws IOException {         Document document = Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8");         Elements elements = document.select(".joblisting");         for (Element element : elements) {             // Parse Data into Elements             Elements jobTitleElement = element.select(".job_title span");             Elements companyNameElement = element.select(".company_name span[itemprop=name]");             Elements locationElement = element.select(".locality span[itemprop=addressLocality]");             Elements dateElement = element.select(".job_date_added [datetime]");              // Strip Data from unnecessary tags             String companyName = companyNameElement.text();             String jobTitle = jobTitleElement.text();             String location = locationElement.text();             String timeAdded = dateElement.attr("datetime");              System.out.println("Firma:\t"+ companyName + "\t" + jobTitle + "\t in:\t" + location + " \t Erstellt am \t" + timeAdded );         }       } } 

Now I want to do the process End-to-End in Talend, and I got assured this works. I tried this (which looks quite shady to me): enter image description here

Basically I put all imports in "advanced settings" and the code in the "basic settings" section. This importLibrary is thought to load the jsoup parsing library, as well as the mysql connect (i might to the connect with talend tools though).

Obviously this isn't working. I tried to strip the Base Code from classes and stuff and it was even worse. Can you help me how to get the generated .txt files parsed with Java here?

EDIT: Here is the Link to the talend Job http://www.share-online.biz/dl/8M5MD99NR1

EDIT2: I changed the code to the one I tried in JavaFlex. But it didn't work (the import part in the start part of the code, the rest in "body/main" and nothing in "end".

like image 548
ZedBrannigan Avatar asked Jul 24 '14 13:07

ZedBrannigan


People also ask

How do I load a text file into a Talend table?

In Talend, you can load text file data into the database table in two ways. Drag and drop the tFileInputDelimited and browse the Text file, and create a schema (or column names) for that text file. Create metadata for the text file and use that File Delimited metadata.

How to use encoding and separator in Talend?

Encoding: By default, Talend will select a suitable Encoding. However, you can use the drop-down button to select the one. Field Separator: Please choose the field that separates each column in your text file. If the desired separator is not available in the option, select the Custom and use Corresponding Character option to place the separator.

Does Talend generate Java code from TMap component?

Thanks for sharing your experience of using tMap component with us (exactly, Talend generates Java code). In addition, for the user routine mentioned by @boulayj, please refer to How+to+create+user+routines and Calling+a+routine+from+a+Job

How to add empinfo to a Talend job design?

First, drag and drop the EmpInfo from the File Delimited folder into the Talend Job design. From the below screenshot, you can see that the file Component properties are using the Repository values. Next, drag and drop the tDBConnection, tDBCommit, and tDBOutput from Palette to Job design space. Here, you can use only tDBOutput also.


1 Answers

This is a problem related to Talend, in your code, use the complete method names including their packages. For your document parsing for example, you can use :

Document document =  org.jsoup.Jsoup.parse(new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"), "utf-8"); 
like image 79
Maouven Avatar answered Oct 21 '22 06:10

Maouven