Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Traditional pdf indexing solution compared to graph-based version

My intention is to index an arbitrary directory containing pdf files (among other file types) with keywords stored in a list. I have a traditional solution and I heard that graph based solutions using e.g. SimpleGraph could be more elegant/efficient and independent of directory structures.

What would a graph-based solution (e.g. SimpleGraph) look like?

Traditional solution

// https://stackoverflow.com/a/14051951/1497139
List<File> pdfFiles = this.explorePath(TestPDFFiles.RFC_DIRECTORY, "pdf");
List<PDFFile> pdfs = this.getPdfsFromFileList(pdfFiles);
…
for (PDFFile pdf:pdfs) {
     // https://stackoverflow.com/a/9560307/1497139
     if (org.apache.commons.lang3.StringUtils.containsIgnoreCase(pdf.getText(), keyWord)) {
          foundList.add(pdf.file.getName()); // here we access by structure (early binding)
          // - in the graph solution by name (late binding)
     }
}
like image 446
pdvsofismo Avatar asked Aug 11 '18 13:08

pdvsofismo


1 Answers

Basically with SimpleGraph you'd use a combination of the modules

  1. FileSystem
  2. PDFSystem

With the FileSystem module you collect your graph of files in the directory and filter it to include only files with the extension pdf - then you analyze the PDFs using the PDFSystem to get the page/text structure - there is already a test case for this in the simplegraph-bundle module showing how it works with some RFC pdfs as input.

TestPDFFiles.java

I have now added the indexing test see below.

The core functionality has been taken from the old test with searching for a single keyword and allowing this as a parameter:

List<Object> founds = pdfSystem.g().V().hasLabel("page")
      .has("text", RegexPredicate.regex(".*" + keyWord + ".*")).in("pages")
      .dedup().values("name").toList();

This is a gremlin query that will do most of the work by searching in a whole tree of PDF files with just one call. I consider this more elegant since you do not have to care about the structure of the input (tree/graph/filesystem/database, etc ...)

JUnit Testcase

 @Test
  /**
   * test for https://github.com/BITPlan/com.bitplan.simplegraph/issues/12
   */
  public void testPDFIndexing() throws Exception {
    FileSystem fs = getFileSystem(RFC_DIRECTORY);
    int limit = Integer.MAX_VALUE;
    PdfSystem pdfSystem = getPdfSystemForFileSystem(fs, limit);
    Map<String, List<String>> index = this.getIndex(pdfSystem, "ARPA",
        "proposal", "plan");
    // debug=true;
    if (debug) {
      for (Entry<String, List<String>> indexEntry : index.entrySet()) {
        List<String> fileNameList = indexEntry.getValue();
        System.out.println(String.format("%15s=%3d %s", indexEntry.getKey(),
            fileNameList.size(), fileNameList));
      }
    }
    assertEquals(14,index.get("ARPA").size());
    assertEquals(9,index.get("plan").size());
    assertEquals(8,index.get("proposal").size());
  }
like image 74
Wolfgang Fahl Avatar answered Nov 14 '22 11:11

Wolfgang Fahl