Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is good way to import a directory/file structure in Neo4j from CSV file?

I am looking to import a lot of filenames into a graph database, using Neo4j. The data is from an external source and available in CSV file. I'd like to create a tree structure from the data, so that I can easily 'navigate' the structure in queries later on (i.e. find all files underneath a certain directory, all file that occur in multiple directories etc.).

So, given the example input:

/foo/bar/example.txt
/bar/baz/another.csv
/example.txt
/foo/bar/onemore.txt

I'd like the create the following graph:

( / ) <-[:in]- ( foo ) <-[:in]- ( bar ) <-[:in]- ( example.txt )
                                        <-[:in]- ( onemore.txt )
      <-[:in]- ( bar ) <-[:in]- ( baz ) <-[:in]- ( another.csv )
      <-[:in]- ( example.txt )

(where each node label is actually an attribute, e.g. path:).

I've been able to achieve the desired effect when using a fixed number of directory levels; for example when each file is at three levels deep, I could create a CSV file with 4 columns:

dir_a,dir_b,dir_c,file
foo,bar,baz,example.txt
foo,bar,ban,example.csv
foo,bar,baz,another.txt

And import it using a cypher query:

LOAD CSV WITH HEADERS FROM "file:///sample.csv" AS row
  MERGE (dir_a:Path {name: row.dir_a})
  MERGE (dir_b:Path {name: row.dir_b}) <-[:in]- (dir_a)
  MERGE (dir_c:Path {name: row.dir_c}) <-[:in]- (dir_b)
  MERGE      (:Path {name: row.file})  <-[:in]- (dir_c)

But I'd like to have a general solution that works for any level of sub-directories (and combinations of levels in one dataset). Note that I am able to pre-process my input if necessary, so I can create any desirable structure in the input CSV file.

I've looked at gists or plugins, but cannot seem to find anything that works. I think/hope that I should be able to do something with the split() function, i.e. use split('/',row.path) to get a list of path elements, but I do not know how to process this list into a chain of MERGE operations.

like image 879
Remco van Engelen Avatar asked Jul 28 '16 15:07

Remco van Engelen


People also ask

How do I import a csv file into Neo4j browser?

To import data from a CSV file into Neo4j, you can use LOAD CSV to get the data into your query. Then you write it to your database using the normal updating clauses of Cypher. A new node with the Artist label is created for each row in the CSV file.

Where is import directory in Neo4j?

From the Open dropdown menu of your Neo4j instance, select Terminal, and navigate to <installation-version>/import.

What is the general format of import and export data in Neo4j?

One of the most common formats of data is in rows and columns on flat files. This spreadsheet format is used for a variety of imports and exports to/from relational databases, so it is easy to retrieve existing data this way. You can also use this format of data for Neo4j!


1 Answers

Here is a first cut at something more generalized.

The premise is that you can split the fully qualified path into components and then use each component of it to split the path so you can struct the fully qualified path for each individual component of the larger path. Use this as the key to merge items on and set the individual component after they are merged. In the case that something is not the root level then find the parent of an individual component and create the relationship back to it. This will break down if there are duplicate component names in a fully qualified path.

First, i started by creating a uniqueness constraint on fq_path

create constraint on (c:Component) assert c.fq_path is unique;

Here is the load statement.

load csv from 'file:///path.csv' as line
with line[0] as line, split(line[0],'/') as path_components
unwind range(0, size(path_components)-1) as idx
with case 
       when idx = 0 then '/'
     else
       path_components[idx]
     end as component
   , case 
       when idx = 0 then '/'
     else
       split(line, path_components[idx])[0] + path_components[idx]
     end as fq_path
   , case 
       when idx = 0 then
         null
       when idx = 1 then
         '/'
     else
       substring(split(line, path_components[idx])[0],0,size(split(line, path_components[idx])[0])-1)
     end as parent
   , case 
       when idx = 0 then
         []
       else
         [1]
     end as find_parent
merge (new_comp:Component {fq_path: fq_path})
set new_comp.name = component
foreach ( y in find_parent |
  merge (theparent:Component {fq_path: parent} )
  merge (theparent)<-[:IN]-(new_comp)
)     
return *

If you want to differentiate between files and folders here are a few queries you can run afterwards to set another label on the respective nodes.

Find the files and set them as File

// find the last Components in a tree (no inbound IN)
// and set them as Files
match (c:Component)
where not (c)<-[:IN]-(:Component)
set c:File
return c

Find the folders and set them as Folder

// find all Components with an inbound IN
// and set them as Folders
match (c:Component)
where  (c)<-[:IN]-(:Component)
set c:Folder
return c
like image 61
Dave Bennett Avatar answered Oct 21 '22 09:10

Dave Bennett