I am looking to import a lot of filenames into a graph database, using Neo4j. The data is from an external source and available in CSV file. I'd like to create a tree structure from the data, so that I can easily 'navigate' the structure in queries later on (i.e. find all files underneath a certain directory, all file that occur in multiple directories etc.).
So, given the example input:
/foo/bar/example.txt
/bar/baz/another.csv
/example.txt
/foo/bar/onemore.txt
I'd like the create the following graph:
( / ) <-[:in]- ( foo ) <-[:in]- ( bar ) <-[:in]- ( example.txt )
<-[:in]- ( onemore.txt )
<-[:in]- ( bar ) <-[:in]- ( baz ) <-[:in]- ( another.csv )
<-[:in]- ( example.txt )
(where each node label is actually an attribute, e.g. path:).
I've been able to achieve the desired effect when using a fixed number of directory levels; for example when each file is at three levels deep, I could create a CSV file with 4 columns:
dir_a,dir_b,dir_c,file
foo,bar,baz,example.txt
foo,bar,ban,example.csv
foo,bar,baz,another.txt
And import it using a cypher query:
LOAD CSV WITH HEADERS FROM "file:///sample.csv" AS row
MERGE (dir_a:Path {name: row.dir_a})
MERGE (dir_b:Path {name: row.dir_b}) <-[:in]- (dir_a)
MERGE (dir_c:Path {name: row.dir_c}) <-[:in]- (dir_b)
MERGE (:Path {name: row.file}) <-[:in]- (dir_c)
But I'd like to have a general solution that works for any level of sub-directories (and combinations of levels in one dataset). Note that I am able to pre-process my input if necessary, so I can create any desirable structure in the input CSV file.
I've looked at gists or plugins, but cannot seem to find anything that works. I think/hope that I should be able to do something with the split() function, i.e. use split('/',row.path) to get a list of path elements, but I do not know how to process this list into a chain of MERGE operations.
To import data from a CSV file into Neo4j, you can use LOAD CSV to get the data into your query. Then you write it to your database using the normal updating clauses of Cypher. A new node with the Artist label is created for each row in the CSV file.
From the Open dropdown menu of your Neo4j instance, select Terminal, and navigate to <installation-version>/import.
One of the most common formats of data is in rows and columns on flat files. This spreadsheet format is used for a variety of imports and exports to/from relational databases, so it is easy to retrieve existing data this way. You can also use this format of data for Neo4j!
Here is a first cut at something more generalized.
The premise is that you can split the fully qualified path into components and then use each component of it to split the path so you can struct the fully qualified path for each individual component of the larger path. Use this as the key to merge items on and set the individual component after they are merged. In the case that something is not the root level then find the parent of an individual component and create the relationship back to it. This will break down if there are duplicate component names in a fully qualified path.
First, i started by creating a uniqueness constraint on fq_path
create constraint on (c:Component) assert c.fq_path is unique;
Here is the load statement.
load csv from 'file:///path.csv' as line
with line[0] as line, split(line[0],'/') as path_components
unwind range(0, size(path_components)-1) as idx
with case
when idx = 0 then '/'
else
path_components[idx]
end as component
, case
when idx = 0 then '/'
else
split(line, path_components[idx])[0] + path_components[idx]
end as fq_path
, case
when idx = 0 then
null
when idx = 1 then
'/'
else
substring(split(line, path_components[idx])[0],0,size(split(line, path_components[idx])[0])-1)
end as parent
, case
when idx = 0 then
[]
else
[1]
end as find_parent
merge (new_comp:Component {fq_path: fq_path})
set new_comp.name = component
foreach ( y in find_parent |
merge (theparent:Component {fq_path: parent} )
merge (theparent)<-[:IN]-(new_comp)
)
return *
If you want to differentiate between files and folders here are a few queries you can run afterwards to set another label on the respective nodes.
Find the files and set them as File
// find the last Components in a tree (no inbound IN)
// and set them as Files
match (c:Component)
where not (c)<-[:IN]-(:Component)
set c:File
return c
Find the folders and set them as Folder
// find all Components with an inbound IN
// and set them as Folders
match (c:Component)
where (c)<-[:IN]-(:Component)
set c:Folder
return c
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With