Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to convert CSV columns into hierarchical relationships?

I have a csv of 7 million biodiversity records where taxonomy levels are as columns. For instance:

RecordID,kingdom,phylum,class,order,family,genus,species 1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens 2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis 3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana 4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris 

I want to create a visualization in D3, but data format must be a network, where each different value of column is a child of the previous column for a certain value. I need to go from the csv to something like this:

{   name: 'Animalia',   children: [{     name: 'Chordata',     children: [{       name: 'Mammalia',       children: [{         name: 'Primates',         children: 'Hominidae'       }, {         name: 'Carnivora',         children: 'Canidae'       }]     }]   }] } 

I haven't come up with an idea of how to do this without using a thousand for loops. Does anybody have a suggestion on how to create this network either on python or javascript?

like image 325
Andres Camilo Zuñiga Gonzalez Avatar asked Nov 12 '19 22:11

Andres Camilo Zuñiga Gonzalez


People also ask

Can CSV be used to store hierarchical data?

Hierarchical DataA hierarchical dataset can be generated from a CSV file, if the records within the file are identified as having different transaction types (i.e. headers and details).


Video Answer


2 Answers

For creating the exact nested object you want we'll use a mix of pure JavaScript and a D3 method named d3.stratify. However, have in mind that 7 million rows (please see the post scriptum below) is a lot to compute.

It's very important to mention that, for this proposed solution, you'll have to separate the Kingdoms in different data arrays (for instance, using Array.prototype.filter). This restriction occurs because we need a root node, and in the Linnaean taxonomy there is no relationship between Kingdoms (unless you create "Domain" as a top rank, which will be the root for all eukaryotes, but then you'll have the same problem for Archaea and Bacteria).

So, suppose you have this CSV (I added some more rows) with just one Kingdom:

RecordID,kingdom,phylum,class,order,family,genus,species 1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens 2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis latrans 3,Animalia,Chordata,Mammalia,Cetacea,Delphinidae,Tursiops,Tursiops truncatus 1,Animalia,Chordata,Mammalia,Primates,Hominidae,Pan,Pan paniscus 

Based on that CSV, we'll create an array here named tableOfRelationships which, as the name implies, has the relationships between the ranks:

const data = d3.csvParse(csv);  const taxonomicRanks = data.columns.filter(d => d !== "RecordID");  const tableOfRelationships = [];  data.forEach(row => {   taxonomicRanks.forEach((d, i) => {     if (!tableOfRelationships.find(e => e.name === row[d])) tableOfRelationships.push({       name: row[d],       parent: row[taxonomicRanks[i - 1]] || null     })   }) }); 

For the data above, this is the tableOfRelationships:

+---------+----------------------+---------------+ | (Index) |         name         |    parent     | +---------+----------------------+---------------+ |       0 | "Animalia"           | null          | |       1 | "Chordata"           | "Animalia"    | |       2 | "Mammalia"           | "Chordata"    | |       3 | "Primates"           | "Mammalia"    | |       4 | "Hominidae"          | "Primates"    | |       5 | "Homo"               | "Hominidae"   | |       6 | "Homo sapiens"       | "Homo"        | |       7 | "Carnivora"          | "Mammalia"    | |       8 | "Canidae"            | "Carnivora"   | |       9 | "Canis"              | "Canidae"     | |      10 | "Canis latrans"      | "Canis"       | |      11 | "Cetacea"            | "Mammalia"    | |      12 | "Delphinidae"        | "Cetacea"     | |      13 | "Tursiops"           | "Delphinidae" | |      14 | "Tursiops truncatus" | "Tursiops"    | |      15 | "Pan"                | "Hominidae"   | |      16 | "Pan paniscus"       | "Pan"         | +---------+----------------------+---------------+ 

Have a look at null as the parent of Animalia: that's why I told you that you need to separate your dataset by Kingdoms, there can be only one null value in the whole table.

Finally, based on that table, we create the hierarchy using d3.stratify():

const stratify = d3.stratify()     .id(function(d) { return d.name; })     .parentId(function(d) { return d.parent; });  const hierarchicalData = stratify(tableOfRelationships); 

And here is the demo. Open your browser's console (the snippet's one is not very good for this task) and inspect the several levels (children) of the object:

const csv = `RecordID,kingdom,phylum,class,order,family,genus,species  1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens  2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis latrans  3,Animalia,Chordata,Mammalia,Cetacea,Delphinidae,Tursiops,Tursiops truncatus  1,Animalia,Chordata,Mammalia,Primates,Hominidae,Pan,Pan paniscus`;    const data = d3.csvParse(csv);    const taxonomicRanks = data.columns.filter(d => d !== "RecordID");    const tableOfRelationships = [];    data.forEach(row => {    taxonomicRanks.forEach((d, i) => {      if (!tableOfRelationships.find(e => e.name === row[d])) tableOfRelationships.push({        name: row[d],        parent: row[taxonomicRanks[i - 1]] || null      })    })  });    const stratify = d3.stratify()    .id(function(d) {      return d.name;    })    .parentId(function(d) {      return d.parent;    });    const hierarchicalData = stratify(tableOfRelationships);    console.log(hierarchicalData);
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>

PS: I don't know what kind of dataviz you'll create, but you really should avoid taxonomic ranks. The whole Linnaean taxonomy is outdated, we don't use ranks anymore: since the phylogenetic systematics was developed in mid-60's we use only taxa, without any taxonomic rank (evolutionary biology teacher here). Also, I'm quite curious about these 7 million rows, since we have described just over 1 million species!

like image 170
Gerardo Furtado Avatar answered Oct 05 '22 17:10

Gerardo Furtado


It is easy to do exactly what you need using python and python-benedict library (it is open source on Github, note: I am the author):

Installation pip install python-benedict

from benedict import benedict as bdict  # data source can be a filepath or an url data_source = """ RecordID,kingdom,phylum,class,order,family,genus,species 1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens 2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis 3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana 4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris """ data_input = bdict.from_csv(data_source) data_output = bdict()  ancestors_hierarchy = ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'] for value in data_input['values']:     data_output['.'.join([value[ancestor] for ancestor in ancestors_hierarchy])] = bdict()  print(data_output.dump()) # if this output is ok for your needs, you don't need the following code  keypaths = sorted(data_output.keypaths(), key=lambda item: len(item.split('.')), reverse=True)  data_output['children'] = [] def transform_data(d, key, value):     if isinstance(value, dict):         value.update({ 'name':key, 'children':[] }) data_output.traverse(transform_data)  for keypath in keypaths:     target_keypath = '.'.join(keypath.split('.')[:-1] + ['children'])     data_output[target_keypath].append(data_output.pop(keypath))  print(data_output.dump()) 

The first print output will be:

{     "Animalia": {         "Chordata": {             "Mammalia": {                 "Carnivora": {                     "Canidae": {                         "Canis": {                             "Canis": {}                         }                     }                 },                 "Primates": {                     "Hominidae": {                         "Homo": {                             "Homo sapiens": {}                         }                     }                 }             }         }     },     "Plantae": {         "nan": {             "Magnoliopsida": {                 "Brassicales": {                     "Brassicaceae": {                         "Arabidopsis": {                             "Arabidopsis thaliana": {}                         }                     }                 },                 "Fabales": {                     "Fabaceae": {                         "Phaseoulus": {                             "Phaseolus vulgaris": {}                         }                     }                 }             }         }     } } 

The second printed output will be:

{     "children": [         {             "name": "Animalia",             "children": [                 {                     "name": "Chordata",                     "children": [                         {                             "name": "Mammalia",                             "children": [                                 {                                     "name": "Carnivora",                                     "children": [                                         {                                             "name": "Canidae",                                             "children": [                                                 {                                                     "name": "Canis",                                                     "children": [                                                         {                                                             "name": "Canis",                                                             "children": []                                                         }                                                     ]                                                 }                                             ]                                         }                                     ]                                 },                                 {                                     "name": "Primates",                                     "children": [                                         {                                             "name": "Hominidae",                                             "children": [                                                 {                                                     "name": "Homo",                                                     "children": [                                                         {                                                             "name": "Homo sapiens",                                                             "children": []                                                         }                                                     ]                                                 }                                             ]                                         }                                     ]                                 }                             ]                         }                     ]                 }             ]         },         {             "name": "Plantae",             "children": [                 {                     "name": "nan",                     "children": [                         {                             "name": "Magnoliopsida",                             "children": [                                 {                                     "name": "Brassicales",                                     "children": [                                         {                                             "name": "Brassicaceae",                                             "children": [                                                 {                                                     "name": "Arabidopsis",                                                     "children": [                                                         {                                                             "name": "Arabidopsis thaliana",                                                             "children": []                                                         }                                                     ]                                                 }                                             ]                                         }                                     ]                                 },                                 {                                     "name": "Fabales",                                     "children": [                                         {                                             "name": "Fabaceae",                                             "children": [                                                 {                                                     "name": "Phaseoulus",                                                     "children": [                                                         {                                                             "name": "Phaseolus vulgaris",                                                             "children": []                                                         }                                                     ]                                                 }                                             ]                                         }                                     ]                                 }                             ]                         }                     ]                 }             ]         }     ] } 
like image 30
Fabio Caccamo Avatar answered Oct 05 '22 17:10

Fabio Caccamo