Is there a way to convert CSV columns into hierarchical relationships?

Tags:

I have a csv of 7 million biodiversity records where taxonomy levels are as columns. For instance:

RecordID,kingdom,phylum,class,order,family,genus,species 1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens 2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis 3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana 4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris

I want to create a visualization in D3, but data format must be a network, where each different value of column is a child of the previous column for a certain value. I need to go from the csv to something like this:

{   name: 'Animalia',   children: [{     name: 'Chordata',     children: [{       name: 'Mammalia',       children: [{         name: 'Primates',         children: 'Hominidae'       }, {         name: 'Carnivora',         children: 'Canidae'       }]     }]   }] }

I haven't come up with an idea of how to do this without using a thousand for loops. Does anybody have a suggestion on how to create this network either on python or javascript?

325

asked Nov 12 '19 22:11

Andres Camilo Zuñiga Gonzalez

Video Answer

2 Answers

For creating the exact nested object you want we'll use a mix of pure JavaScript and a D3 method named d3.stratify. However, have in mind that 7 million rows (please see the post scriptum below) is a lot to compute.

It's very important to mention that, for this proposed solution, you'll have to separate the Kingdoms in different data arrays (for instance, using Array.prototype.filter). This restriction occurs because we need a root node, and in the Linnaean taxonomy there is no relationship between Kingdoms (unless you create "Domain" as a top rank, which will be the root for all eukaryotes, but then you'll have the same problem for Archaea and Bacteria).

So, suppose you have this CSV (I added some more rows) with just one Kingdom:

RecordID,kingdom,phylum,class,order,family,genus,species 1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens 2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis latrans 3,Animalia,Chordata,Mammalia,Cetacea,Delphinidae,Tursiops,Tursiops truncatus 1,Animalia,Chordata,Mammalia,Primates,Hominidae,Pan,Pan paniscus

Based on that CSV, we'll create an array here named tableOfRelationships which, as the name implies, has the relationships between the ranks:

const data = d3.csvParse(csv);  const taxonomicRanks = data.columns.filter(d => d !== "RecordID");  const tableOfRelationships = [];  data.forEach(row => {   taxonomicRanks.forEach((d, i) => {     if (!tableOfRelationships.find(e => e.name === row[d])) tableOfRelationships.push({       name: row[d],       parent: row[taxonomicRanks[i - 1]] || null     })   }) });

For the data above, this is the tableOfRelationships:

+---------+----------------------+---------------+ | (Index) |         name         |    parent     | +---------+----------------------+---------------+ |       0 | "Animalia"           | null          | |       1 | "Chordata"           | "Animalia"    | |       2 | "Mammalia"           | "Chordata"    | |       3 | "Primates"           | "Mammalia"    | |       4 | "Hominidae"          | "Primates"    | |       5 | "Homo"               | "Hominidae"   | |       6 | "Homo sapiens"       | "Homo"        | |       7 | "Carnivora"          | "Mammalia"    | |       8 | "Canidae"            | "Carnivora"   | |       9 | "Canis"              | "Canidae"     | |      10 | "Canis latrans"      | "Canis"       | |      11 | "Cetacea"            | "Mammalia"    | |      12 | "Delphinidae"        | "Cetacea"     | |      13 | "Tursiops"           | "Delphinidae" | |      14 | "Tursiops truncatus" | "Tursiops"    | |      15 | "Pan"                | "Hominidae"   | |      16 | "Pan paniscus"       | "Pan"         | +---------+----------------------+---------------+

Have a look at null as the parent of Animalia: that's why I told you that you need to separate your dataset by Kingdoms, there can be only one null value in the whole table.

Finally, based on that table, we create the hierarchy using d3.stratify():

const stratify = d3.stratify()     .id(function(d) { return d.name; })     .parentId(function(d) { return d.parent; });  const hierarchicalData = stratify(tableOfRelationships);

And here is the demo. Open your browser's console (the snippet's one is not very good for this task) and inspect the several levels (children) of the object:

const csv = `RecordID,kingdom,phylum,class,order,family,genus,species  1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens  2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis latrans  3,Animalia,Chordata,Mammalia,Cetacea,Delphinidae,Tursiops,Tursiops truncatus  1,Animalia,Chordata,Mammalia,Primates,Hominidae,Pan,Pan paniscus`;    const data = d3.csvParse(csv);    const taxonomicRanks = data.columns.filter(d => d !== "RecordID");    const tableOfRelationships = [];    data.forEach(row => {    taxonomicRanks.forEach((d, i) => {      if (!tableOfRelationships.find(e => e.name === row[d])) tableOfRelationships.push({        name: row[d],        parent: row[taxonomicRanks[i - 1]] || null      })    })  });    const stratify = d3.stratify()    .id(function(d) {      return d.name;    })    .parentId(function(d) {      return d.parent;    });    const hierarchicalData = stratify(tableOfRelationships);    console.log(hierarchicalData);

<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min.js"></script>

PS: I don't know what kind of dataviz you'll create, but you really should avoid taxonomic ranks. The whole Linnaean taxonomy is outdated, we don't use ranks anymore: since the phylogenetic systematics was developed in mid-60's we use only taxa, without any taxonomic rank (evolutionary biology teacher here). Also, I'm quite curious about these 7 million rows, since we have described just over 1 million species!

170

answered Oct 05 '22 17:10

Gerardo Furtado

It is easy to do exactly what you need using python and python-benedict library (it is open source on Github, note: I am the author):

Installation pip install python-benedict

from benedict import benedict as bdict  # data source can be a filepath or an url data_source = """ RecordID,kingdom,phylum,class,order,family,genus,species 1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens 2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis 3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana 4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris """ data_input = bdict.from_csv(data_source) data_output = bdict()  ancestors_hierarchy = ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'] for value in data_input['values']:     data_output['.'.join([value[ancestor] for ancestor in ancestors_hierarchy])] = bdict()  print(data_output.dump()) # if this output is ok for your needs, you don't need the following code  keypaths = sorted(data_output.keypaths(), key=lambda item: len(item.split('.')), reverse=True)  data_output['children'] = [] def transform_data(d, key, value):     if isinstance(value, dict):         value.update({ 'name':key, 'children':[] }) data_output.traverse(transform_data)  for keypath in keypaths:     target_keypath = '.'.join(keypath.split('.')[:-1] + ['children'])     data_output[target_keypath].append(data_output.pop(keypath))  print(data_output.dump())

The first print output will be:

{     "Animalia": {         "Chordata": {             "Mammalia": {                 "Carnivora": {                     "Canidae": {                         "Canis": {                             "Canis": {}                         }                     }                 },                 "Primates": {                     "Hominidae": {                         "Homo": {                             "Homo sapiens": {}                         }                     }                 }             }         }     },     "Plantae": {         "nan": {             "Magnoliopsida": {                 "Brassicales": {                     "Brassicaceae": {                         "Arabidopsis": {                             "Arabidopsis thaliana": {}                         }                     }                 },                 "Fabales": {                     "Fabaceae": {                         "Phaseoulus": {                             "Phaseolus vulgaris": {}                         }                     }                 }             }         }     } }

The second printed output will be:

{     "children": [         {             "name": "Animalia",             "children": [                 {                     "name": "Chordata",                     "children": [                         {                             "name": "Mammalia",                             "children": [                                 {                                     "name": "Carnivora",                                     "children": [                                         {                                             "name": "Canidae",                                             "children": [                                                 {                                                     "name": "Canis",                                                     "children": [                                                         {                                                             "name": "Canis",                                                             "children": []                                                         }                                                     ]                                                 }                                             ]                                         }                                     ]                                 },                                 {                                     "name": "Primates",                                     "children": [                                         {                                             "name": "Hominidae",                                             "children": [                                                 {                                                     "name": "Homo",                                                     "children": [                                                         {                                                             "name": "Homo sapiens",                                                             "children": []                                                         }                                                     ]                                                 }                                             ]                                         }                                     ]                                 }                             ]                         }                     ]                 }             ]         },         {             "name": "Plantae",             "children": [                 {                     "name": "nan",                     "children": [                         {                             "name": "Magnoliopsida",                             "children": [                                 {                                     "name": "Brassicales",                                     "children": [                                         {                                             "name": "Brassicaceae",                                             "children": [                                                 {                                                     "name": "Arabidopsis",                                                     "children": [                                                         {                                                             "name": "Arabidopsis thaliana",                                                             "children": []                                                         }                                                     ]                                                 }                                             ]                                         }                                     ]                                 },                                 {                                     "name": "Fabales",                                     "children": [                                         {                                             "name": "Fabaceae",                                             "children": [                                                 {                                                     "name": "Phaseoulus",                                                     "children": [                                                         {                                                             "name": "Phaseolus vulgaris",                                                             "children": []                                                         }                                                     ]                                                 }                                             ]                                         }                                     ]                                 }                             ]                         }                     ]                 }             ]         }     ] }

answered Oct 05 '22 17:10

Fabio Caccamo

Related questions
                            
                                Pass javascript function as data-* attribute and execute
                            
                                Why does re-declaring an argument inside of a try/catch throw a ReferenceError?
                            
                                Using Google Analytics asynchronous code from external JS file
                            
                                Loading jQuery, Underscore and Backbone using RequireJS 2.0.1 and shim
                            
                                module.exports "Module is not defined"
                            
                                ES6 - declare a prototype method on a class with an import statement
                            
                                How to debug websocket connection error with "Unknown reason"
                            
                                Why Doesn't jQuery use JSDoc? [closed]
                            
                                V8 and ECMAScript differences
                            
                                <a href="javascript:foo(this)"> passes Window, I want the tag element itself
                            
                                Parsing JSON containing new line characters [duplicate]
                            
                                What are the accessibility implications of using a framework like angularjs?
                            
                                Load separate sourcemap file in chrome dev tools
                            
                                Angular Material VS Materializecss [closed]
                            
                                In Nest.js, how to get a service instance inside a decorator?
                            
                                How does bitwise operation work on Booleans?
                            
                                How can I get JSDoc to mark my param as a jQuery object?
                            
                                Angularjs. How can I pass variable as argument to custom filter?
                            
                                Service Worker Registration Failed
                            
                                How to use CORS to implement JavaScript Google Places API request

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a way to convert CSV columns into hierarchical relationships?

Tags:

python

javascript

hierarchical-data

data-visualization

d3.js