I am working with enron email dataset and I am trying to remove email addresses that don't have "@enron.com" (i.e. I would like to have enron emails only). When I tried to delete those addresses without @enron.com, some emails just got skipped for some reasons. A small graph is shown below where vertices are email address. This is gml format:
Creator "igraph version 0.7 Sun Mar 29 20:15:45 2015"
Version 1
graph
[
directed 1
node
[
id 0
label "[email protected]"
]
node
[
id 1
label "[email protected]"
]
node
[
id 2
label "[email protected]"
]
node
[
id 3
label "[email protected]"
]
node
[
id 4
label "[email protected]"
]
node
[
id 5
label "[email protected]"
]
node
[
id 6
label "[email protected]"
]
node
[
id 7
label "[email protected]"
]
node
[
id 8
label "[email protected]"
]
node
[
id 9
label "[email protected]"
]
edge
[
source 5
target 5
weight 1
]
]
My code is:
G = ig.read("enron_email_filtered.gml")
for v in G.vs:
print v['label']
if '@enron.com' not in v['label']:
G.delete_vertices(v.index)
print 'Deleted'
In this dataset, 7 emails should be deleted. However, based on the above code, only 5 emails are removed.
NetworkX is pure Python, well documented and handles changes to the network gracefully. iGraph is more performant in terms of speed and ram usage but less flexible for dynamic networks. iGraph is a C library with very smart indexing and storage approaches so you can load pretty large graphs in ram.
The simplest way to install the igraph R package is typing install. packages("igraph") in your R session. If you want to download the package manually, the following link leads you to the page of the latest release on CRAN where you can pick the appropriate source or binary distribution yourself.
From the tutorial here, you can access all the vertices with a specific property, and then delete them in bulk as follows:
to_delete_ids = [v.index for v in G.vs if '@enron.com' not in v['label']]
G.delete_vertices(to_delete_ids)
Here is the output I got:
to delete ids: [1, 3, 4, 5, 7, 8, 9]
Before deletion: IGRAPH D-W- 10 1 --
+ attr: id (v), label (v), weight (e)
+ edges:
5->5
After deletion: IGRAPH D-W- 3 0 --
+ attr: id (v), label (v), weight (e)
label: [email protected]
label: [email protected]
label: [email protected]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With