I have 2.92M data points in a 3.0GB CSV file and I need to loop through it twice to create a graph which I want to load into NetworkX. At the current rate it will take me days to generate this graph. How can I speed this up?
similarity = 8
graph = {}
topic_pages = {}
CSV.foreach("topic_page_node_and_edge.csv") do |row|
topic_pages[row[0]] = row[1..-1]
end
CSV.open("generate_graph.csv", "wb") do |csv|
i = 0
topic_pages.each do |row|
i+=1
row = row.flatten
topic_pages_attributes = row[1..-1]
graph[row[0]] = []
topic_pages.to_a[i..-1].each do |row2|
row2 = row2.flatten
topic_pages_attributes2 = row2[1..-1]
num_matching_attributes = (topic_pages_attributes2 & topic_pages_attributes).count
if num_matching_attributes >= similarity or num_matching_attributes == topic_pages_attributes2.count or num_matching_attributes == topic_pages_attributes.count
graph[row[0]].push(row2[0])
end
end
csv << [row[0], graph[row[0]]].flatten
end
end
benchmark. For example using cProfile, which comes with Python. It's easy to have some costly inefficiencies in your code, and they can easily come at a 10x performance cost in intensive applications.
Pretty code such as
(topic_pages_attributes2 & topic_pages_attributes).count
may turn out to be a major factor in your runtime, that can easily be reduced by using more traditional code.
Use a more efficient language. For example in benchmarksgame.alioth, on a number of intensive problems, the fastest Python 3 program is in median 63x slower than the fastest C program (Ruby is at 67x, JRuby at 33x). Yes, the performance gap can be big, even with well-optimized Python code. But if you didn't optimize your code, it may be even bigger; and you may be able to get a 100x-1000x speedup by using a more efficient language and carefully optimizing your code.
Consider more clever formulations of your problem. For example, instead of iterating over each node, iterate over each edge once. In your case, that would probably mean building an inverted index, topic -> pages. This is very similar to the way text search engines work, and a popular way to compute such operations on clusters: the individual topics can be split on separate nodes. This approach benefits from the sparsity in your data. This can take down the runtime of your algorithm drastically.
You have about 3 Mio documents. Judging from your total data size, they probably have less than 100 topics on average? Your pairwise comparison approach needs 3mio^2 comparisons, that is what hurts you. If the more popular topics are used on only 30.000 documents each, you may get away with computing only 30k^2 * number of topics. Assuming you have 100 of such very popular topics (rare topics don't matter much), this would be a 100x speedup.
Simplify your problem. For example, first merge all documents that have exactly the same topics by sorting. To make this more effective, also eliminate all topics that occur exactly once. But probably there are only some 10.000-100.000 different sets of documents. This step can be easily solved using sorting, and will make your problem some 900-90000 times easier (assuming above value range).
Some of these numbers may be too optimistic - for example, IO was not taken into account at all, and if your problem is I/O bound, using C/Java may not help much. There may be some highly popular topics that can hurt with the approaches discussed in C. For D) you need O(n log n) time for sorting your data; but there are very good implementations for this available. But it definitely is a simplification that you should do. These documents will also form fully connected cliques in your final data, which likely hurt other analyses as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With