Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating a directed word graph in python

From a list of sentences, I want to generate a directed graph to generate a sub sentence according to the following property:

Suppose there are three sentences:

  • I love candy
  • I love you
  • I love candy very much

The graph will have edges between every consecutive words with weight 1 each time the pair occurs.

Then I find out the path in the graph with maximum weight. Here it returns 'I love candy very much' with a weight of 3+2+1+1

str1 = """Man who run in front of car, get tired.
Man who run behind car, get exhausted."""
# create a list of words separated at whitespaces
wordList1 = str1.split(None)
# strip any punctuation marks and build modified word list
# start with an empty list
wordList2 = []
for word1 in wordList1:
# last character of each word
lastchar = word1[-1:]
# use a list of punctuation marks
if lastchar in [",", ".", "!", "?", ";"]:
    word2 = word1.rstrip(lastchar)
else:
    word2 = word1
# build a wordList of lower case modified words
wordList2.append(word2.lower())

Now wordList2 has a list of consecutive words. How can I convert it into a graph. I am using the networkx library but not much familiar with it.

How can I proceed with making the edge weighted graph?

like image 625
R1234 Avatar asked Oct 03 '12 23:10

R1234


2 Answers

Here's a solution to your problem using networkX.

This code will generate a directed graph, dG. dG will have one node per word, with a 'count' attribute indicating how many times the word has been seen. It will have one directed edge per bigram, with a 'weight' attribute indicating how many times the bigram has occurred.

import networkx as nx
import string
from sys import maxint

str1 = """Man who run in front of car, get tired.
Man who run behind car, get exhausted."""

wordList1 = str1.split(None)

wordList2 = [string.rstrip(x.lower(), ',.!?;') for x in wordList1]

dG = nx.DiGraph()

for i, word in enumerate(wordList2):
    try:
        next_word = wordList2[i + 1]
        if not dG.has_node(word):
            dG.add_node(word)
            dG.node[word]['count'] = 1
        else:
            dG.node[word]['count'] += 1
        if not dG.has_node(next_word):
            dG.add_node(next_word)
            dG.node[next_word]['count'] = 0

        if not dG.has_edge(word, next_word):
            dG.add_edge(word, next_word, weight=maxint - 1)
        else:
            dG.edge[word][next_word]['weight'] -= 1
    except IndexError:
        if not dG.has_node(word):
            dG.add_node(word)
            dG.node[word]['count'] = 1
        else:
            dG.node[word]['count'] += 1
    except:
        raise

To see the contents of the graph, you can print the nodes and edges.

for node in dG.nodes():
    print '%s:%d\n' % (node, dG.node[node]['count'])

for edge in dG.edges():
    print '%s:%d\n' % (edge, maxint - dG.edge[edge[0]][edge[1]]['weight'])

Note that the bigram edge weight, instead of counting up from zero, counts down from maxint. This is because networkX's shortest path utilities will use the weight value as a cost per edge. By making our highest counts our smallest weights, we can use these utilities to find the path with the highest edge counts.

for example, we can get a path with highest count between the word 'man' and the word 'exhausted':

>>>shortest_path = nx.shortest_path(dG, source='man', target='exhausted', weight='weight')
>>>print shortest_path
['man', 'who', 'run', 'behind', 'car', 'get', 'exhausted']

or, we can get the paths with the highest counts between the word 'man' and all other words:

>>>shortest_paths = nx.shortest_path(dG, source='man', weight='weight')
>>>print shortest_paths
{'run': ['man', 'who', 'run'], 
'get': ['man', 'who', 'run', 'behind', 'car', 'get'], 
'exhausted': ['man', 'who', 'run', 'behind', 'car', 'get', 'exhausted'], 
'car': ['man', 'who', 'run', 'behind', 'car'], 
'who': ['man', 'who'], 
'behind': ['man', 'who', 'run', 'behind'], 
'of': ['man', 'who', 'run', 'in', 'front', 'of'], 
'in': ['man', 'who', 'run', 'in'], 
'front': ['man', 'who', 'run', 'in', 'front'], 
'tired': ['man', 'who', 'run', 'behind', 'car', 'get', 'tired'], 
'man': ['man']}

As noted above, there is a possibility of getting cycles in a graph like this, and I'm not sure how well the networkx shortest path algorithm will handle that case.

Good luck!

like image 160
brentlance Avatar answered Oct 14 '22 20:10

brentlance


Use a dictionary to store bigrams, adding 1 to the value each time you encounter the bigram. Get the max of the dictionary values to generate your starting point, then recursively call a function that gets the bigram with the highest value that starts with the last word in the prior bigram, until no more bigrams that start with the last word exist. Be wary of the possibility of circular graphs, e.g. I love that I love that I love ad infinitum (set a recursion limit).

I've never worked with networkx specifically, but after glancing through the homepage, the above still applies.

like image 37
kreativitea Avatar answered Oct 14 '22 20:10

kreativitea