Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BFS five letter word chain

I am in need of some help with an assignment regarding a BFS word chain. The word chain is based on five-letter words, two words are connected when the last four letters in word x are in word y. For example climb and blimp are connected because l, i, m, and b in climb are in the word blimp.

The recommendation is to use directed BFS from Sedgewick's algorithms 4th edition or to modify it. The code can be found here: https://algs4.cs.princeton.edu/40graphs/ and to use the following code to read a data file to the list words:

BufferedReader r =
    new BufferedReader(new InputStreamReader(new FileInputStream(fnam)));
ArrayList<String> words = new ArrayList<String>();
while (true) {
    String word = r.readLine();
    if (word == null) { break; }
    assert word.length() == 5;  // inputcheck, if you run with assertions
    words.add(word);
}

and the following code to read the test cases from a file:

BufferedReader r = 
    new BufferedReader(new InputStreamReader(new FileInputStream(fnam)));
while (true) {
    String line = r.readLine();
    if (line == null) { break; }
    assert line.length() == 11; // inputcheck, if you run with assertions
    String start = line.substring(0, 5);
    String goal = line.substring(6, 11);
    // ... search path from start to goal here
}

The words in the data file are:

their
moist
other
blimp
limps
about
there
pismo
abcde
bcdez
zcdea
bcdef
fzcde

When the test case file is used...

other there
other their
their other
blimp moist
limps limps
moist limps
abcde zcdea

...the output should be the number of edges between each word pair, and -1 if there is no path between the words.

1
1
-1
3
0
-1
2

I am new to working with graphs and I am not sure how to use Sedgewick's BFS and modify it to read the test cases file. Any help is appreciated.

like image 584
arvidst4 Avatar asked Oct 26 '20 10:10

arvidst4


3 Answers

Let's assume n is the number of words in the dataset.

Firstly, we need to build an adjacency list for all the above words according to the given condition, i.e. there's an edge between x and y if and only if the last four letters of x are present in y. Building this adjacency list is an O(n^2 * w) operation, where w is the average size of each word in the dataset.

Secondly, all we have to do is a traditional BFS over the test data.

Here's the main function:

    public static void main(String[] args) throws IOException {
        // get words from dataset
        List<String> words = readData();
        // get the word pairs to test
        List<List<String>> testData = getTestData();
        // form an adjacency list
        Map<String, List<String>> adj = getAdjacencyList(words);
        
        // for each test, do a traditional BFS
        for (List<String> test : testData) {
            System.out.println(bfs(adj, test));
        }
    }

Here's the function to build an adjacency list according to the given condition:

    public static Map<String, List<String>> getAdjacencyList(List<String> words) {
        Map<String, List<String>> adj = new HashMap<>();
        for (int i = 0; i < words.size(); ++i) {
            String word = words.get(i);
            adj.put(word, adj.getOrDefault(word, new ArrayList<>()));
            for (int j = 0; j < words.size(); ++j) {
                if (i == j) continue;
                int count = 0;
                String other = words.get(j);
                for (int k = 1; k < 5; ++k) {
                    count += other.indexOf(word.charAt(k)) != -1 ? 1 : 0;
                }
                // if the condition is satisfied, there exists an edge from `word` to `other`
                if (count >= 4)
                    adj.get(word).add(other);
            }
        }

        return adj;
    }

And here's the BFS:

    public static int bfs(Map<String, List<String>> adj, List<String> test) {
        Queue<String> q = new LinkedList<>();
        Set<String> visited = new HashSet<>(); // to keep track of the visited words, since the graph is not necessarily a DAG
        String start = test.get(0);
        String end = test.get(1);
        // if `start` and `end` words are equal
        if (start.equals(end))
            return 0;

        q.add(start);
        visited.add(start);
        int count = 0;
        while (!q.isEmpty()) {
            count++;
            int size = q.size();
            for (int i = 0; i < size; ++i) {
                String word = q.poll();
                for (String val : adj.get(word)) {
                    if (val.equals(end))
                        return count; // return the number of edges
                    if (!visited.contains(val)) // only add the words which aren't visited yet.
                        q.add(val);
                }
            }
        }
        return -1; // if there isn't any edge
    }
like image 181
The Room Avatar answered Oct 19 '22 14:10

The Room


@The Room have provided a pretty good answer , but I want to suggest a simple modification for the adjacency list construction part as the provided approach for building the list is of complexity O(n^2) which will lead for poor performance for large input files.

Simply you can take every possible sorted pattern of 4 characters of each word and insert it in a hash map with the id of the word (index for example).

C++ code example:

map<string , vector<int> >mappings ;

for(int i = 0 ; i < words.size();  i++){
    string word = words[i].substr(0 , 4) ; 
    sort(word.begin() , word.end()); 
    mappings[word].push_back(i); 
    for(int j = 0 ; j < 4 ; j++){
        word = words[i].substr(0 , 4) ; 
        word[j] = words[i][4]; 
        sort(word.begin() , word.end()); 
        mappings[word].push_back(i);
    }
}

Now you have a vector of the words' indices that you know there must be an edge between them and any word ending with the same 4 characters of the vector's key.

And then you can simply build the graph and with just taking care of not creating self loops (avoid making an edge with a node and itself).

Code example:

// Building the graph with complexity of O(n * log(no. of edges))
const int N = 100000; // Just and example 
vector<int>graph[N]; 
for(int i = 0 ; i < words.size(); i++){
    string tmp = words[i].substr(1 , 4); 
    sort(tmp.begin() , tmp.end()); 
    for(int j = 0 ; j < mappings[tmp].size(); j++){
        if (j == mappings[tmp][j])
            continue; 
            
        graph[i].push_back(mappings[tmp][j]);
    }
}

Finally you can loop over your test file , get the start and goal indices (When reading the file store each word as a key with a value of it's index) and then you apply the bfs function to calculate the number of edges as described in the answer of @The Room

I just wanted to suggest this answer for folks that may need a solution for a similar problem with a large inputs which will reduce the complexity of building the graph from O(N^2) to O(N * log(no. of edges)) where N is the number of words.

like image 2
Volpe95 Avatar answered Oct 19 '22 15:10

Volpe95


My approach was slightly different and there is also a subtle nuance to the question which I will come onto below:

First we create an adjacency list: ( @Volpe95 has a nice optimisation for this). A Map of Nodes is used where the word is the key.

Map<String, Node> nodes = new HashMap<>();

        List<String> words = new DataHelper().loadWords("src/main/wordsInput.dat");
        System.out.println(words);

        for (int i = 0; i < words.size(); i++) {
            String l = words.get(i);
            nodes.put(l, new Node(l));
        }

        for(Map.Entry<String,Node> l: nodes.entrySet()) {
            for(Map.Entry<String, Node> r:nodes.entrySet()) {
                if (l.equals(r)) continue;
                if (isLinkPair(l.getKey(), r.getKey())) {
                    Node t = nodes.get(l.getKey());
                    System.out.println(t);
                    t.addChild(nodes.get(r.getKey()));
                }
            }

        }

IsLinkPair checks if the last four letters of a word can be found in a possible child word.

private static boolean isLinkPair(String l, String r) {
        // last 4 chars only
        for (int i = 1; i < l.length(); i++) {
            if(r.indexOf(l.charAt(i)) == -1){
                return false;
            }
        }
        return true;
    }

A node stores each word and children as well as the edgeTo, this is used to calculate the shortest path where each node stores its parent. This child parent will always be on the shortest path. (Sedgewick stores this data in separate arrays but its often easier to group these in a class as it makes the code easier to understand)

(Getters Setters etc omitted for clarity and Equals)

public class Node {
    private Set<Node> children;
    private String word;

    private Node edgeTo;

    private int visited;

    public Node(String word) {
        children = new HashSet<>();
        this.word = word;
        edgeTo = null;
    }
}

The BFS algorithm based on Sedgewick's, searches each node, its immediate children and their children in turn and so on. It is searching ever so distant from the origin each time. Note a queue is used and this is implemented by the LinkedList in Java.

private boolean bfs(Map<String,Node> map, Node source, Node target) {
        if(source == null || target == null) return false;
        if(source.equals(target))return true;
        Queue<Node> queue = new LinkedList<>();
        source.setVisited();
        queue.add(source);
        while(!queue.isEmpty()) {
            Node v = queue.poll();
            for (Node c : v.getChildren()) {
                if(c.getVisited()==0){
                    System.out.println("visiting " + c);
                    c.setVisited();
                    c.setEdgeTo(v);
                    if(c.equals(target)) {
                        return true;
                    }
                    queue.add(c);
                }
            }
        }

        return false;
    }

Note that v is the parent and c are its children. setEdgeTo is used to set a childs parent.

Finally we check the results where source and target are the source and target words respectively:

BreadthFirstPaths bfs = new BreadthFirstPaths(nodes,source,target);
int shortestPath = bfs.getShortestPath(nodes,source,target);

So what of the nuance I mentioned above? The shortest path calculation is necessary as zcdea has two parents fzcde and bcdez and you need the one on the shortest path. To do use the edgeTo of a child, find its parent and repeat until the path is walked as shown below. That child parent relationship will always be on the shortest path because of the way the bfs searchs from an origin outwards.

// get edgeTo on target (the parent) , find this node and get its parent
    // continue until the shortest path is walked or no path is found
    public int getShortestPath(Map<String,Node> map, String source, String target) {
        Node node = map.get(target);
        int pathLength = 0;
        do {
            if(node == null || pathLength > map.size()) return NOPATH;
            if(node.equals(map.get(source))) return pathLength;
            node = map.get(node.getWord()).getEdgeTo();
            pathLength++;
        } while (true);
    }

There are always space time complexity tradeoffs to consider and optimisations.

like image 2
Andrew Avatar answered Oct 19 '22 15:10

Andrew