Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I have a Map that all processes can access?

I'm building a multithreaded web crawler.

I launch a thread that gets first n href links and parses some data. Then it should add those links to a Visited list that other threads can access and adds the data to a global map that will be printed when the program is done. Then the thread launches new n new threads all doing the same thing.

How can I setup a global list of Visited sites that all threads can access and a global map that all threads can also write to.

like image 620
MikeC Avatar asked Jan 07 '23 08:01

MikeC


2 Answers

You can't share data between processes. That doesn't mean that you can't share information.

the usual way is either to use a special process (a server) in charge of this job: maintain a state; in your case the list of visited links.

Another way is to use ETS (or Mnesia the database build upon ETS) which is designed to share information between processes.

like image 67
Pascal Avatar answered Jan 11 '23 23:01

Pascal


Just to clarify, erlang/elixir uses processes rather than threads.

Given a list of elements, a generic approach:

  • An empty list called processed is saved to ets, dets, mnesia or some DB.
  • The new list of elements is filtered against the processed list so the Task is not unnecessarily repeated.
  • For each element of the filtered list, a task is run (which in turn spawns a process) and does some work on each element that returns a map of the required data. See the Task module Task.async/1 and Task.yield_many/2 could be useful.
  • Once all the tasks have returned or yielded,

    1. all the maps or parts of the data in the maps are merged and can be persisted if/as required/appropriate.
    2. the elements whose tasks did not crash or timeout are added to the processed list in the DB.
  • Tasks which crash or timeout could be handled differently.

like image 26
stephen_m Avatar answered Jan 11 '23 22:01

stephen_m