Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Analyzing Large Graphs - Retrieving Clusters and Calculating Strongest Paths

I tried to use Breadth-First algorithm to retrieve the whole cluster of connections to which the selected user belongs the following way:

CREATE PROCEDURE Breadth_First (@StartNode varchar(50), @LinkStrength decimal(10,7) = 0.1, @EndNode varchar(50) = NULL)
    AS
    BEGIN
        -- Automatically rollback the transaction if something goes wrong.   
        SET XACT_ABORT ON   
        BEGIN TRAN

        -- Increase performance and do not intefere with the results.
        SET NOCOUNT ON;

        -- Create a temporary table for storing the discovered nodes as the algorithm runs
        CREATE TABLE #Discovered
        (
             DiscoveredUser varchar(50) NOT NULL,    -- The Node Id
            Predecessor varchar(50) NULL,    -- The node we came from to get to this node.
            LinkStrength decimal(10,7) NULL, -- positive - from predecessor to  DiscoveredUser, negative - from  DiscoveredUser to predecessor
            OrderDiscovered int -- The order in which the nodes were discovered.
        )

        -- Initially, only the start node is discovered.
        INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
        VALUES (@StartNode, NULL, NULL, 0)

        -- Add all nodes that we can get to from the current set of nodes,
        -- that are not already discovered. Run until no more nodes are discovered.
        WHILE @@ROWCOUNT > 0
        BEGIN
            -- If we have found the node we were looking for, abort now.
            IF @EndNode IS NOT NULL
                IF EXISTS (SELECT TOP 1 1 FROM #Discovered WHERE  DiscoveredUser = @EndNode)
                    BREAK   

            -- We need to group by ToNode and select one FromNode since multiple
            -- edges could lead us to new same node, and we only want to insert it once.
            INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
            (SELECT mc.called_party, mc.calling_party, mc.link_strength, d.OrderDiscovered + 1
            FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.calling_party
            WHERE mc.called_party NOT IN (SELECT  DiscoveredUser From #Discovered) AND mc.link_strength > @LinkStrength
            UNION
            SELECT mc.calling_party, mc.called_party, mc.link_strength * (-1), d.OrderDiscovered + 1
            FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.called_party
            WHERE mc.calling_party NOT IN (SELECT  DiscoveredUser FROM #Discovered) AND mc.link_strength > @LinkStrength
            )
        END;

        -- Select the results. We use a recursive common table expression to
        -- get the full path from the start node to the current node.
        WITH BacktraceCTE(Predecessor,  DiscoveredUser, LinkStrength, OrderDiscovered, Path)
        AS
        (
            -- Anchor/base member of the recursion, this selects the start node.
            SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered, 
                CAST(n. DiscoveredUser AS varchar(MAX))
            FROM #Discovered d JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
            WHERE d. DiscoveredUser = @StartNode

            UNION ALL

            -- Recursive member, select all the nodes which have the previous
            -- one as their predecessor. Concat the paths together.
            SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered,
                CAST(cte.Path + ',' + CAST(n. DiscoveredUser as varchar(30)) as varchar(MAX))
            FROM #Discovered d JOIN BacktraceCTE cte ON d.Predecessor = cte. DiscoveredUser
            JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
        )

        SELECT Predecessor,  DiscoveredUser, LinkStrength, OrderDiscovered, Path FROM BacktraceCTE
        WHERE  DiscoveredUser = @EndNode OR @EndNode IS NULL -- This kind of where clause can potentially produce
        ORDER BY OrderDiscovered                -- a bad execution plan, but I use it for simplicity here.

        DROP TABLE #Discovered
        COMMIT TRAN
        RETURN 0
    END

The Graph (social network) which I am currently analyzing has 28M connections and without ignoring weak connections (set treshold with @LinkStrength) the execution is running very long time (so far I haven't got any results and will try to leave it running over night).

Anyway the next step is to calculate the shortest (strongest) links between two users (there are around 3M users). I was thinking about using Djikstra algorithm but am not sure whether it is possible to analyze such network on the PC I am currently using (Quad CPU 2.66 GHz, 4GB RAM) and the data is stored in MS SQL Server 2008 database.

To summarize I would appreciate to get some answers/suggestions for the following questions:

  1. Is it possible to execute the queries as complex as one above for described graph (28M connections, 3M users) on the described machine (2.66 GHz, 4GB RAM)?
  2. If it is not possible are there any other possible approaches by which the execution time could be reduced (e.g. creating tables with partial results)?
  3. Do you recommend any other algorithms for detecting clusters and calculating shortest paths for the described graph?

Thank you!

like image 492
Niko Gamulin Avatar asked Nov 08 '10 13:11

Niko Gamulin


1 Answers

First, use indexes.

Second, you need to reduce your memory requirements. That means first provide a short alias for your VARCHAR(50), such as int which is 4 bytes instead of 50. That will get you a 10x speedup.

declare @tmpPeople table(
  ixPerson int identity primary key,
  UserNodeID varchar(50),
  unique(UserNodeID, ix) -- this creates an index
)
Insert @tmpPeople(UserNodeID) select UserNodeID from NormalPeopleTable
declare @relationships table(
  ixParent int,
  ixChild int,
  unique(ixParent, ixChild),
  unique(ixChild, ixParent)
)
insert @relationships(ixParent, ixChild)
select distinct p.ixPerson, c.ixPerson
from NormalRelationshipsTable nr
inner join @tmpPeople p on p.UserNodeID = nr.ParentUserNodeID
inner join @tmpPeople c on c.UserNodeID = nr.ChildUserNodeID

-- OK now got a copy of all relationships, but it is a fraction of the size
-- because we are using integers for the keys.
-- if we need to we can insert the reverse relationships too.

You need to write a query which does what you want, not something "generic".

If you want to find the shortest distance between two nodes, you can cut your search time by searching from both ends at once.

declare @p1 table(
ix int identity primary key,
ixParent int,
ixChild int,
nDeep int,
-- Need indexes
unique(ixParent, ixChild, nDeep, ix),
unique(ixChild, ixParent, nDeep, ix)
)
-- That's now indexed both ways. 
-- If you only need one, you can comment the other out.
-- define @p2 the same

insert @p1 (ixParent, ixChild, nDeep) select @ixUserFrom, @ixUserFrom, 0
insert @p2 ..... @ixUserTo, @ixUserTo, 0

-- Breadth first goes like this.
-- Just keep repeating it till you get as far as you want.
insert @p1 (ixParent, ixChild, nDeep)
select
p1.ixChild, r.ixChild, p1.nDeep+1
from @p1 p1 inner join @relationships r on r.ixParent = p1.ixChild
-- may want to exclude those already in the table
where not exists (
    select 1 from @p1 p1 
    where p1.ixParent = p.ixChild and p1.ixChild = r.ixChild
)

For a "distance from Alice to Bob" you do two searches in parallel, and stop when Alice's search contains anyone also contained in Bob's search. That also cuts your time down by a factor of n^2 where n is the average number of connections.

Don't forget to stop if the depth gets too much.

like image 138
Ben Avatar answered Sep 27 '22 18:09

Ben