I would like any advice on how to create and visualize a link map between blogs so to reflect the "social network" between them.
Here is how I am thinking of doing it:
I imagine that in order to do this in R, one would use RCurl/XML (Thanks Shane for your answer here), combined with something like igraph
.
But since I don't have experience with either of them, is there someone here that might be willing to correct me if I missed any important step, or attach any useful snippet of code to allow this task?
p.s: My motivation for this question is that in a week I am giving a talk on useR 2010 on "blogging and R", and I thought this might be a nice way to both give something fun to the audience and also motivate them to do something like this themselves.
Thanks a lot!
Tal
NB: This example is a very BASIC way of getting the links and therefore would need to be tweaked in order to be more robust. :)
I don't know how useful this code is, but hopefully it might give you an idea of the direction to go in (just copy and paste it into R, it's a self contained example once you've installed the packages RCurl and XML):
library(RCurl)
library(XML)
get.links.on.page <- function(u) {
doc <- getURL(u)
html <- htmlTreeParse(doc, useInternalNodes = TRUE)
nodes <- getNodeSet(html, "//html//body//a[@href]")
urls <- sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])
urls <- sort(urls)
return(urls)
}
# a naieve way of doing it. Python has 'urlparse' which is suppose to be rather good at this
get.root.domain <- function(u) {
root <- unlist(strsplit(u, "/"))[3]
return(root)
}
# a naieve method to filter out duplicated, invalid and self-referecing urls.
filter.links <- function(seed, urls) {
urls <- unique(urls)
urls <- urls[which(substr(urls, start = 1, stop = 1) == "h")]
urls <- urls[grep("http", urls, fixed = TRUE)]
seed.root <- get.root.domain(seed)
urls <- urls[-grep(seed.root, urls, fixed = TRUE)]
return(urls)
}
# pass each url to this function
main.fn <- function(seed) {
raw.urls <- get.links.on.page(seed)
filtered.urls <- filter.links(seed, raw.urls)
return(filtered.urls)
}
### example ###
seed <- "http://www.r-bloggers.com/blogs-list/"
urls <- main.fn(seed)
# crawl first 3 links and get urls for each, put in a list
x <- lapply(as.list(urls[1:3]), main.fn)
names(x) <- urls[1:3]
x
If you copy and paste it into R, and then look at x, I think it'll make sense.
Either way, good luck mate! Tony Breyal
Tal,
This type of data collection is referred to as a k-snowball search in network theory and should be fairly straightforward in R. As you note, the easiest way to accomplish this would be using the XMl
package and the htmlTreeParse command. This will parse the HTML from a blog into a tree, which will allow you to more easily perform the link extraction you are interested in.
Also, igraph
would be perfectly capable of representing the graphs, but also has a useful function graph.compose for taking two graphs and returning their edge set composition. You will need this to combine data as you continue to "roll the snowball." The basic steps of the process would be:
I have no code for this in R, but I have generated code that performs a very similar process in Python using Google's SocialGraph API.
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With