Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create a Corpus from many html files in R

I would like to create a Corpus for the collection of downloaded HTML files, and then read them in R for future text mining.

Essentially, this is what I want to do:

  • Create a Corpus from multiple html files.

I tried to use DirSource:

library(tm)
a<- DirSource("C:/test")
b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=readPlain))

but it returns "invalid directory parameters"

  • Read in html files from the Corpus all at once. Not sure how to do it.

  • Parse them, convert them to plain text, remove tags. Many people suggested using XML, however, I didn't find a way to process multiple files. They are all for one single file.

Thanks very much.

like image 899
user2097824 Avatar asked Feb 22 '13 03:02

user2097824


People also ask

What does Corpus () do in R?

Corpus is an R text processing package with full support for international text (Unicode). It includes functions for reading data from newline-delimited JSON files, for normalizing and tokenizing text, for searching for term occurrences, and for computing term occurrence frequencies (including n-grams).

What is a Corpus object in R?

Notes for “Text Mining with R: A Tidy Approach” A corpus object, however, is a data structure for text data before tokenization. One common example is Corpus objects from the tm package. These store text alongside metadata, which may include an ID, date/time, title, or language for each document.


2 Answers

This should do it. Here I've got a folder on my computer of HTML files (a random sample from SO) and I've made a corpus out of them, then a document term matrix and then done a few trivial text mining tasks.

# get data
setwd("C:/Downloads/html") # this folder has your HTML files 
html <- list.files(pattern="\\.(htm|html)$") # get just .htm and .html files

# load packages
library(tm)
library(RCurl)
library(XML)
# get some code from github to convert HTML to text
writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")))
source("htmlToText.R")
# convert HTML to text
html2txt <- lapply(html, htmlToText)
# clean out non-ASCII characters
html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub=""))

# make corpus for text mining
corpus <- Corpus(VectorSource(html2txtclean))

# process text...
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(a, PlainTextDocument)
a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)
a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10))) 
newstopwords <- findFreqTerms(a.dtm1, lowfreq=10) # get most frequent words
# remove most frequent words for this corpus
a.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,] 
inspect(a.dtm2)

# carry on with typical things that can now be done, ie. cluster analysis
a.dtm3 <- removeSparseTerms(a.dtm2, sparse=0.7)
a.dtm.df <- as.data.frame(inspect(a.dtm3))
a.dtm.df.scale <- scale(a.dtm.df)
d <- dist(a.dtm.df.scale, method = "euclidean") 
fit <- hclust(d, method="ward")
plot(fit)

enter image description here

# just for fun... 
library(wordcloud)
library(RColorBrewer)

m = as.matrix(t(a.dtm1))
# get word counts in decreasing order
word_freqs = sort(colSums(m), decreasing=TRUE) 
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
# plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

enter image description here

like image 55
Ben Avatar answered Nov 03 '22 10:11

Ben


This will correct the error.

 b<-Corpus(a, ## I change DireSource(a) by a
          readerControl=list(language="eng", reader=readPlain))

But I think to read your Html you need to use xml reader. Something like :

r <- Corpus(DirSource('c:\test'),
             readerControl = list(reader = readXML),spec)

But you need to supply the spec argument, which depends with your file structure. see for example readReut21578XML. It is a good example of xml/html parser.

like image 2
agstudy Avatar answered Nov 03 '22 10:11

agstudy