create a Corpus from many html files in R

Tags:

I would like to create a Corpus for the collection of downloaded HTML files, and then read them in R for future text mining.

Essentially, this is what I want to do:

Create a Corpus from multiple html files.

I tried to use DirSource:

library(tm)
a<- DirSource("C:/test")
b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=readPlain))

but it returns "invalid directory parameters"

Read in html files from the Corpus all at once. Not sure how to do it.
Parse them, convert them to plain text, remove tags. Many people suggested using XML, however, I didn't find a way to process multiple files. They are all for one single file.

Thanks very much.

899

asked Feb 22 '13 03:02

user2097824

2 Answers

This should do it. Here I've got a folder on my computer of HTML files (a random sample from SO) and I've made a corpus out of them, then a document term matrix and then done a few trivial text mining tasks.

# get data
setwd("C:/Downloads/html") # this folder has your HTML files 
html <- list.files(pattern="\\.(htm|html)$") # get just .htm and .html files

# load packages
library(tm)
library(RCurl)
library(XML)
# get some code from github to convert HTML to text
writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")))
source("htmlToText.R")
# convert HTML to text
html2txt <- lapply(html, htmlToText)
# clean out non-ASCII characters
html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub=""))

# make corpus for text mining
corpus <- Corpus(VectorSource(html2txtclean))

# process text...
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(a, PlainTextDocument)
a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)
a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10))) 
newstopwords <- findFreqTerms(a.dtm1, lowfreq=10) # get most frequent words
# remove most frequent words for this corpus
a.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,] 
inspect(a.dtm2)

# carry on with typical things that can now be done, ie. cluster analysis
a.dtm3 <- removeSparseTerms(a.dtm2, sparse=0.7)
a.dtm.df <- as.data.frame(inspect(a.dtm3))
a.dtm.df.scale <- scale(a.dtm.df)
d <- dist(a.dtm.df.scale, method = "euclidean") 
fit <- hclust(d, method="ward")
plot(fit)

enter image description here

# just for fun... 
library(wordcloud)
library(RColorBrewer)

m = as.matrix(t(a.dtm1))
# get word counts in decreasing order
word_freqs = sort(colSums(m), decreasing=TRUE) 
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
# plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

enter image description here

answered Nov 03 '22 10:11

Ben

This will correct the error.

 b<-Corpus(a, ## I change DireSource(a) by a
          readerControl=list(language="eng", reader=readPlain))

But I think to read your Html you need to use xml reader. Something like :

r <- Corpus(DirSource('c:\test'),
             readerControl = list(reader = readXML),spec)

But you need to supply the spec argument, which depends with your file structure. see for example readReut21578XML. It is a good example of xml/html parser.

answered Nov 03 '22 10:11

agstudy

Related questions
                            
                                How to prevent an textfield losing focus using jQuery and JavaScript?
                            
                                How do you find out if an HTML element has a certain class with plain Javascript?
                            
                                Select all inputs, labels, selects etc within THIS - each loop
                            
                                Why does select have a slightly larger height than input[type=text]?
                            
                                How to cancel an image load after a period of time?
                            
                                How can I stop Firefox from caching the contents of a textarea on localhost?
                            
                                load external html file to div using jquery
                            
                                Best place to insert JavaScript within a HTML document [duplicate]
                            
                                How do I extend selection to word boundary using JavaScript, once only?
                            
                                Losing column widths when printing HTML table
                            
                                Internet Explorer 8 won't modify HTML5 tags in print stylesheet
                            
                                Create new HTML5 video element throught JavaScript
                            
                                How to create a box-shadow that covers the entire page?
                            
                                HTML5 Canvas size and resolution
                            
                                Can a user edit the page source, manipulate hidden field values and then post the form with those values?
                            
                                Go Parse HTML table
                            
                                Why is there a pesky little space between <img> and other elements? [duplicate]
                            
                                how to show asterisk sign before label start when label has fixed width
                            
                                CSS: Remove Line Height (leading) on larger text
                            
                                How to make new heading tag numbers such as h7, h8, etc.?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

create a Corpus from many html files in R

Tags:

html

r

xml-parsing

text-mining

corpus

user2097824

People also ask

2 Answers

Ben

agstudy

Recent Activity

Donate For Us