Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Visualise distances between texts

I'm working on a research project for school. I've written some text mining software that analyzes legal texts in a collection and spits out a score that indicates how similar they are. I ran the program to compare each text with every other text, and I have data like this (although with many more points):

codeofhammurabi.txt crete.txt      0.570737
codeofhammurabi.txt iraqi.txt      1.13475
codeofhammurabi.txt magnacarta.txt 0.945746
codeofhammurabi.txt us.txt         1.25546
crete.txt iraqi.txt                0.329545
crete.txt magnacarta.txt           0.589786
crete.txt us.txt                   0.491903
iraqi.txt magnacarta.txt           0.834488
iraqi.txt us.txt                   1.37718
magnacarta.txt us.txt              1.09582

Now I need to plot them on a graph. I can easily invert the scores so that a small value now indicates texts that are similar and a large value indicates texts that are dissimilar: the value can be the distance between points on a graph representing the texts.

codeofhammurabi.txt crete.txt      1.75212
codeofhammurabi.txt iraqi.txt      0.8812
codeofhammurabi.txt magnacarta.txt 1.0573
codeofhammurabi.txt us.txt         0.7965
crete.txt iraqi.txt                3.0344
crete.txt magnacarta.txt           1.6955
crete.txt us.txt                   2.0329
iraqi.txt magnacarta.txt           1.1983
iraqi.txt us.txt                   0.7261
magnacarta.txt us.txt              0.9125

SHORT VERSION: Those values directly above are distances between points on a scatter plot (1.75212 is the distance between the codeofhammurabi point and the crete point). I can imagine a big system of equations with circles representing the distances between points. What's the best way to make this graph? I have MATLAB, R, Excel, and access to pretty much any software I might need.

If you can even point me in a direction, I'll be infinitely grateful.

like image 942
xiii1408 Avatar asked Apr 14 '13 21:04

xiii1408


1 Answers

If the question is 'how I can do something like this guy did?' (from xiii1408's comment to the question), then the answer is use Gephi’s built-in Force Atlas 2 algorithm on Euclidean distances of document topic posterior probabilities.

"This guy" is Matt Jockers, who is an innovative scholar in the digital humanities. He has documented some of his methods on his blog and else where, etc. Jockers mostly works in R and shares some of his code. His basic work flow seems to be:

  1. break plain text into 1000-word chunks,
  2. remove stopwords (don't stem),
  3. do part-of-speech tagging and keep nouns only,
  4. build a topic model (using LDA),
  5. calculate Euclidean distances between documents based on topic proportions, subset the distances to keep only ones below a certain threshold, and then
  6. visualise with a force-directed graph

Here's a small-scale reproducible example in R (with an export to Gephi) that might be close to what Jockers did:

#### prepare workspace
# delete current objects and clear RAM
rm(list = ls(all.names = TRUE))
gc()

Get data...

#### import text
# working from the topicmodels package vignette
# using collection of abstracts of the Journal of Statistical Software (JSS) (up to 2010-08-05).
install.packages("corpus.JSS.papers", repos = "http://datacube.wu.ac.at/", type = "source")
data("JSS_papers", package = "corpus.JSS.papers")
# For reproducibility of results we use only abstracts published up to 2010-08-05 
JSS_papers <- JSS_papers[JSS_papers[,"date"] < "2010-08-05",]

Clean and reshape...

#### clean and reshape data
# Omit abstracts containing non-ASCII characters in the abstracts
JSS_papers <- JSS_papers[sapply(JSS_papers[, "description"], Encoding) == "unknown",]
# remove greek characters (from math notation, etc.)
library("tm")
library("XML")
remove_HTML_markup <- function(s) tryCatch({
    doc <- htmlTreeParse(paste("<!DOCTYPE html>", s),
                         asText = TRUE, trim = FALSE)
                         xmlValue(xmlRoot(doc))
                         }, error = function(s) s)
# create corpus
corpus <- Corpus(VectorSource(sapply(JSS_papers[, "description"], remove_HTML_markup)))
# clean corpus by removing stopwords, numbers, punctuation, whitespaces, words <3 characters long..
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus_clean <- tm_map(corpus, wordLengths=c(3,Inf), FUN = tm_reduce, tmFuns = funcs)

Part of speech tagging and sub-setting of nouns...

#### Part-of-speach tagging to extract nouns only
library("openNLP", "NLP")
# function for POS tagging
tagPOS <-  function(x) {

  s <- NLP::as.String(x)
  ## Need sentence and word token annotations.

  a1 <- NLP::Annotation(1L, "sentence", 1L, nchar(s))
  a2 <- NLP::annotate(s, openNLP::Maxent_Word_Token_Annotator(), a1)
  a3 <- NLP::annotate(s,  openNLP::Maxent_POS_Tag_Annotator(), a2)

  ## Determine the distribution of POS tags for word tokens.
  a3w <- a3[a3$type == "word"]
  POStags <- unlist(lapply(a3w$features, `[[`, "POS"))

  ## Extract token/POS pairs (all of them): easy - not needed
  # POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
  return(unlist(POStags))
} 
# a  loop to do POS tagging on each document and do garbage cleaning after each document
# first prepare vector to hold results (for optimal loop speed)
corpus_clean_tagged <- vector(mode = "list",  length = length(corpus_clean))
# then loop through each doc and do POS tagging
# warning: this may take some time!
for(i in 1:length(corpus_clean)){
  corpus_clean_tagged[[i]] <- tagPOS(corpus_clean[[i]])
  print(i) # nice to see what we're up to
  gc()
}

# subset nouns
wrds <- lapply(unlist(corpus_clean), function(i) unlist(strsplit(i, split = " ")))
NN <- lapply(corpus_clean_tagged, function(i) i == "NN")
Noun_strings <- lapply(1:length(wrds), function(i) unlist(wrds[i])[unlist(NN[i])])
Noun_strings <- lapply(Noun_strings, function(i) paste(i, collapse = " "))
# have a look to see what we've got
Noun_strings[[1]]
[8] "variogram model splus user quality variogram model pairs locations measurements variogram nonstationarity outliers variogram fit sets soil nitrogen concentration"

Topic modelling with latent Dirichlet allocation...

#### topic modelling with LDA (Jockers uses the lda package and MALLET, maybe topicmodels also, I'm not sure. I'm most familiar with the topicmodels package, so here it is. Note that MALLET can be run from R: https://gist.github.com/benmarwick/4537873
# put the cleaned documents back into a corpus for topic modelling
corpus <- Corpus(VectorSource(Noun_strings))
# create document term matrix 
JSS_dtm <- DocumentTermMatrix(corpus)
# generate topic model 
library("topicmodels")
k = 30 # arbitrary number of topics (they are ways to optimise this)
JSS_TM <- LDA(JSS_dtm, k) # make topic model
# make data frame where rows are documents, columns are topics and cells 
# are posterior probabilities of topics
JSS_topic_df <- setNames(as.data.frame(JSS_TM@gamma),  paste0("topic_",1:k))
# add row names that link each document to a human-readble bit of data
# in this case we'll just use a few words of the title of each paper
row.names(JSS_topic_df) <- lapply(1:length(JSS_papers[,1]), function(i) gsub("\\s","_",substr(JSS_papers[,1][[i]], 1, 60)))

Calculate Euclidean distances of one document from another using topics probabilities as the document's 'DNA'

#### Euclidean distance matrix
library(cluster)
JSS_topic_df_dist <-  as.matrix(daisy(JSS_topic_df, metric =  "euclidean", stand = TRUE))
# Change row values to zero if less than row minimum plus row standard deviation
# This is how Jockers subsets the distance matrix to keep only 
# closely related documents and avoid a dense spagetti diagram 
# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
JSS_topic_df_dist[ sweep(JSS_topic_df_dist, 1, (apply(JSS_topic_df_dist,1,min) + apply(JSS_topic_df_dist,1,sd) )) > 0 ] <- 0

Visualize using a force-directed graph...

#### network diagram using Fruchterman & Reingold algorithm (Jockers uses the ForceAtlas2 algorithm which is unique to Gephi)
library(igraph)
g <- as.undirected(graph.adjacency(JSS_topic_df_dist))
layout1 <- layout.fruchterman.reingold(g, niter=500)
plot(g, layout=layout1, edge.curved = TRUE, vertex.size = 1,  vertex.color= "grey", edge.arrow.size = 0.1, vertex.label.dist=0.5, vertex.label = NA)

enter image description here And if you want to use the Force Atlas 2 algorithm in Gephi you simply export the R graph object to a graphml file and then open it in Gephi and set the layout to Force Atlas 2:

# this line will export from R and make the file 'JSS.graphml' in your working directory ready to open with Gephi
write.graph(g, file="JSS.graphml", format="graphml") 

Here's the Gephi plot with the Force Atlas 2 algorithm: enter image description here

like image 56
Ben Avatar answered Oct 23 '22 23:10

Ben