Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

FAQ markup to R data structure

I'm reading the R FAQ source in texinfo, and thinking that it would be easier to manage and extend if it was parsed as an R structure. There are several existing examples related to this:

  • the fortunes package

  • bibtex entries

  • Rd files

each with some desirable features.

In my opinion, FAQs are underused in the R community because they lack i) easy access from the R command-line (ie through an R package); ii) powerful search functions; iii) cross-references; iv) extensions for contributed packages. Drawing ideas from packages bibtex and fortunes, we could conceive a new system where:

  • FAQs can be searched from R. Typical calls would resemble the fortune() interface: faq("lattice print"), or faq() #surprise me!, faq(51), faq(package="ggplot2").

  • Packages can provide their own FAQ.rda, the format of which is not clear yet (see below)

  • Sweave/knitr drivers are provided to output nicely formatted Markdown/LaTeX, etc.

QUESTION

I'm not sure what is the best input format, however. Either for converting the existing FAQ, or for adding new entries.

It is rather cumbersome to use R syntax with a tree of nested lists (or an ad hoc S3/S4/ref class or structure,

\list(title = "Something to be \\escaped", entry = "long text with quotes, links and broken characters", category = c("windows", "mac", "test")) 

Rd documentation, even though not an R structure per se (it is more a subset of LaTeX with its own parser), can perhaps provide a more appealing example of an input format. It also has a set of tools to parse the structure in R. However, its current purpose is rather specific and different, being oriented towards general documentation of R functions, not FAQ entries. Its syntax is not ideal either, I think a more modern markup, something like markdown, would be more readable.

Is there something else out there, maybe examples of parsing markdown files into R structures? An example of deviating Rd files away from their intended purpose?

To summarise

I would like to come up with:

1- a good design for an R structure (class, perhaps) that would extend the fortune package to more general entries such as FAQ items

2- a more convenient format to enter new FAQs (rather than the current texinfo format)

3- a parser, either written in R or some other language (bison?) to convert the existing FAQ into the new structure (1), and/or the new input format (2) into the R structure.

Update 2: in the last two days of the bounty period I got two answers, both interesting but completely different. Because the question is quite vast (arguably ill-posed), none of the answers provide a complete solution, thus I will not (for now anyway) accept an answer. As for the bounty, I'll attribute it to the answer most up-voted before the bounty expires, wishing there was a way to split it more equally.

like image 868
baptiste Avatar asked May 26 '12 03:05

baptiste


2 Answers

(This addresses point 3.)

You can convert the texinfo file to XML

wget http://cran.r-project.org/doc/FAQ/R-FAQ.texi makeinfo --xml R-FAQ.texi  

and then read it with the XML package.

library(XML) doc <- xmlParse("R-FAQ.xml") r <- xpathSApply( doc, "//node", function(u) {   list(list(     title    = xpathSApply(u, "nodename", xmlValue),     contents = as(u, "character")   )) } ) free(doc) 

But it is much easier to convert it to text

makeinfo --plaintext R-FAQ.texi > R-FAQ.txt 

and parse the result manually.

doc <- readLines("R-FAQ.txt")  # Split the document into questions # i.e., around lines like ****** or ======. i <- grep("[*=]{5}", doc) - 1 i <- c(1,i) j <- rep(seq_along(i)[-length(i)], diff(i)) stopifnot(length(j) == length(doc)) faq <- split(doc, j)  # Clean the result: since the questions  # are in the subsections, we can discard the sections. faq <- faq[ sapply(faq, function(u) length(grep("[*]", u[2])) == 0) ]  # Use the result cat(faq[[ sample(seq_along(faq),1) ]], sep="\n") 
like image 135
Vincent Zoonekynd Avatar answered Sep 18 '22 14:09

Vincent Zoonekynd


I'm a little unclear on your goals. You seem to want all the R-related documentation converted into some format which R can manipulate, presumably so the one can write R routines to extract information from the documentation better.

There seem to be three assumptions here.

1) That it will be easy to convert these different document formats (texinfo, RD files, etc.) to some standard form with (I emphasize) some implicit uniform structure and semantics.
Because if you cannot map them all to a single structure, you'll have to write separate R tools for each type and perhaps for each individual document, and then the post-conversion tool work will overwhelm the benefit.

2) That R is the right language in which to write such document processing tools; suspect you're a little biased towards R because you work in R and don't want to contemplate "leaving" the development enviroment to get information about working with R better. I'm not an R expert, but I think R is mainly a numerical language, and does not offer any special help for string handling, pattern recognition, natural language parsing or inference, all of which I'd expect to play an important part in extracting information from the converted documents that largely contain natural language. I'm not suggesting a specific alternative language (Prolog??), but you might be better off, if you succeed with the conversion to normal form (task 1) to carefully choose the target language for processing.

3) That you can actually extract useful information from those structures. Library science was what the 20th century tried to push; now we're all into "Information Retrieval" and "Data Fusion" methods. But in fact reasoning about informal documents has defeated most of the attempts to do it. There are no obvious systems that organize raw text and extract deep value from it (IBM's Jeopardy-winning Watson system being the apparent exception but even there it isn't clear what Watson "knows"; would you want Watson to answer the question, "Should the surgeon open you with a knife?" no matter how much raw text you gave it) The point is that you might succeed in converting the data but it isn't clear what you can successfully do with it.

All that said, most markup systems on text have markup structure and raw text. One can "parse" those into tree-like structures (or graph-like structures if you assume certain things are reliable cross-references; texinfo certainly has these). XML is widely pushed as a carrier for such parsed-structures, and being able to represent arbitrary trees or graphs it is ... OK ... for capturing such trees or graphs. [People then push RDF or OWL or some other knoweldge encoding system that uses XML but this isn't changing the problem; you pick a canonical target independent of R]. So what you really want is something that will read the various marked-up structures (texinfo, RD files) and spit out XML or equivalent trees/graphs. Here I think you are doomed into building separate O(N) parsers to cover all the N markup styles; how otherwise would a tool know what the value markup (therefore parse) was? (You can imagine a system that could read marked-up documents when given a description of the markup, but even this is O(N): somebody still has to describe the markup). One this parsing is to this uniform notation, you can then use an easily built R parser to read the XML (assuming one doesn't already exist), or if R isn't the right answer, parse this with whatever the right answer is.

There are tools that help you build parsers and parse trees for arbitrary lanuages (and even translators from the parse trees to other forms). ANTLR is one; it is used by enough people so you might even accidentally find a texinfo parser somebody already built. Our DMS Software Reengineering Toolkit is another; DMS after parsing will export an XML document with the parse tree directly (but it won't necessarily be in that uniform representation you ideally want). These tools will likely make it relatively easy to read the markup and represent it in XML.

But I think your real problem will be deciding what you want to extract/do, and then finding a way to do that. Unless you have a clear idea of how to do the latter, doing all the up front parsers just seems like a lot of work with unclear payoff. Maybe you have a simpler goal ("manage and extend" but those words can hide a lot) that's more doable.

like image 43
Ira Baxter Avatar answered Sep 19 '22 14:09

Ira Baxter