Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data format for text annotations

The diff format is more or less the de facto standard for representing differences between texts and is widely used by programmers to distribute source code changes. Most version control systems can output diffs, and diffs are used to discuss proposed changes to text (e.g. source code) since they are very powerful in illustrating the changes.

However, I would often like to simply comment on a text without changing it, and would like a data format which can represent annotations to text in a way that is as powerful as diff is for changes. A typical use case would be a code review where I want to comment on the code but not (yet) propose any changes. Another use case would be to annotate an article with my own thoughts and reminders. In Word, I can annotate text by marking it and creating a comment balloon beside the text. But Word is cumbersome in other ways – I would like to have just the annotations in a separate file and keep the originals as they are.

What data formats exist that can represent annotations to text in a way that is as exact as a diff is for changes?

I'm not looking for general answers like "XML". I'm looking for formats that explicitly represent annotations to text. (Perhaps no such format exists except the application-specific formats of certain programs like Word.)

like image 365
Fabian Fagerholm Avatar asked Dec 02 '12 17:12

Fabian Fagerholm


1 Answers

Excellent question.

Most people would throw XML or a subset like HTML into the discussion. The markup languages use (data-) attributes stored into the original text. But that's not what you are looking for. I'm excluding XML/HTML and RDF & microformats.

In general

You need to keep the original text, clone it and then add the annotation via a custom markup language. This allows a text difference of the original text against the annotated text. Important is the seperate storage of original text and revisions of annotations to the text.

This allows multiple diffs:

  • diff between "original text" and "annotated text revision 1..n"
  • diff between "annotated text rev n" and "annotated text rev n+1".

This is quite powerful.

What data formats exist?

  • BioNLP formats, are a textmining format for information extraction
    • http://2011.bionlp-st.org/home/file-formats
  • Brat standoff format
    • http://brat.nlplab.org/standoff.html

In the standoff representation, the texts of the documents are kept separate from annotations, which are connected to specific spans of texts through character offsets. The annotations are associated with their texts by the file naming convention that their base name (file name without suffix) is the same: for example, the file PMID-1000.a1 contains annotations for the file PMID-1000.txt.

As you can see, it's a "filename based annotation association". Lots of room for academic research and improvement ,)

like image 154
Jens A. Koch Avatar answered Oct 25 '22 20:10

Jens A. Koch