The diff format is more or less the de facto standard for representing differences between texts and is widely used by programmers to distribute source code changes. Most version control systems can output diffs, and diffs are used to discuss proposed changes to text (e.g. source code) since they are very powerful in illustrating the changes.
However, I would often like to simply comment on a text without changing it, and would like a data format which can represent annotations to text in a way that is as powerful as diff is for changes. A typical use case would be a code review where I want to comment on the code but not (yet) propose any changes. Another use case would be to annotate an article with my own thoughts and reminders. In Word, I can annotate text by marking it and creating a comment balloon beside the text. But Word is cumbersome in other ways – I would like to have just the annotations in a separate file and keep the originals as they are.
What data formats exist that can represent annotations to text in a way that is as exact as a diff is for changes?
I'm not looking for general answers like "XML". I'm looking for formats that explicitly represent annotations to text. (Perhaps no such format exists except the application-specific formats of certain programs like Word.)
Excellent question.
Most people would throw XML or a subset like HTML into the discussion. The markup languages use (data-) attributes stored into the original text. But that's not what you are looking for. I'm excluding XML/HTML and RDF & microformats.
In general
You need to keep the original text, clone it and then add the annotation via a custom markup language. This allows a text difference of the original text against the annotated text. Important is the seperate storage of original text and revisions of annotations to the text.
This allows multiple diffs:
This is quite powerful.
What data formats exist?
In the standoff representation, the texts of the documents are kept separate from annotations, which are connected to specific spans of texts through character offsets. The annotations are associated with their texts by the file naming convention that their base name (file name without suffix) is the same: for example, the file PMID-1000.a1 contains annotations for the file PMID-1000.txt.
As you can see, it's a "filename based annotation association". Lots of room for academic research and improvement ,)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With