Why should I use a human readable file format in preference to a binary one? Is there ever a situation when this isn't the case? EDIT: I did have this as an explanation when initially posting the question, but it's not so relevant now: When answering this question I wanted to refer the asker to a standard SO answer on why using a human readable file format is a good idea. Then I searched for one and couldn't find one. So here's the question

<h3> It depends </h3> The right answer is it depends. If you are writing audio/video data for instance, if you crowbar it into a human readable format, it won't be very readable! And word documents are the classic example where people have wished they were human readable, so more flexible, and by moving to XML MS are going that way. Much more important than binary or text is a standard or not a standard. If you use a standard format, then chances are you and the next guy won't have to write a parser, and that's a win for everyone. Following this are some opinionated reasons why you might want to choose one over the other, if you have to write your own format (and parser). <h3> Why use human readable? </h3> <ol> <li> The next guy. Consider the maintaining developer looking at your code 30 years or six months from now. Yes, he should have the source code. Yes he should have the documents and the comments. But he quite likely won't. And having been that guy, and had to rescue or convert old, extremely, valuable data, I'll thank you for for making it something I can just look at and understand.</li> <li> Let me read AND WRITE it with my own tools. If I'm an emacs user I can use that. Or Vim, or notepad or ... Even if you've created great tools or libraries, they might not run on my platform, or even run at all any more. Also, I can then create new data with my tools.</li> <li> The tax isn't that big - storage is free. Nearly always disc space is free. And if it isn't you'll know. Don't worry about a few angle brackets or commas, usually it won't make that much difference. Premature optimisation is the root of all evil. And if you are really worried just use a standard compression tool, and then you have a small human readable format - anyone can run unzip.</li> <li> The tax isn't that big - computers are quick. It might be a faster to parse binary. Until you need to add an extra column, or data type, or support both legacy and new files. (though this is mitigated with Protocol Buffers)</li> <li> There are a lot of good formats out there. Even if you don't like XML. Try CSV. Or JSON. Or .properties. Or even XML. Lots of tools exist for parsing these already in lots of languages. And it only takes 5mins to write them again if mysteriously all the source code gets lost.</li> <li> Diffs become easy. When you check in to version control it is much easier to see what has changed. And view it on the Web. Or your iPhone. Binary, you know something has changed, but you rely on the comments to tell you what.</li> <li> Merges become easy. You still get questions on the web asking how to append one PDF to another. This doesn't happen with Text.</li> <li> Easier to repair if corrupted. Try and repair a corrupt text document vs. a corrupt zip archive. Enough said.</li> <li> Every language (and platform) can read or write it. Of course, binary is the native language for computers, so every language will support binary too. But a lot of the classic little tool scripting languages work a lot better with text data. I can't think of a language that works well with binary and not with text (assembler maybe) but not the other way round. And that means your programs can interact with other programs you haven't even thought of, or that were written 30 years before yours. There are reasons Unix was successful.</li> </ol> <h3>Why not, and use binary instead?</h3> <ol> <li> You might have a lot of data - terabytes maybe. And then a factor of 2 could really matter. But premature optimization is still the root of all evil. How about use a human one now, and convert later? It won't take much time.</li> <li> Storage might be free but bandwidth isn't (Jon Skeet in comments). If you are throwing files around the network then size can really make a difference. Even bandwidth to and from disc can be a limiting factor.</li> <li> Really performance intensive code. Binary can be seriously optimised. There is a reason databases don't normally have their own plain text format.</li> <li> A binary format might be the standard. So use PNG, MP3 or MPEG. It makes the next guys job easier (for at least the next 10 years).</li> <li> There are lots of good binary formats out there. Some are global standards for that type of data. Or might be a standard for hardware devices. Some are standard serialization frameworks. A great example is Google Protocol Buffers. Another example: Bencode </li> <li> Easier to embed binary. Some data already is binary and you need to embed it. This works naturally in binary file formats, but looks ugly and is very inefficient in human readable ones, and usually stops them being human readable.</li> <li> Deliberate obscurity. Sometimes you don't want it obvious what your data is doing. Encryption is better than accidental security through obscurity, but if you are encrypting you might as well make it binary and be done with it.</li> </ol> <h3>Debatable</h3> <ol> <li> Easier to parse. People have claimed that both text and binary are easier to parse. Now clearly the easiest to parse is when your language or library supports parsing, and this is true for some binary and some human readable formats, so doesn't really support either. Binary formats can clearly be chosen so they are easy to parse, but so can human readable (think CSV or fixed width) so I think this point is moot. Some binary formats can just be dumped into memory and used as is, so this could be said to be the easiest to parse, especially if numbers (not just strings are involved. However I think most people would argue human readable parsing is easier to debug, as it is easier to see what is going on in the debugger (slightly).</li> <li> Easier to control. Yes, it is more likely someone will mangle text data in their editor, or will moan when one Unicode format works and another doesn't. With binary data that is less likely. However, people and hardware can still mangle binary data. And you can (and should) specify a text encoding for human-readable data, either flexible or fixed.</li> </ol> At the end of the day, I don't think either can really claim an advantage here. <h3> Anything else </h3> Are you sure you really want a file? Have you considered a database? :-) Credits A lot of this answer is merging together stuff other people wrote in other answers (you can see them there). And especially big thanks to Jon Skeet for his comments (both here and offline) for suggesting ways it could be improved.

Why should I use a human readable file format?

Tags:

file

formatting

abstraction

encoding

binary

Why should I use a human readable file format in preference to a binary one? Is there ever a situation when this isn't the case?

EDIT: I did have this as an explanation when initially posting the question, but it's not so relevant now:

When answering this question I wanted to refer the asker to a standard SO answer on why using a human readable file format is a good idea. Then I searched for one and couldn't find one. So here's the question

998

asked Feb 20 '09 08:02

Nick Fortescue

2 Answers

It depends

The right answer is it depends. If you are writing audio/video data for instance, if you crowbar it into a human readable format, it won't be very readable! And word documents are the classic example where people have wished they were human readable, so more flexible, and by moving to XML MS are going that way.

Much more important than binary or text is a standard or not a standard. If you use a standard format, then chances are you and the next guy won't have to write a parser, and that's a win for everyone.

Following this are some opinionated reasons why you might want to choose one over the other, if you have to write your own format (and parser).

Why use human readable?

The next guy. Consider the maintaining developer looking at your code 30 years or six months from now. Yes, he should have the source code. Yes he should have the documents and the comments. But he quite likely won't. And having been that guy, and had to rescue or convert old, extremely, valuable data, I'll thank you for for making it something I can just look at and understand.
Let me read AND WRITE it with my own tools. If I'm an emacs user I can use that. Or Vim, or notepad or ... Even if you've created great tools or libraries, they might not run on my platform, or even run at all any more. Also, I can then create new data with my tools.
The tax isn't that big - storage is free. Nearly always disc space is free. And if it isn't you'll know. Don't worry about a few angle brackets or commas, usually it won't make that much difference. Premature optimisation is the root of all evil. And if you are really worried just use a standard compression tool, and then you have a small human readable format - anyone can run unzip.
The tax isn't that big - computers are quick. It might be a faster to parse binary. Until you need to add an extra column, or data type, or support both legacy and new files. (though this is mitigated with Protocol Buffers)
There are a lot of good formats out there. Even if you don't like XML. Try CSV. Or JSON. Or .properties. Or even XML. Lots of tools exist for parsing these already in lots of languages. And it only takes 5mins to write them again if mysteriously all the source code gets lost.
Diffs become easy. When you check in to version control it is much easier to see what has changed. And view it on the Web. Or your iPhone. Binary, you know something has changed, but you rely on the comments to tell you what.
Merges become easy. You still get questions on the web asking how to append one PDF to another. This doesn't happen with Text.
Easier to repair if corrupted. Try and repair a corrupt text document vs. a corrupt zip archive. Enough said.
Every language (and platform) can read or write it. Of course, binary is the native language for computers, so every language will support binary too. But a lot of the classic little tool scripting languages work a lot better with text data. I can't think of a language that works well with binary and not with text (assembler maybe) but not the other way round. And that means your programs can interact with other programs you haven't even thought of, or that were written 30 years before yours. There are reasons Unix was successful.

Why not, and use binary instead?

You might have a lot of data - terabytes maybe. And then a factor of 2 could really matter. But premature optimization is still the root of all evil. How about use a human one now, and convert later? It won't take much time.
Storage might be free but bandwidth isn't (Jon Skeet in comments). If you are throwing files around the network then size can really make a difference. Even bandwidth to and from disc can be a limiting factor.
Really performance intensive code. Binary can be seriously optimised. There is a reason databases don't normally have their own plain text format.
A binary format might be the standard. So use PNG, MP3 or MPEG. It makes the next guys job easier (for at least the next 10 years).
There are lots of good binary formats out there. Some are global standards for that type of data. Or might be a standard for hardware devices. Some are standard serialization frameworks. A great example is Google Protocol Buffers. Another example: Bencode
Easier to embed binary. Some data already is binary and you need to embed it. This works naturally in binary file formats, but looks ugly and is very inefficient in human readable ones, and usually stops them being human readable.
Deliberate obscurity. Sometimes you don't want it obvious what your data is doing. Encryption is better than accidental security through obscurity, but if you are encrypting you might as well make it binary and be done with it.

Debatable

Easier to parse. People have claimed that both text and binary are easier to parse. Now clearly the easiest to parse is when your language or library supports parsing, and this is true for some binary and some human readable formats, so doesn't really support either. Binary formats can clearly be chosen so they are easy to parse, but so can human readable (think CSV or fixed width) so I think this point is moot. Some binary formats can just be dumped into memory and used as is, so this could be said to be the easiest to parse, especially if numbers (not just strings are involved. However I think most people would argue human readable parsing is easier to debug, as it is easier to see what is going on in the debugger (slightly).
Easier to control. Yes, it is more likely someone will mangle text data in their editor, or will moan when one Unicode format works and another doesn't. With binary data that is less likely. However, people and hardware can still mangle binary data. And you can (and should) specify a text encoding for human-readable data, either flexible or fixed.

At the end of the day, I don't think either can really claim an advantage here.

Anything else

Are you sure you really want a file? Have you considered a database? :-)

Credits

A lot of this answer is merging together stuff other people wrote in other answers (you can see them there). And especially big thanks to Jon Skeet for his comments (both here and offline) for suggesting ways it could be improved.

179

answered Oct 20 '22 02:10

12 revs, 2 users 98%

It entirely depends on the situation.

Benefits of a human readable format:

You can read it in its "native" format
You can write it yourself, e.g. for unit tests - or even for real content, depending on what it's for

Probable benefits of a binary format:

Easier to parse (in terms of code)
Faster to parse
More efficient in terms of space
Easier to control (any time you need text in there, you can ensure it's UTF-8 encoded, and length prefixed etc)
Easier to include opaque binary data efficiently (images, etc - with a text format you'd be getting into base64)

Don't forget that you can always implement a binary format but produce tools to convert to/from a human-readable format as well. That's what the Protocol Buffers framework does - it's actually pretty rare IME to need to parse a text version of a protocol buffer, but it's really handy to be able to write it out as text.

EDIT: Just in case this ends up being an accepted answer, you should also bear in mind the point made by starblue: Human readable forms are much better for diffing. I suspect it would be feasible to design a binary format which is appropriate for diffing (and where a human-readable diff could be generated) but out-of-the-box support from existing diff tools will be better for text.

answered Oct 20 '22 03:10

Jon Skeet

Related questions
                            
                                Exclude certain file extensions when getting files from a directory
                            
                                Run .jar from batch-file
                            
                                GIT list of new/modified/deleted files
                            
                                Change File Encoding to utf-8 via vim in a script
                            
                                Are there performance issues storing files in PostgreSQL?
                            
                                Splitting a file in linux based on content [duplicate]
                            
                                File opening mode in Ruby
                            
                                Best way to choose a random file from a directory in a shell script
                            
                                What is the best/simplest way to read in an XML file in Java application? [closed]
                            
                                Why does Java read a big file faster than C++?
                            
                                Read, edit, and write a text file line-wise using Ruby
                            
                                Create file with given size in Java
                            
                                Adding folders to a zip file using python
                            
                                What is the best way to slurp a file into a string in Perl?
                            
                                Standard File Naming Conventions in Ruby
                            
                                WPF/C#: Where should I be saving user preferences files?
                            
                                Symfony2 file upload step by step
                            
                                Count the number of times a word appears in a file
                            
                                How to create a file -- including folders -- for a given path?
                            
                                Is there a need to close files that have no reference to them?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With