Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for large text files for testing compression in all sizes

I am looking for large text files for testing the compression and decompression in all sizes from 1kb to 100mb. Can someone please refer me to download it from some link ?

like image 683
Siranjeevi Rajendran Avatar asked Jun 12 '17 06:06

Siranjeevi Rajendran


People also ask

How do I find large text files?

To be able to open such large CSV files, you need to download and use a third-party application. If all you want is to view such files, then Large Text File Viewer is the best choice for you. For actually editing them, you can try a feature-rich text editor like Emacs, or go for a premium tool like CSV Explorer.

How do I open a 15gb text file?

Solution 1: Download a Dedicated Large File Viewer If all you need to do is read the large file, you can download a dedicated large file viewer such as the Large Text File Viewer. Such tools will open large text files with ease.

What is the largest text file?

Over 108 Billion Lines.


3 Answers

And don't forget the collection of Corpus

The Canterbury Corpus
The Artificial Corpus
The Large Corpus
The Miscellaneous Corpus
The Calgary Corpus
The Canterbury Corpus

SEE: https://corpus.canterbury.ac.nz/descriptions/

there is a download links for the files available for each set

like image 161
Phillip Williams Avatar answered Oct 21 '22 12:10

Phillip Williams


You can download enwik8 and enwik9 from here. They are respectively 100,000,000 and 1,000,000,000 bytes of text for compression benchmarks. You can always pull subsets of those for smaller tests.

like image 22
Mark Adler Avatar answered Oct 21 '22 11:10

Mark Adler


*** Linux users only ***

Arbitrarily large text files can be generated on Linux with the following command:

tr -dc "A-Za-z 0-9" < /dev/urandom | fold -w100|head -n 100000 > bigfile.txt

This command will generate a text file that will contain 100,000 lines of random text and look like this:

NsQlhbisDW5JVlLSaZVtCLSUUrkBijbkc5f9gFFscDkoGnN0J6GgIFqdCLyhbdWLHxRVY8IwDCrWF555JeY0yD0GtgH21NotZAEe
iWJR1A4 bxqq9VKKAzMJ0tW7TCOqNtMzVtPB6NrtCIg8NSmhrO7QjNcOzi4N b VGc0HB5HMNXdyEoWroU464ChM5R Lqdsm3iPo
1mz0cPKqobhjDYkvRs5LZO8n92GxEKGeCtt oX53Qu6T7O2E9nJLKoUeJI6Ul7keLsNGI2BC55qs7fhqW8eFDsGsLPaImF7kFJiz
...
...

On my Ubuntu 18 its size it about 10MB. Bumping up the number of lines, and thereby bumping up the size, is easy. Just increase the head -n 100000 part. So, say, this command:

tr -dc "A-Za-z 0-9" < /dev/urandom | fold -w100|head -n 1000000 > bigfile.txt

will generate a file with 1,000,000 of random lines of text and be around 100MB. On my commodity hardware the latter command takes about 3 seconds to finish.

like image 24
codemonkey Avatar answered Oct 21 '22 10:10

codemonkey