Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split an mbox file into n-MB big chunks using the terminal?

So I've read through this question on SO but it does not quite help me any. I want to import a Gmail generated mbox file into another webmail service, but the problem is it only allows 40 MB huge files per import.

So I somehow have to split the mbox file into max. 40 MB big files and import them one after another. How would you do this?

My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.

I also looked at the split command, but Im afraid it would cutoff mails. Thanks for any help!

like image 261
Alex Avatar asked Jan 23 '15 13:01

Alex


People also ask

Is splitting a Unix MBOX mail file into individual messages?

The most reliable way to split MBOX into a single message is by using the MBOX conversion tool. It enables you to break the MBOX file in individual messages with all information. From a normal user to professionals, all can use this solution to get their job done.

Can I merge MBOX files?

Yes, you can add and merge MBOX files into a single PST. Both manual and automated solutions are available to carry out this task.

What program do I use to open a MBOX file?

Our favorite app for opening an MBOX file is the open-source Mozilla Thunderbird application. We performed the steps here with Mozilla Thunderbird version 78, which was the current version as of January 15, 2021. To get started, download Mozilla Thunderbird and install it. It's available for Windows, macOS, and Linux.


3 Answers

If your mbox is in standard format, each message will begin with From and a space:

From [email protected]

So, you could COPY YOUR MBOX TO A TEMPORARY DIRECTORY and try using awk to process it, on a message-by-message basis, only splitting at the start of any message. Let's say we went for 1,000 messages per output file:

awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox

then you will get output files called chunk_1.txt to chunk_n.txt each containing up to 1,000 messages.

If you are unfortunate enough to be on Windows (which is incapable of understanding single quotes), you will need to save the following in a file called awk.txt

BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}

and then type

awk -f awk.txt mbox
like image 185
Mark Setchell Avatar answered Oct 03 '22 16:10

Mark Setchell


I just improved a script from Mark Sechell's answer. As We can see, that script can parse the mbox file based on the amount of email per chunk. This improved script can parse the mbox file based on the defined-maximum-size for each chunk.
So, if you have size limitation in uploading or importing the mbox file, you can try the script below to split the mbox file into chunks with specified size*.
Save the script below to a text file, e.g. mboxsplit.txt, in the directory that contains the mbox file (e.g. named mbox):

BEGIN{chunk=0;filesize=0;}
    /^From /{
    if(filesize>=40000000){#file size per chunk in byte
        close("chunk_" chunk ".txt");
        filesize=0;
        chunk++;
    }
  }
  {filesize+=length()}
  {print > ("chunk_" chunk ".txt")}

And then run/type this line in that directory (contains the mboxsplit.txt and the mbox file):

  awk -f mboxsplit.txt mbox

Please note:

  • The size of the result may be larger than the defined size. It depends on the last email size inserted into the buffer/chunk before checking the chunk size.
  • It will not split the email body
  • One chunk may contain only one email if the email size is larger than the specified chunk size

I suggest you to specify the chunk size less or lower than the maximum upload/import size.

like image 41
Oki Erie Rinaldi Avatar answered Oct 03 '22 16:10

Oki Erie Rinaldi


formail is perfectly suited for this task. You may look at formail's +skip and -total options

Options
...
+skip
Skip the first skip messages while splitting.
-total
Output at most total messages while splitting.

Depending on the size of your mailbox and mails, you may try

formail -100 -s <google.mbox >import-01.mbox
formail +100 -100 -s <google.mbox >import-02.mbox
formail +200 -100 -s <google.mbox >import-03.mbox

etc.

The parts need not be of equal size, of course. If there's one large e-mail, you may have only formail +100 -60 -s <google.mbox >import-02.mbox, or if there are many small messages, maybe formail +100 -500 -s <google.mbox >import-02.mbox.

To look for an initial number of mails per chunk, try

formail -100 -s <google.mbox | wc
formail -500 -s <google.mbox | wc
formail -1000 -s <google.mbox | wc

You may need to experiment a bit, in order to accommodate to your mailbox size. On the other hand, since this seems to be a one time task, you may not want to spend too much time on this.

like image 35
Olaf Dietsche Avatar answered Oct 03 '22 17:10

Olaf Dietsche