Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse a tar file in C++

Tags:

c++

tar

What I want to do is download a .tar file with multiple directories with 2 files each. The problem is I can't find a way to read the tar file without actually extracting the files (using tar).

The perfect solution would be something like:

#include <easytar>

Tarfile tar("somefile.tar");
std::string currentFile, currentFileName;
for(int i=0; i<tar.size(); i++){
  file = tar.getFileText(i);
  currentFileName = tar.getFileName(i);
  // do stuff with it
}

I'm probably going to have to write this myself, but any ideas would be appreciated..

like image 828
Brendan Long Avatar asked Mar 24 '10 02:03

Brendan Long


People also ask

How do I extract a tar file in Linux?

Simply right-click the item you want to compress, mouseover compress, and choose tar.gz. You can also right-click a tar.gz file, mouseover extract, and select an option to unpack the archive.

How do I open a tar file in CMD?

Open a command prompt, and cd to the directory. Type 7z x filename. tar at the command prompt (where filename. tar is the name of the tar file).


2 Answers

I figured this out myself after a bit of work. The tar file spec actually tells you everything you need to know.

First off, every file starts with a 512 byte header, so you can represent it with a char[512] or a char* pointing at somewhere in your larger char array (if you have the entire file loaded into one array for example).

The header looks like this:

location  size  field
0         100   File name
100       8     File mode
108       8     Owner's numeric user ID
116       8     Group's numeric user ID
124       12    File size in bytes
136       12    Last modification time in numeric Unix time format
148       8     Checksum for header block
156       1     Link indicator (file type)
157       100   Name of linked file

So if you want the file name, you grab it right here with string filename(buffer[0], 100);. The file name is null padded, so you could do a check to make sure there's at least one null and then leave off the size if you want to save space.

Now we want to know if it's a file or a folder. The "link indicator" field has this information, so:

// Note that we're comparing to ascii numbers, not ints
switch(buffer[156]){
    case '0': // intentionally dropping through
    case '\0':
        // normal file
        break;
    case '1':
        // hard link
        break;
    case '2':
        // symbolic link
        break;
    case '3':
        // device file/special file
        break;
    case '4':
        // block device
        break;
    case '5':
        // directory
        break;
    case '6':
        // named pipe
        break;
}

At this point, we already have all of the information we need about directories, but we need one more thing from normal files: the actual file contents.

The length of the file can be stored in two different ways, either as a 0-or-space-padded null-terminated octal string, or "a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field".

Numeric values are encoded in octal numbers using ASCII digits, with leading zeroes. For historical reasons, a final NUL or space character should be used. Thus although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabytes on archived files. To overcome this limitation, star in 2001 introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field. GNU-tar and BSD-tar followed this idea. Additionally, versions of tar from before the first POSIX standard from 1988 pad the values with spaces instead of zeroes.

Here's how you would read the octal format, but I haven't written code for the base-256 version:

// in one function
int size_of_file = octal_string_to_int(&buffer[124], 11);

// elsewhere
int octal_string_to_int(char *current_char, unsigned int size){
    unsigned int output = 0;
    while(size > 0){
        output = output * 8 + *current_char - '0';
        current_char++;
        size--;
    }
    return output;
}

Ok, so now we have everything except the actual file contents. All we have to do is grab the next size bytes of data from the tar file and we'll have our file contents:

// Get to the next block after the header ends
location += 512;
file_contents = new char[size];
memcpy(file_contents, &buffer[location], size);
// Go to the next block by rounding up to 512
// This isn't necessarily the most efficient way to do this,
// but it's the most obvious.
location += (int)ceil(size / 512.0)
like image 81
Brendan Long Avatar answered Sep 23 '22 06:09

Brendan Long


Have you looked at libtar?

From the fink package info:

libtar-1.2-1: Tar file manipulation API libtar is a C library for manipulating POSIX tar files. It handles adding and extracting files to/from a tar archive. libtar offers the following features:
* Flexible API - you can manipulate individual files or just extract a whole archive at once.
* Allows user-specified read() and write() functions, such as zlib's gzread() and gzwrite().
* Supports both POSIX 1003.1-1990 and GNU tar file formats.

Not c++ per se, but you can link to c pretty easily...

like image 40
dmckee --- ex-moderator kitten Avatar answered Sep 26 '22 06:09

dmckee --- ex-moderator kitten