Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to store dynamic data (unknown number of fields) to a file?

Tags:

delphi

I need to store some data in a file. FOR THE MOMENT each record (data set) consists in:

  • a string (variable length),
  • an array of integers (variable length),
  • an array of bytes (variable length),
  • some integer values.

It wouldn't be difficult at all to save all these stuff in a binary file. However, I know for sure that (unfortunately) my data format will change in time, and I want to have the possibility to add more fields to each "record". So, obviously my file format cannot be fixed. I suppose that the best solution will be to save my data in (DB) table but I don't want to mess up with the big guns (SQL, ADO, BDE, Nexus...). I need a rudimentary library (if possible single PAS file) that can do that. Since the purpose of this is rather storing data than working with data, can it be done without a DB table?

Requirements for this library:

  • it needs to easily support more than 1 million rows
  • really lightweight
  • single PAS file if possible
  • MANDATORY: easy to install in a new machine (together with the project in which it compiles)
  • MANDATORY: in order to use it I don't need to redistribute anything
  • MANDATORY: in order to use it the user doesn't have to install/setup stuff
  • can be freeware/shareware
  • it doesn't have to support SQL queries or similar advanced stuff

I use D7

like image 749
Server Overflow Avatar asked Feb 11 '11 17:02

Server Overflow


5 Answers

Take a look at our Synopse Big Table unit.

With its recent upgrade, it could fit perfectly your need.

Here is how you create your field layout:

var Table: TSynBigTableRecord;
    FieldText, FieldInt: TSynTableFieldProperties;
begin
  Table := TSynBigTableRecord.Create('FileName.ext','TableName');
  FieldText := Table.AddField('text',tftWinAnsi,[tfoIndex]);
  FieldInt := Table.AddField('Int',tftInt32,[tfoIndex,tfoUnique]);
  Table.AddFieldUpdate;

For storing an array of bytes or Integers, just use the tftWinAnsi or even better tftBlobInternal kind of field (this is a true variable-length field), then map it to or from a dynamic array, just like a RawByteString.

You can safely add fields later, the data file will be processed for you.

There are several ways of handling the data, but I've implemented a variant-based way of using a record, with true late-binding:

var vari: Variant;

  // initialize the variant
  vari := Table.VariantVoid;
  // create record content, and add it to the database
  vari.text := 'Some text';
  vari.int := 12345;
  aID := Table.VariantAdd(vari);
  if aID=0 then
    ShowMessage('Error adding record');
  // how to retrieve it
  vari := Table.VariantGet(aID);
  assert(vari.ID=aID);
  assert(vari.INT=12345);
  assert(vari.Text='Some text');

About speed, you can't find anything faster IMHO. Creating 1,000,000 records with some text and an integer value, both fields using an index, and the integer field set as unique is less than 880 ms on my laptop. It will use very little disk space, because all storage is variable-length encoded (similar to Google's Protocol Buffers).

It needs only two units, works for Delphi 6 up to XE (and is already Unicode ready, so using this unit you could upgrade safely to a newer Delphi version when you want to). There is no installation need, and it's only a few KB added to your executable. It's just a small but powerful NoSQL engine written in pure Delphi, but with the power of use of a database (i.e. pure field layout) and the speed of a in-memory engine, with no limit in size.

And it's full OpenSource, with a permissive licence.

Note that we also provide a SQLite3 wrapper, but it's another project. Slower but more powerful, with SQL support and an integrated Client/Server ORM.

like image 129
Arnaud Bouchez Avatar answered Nov 03 '22 01:11

Arnaud Bouchez


Use Synopse BigTable, http://synopse.info/ a key=>value database, value in this case is the serialization of you data (json, binary, xml, ...).

This is CRAZY fast, lightweight and free.

like image 31
arthurprs Avatar answered Nov 03 '22 01:11

arthurprs


I don't think you need a database for this. If you use a database I don't see how it solves the problem of your data structure changing.

I personally would store to YAML format which is very easily extensible. That requires quite a bit of work linking to some LIBYAML so a very lightweight alternative would be to store to INI files. These are easily extensible whilst maintaining compatibility with old files.

You can quite easily roll your own binary format that is extensible. What you do is you write each record to a block. Each block has a short header which includes its length.

When you read the data you read up to the end of the block and then if you are expecting more data you simply stop reading and use default values for the data. If you have read all the data you know about but are not at the end of the block, the file must have come from a later version of your program and you just skip to the end of the block. Perhaps you warn that the file contained data which you didn't know about.

Extensibility is achieved by always writing data out in the same order as previous versions. Any new data goes at the end of each block.

like image 45
David Heffernan Avatar answered Nov 03 '22 03:11

David Heffernan


In order of your level of effort to implement, I suggest, in this order:

  1. CSV or INI files (TMemIniFile, or TJvCsvDataSet). This is the least work for you. You can support millions of lines in a single file, but the memory consumed will be enormous. I have envisioned a "data writer" component to replace my TJvCsvDataSet with something that only appends records, and does not load them into memory. This would allow you to write out to CSV files, and even read them back, row by row, but not load them all at once. This approach might be idea for you. A simple CSV reader/writer class that is NOT a dataset object.

  2. one-XML-tag-per-line files. This is more flexible than INI files and can be hierarchical. INI files are non-hierarchical. Neither SAX or DOM are required if you simply open a file stream, and append a line of text, that is in this form, ending in cr+lf:

    < logitem attrib1="value1" attrib2="value2" />

  3. List item

  4. Some kind of binary nosql db like bsddb, couchdb, etc.
like image 1
Warren P Avatar answered Nov 03 '22 03:11

Warren P


Any time you need to store variable length data into a binary format, you should store the data's length in front of the actual data.

Since you also need to add new fields later, you should store the number of fields per record (or at least an end-of-record marker at the end) so you can maintain the correct position while moving around the file during reading and seeking operations.

As for the actual record data, I would suggest a type-length-data format for each field so you can add new fields without knowing what their data types will be ahead of time, and to allow code to recognize and read/skip individual fields as needed regardless of content (ie, if an old app tries to read a file with newer fields, it can skip what it does not recognize).

In the end, you will end up with something like this as a start, which you can then expand on, optimize, etc as needed:

const
  cTypeUnknown     = $00;
  cTypeString      = $01;
  cTypeInteger     = $02;
  cTypeByte        = $03;

  cTypeArray        = $80;
  cTypeStringArray  = cTypeStringArray or cTypeArray;
  cTypeIntegerArray = cTypeIntegerArray or cTypeArray;
  cTypeByteArray    = cTypeByteArray or cTypeArray;

type
  Streamable = class
  public
    procedure Read(Stream: TStream); virtual; abstract;
    procedure Write(Stream: TStream); virtual; abstract;
  end;

  Field = class(Streamable)
  public
    function GetType: Byte; virtual; abstract;
  end;

  FieldClass = class of Field;

  StringField = class(Field)
  public
    Data: String;
    function GetType: Byte; override;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  StringArrayField = class(Field)
  public
    Data: array of String;
    function GetType: Byte; override;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  IntegerField = class(Field)
  public
    Data: Integer;
    function GetType: Byte; override;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  IntegerArrayField = class(Field)
  public
    Data: array of Integer;
    function GetType: Byte; override;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  ByteField = class(Field)
  public
    Data: Byte;
    function GetType: Byte; override;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  ByteArrayField = class(Field)
  public
    Data: array of Byte;
    function GetType: Byte; override;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  AnyField = class(ByteArrayField)
  public
    Type: Byte;
    function GetType: Byte; override;
  end;

  Record = class(Streamable)
  public
    Fields: array of Field;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  RecordArray = class(Streamable)
  public
    Records: array of Record;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

procedure WriteByte(Stream: TStream; Value: Byte);
begin
  Stream.WriteBuffer(@Value, SizeOf(Byte));
end;

function ReadByte(Stream: TStream): Byte;
begin
  Stream.ReadBuffer(@Result, SizeOf(Byte));
end;

procedure WriteInteger(Stream: TStream; Value: Integer);
begin
  Stream.WriteBuffer(@Value, SizeOf(Integer));
end;

function ReadInteger(Stream: TStream): Integer;
begin
  Stream.ReadBuffer(@Result, SizeOf(Integer));
end;

procedure WriteString(Stream: TStream; Value: String);
var
  S: UTF8String;
begin
  S := UTF8Encode(Value);
  WriteInteger(Stream, Length(S));
  if Length(S) > 0 then
    Stream.WriteBuffer(S[1], Length(S));
end;

function ReadString(Stream: TStream): String;
var
  S: UTF8String;
begin
  SetLength(S, ReadInteger(Stream));
  if Length(S) > 0 then
    Stream.ReadBuffer(S[1], Length(S));
  Result := UTF8Decode(S);
end;

function StringField.GetType: Byte;
begin
  Result := cTypeString;
end;

procedure StringField.Read(Stream: TStream);
begin
  Data := ReadString(Stream);
end;

procedure StringField.Write(Stream: TStream);
begin
  WriteString(Data);
end;

function StringArrayField.GetType: Byte;
begin
  Result := cTypeStringArray;
end;

procedure StringArrayField.Read(Stream: TStream);
var
  I: Integer;
begin
  SetLength(Data, ReadInteger(Stream));
  for I := 0 to High(Data) do
    Data[I] := ReadString(Stream);
end;

procedure StringArrayField.Write(Stream: TStream);
var
  I: Integer;
begin
  WriteInteger(Stream, Length(Data));
  for I := 0 to High(Data) do
    WriteString(Stream, Data[I]);
end;

procedure IntegerField.GetType: Byte;
begin
  Result := cTypeInteger;
end;

procedure IntegerField.Read(Stream: TStream);
begin
  Assert(ReadInteger(Stream) == SizeOf(Integer));
  Data := ReadInteger(Stream);
end;

procedure IntegerField.Write(Stream: TStream);
begin
  WriteInteger(Stream, SizeOf(Integer));
  WriteInteger(Stream, Data);
end;

function IntegerArrayField.GetType;
begin
  Result := cTypeIntegerArray;
end;

procedure IntegerArrayField.Read(Stream: TStream);
var
  Num: Integer;
begin
  I := ReadInteger(Stream);
  Assert((I mod SizeOf(Integer)) == 0);
  SetLength(Data, I);
  if Length(Data) > 0 then
    Stream.ReadBuffer(Data[0], I * SizeOf(Integer));
end;

procedure IntegerArrayField.Write(Stream: TStream);
begin
  WriteInteger(Stream, Length(Data));
  if Length(Data) > 0 then
    Stream.WriteBuffer(Data[0], Length(Data) * SizeOf(Integer));
end;

procedure ByteField.GetType: Byte;
begin
  Result := cTypeByte;
end;

procedure ByteField.Read(Stream: TStream);
begin
  Assert(ReadInteger(Stream) == SizeOf(Byte));
  Data := ReadByte(Stream);
end;

procedure ByteField.Write(Stream: TStream);
begin
  WriteInteger(Stream, SizeOf(Byte));
  WriteByte(Stream, Byte);
end;

function ByteArrayField.GetType: Byte;
begin
  Result := cTypeByteArray;
end;

procedure ByteArrayField.Read(Stream: TStream);
begin
  SetLength(Data, ReadInteger(Stream));
  if Length(Data) > 0 then
    Stream.ReadBuffer(Data[0], Length(Data));
end;

procedure ByteArrayField.Write(Stream: TStream); override;
begin
  WriteInteger(Stream, Length(Data));
  if Length(Data) > 0 then
    Stream.WriteBuffer(Data[0], Length(Data));
end;

function AnyField.GetType: Byte;
begin
  Result := Type;
end;

procedure Record.Read(Stream: TStream);
const
  PlainTypes = array[1..3] of FieldClass = (StringField, IntegerField, ByteField);
  ArrayTypes = array[1..3] of FieldClass = (StringArrayField, IntegerArrayField, ByteArrayField);
var
  I: Integer;
  RecType, PlainType: Byte;
begin
  SetLength(Fields, ReadInteger(Stream));
  for I := 0 to High(Fields) do
  begin
    RecType := ReadByte(Stream);
    PlainType := RecType and (not cTypeArray);
    if (PlainType >= cTypeString) and (PlainType <= cTypeByte) then
    begin
      if (RecType and cTypeArray) <> cTypeArray then
        Fields[I] := PlainTypes[PlainType].Create
      else
        Fields[I] := ArrayTypes[PlainType].Create;
    end else
      Fields[I] := AnyField.Create;
    Fields[I].Read(Stream);
  end;
end;

procedure Record.Write(Stream: TStream)
var
  I: Integer;
begin
  WriteInteger(Stream, Length(Fields));
  for I := 0 to High(Fields) do
  begin
    WriteByte(Stream, Fields[I].GetType);
    Fields[I].Write(Stream);
  end;
end;

procedure RecordArray.Read(Stream: TStream);
var
  I: Integer;
begin
  SetLength(Records, ReadInteger(Stream));
  for I := High(Records) do
  begin
    Records[I] := Record.Create;
    Records[I].Read(Stream);
  end;
end;

procedure RecordArray.Write(Stream: TStream);
begin
  WriteInteger(Stream, Length(Records));
  for I := High(Records) do
    Records[I].Write(Stream);
end;
like image 1
Remy Lebeau Avatar answered Nov 03 '22 03:11

Remy Lebeau