Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

read/write to a large size file in java

i have a binary file with following format :

[N bytes identifier & record length] [n1 bytes data] 
[N bytes identifier & record length] [n2 bytes data] 
[N bytes identifier & record length] [n3 bytes data]

as you see i have records with different lengths. in each record i have N bytes fixed which contains and id and the length of data in record.

this file is very big and can contains 3 millions records.

I want to open this file by an application and let user to browse and edit the records. ( Insert / Update / Delete records)

my initial plan is to create and index file from original file and for each record, keep next and previous record address to navigate forward and backward easily. (some sort of linked list but in file not in memory)

  • is there library (java library) to help me to implement this requirement ?

  • any recommendation or experience that you think is useful?

----------------- EDIT ----------------------------------------------

Thanks for guides and suggestions,

some more info:

the original file and its format is out of my control (it's a third party file) and i can't change the file format. but i have to read it, let user to navigate over records and edit some of them (insert new record/ update an existing record/ delete a record) and at the end save it back to original file format.

do u still recommend DataBase instead of a normal index file ?

----------------- SECOND EDIT ----------------------------------------------

record size in update mode is fixed. it means updated (edited) record has same length as original record's, unless user delete the record and create another record with different format.

Many Thanks

like image 570
mhshams Avatar asked Apr 01 '11 11:04

mhshams


People also ask

Which of the following is the best method to write large amount of data to a file?

The best solution would be implement own Writer which directly uses write(byte[]) method of FileOutputStream which used underlying native writeBytes method .

How do you read an efficient file in Java?

Reading Text Files in Java with BufferedReader If you want to read a file line by line, using BufferedReader is a good choice. BufferedReader is efficient in reading large files. The readline() method returns null when the end of the file is reached. Note: Don't forget to close the file when reading is finished.


3 Answers

Seriously, you should NOT be using a binary file for this. You should use a database.

The problems with trying to implement this as a regular file stem from the fact that operating systems do not allow you to insert extra bytes into the middle of an existing file. So if you need to insert a record (anywhere but the end), update a record (with a different size) or remove a record, you would need to:

  • rewrite other records (after the insertion/update/deletion point) to make or reclaim space, or
  • implement some kind of free space management within the file.

All of this is complicated and / or expensive.

Fortunately, there is a class of software that implements this kind of thing. It is called database software. There are a wide range of options, ranging from using a full-scale RDBMS to light-weight solutions like BerkeleyDB files.


In response to your 1st and 2nd edits, a database will still be simpler.

However, here's an alternative that might perform better for this use-case than using a DB... without doing complicated free-space management.

  1. Read the file and build an in-memory index that maps ids to file locations.

  2. Create a second file to hold new and updated records.

  3. Perform the record adds/updates/deletes:

    1. An addition is handled by writing the new record to the end of the second file, and adding an index entry for it.

    2. An update is handled by writing the updated record to the end of the second file, and changing the existing index entry to point to it.

    3. A delete is handled by deleting the index entry for the record's key.

  4. Compact the file as follows:

    1. Create a new file.

    2. Read each record in the old file in order, and check the index for the record's key. If the entry still points to the location of the record, copy the record to the new file. Otherwise skip it.

    3. Repeat the step 4.2 for the second file.

  5. If we completed all of the above successfully, delete the old file and second file.

Note this relies on being able to keep the index in memory. If that is not feasible, then the implementation is going to be more complicated ... and more like a database.

like image 90
Stephen C Avatar answered Oct 01 '22 10:10

Stephen C


Having a data file and an index file would be the general base idea for such an implementation, but you'd pretty much find yourself dealing with data fragmentation upon repeated data updates/deletion, etc. This kind of project, in itself, should be a separate project and should not be part of your main application. However, essentially, a database is what you need as it is specifically designed for such operations and use cases and will also allow you to search, sort, and extend (alter) your data structure without having to refactor an in-house (custom) solution.

May I suggest you to download Apache Derby and create a local embedded database (derby does it for you want you create a new embedded connection at run-time). It will not only be faster than anything you'll write yourself, but will make your application easier to maintain.

Apache Derby is a single jar file that you can simply include and distribute with your project (check the license if any legal issue may apply in your app). There is no need for a database server or third party software; it's all pure Java.

Bottom line as that it all depends on how large is your application, if you need to share the data across many clients, if speed is a critical aspect of your app, etc.

For a stand-alone, single user project, I recommend Apache Derby. For a n-tier application, you might want to look into MySQL, PostgreSQL or (hrm) even Oracle. Using already made and tested solutions is not only smart, but will cut down your development time (and maintenance efforts).

Cheers.

like image 20
Yanick Rochon Avatar answered Oct 01 '22 09:10

Yanick Rochon


Generally you are better off letting a library or database do the work for you.

You may not want to have an SQL database and there are plenty of simple databases which don't use SQL. http://nosql-database.org/ lists 122 of them.

At a minimum, if you are going to write this I suggest you read the source for one of these databases to see how they work.


Depending on the size of the records, 3 million isn't that much and I would suggest you keep as much in memory as possible.

The problem you are likely to have is ensuring the data is consistent and recovering the data when a corruption occurs. The second problem is dealing with fragmentation efficiently (some thing the brightest minds working on the GC deal with) The third problem is likely to be maintain the index in a transaction fashion with the source data to ensure there are no inconsistencies.

While this may appear simple at first, there are significant complexities in making sure there data is reliable, maintainable and can be accessed efficiently. This is why most developers use an existing database/datastore library and concentrate on the features which are unqiue to their application.

like image 30
Peter Lawrey Avatar answered Oct 01 '22 09:10

Peter Lawrey