How are these sequence files generated ? I saw a link about sequence file here,
http://wiki.apache.org/hadoop/SequenceFile
Are these written using default Java serializer ? and How do I read a sequence file ?
Sequence files are generated by MapReduce tasks and and can be used as common format to transfer data between MapReduce jobs.
You can read them in the following manner:
Configuration config = new Configuration();
Path path = new Path(PATH_TO_YOUR_FILE);
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value))
// perform some operating
reader.close();
Also you can generate sequence files by yourself using SequenceFile.Writer.
The classes used in the example are the following:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
And are contained within the hadoop-core
maven dependency:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
Thanks to Lev Khomich's answer, my problem has been solved.
However, the solution has been deprecated for a while and the new API offers more features and also easy to use.
Check out the source code of hadoop.io.SequenceFile, click here:
Configuration config = new Configuration();
Path path = new Path("/Users/myuser/sequencefile");
SequenceFile.Reader reader = new Reader(config, Reader.file(path));
WritableComparable key = (WritableComparable) reader.getKeyClass()
.newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value)) {
System.out.println(key);
System.out.println(value);
System.out.println("------------------------");
}
reader.close();
Extra info, here is the sample output running against the data file generated by Nutch/injector:
------------------------
https://wiki.openoffice.org/wiki/Ru/FAQ
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sun Apr 13 16:12:59 MDT 2014
Modified time: Wed Dec 31 17:00:00 MST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:
------------------------
https://www.bankhapoalim.co.il/
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sun Apr 13 16:12:59 MDT 2014
Modified time: Wed Dec 31 17:00:00 MST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:
Thanks!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With