I am trying to run a simple NaiveBayesClassifer using hadoop, getting this error
Exception in thread "main" java.io.IOException: No FileSystem for scheme: file at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:180) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.mahout.classifier.naivebayes.NaiveBayesModel.materialize(NaiveBayesModel.java:100) Code :
Configuration configuration = new Configuration(); NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), configuration);// error in this line.. modelPath is pointing to NaiveBayes.bin file, and configuration object is printing - Configuration: core-default.xml, core-site.xml
I think its because of jars, any ideas?
This is a typical case of the maven-assembly plugin breaking things.
Different JARs (hadoop-commons for LocalFileSystem, hadoop-hdfs for DistributedFileSystem) each contain a different file called org.apache.hadoop.fs.FileSystem in their META-INFO/services directory. This file lists the canonical classnames of the filesystem implementations they want to declare (This is called a Service Provider Interface implemented via java.util.ServiceLoader, see org.apache.hadoop.FileSystem#loadFileSystems).
When we use maven-assembly-plugin, it merges all our JARs into one, and all META-INFO/services/org.apache.hadoop.fs.FileSystem overwrite each-other. Only one of these files remains (the last one that was added). In this case, the FileSystem list from hadoop-commons overwrites the list from hadoop-hdfs, so DistributedFileSystem was no longer declared.
After loading the Hadoop configuration, but just before doing anything FileSystem-related, we call this:
hadoopConfig.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName() ); hadoopConfig.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName() ); It has been brought to my attention by krookedking that there is a configuration-based way to make the maven-assembly use a merged version of all the FileSystem services declarations, check out his answer below.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With