I have an input folder that contains +100,000 files.
I would like to do a batch operation on them, i.e. rename all of them in a certain way, or move them to a new path based on information in each file's name.
I would like to use Spark to do that, but unfortunately when I tried the following piece of code:
final org.apache.hadoop.fs.FileSystem ghfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI(args[0]), new org.apache.hadoop.conf.Configuration());
org.apache.hadoop.fs.FileStatus[] paths = ghfs.listStatus(new org.apache.hadoop.fs.Path(args[0]));
List<String> pathsList = new ArrayList<>();
for (FileStatus path : paths) {
pathsList.add(path.getPath().toString());
}
JavaRDD<String> rddPaths = sc.parallelize(pathsList);
rddPaths.foreach(new VoidFunction<String>() {
@Override
public void call(String path) throws Exception {
Path origPath = new Path(path);
Path newPath = new Path(path.replace("taboola","customer"));
ghfs.rename(origPath,newPath);
}
});
I get an error that hadoop.fs.FileSystem is not Serializable (and therefore probably cannot be used in parallel operations)
Any idea how I can workaround it or have it done another way?
To rename multiple files from File Explorer, select all the files you wish to rename, then press the F2 key. The name of the last file will become highlighted. Type the new name you wish to give to every file, then press Enter.
We feel the F2 keyboard shortcut is the fastest way to rename a bunch of files, whether you are trying to add different names for each of them or change all their names in one go.
In spark we can't control name of the file written to the directory. First write the data to the HDFS directory then For changing the name of file we need to use HDFS api . In case if you want to delete success files in the directory use fs. delete to delete _Success files.
Use fs. rename() by passing source and destination paths to rename a file. Note that in the above example we also check if the file exists using fs. exists(path) method.
The problem is that you are trying to serialize the ghfs object. If you use mapPartitions and recreate the ghfs object in each partition you will be able to run your code with just a couple of minor changes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With