Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to rename huge amount of files in Hadoop/Spark?

I have an input folder that contains +100,000 files.

I would like to do a batch operation on them, i.e. rename all of them in a certain way, or move them to a new path based on information in each file's name.

I would like to use Spark to do that, but unfortunately when I tried the following piece of code:

 final org.apache.hadoop.fs.FileSystem ghfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI(args[0]), new org.apache.hadoop.conf.Configuration());
        org.apache.hadoop.fs.FileStatus[] paths = ghfs.listStatus(new org.apache.hadoop.fs.Path(args[0]));
        List<String> pathsList = new ArrayList<>();
        for (FileStatus path : paths) {
            pathsList.add(path.getPath().toString());
        }
        JavaRDD<String> rddPaths = sc.parallelize(pathsList);

        rddPaths.foreach(new VoidFunction<String>() {
            @Override
            public void call(String path) throws Exception {
                Path origPath = new Path(path);
                Path newPath = new Path(path.replace("taboola","customer"));
                ghfs.rename(origPath,newPath);
            }
        });

I get an error that hadoop.fs.FileSystem is not Serializable (and therefore probably cannot be used in parallel operations)

Any idea how I can workaround it or have it done another way?

like image 413
Yaniv Donenfeld Avatar asked Jul 08 '14 13:07

Yaniv Donenfeld


People also ask

How do I rename a lot of files at once?

To rename multiple files from File Explorer, select all the files you wish to rename, then press the F2 key. The name of the last file will become highlighted. Type the new name you wish to give to every file, then press Enter.

What methods will rename a file the fastest?

We feel the F2 keyboard shortcut is the fastest way to rename a bunch of files, whether you are trying to add different names for each of them or change all their names in one go.

How do I rename a file in Spark?

In spark we can't control name of the file written to the directory. First write the data to the HDFS directory then For changing the name of file we need to use HDFS api . In case if you want to delete success files in the directory use fs. delete to delete _Success files.

How do I rename a file in Hadoop?

Use fs. rename() by passing source and destination paths to rename a file. Note that in the above example we also check if the file exists using fs. exists(path) method.


1 Answers

The problem is that you are trying to serialize the ghfs object. If you use mapPartitions and recreate the ghfs object in each partition you will be able to run your code with just a couple of minor changes.

like image 91
David Avatar answered Sep 28 '22 18:09

David