How to rename huge amount of files in Hadoop/Spark?

Tags:

I have an input folder that contains +100,000 files.

I would like to do a batch operation on them, i.e. rename all of them in a certain way, or move them to a new path based on information in each file's name.

I would like to use Spark to do that, but unfortunately when I tried the following piece of code:

 final org.apache.hadoop.fs.FileSystem ghfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI(args[0]), new org.apache.hadoop.conf.Configuration());
        org.apache.hadoop.fs.FileStatus[] paths = ghfs.listStatus(new org.apache.hadoop.fs.Path(args[0]));
        List<String> pathsList = new ArrayList<>();
        for (FileStatus path : paths) {
            pathsList.add(path.getPath().toString());
        }
        JavaRDD<String> rddPaths = sc.parallelize(pathsList);

        rddPaths.foreach(new VoidFunction<String>() {
            @Override
            public void call(String path) throws Exception {
                Path origPath = new Path(path);
                Path newPath = new Path(path.replace("taboola","customer"));
                ghfs.rename(origPath,newPath);
            }
        });

I get an error that hadoop.fs.FileSystem is not Serializable (and therefore probably cannot be used in parallel operations)

Any idea how I can workaround it or have it done another way?

413

asked Jul 08 '14 13:07

Yaniv Donenfeld

1 Answers

The problem is that you are trying to serialize the ghfs object. If you use mapPartitions and recreate the ghfs object in each partition you will be able to run your code with just a couple of minor changes.

answered Sep 28 '22 18:09

David

Related questions
                            
                                Hue: Failed to access filesystem root
                            
                                The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: -wx------
                            
                                foreach function not working in Spark DataFrame
                            
                                What is difference between Apache flume and Apache storm?
                            
                                java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V
                            
                                Spark csv reading speed is very slow although I increased the number of nodes
                            
                                Which is the easiest way to combine small HDFS blocks?
                            
                                how does netezza work? how does it compare to Hadoop?
                            
                                Hive doesn't work on install
                            
                                Changing user in python
                            
                                Amazon EC2 vs PiCloud [closed]
                            
                                hadoop hdfs points to file:/// not hdfs://
                            
                                error in hive metadata: org.apache.thrift.transport.TTransportException: java.net
                            
                                Deleting jobs from oozie's web UI?
                            
                                Accessing files in HDFS using Java
                            
                                Hadoop Pig count number
                            
                                HDFS error: target already exists
                            
                                Hive is not showing tables
                            
                                Data visualisation tools availble on hive hadoop
                            
                                Create HIVE partitioned table HDFS location assistance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to rename huge amount of files in Hadoop/Spark?

Tags:

parallel-processing

apache-spark

hadoop

bigdata

Yaniv Donenfeld

People also ask

1 Answers

David

Recent Activity

Donate For Us