I have 300000+ files on a HDFS data directory.
When I do a hadoop fs -ls and I am getting an out of memory error saying GC Limit has exceeded. The cluster nodes have 256 GB of RAM each. How do I fix it?
You can make more memory available to the hdfs command by specifying 'HADOOP_CLIENT_OPTS'
HADOOP_CLIENT_OPTS="-Xmx4g" hdfs dfs -ls /
Found here: http://lecluster.delaurent.com/hdfs-ls-and-out-of-memory-gc-overhead-limit/
This fixed the problem for me, I had over 400k files in one directory and needed to delete most but not all of them.
Write a python script to split the files into multiple directories and run through them. First of all what are you trying to achieve when you know you have 300000+ files in a directory. If you want to concatenate better arrange them into sub dirs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With