Optimizing s3distcp by maximizing cluster memory allocation

Context: I’m running s3distcp, and the container uses a small portion of cluster memory. I see that exact memory amount specified as mapreduce.reduce.memory.mb. But manually tweaking that explicit number in a cluster or job config feels like the wrong way - like I’m overlooking the “use everything” switch.

emr doesnt really provide anything beyond just autoinstalling bigtop hadoop. it sounds like its s3distcp thats limiting you, the memory per mapper/reducer configs shouldnt matter. not sure exactly how that works, but if the job decides it only has one mapper or reducer than it doesnt matter how big the cluster is. there might be some settings on s3distcp itself you can play with

"-D mapreduce.job.reduces=${no_of_reduce }" is probably something you could increase if theres lots of files

Thanks for the response :slightly_smiling_face:

there’s a maximizeResourceAllocation config for Spark

but not sure there’s a corresponding one for hadoop MR jobs

Oh interesting, didn’t know that one (have only played a bit with Spark ETL jobs through Glue so far).