Tuesday, February 23, 2016

S3-dist-cp and recursive subdirectory groupings

When working with AWS (specifically AWS EMR hadoop), you can use the S3distcp to concatenate files together with the --groupBy option.  What is really cool, this will work even on already-compressed (gzip) files!

However, recursive sub-directories are not natively supported by S3distcp.  So instead, need to stage it. To stage, we are going to use the distcp that S3distcp originated from as it has some other useful features not in s3distcp.

Using AWS EMR you can create a Custom JAR step, and either use the /usr/lib/hadoop/hadoop-distcp.jar or upload your own version of hadoop-distcp.jar to S3 and reference that version. Then, for args you want to copy the contents with the --update to a destination staging area where the individual files are stored in a flattened directory structure. In this example, I'll filter to just csv.gz files.

  --update s3://test/raw/**/*.csv.gz s3://test/staging

After that, then you can use the command-runner.jar to concatenate in any grouping defined by the regular expression.  The example used is by 4-digits (years for examples) in the filenames, such that all the daily/monthly files are put together into a single year file.  The -outputCodec gz ensures that the ending file is also compressed.

s3-dist-cp --src=s3://test/staging --dest=s3://test/grouped/ --groupBy .*([0-9][0-9][0-9][0-9]).* --outputCodec gzip

If you get errors like "ERROR: Skipping key XYZ because it ends with '/'", this is usually because either there are no source files, or the regex in your groupBy is not quite correct and filters out to no files.

1 comment:

Arwin Tio said...

Hi Darren,

How do you figure that S3DistCp doesn't support recursive subdirectories? Why not just stage the files onto HDFS rather than S3?