Using custom .NET activities in Azure Data Factory
September 22, 2015
Azure Data Factory provides a great number of data processing activities out of the box (for example running Hive or Pig scripts on Hadoop / HDInsight).
In many case though, you just need to run an activity that you already have built or know how to build in .NET. So, how would you go about that? Would you need to convert all those items to Hive scripts?
Actually, no. Enter Custom .NET activities. Using this you can run a .NET library on Azure Batch or HDInsight (whatever you like) and make it part of your Data Factory pipeline. Regardless of whether you use Batch or HDInsight you can just run your .NET code on it. I prefer using Batch since it provides more auto-scaling options, is cheaper and makes more sense to me in general; I mean, why run .NET code on a HDInsight service that runs Hive and Pig? It feels weird. However, if you already have HDInsight running and prefer to minimize the number of components to manage, choosing HDInsight might make more sense than using Batch.
Switching from Batch to HDInsight means to changing the LinkedServiceName for the activity to point to your HDInsight or HDInsight on demand cluster.
Tables are passed to the .NET activity using a connection string, so essentially if you have both input and output tables defined as blob storage items, your custom assembly will get a connection string to the blob storage items, read the input files, do its processing and write the output files before passing on the control to ADF.
Using this framework the sky is the limit: anything you can run in .NET can now be part of your ADF processing pipeline…pretty cool!