Category Archives

5 Articles

Implementing incremental data load using Azure Data Factory

Azure Data Factory is a fully managed data processing solution offered in Azure. It connects to many sources, both in the cloud as well as on-premises. One of the basic tasks it can do is copying data over from one source to another – for example from a table in Azure Table Storage to an Azure SQL Database table. To get the best performance and avoid unwanted duplicates in the target table, we need to include incremental data load or delta’s. Also, we can build mechanisms to further avoid unwanted duplicates when a data pipeline is restarted.

In this post I will explain how to cover both scenario’s using a pipeline that takes data from Azure Table Storage, copies it over into Azure SQL and finally brings a subset of the columns over to another Azure SQL table. The result looks like this:

The components are as follows:

  • MyAzureTable: the source table in Azure Table Storage
  • CopyFromAzureTableToSQL: the pipeline copying data over into the first SQL table
  • Orders: the first SQL Azure database table
  • CopyFromAzureSQLOrdersToAzureSQLOrders2: the pipeline copying data from the first SQL table to the second – leaving behind certain columns
  • Orders2: the second and last SQL Azure database table

Setting up the basics is relatively easy. The devil is in the details, however.

  1. The linked services

Every data pipeline in Azure Data Factory begins with setting up linked services. In this case, we need two; one to the Azure Table storage and one to SQL Azure. The definition of the linked service to Azure Table Storage is as follows:

The SQL Azure linked service definition looks like this:

Note the name property – we will need to refer to it later.

  1. The datasets

Datasets define tables or queries that return data that we will process in the pipeline. The first dataset we need to define is the source dataset (called MyAzureTable). The definition is as follows:

Note that, again, this item has a name. We will use it in the pipeline later. Also, you will need to specify the name of your Azure Table in the “tablename” property. Note that the “LinkedServiceName” property is set to the name of the linked service we definied earlier. This way, Azure Data Factory knows where to find the table. Also, the “availability” property specifies the slices Azure Data Factory uses to process the data. This defines how long ADF waits before processing the data as it waits for the specified time to pass before processing. The settings above specify hourly slices, which means that data will be processed every hour. We will later set up the pipeline in such a way that ADF will just process the data that was added or changed in that hour, not all data available (as is the default behavior). Minimum slice size currently is 15 minutes. Also note that the dataset is specified as being external (“external”:true). This means that ADF will not try to coördinate tasks for this table as assumes the data will be written from somewhere outside ADF (your application for example) and will be ready for pickup when the slice size is passed.

The target dataset in SQL Azure follows the same definition:

Important to note is that we defined the structure explicitly – it is not required for the working of the first pipeline, but it is for the second, which will use this same table as source. Also note that presence of the column ‘ColumnForADuseOnly’ in the table. This column is later used by ADF to make sure data that is already processed is not again appended to the target table. Of course, the SQL table itself will need to have (at least) the same columns and matching data types:

  1. The first pipeline (from Azure Table to SQL)

The first pipeline takes the order data in the Azure table and copies it into the Orders table in SQL Azure. It does that incrementally and with repeatability – which means that a) each slice will only process a specific subset of the data and b) if a slice is restarted the same data will not be copied over twice. This results in a fast processing engine without duplication in the target table – data is copied over once, regardless of the number of restarts. Note that by default ADF copies all data over to the target so you would get so many rows in the table as there are orders in the Azure Table times the number of slices that ran (each slice bringing over the full Azure table). The definition is as follows:

Note that the pipeline consists of a single activity, which is a Copy activity. I could have specified another activity in the same pipeline – I have not done so for simplicity. The Copy activity takes as input the Azure Table (MyAzureTable) and outputs into the SQL Azure Table “Orders”. The source Query is very important – as this is used to select just the data we want! We use the column ‘OrderTimestamp’ which and select only the orders from MyAzureTable where the OrderTimestamp is greater than or equal to the starting time of the slice and less than the end time of the slice. A sample query against the Azure Table executed in this way looks like this:

OrderTimestamp ge datetime’2017-03-20T13:00:00Z’ and OrderTimestamp lt datetime’2017-03-20T15:00:00Z’

Also, look at the specification of the “sliceIdentifierColumnName” property on the target (sink) – this column is in the target SQL Azure table and is used by ADF to keep track of what data is already copied over so if the slice is restarted the same data is not copied over twice.

This pipeline will run each hour (“scheduler” properties), starting at 09:00:00 local clock (“specified by the “start” property) and can run 10 slices in parallel (specified by the “concurrency” property).

  1. The second pipeline (from SQL to SQL)

The second pipeline is there to prove the mapping of specific columns to others as well as showing how to do an incremental load from SQL Azure to another target. Note that I use the same linked service so this exercise is not really useful – the same effect could be retrieved by creating a view. The definition is as follows:

Note that we specify a “sqlReaderQuery” this time which selects the right subset of data for the slice. We use WindowStart and WindowEnd this time instead of SliceStart and SliceEnd earlier. At this point is does not matter as ADF requires both to be the same. WindowStart and WindowEnd refer to the pipeline start and end times, while SliceStart and SliceEnd refer to the slice start and end times. Using the “translator” properties we specify which columns to map – note that we copy over SalesAmount and OrderTimestamp exclusively.

There you have it – a fully incremental, repeatable data pipeline in Azure Data Factory, thanks to setting up a smart source query and using the “sliceIdentifierColumnName” property. The full source code is available on Github. More info on how this works is available in the official documentation.

Questions? Remarks? Let me know!

Power BI Refresh scenarios against data storage solutions explained

A recurring theme with customers and partners is Power BI data refresh. More specifically, there is some confusion on what refresh scenario requires a Pro version and what can be done with the Free version of Power BI. I made the diagram below to help explain this. It shows the refresh scenarios against data storage solutions, such as SQL Azure, SQL in a virtual machine in Azure or SQL server on premises. I used these as examples, there are other options as well. I think the overall time carries over to other data storage solutions. The diagram shows the refresh that can be done using a Power BI Free account as orange and the refresh scenarios that need Power BI Pro as green lines. As shown in the diagram, if  you want to refresh against on-premises sources or a database running in a VM in Azure you will need a gateway and Power BI Pro. This applies not only to the creator of the report and schedule but also to every consumer. If you use PAAS solutions for data storage in Azure such as SQL Azure, it becomes a bit more difficult and it is really dependent on the type of refresh required. If you need a refresh cycle higher than once a day (either max 8 times per 24 hours or live) you will need Power BI Pro. If you just want to refresh against such as SQL Azure and once a day is enough you can do that using Power BI Free. Again, the license requirement carries over from author to viewer; if the author of the report requires Pro, then the viewers also need Pro.

Power BI Refresh scenarios against data storage solutions

Hope this helps. If you have any questions or feedback, please comment below!

Azure SQL Database almost on par with the on-premise version

Hi blogreaders J

I started the blog with the title: “Azure SQL Database almost on par with the on-premise version”.

And looking from a T-SQL perspective it is.
Our latest version of Azure SQL Database (let’s call it SQL Azure for now) is for 95% compatible with SQL 2014.
So not a bad title!

But there is more to it…

There are two perspectives to take in account here:

  1. Microsoft is not only working on T-SQL compatibility, but is working on al fronts to enhance functionality.
    So actually… the engine powering SQL Azure is already beyond SQL 2014 and more towards SQL 2016.
  2. SQL Azure is a PAAS offering in Azure.
    So… you don’t only get the database, but also services around it in maintenance like:
    1. Automatic replication of your date 3 times (and 3 more times if geo-redundant)
    2. Point in time restore up to 14 days
    3. Patching, security updates

Thinking of SQL Azure of being on par with SQL 2014 is kind of old skool thinking.
Thinking of SQL Azure should be about thinking about using a relational database as a service.
And those services will enhance in time.

So no more thinking about versions, but thinking about functions.
This changes the paradigm of developing big time!
You will have to develop is a much more pure way.
Keeping your application architecture clean and crisp.

For example: No more business logic in database triggers, but logic in webservices and the call’s to the database in as plain as ANSI-SQL as possible.
Code examples can be found at: https://msdn.microsoft.com/en-us/library/azure/ee621787.aspx
And to keep up to speed on the latest developments of SQL Azure keep checking the blog: http://azure.microsoft.com/blog/

Happy programming!

Link to a one-slide Powerpoint to show the latest services in Azure

Use below link to download a very usefull slide which shows you all IAAS and PAAS services within Azure:

Overall services in Azure

Have fun,

Harry

Platform as a Service (PAAS) moving a rocket speed!

As an Enterprise Architect in an organization life has always been dynamic to say the least! It is your responsibility to keep up with the latest developments in ICT both in technique as in architecture. In the old days of on-premise only that was a big challenge. But with the Cloud as a integral part of your information systems it became even more complex.

But still… The Cloud was moving vm’s to Amazon or Microsoft. So architecturally not that complex. Identity & access off course, but that’s about it. Then came Platform as a service (PAAS). That was something completely different! Not moving vm’s to the Cloud, but move complete technical workloads to the Cloud like an ESB in the Cloud, Media Services, Federated identity, Storage, etc, etc..

This does impact your architecture!

A blazing 78 new PAAS services were introduced in 2014 within Azure. So it’s moving rocket fast! And to be fair: not only at Microsoft, also are other Cloud vendors moving into the PAAS area with new services.

What is the impact for you as Enterprise Architect?

In your normal day to day work you make choices based on software you can purchase and implement at your data center. But now you should at least ask yourself for every choice you have to make: Do I want to do this myself or shall I take this as a service from one of the Cloud vendors.

An example: Your organization wants to use Cloud services from multiple Cloud vendors but you want a single sign on experience for your users. Now you can buy a federated identity server, do research on all Cloud vendors on how to connect and then build the connections. But you can also use The Windows Azure Active Directory Federation Service (ADFS) from Microsoft with over 2600 Cloud vendors already pre-installed.

Second example: You have a new web application that you need to deploy. Again you can buy a few servers, install IIS, SQL Server, the application and install everything and schedule things like backup, patch management, storage, etc., etc. But you can also take a web-role to host the web-application, Azure SQL database to host you data and let Microsoft worry about backup’s, 3 replica’s for DR, patching the server, etc.., etc.

So my message to all you Enterprise Architects out there: Examine carefully the PAAS offerings from the Cloud vendors before making expensive buy decisions. My recommendations to checkout:

Azure Service Bus, Azure Machine Learning, WAAS, BizTalk Services and Azure SQL Database. Next blog-post I will dig deeper on Azure SQL Database.

%d bloggers like this: