Category Archives

88 Articles

Implementing incremental data load using Azure Data Factory

Azure Data Factory is a fully managed data processing solution offered in Azure. It connects to many sources, both in the cloud as well as on-premises. One of the basic tasks it can do is copying data over from one source to another – for example from a table in Azure Table Storage to an Azure SQL Database table. To get the best performance and avoid unwanted duplicates in the target table, we need to include incremental data load or delta’s. Also, we can build mechanisms to further avoid unwanted duplicates when a data pipeline is restarted.

In this post I will explain how to cover both scenario’s using a pipeline that takes data from Azure Table Storage, copies it over into Azure SQL and finally brings a subset of the columns over to another Azure SQL table. The result looks like this:

The components are as follows:

  • MyAzureTable: the source table in Azure Table Storage
  • CopyFromAzureTableToSQL: the pipeline copying data over into the first SQL table
  • Orders: the first SQL Azure database table
  • CopyFromAzureSQLOrdersToAzureSQLOrders2: the pipeline copying data from the first SQL table to the second – leaving behind certain columns
  • Orders2: the second and last SQL Azure database table

Setting up the basics is relatively easy. The devil is in the details, however.

  1. The linked services

Every data pipeline in Azure Data Factory begins with setting up linked services. In this case, we need two; one to the Azure Table storage and one to SQL Azure. The definition of the linked service to Azure Table Storage is as follows:

The SQL Azure linked service definition looks like this:

Note the name property – we will need to refer to it later.

  1. The datasets

Datasets define tables or queries that return data that we will process in the pipeline. The first dataset we need to define is the source dataset (called MyAzureTable). The definition is as follows:

Note that, again, this item has a name. We will use it in the pipeline later. Also, you will need to specify the name of your Azure Table in the “tablename” property. Note that the “LinkedServiceName” property is set to the name of the linked service we definied earlier. This way, Azure Data Factory knows where to find the table. Also, the “availability” property specifies the slices Azure Data Factory uses to process the data. This defines how long ADF waits before processing the data as it waits for the specified time to pass before processing. The settings above specify hourly slices, which means that data will be processed every hour. We will later set up the pipeline in such a way that ADF will just process the data that was added or changed in that hour, not all data available (as is the default behavior). Minimum slice size currently is 15 minutes. Also note that the dataset is specified as being external (“external”:true). This means that ADF will not try to coördinate tasks for this table as assumes the data will be written from somewhere outside ADF (your application for example) and will be ready for pickup when the slice size is passed.

The target dataset in SQL Azure follows the same definition:

Important to note is that we defined the structure explicitly – it is not required for the working of the first pipeline, but it is for the second, which will use this same table as source. Also note that presence of the column ‘ColumnForADuseOnly’ in the table. This column is later used by ADF to make sure data that is already processed is not again appended to the target table. Of course, the SQL table itself will need to have (at least) the same columns and matching data types:

  1. The first pipeline (from Azure Table to SQL)

The first pipeline takes the order data in the Azure table and copies it into the Orders table in SQL Azure. It does that incrementally and with repeatability – which means that a) each slice will only process a specific subset of the data and b) if a slice is restarted the same data will not be copied over twice. This results in a fast processing engine without duplication in the target table – data is copied over once, regardless of the number of restarts. Note that by default ADF copies all data over to the target so you would get so many rows in the table as there are orders in the Azure Table times the number of slices that ran (each slice bringing over the full Azure table). The definition is as follows:

Note that the pipeline consists of a single activity, which is a Copy activity. I could have specified another activity in the same pipeline – I have not done so for simplicity. The Copy activity takes as input the Azure Table (MyAzureTable) and outputs into the SQL Azure Table “Orders”. The source Query is very important – as this is used to select just the data we want! We use the column ‘OrderTimestamp’ which and select only the orders from MyAzureTable where the OrderTimestamp is greater than or equal to the starting time of the slice and less than the end time of the slice. A sample query against the Azure Table executed in this way looks like this:

OrderTimestamp ge datetime’2017-03-20T13:00:00Z’ and OrderTimestamp lt datetime’2017-03-20T15:00:00Z’

Also, look at the specification of the “sliceIdentifierColumnName” property on the target (sink) – this column is in the target SQL Azure table and is used by ADF to keep track of what data is already copied over so if the slice is restarted the same data is not copied over twice.

This pipeline will run each hour (“scheduler” properties), starting at 09:00:00 local clock (“specified by the “start” property) and can run 10 slices in parallel (specified by the “concurrency” property).

  1. The second pipeline (from SQL to SQL)

The second pipeline is there to prove the mapping of specific columns to others as well as showing how to do an incremental load from SQL Azure to another target. Note that I use the same linked service so this exercise is not really useful – the same effect could be retrieved by creating a view. The definition is as follows:

Note that we specify a “sqlReaderQuery” this time which selects the right subset of data for the slice. We use WindowStart and WindowEnd this time instead of SliceStart and SliceEnd earlier. At this point is does not matter as ADF requires both to be the same. WindowStart and WindowEnd refer to the pipeline start and end times, while SliceStart and SliceEnd refer to the slice start and end times. Using the “translator” properties we specify which columns to map – note that we copy over SalesAmount and OrderTimestamp exclusively.

There you have it – a fully incremental, repeatable data pipeline in Azure Data Factory, thanks to setting up a smart source query and using the “sliceIdentifierColumnName” property. The full source code is available on Github. More info on how this works is available in the official documentation.

Questions? Remarks? Let me know!

Dutch elections analyzed in Power BI

Just in time for the elections in the Netherlands: I made an analysis using Power BI of the results of the elections since 1918. It is published (in Dutch) on the official Microsoft the Netherlands blog: https://blogs.microsoft.nl/technologieenmaatschappij/tweede-kamerverkiezingen-technologie-brengt-uitslagen-tot-leven/.

Enjoy!

Annual radio countdown Top 2000 in Power BI

Full screen

Most of this post will be in Dutch. Every year a list of 2000 songs is voted on by people from the Netherlands. This list is then broadcast by Radio 2 in the last days of the year. I decided to do a Power BI analysis on it and included all the lists from the start (1999). Also, I included an R-based forecast. If you want to get some insight into the musical taste in the Netherlands, go ahead and have fun!

De Top 2000. Fenomeen. Traditie. Zo’n ‘hoort-bij-de-tijd-van-het-jaar’ gevoel. Voor mij wel in ieder geval. Het is in ieder geval de beste afspiegeling van de muzieksmaak van Nederland die ik ken. Maar ja, wat heb je eraan als je geen vergelijking kunt maken met vorig jaar? 2000 nummers, dat is best veel om even te onthouden. En wie was ook alweer de grootste stijger vorig jaar? Wat is de oudste song in de lijst van jouw favoriete artiest? Welke song deed er het langst over om in de lijst te komen? Wat is er eigenlijk waar van de opmerking dat de lijst reageert op overlijden? Nu kun je het antwoord vinden op deze en vele andere vragen!

Ik heb namelijk alle Top 2000’s sinds het begin (1999) opgenomen in deze Power BI, als kerstcadeautje voor jou, lezer. Je kunt de analyse hieronder vinden. Ik heb drie views gemaakt: de eerste is gericht op artiesten zodat je meer te weten kunt komen over hoe jouw favoriete artiest het doet in de lijst. De tweede geeft inzicht in alles over een ‘versie’ van de Top 2000. Als bonus is er ook een forecast / voorspelling view (derde pagina). Hier kun je bekijken hoe op basis van de historie de positie van de artiesten / songs zich gaat ontwikkelen. Het mooiste is: alles is interactief.

Zo vinden we nu eindelijk het antwoord op de vraag of de lijst reageert op overlijden. Het antwoord is ja, kijk maar bij Michael Jackson (overleden 2009 en dat jaar zijn beste notering: #27) of Amy Winehouse (overleden 2011 en dat jaar haar beste notering: #72).

Doe er je voordeel mee. Ik vind het in ieder geval een mooie afspiegeling van de tijdsgeest. Ik wens jullie een fijne kerst!

Passing filters to Power BI reports using the URL / query string

It was only recently that I discovered this – you can pass filters to Power BI reports by deep linking to it and adding a filter in the URL (also called query string). I am not even sure this is a feature, since it only seems to work for ‘equals’ and does not work on Dashboards in Power BI.

First of all, we need to understand deep linking in Power BI. Most artifacts in Power BI (for example reports and dashboards) have a their own URL. This means we can open them directly when we know the URL. The URL will have the following format:

https://[tenant].powerbi.com/groups/[workspace]/[artifact type]/[artifact id]/[optional part to be opened]

Thus, the URL may look like this:

https://app.powerbi.com/groups/me/reports/dc121d4b-9aad-4494-b1de-8037f53d8355/ReportSection3

This would open page ReportSection3 in report with id dc121d4b-9aad-4494-b1de-8037f53d8355 in my personal workspace in the tenant app.

If you know anything about web development, you know that you can pass things through the URL to the application this is normally done by adding a question mark to the end of the URL and specifying whatever you want to pass. Multiple items can be added by separating them out by ampersands, like this:

https://myurl.com/?firstname=Jeroen&Lastname=ter%20Heerdt

(notice the %20, which is a URL encoded space).

Combining these two things, you can pass parameters to Power BI reports if you have the report URL (simply open the report to get that). Once you have it, add the following:

?filter=[tablename]/[columnname] eq ‘[filter value]’

So, if I wanted to filter the activity column in my workitem table so it only shows items where the activity is blogging, I would add the following:

?filter=workitem/activity eq ‘blogging’

(eq is short for equals here).

This needs, however, to be encoded for the URL. You can easily find tools online to do that for you, or if you know a little bit about what you are doing, you can do it by yourself. The addition above becomes:

?filter=workitem%252Factivity%20eq%20%27blogging%27

(/ becomes %252F, space becomes %20 and ‘ becomes %27)

This would open page ReportSection3 in report with id dc121d4b-9aad-4494-b1de-8037f53d8355 in my personal workspace in the tenant app with a filter set on the workitem table’s activity column to be equal to blogging:

https://app.powerbi.com/groups/me/reports/dc121d4b-9aad-4494-b1de-8037f53d8355/ReportSection3?filter=workitem%252Factivity%20eq%20%27blogging%27

By the way, you would probably only want to use this is very specific scenario’s. It is way better to look at Power BI Embedded to integrate Power BI in your applications. Note that Power BI Embedded is targeted at organizations providing software to others (hosted or not). It is not for internal applications.

Power BI Custom Visual Development Tip: VisualPlugin.ts: Property ‘Visual’ does not exist error

So here you are, creating your very own Power BI custom visual. You have read the documentation and ran the tutorial (https://github.com/Microsoft/PowerBI-visuals/blob/master/Readme.md and https://github.com/Microsoft/PowerBI-visuals-sampleBarChart). You feel proud because you are done creating your awesome looking 3d-piechart-barchart-mashup visual. Then it happens. You run: pbiviz start to view your visual and….BAM:

Ouch. Now, before you starting banging your head against the wall until it hurts, here is the solution:

Most probably you have (as good practice dictates) changed the class name of your visual from the default ‘Visual’ to something more interesting, such as MyAwesome3DpieChartBarChartMashupTheDutchDataDudeIsSoGoingToHateThisVisual.

Well, you forgot to change the visualClassName as specified in pbiviz.json so the code can actually find the entry point for your awesome visual. So, quick fix: open pbiviz.json and change the visualClassName property into your class name (which is hopefully not alike the one above). Save the file, re-run pbiviz start and done!

(I know this is a very newbie / getting started type of error, but it took me more than 5 minutes searching for it when I first encountered it. I figured it is worthwhile saving everyone’s time and log it for my own future reference ;))

First look: Esri ArcGIS Maps in Power BI

Esri is a leader in the GIS industry and ArcGIS is a very popular product to build great maps. Now, you can use ArcGIS maps in Power BI (in preview). See the official information here: https://powerbi.microsoft.com/en-us/blog/announcing-arcgis-maps-for-power-bi-by-esri-preview/. This is really cool, I know a lot of you have been asking for this for a long time!

You will find the option to enable this preview in PowerBI.com, not in the Power BI Desktop. Log in to PowerBI and open the settings. You can find the ArcGIS preview there and enable it by simply selecting the checkbox:

With that enabled, create a report with some geographical information (or edit an existing one). I used the Google Analytics data that keeps track of my blog. Google Analytics data can be loaded into Power BI simply by using the content pack. In edit mode in the report you will find the ArcGIS component in the Visualizations list:

Click it and create your map as you would with the normal map. I noticed it needs some time to build the map (probably due to the preview) but once it is done it is fully interactive with the other items on your report as you would expect:

You can change a lot of the ArcGIS options, such as switching out maps, changing symbol styles, adding reference layers, etc.

I love this – the awesome power of ArcGIS and Power BI combined! I cannot wait to see what you will create with this.

Power BI learning resources – follow up 3

Another great resource for learning about Power BI is the course on EDX: Analyzing and Visualizing Data with Power BI. Granted, has been around for a while, but I forgot blogging about it; maybe it is a bit easier to find now.

Enjoy the new skills you will learn with this!

 

Power BI Refresh scenarios against data storage solutions explained

A recurring theme with customers and partners is Power BI data refresh. More specifically, there is some confusion on what refresh scenario requires a Pro version and what can be done with the Free version of Power BI. I made the diagram below to help explain this. It shows the refresh scenarios against data storage solutions, such as SQL Azure, SQL in a virtual machine in Azure or SQL server on premises. I used these as examples, there are other options as well. I think the overall time carries over to other data storage solutions. The diagram shows the refresh that can be done using a Power BI Free account as orange and the refresh scenarios that need Power BI Pro as green lines. As shown in the diagram, if  you want to refresh against on-premises sources or a database running in a VM in Azure you will need a gateway and Power BI Pro. This applies not only to the creator of the report and schedule but also to every consumer. If you use PAAS solutions for data storage in Azure such as SQL Azure, it becomes a bit more difficult and it is really dependent on the type of refresh required. If you need a refresh cycle higher than once a day (either max 8 times per 24 hours or live) you will need Power BI Pro. If you just want to refresh against such as SQL Azure and once a day is enough you can do that using Power BI Free. Again, the license requirement carries over from author to viewer; if the author of the report requires Pro, then the viewers also need Pro.

Power BI Refresh scenarios against data storage solutions

Hope this helps. If you have any questions or feedback, please comment below!

Roles in a data driven organization

All this talk all the time about Big Data and Advanced Analytics is all well and good, in fact it is something I do most of my time. The technology is there and has great potential. The biggest question now is how to use these technologies to their full extent and maximize the benefits of the technologies for your organization. The answer lies in becoming a data driven organization.

A data driven organization is an organization that breathes data, not only in the sense of producing data, but also in the sense of analyzing, consuming and really understanding data, both their own as well as the data others can provide. In order to have a sense to become a data driven organization, you will need to change People, Process and Technology. There is enough talk about the Technology in the market already (and on this blog), so I will come back to that later and not go into much detail now. Let’s look at the other two: People and Process. I view Process as very much related to People: bringing in new skills without the proper Process in place for how to work with them and for the new People to work together will not be very useful.

So, what People do you need? In other words: what roles do you need in a data driven organization? I see four required roles in any organization that wants to be more data-driven. This is not to say that these four roles should be four different people; it is very well possible that someone might take on more than one role. I am however confident that there exist very few people who will able to do all four roles since each requires specific skills, focus and passion.

The four roles are: Wrangler, Scientist, Artist and Communicator. Let’s look at the four roles in more detail.

Wrangler

The role Wrangler, or data wrangler as others call this role is responsible for identifying, qualifying and providing access to data sets. In this sense the data is the wild horse that the wrangler tames. This role is a need for the Scientist role to work with qualified, trustable and managed data sources. In much situations, this looks a lot like the current data management roles already present in organizations. This role lives mostly in IT. Keywords here are databases, connection strings, Hadoop, protocols, file formats, data quality, master data management, data classification.

Scientist

More popularly called the Data Scientist, a lot of people seem to believe that as long as you hire a Data Scientist you are a data driven company. This is much the same as saying that if you have Hadoop you ‘do Big Data’. This is about as smart as saying that if you got your driver’s license you make an excellent Formula 1 driver. It is just not true, sorry. Note also, that the opposite applies; if you are a great Formula 1 driver you could be a very bad driver on open roads. Running Hadoop does say you use Big Data. Hiring a Data Scientist does not mean you are a data driven company.

A Scientist is someone who applies maths, a lot of maths, to convert data into information. He or she applies statistical models and things like deep learning, data mining and machine learning to make this happen. Scientists are the rock stars of this data-focused world since they are the once actually making the magic happen. However, they cannot do it alone. They need good quality and trustable data, which is what the Wrangler supplies. Also, these Scientists happen to be ill-understood by the rest of the organization. This do this experiment: have your (Data) Scientist stick around the water cooler for 15 minutes every day and let him / her talk to people (I know, for some this is hard already). Then, check how quickly the person the Scientist is talking to disconnects. My experience is that someone who is not a fellow Scientist or Communicator will not make it for 15 minutes. Just try it, you will see what I mean.

Keywords here are data mining, R, Python, machine learning, statistics, algorithms.

Artist

The Artist role converts information the Scientist brewed up to insight that the consumers can understand and use. This role focusses on esthetics and the best way of data visualization to bring the message across in the best possible way. While the Wrangler is a very IT focused role and the Scientist is very mathematical, the Artist often comes from the creative arts world. The Artist just loves making things understandable and loves making the world a better place by creating beautiful things, such as great looking reports and dashboards. They often employ storytelling and other powerful visual methods such as infographics to convey their message to the consumers.

Keywords: data visualization, dashboard design, signaling colors, storytelling.

Communicator

The last role in data driven organizations is a chameleon; If you look at the types of person in the Wrangler, Scientist and Artist role it is clear to see that these are very different people, with different backgrounds and different passions. Just as much as some of them find it hard to talk to the rest of the organization they can find it difficult to talk among their own and work together. In order to make sure there is no communications breakdown, many organizations invest in a Communicator; someone who has enough understanding of the passion of the people in the other roles to be able to level with them, understand their needs and explain the needs of others to them. Sub types of the Communicator is the Wrangler-Scientist communicator and the Scientist-Artist communicator.

This concludes the roles I see in a data driven organization; of course these roles with need the be supported with the right Processes and Technology. Having a Technology platform instead of disparate tools will help you to achieve this and make the best out of the investments you are making in these roles.

Automatically building a Microsoft BI machine using PowerShell – Final post (post #14)

This post is #14 in the series to automatically build a Microsoft BI machine using PowerShell – see the start of series.

In this series:

Start of series – introduction and layout of subjects
Post #2 – Preparation: install files using Azure disk
Post #3 – Preparation: install files using Azure File Service
Post #4 –Preparation: logging infrastructure
Post #5 – Master script
Post #6 – Disabling Internet Explorer Enhanced Security Configuration
Post #7 – Active Directory setup
Post #8 – Configuring Password policy
Post #9 – Installing System Center Endpoint Protection
Post #10 – Installing SQL Server
Post #11 – Installing SharePoint Server
Post #12 – Installing PowerPivot for SharePoint
Post #13 – Configuring PowerPivot for SharePoint

Wow. This has been a long and wild ride. But you and I made it together. We now have the full recipe to automatically configure a Microsoft BI demo machine with PowerShell. Of course there is more to be done, such as configuring other Service Accounts and deploying demo content; this script however saves me a lot of time every time I need to stand up a new demo machine.

You can download the script on Github. Please note (again) that the code is provided as-is and you should use it at your own risk. It is probably still buggy but should give you a good starting point to adapt it to your needs.

I enjoyed the ride with you; hope I made your life a bit easier of the course of this series. Enjoy!

%d bloggers like this: