Jeroen ter Heerdt

Machine learning is like washing clothes

Automating a data wrangling pipeline using chained Azure Functions. This post and its code was co-authored with Codeprincess.

Machine Learning is all about data… and algorithms. But first and most importantly it’s about data. And it’s not just any data, we need enough, high quality and (and in many cases) “cleaned” data. When we talk about cleaning data, we normally use the word ‘wrangling’. Wrangling is like washing clothes: your dirty data goes in and clean, neatly ironed and folded data comes out. We know from experience and research that data scientists spend 80% of their time wrangling (=cleaning) data – no wonder people call them “data janitors” from time to time!

Waste of talent
OK, your data scientists are spending most of their time on janitor and wrangler tasks. So what? Well, it turns out you normally do not need actual data scientists to do wrangling. You can leave that to others and have data scientists do what they do best: casting obscure mathematical spells to teach data tricks.
Mostly wrangling data is a sequential process of certain steps which need to be executed. In many cases this is executed by hand either using visual tools or by creating scripts. It all starts with data exploration to get an grasp of the data before taking steps to wrangle it.

Automating the data wrangling process
To speed up the wrangling process on the one hand and make it less painful on the other hand, we can automate the process. A great technology to use is Azure Functions. Azure Functions are stateless, serverless, short-running pieces of code. You can write them in many languages, we used C# and JavaScript in this situation. Each step on the wrangling process goes into its own Azure Function which then will chained to each other to form a pipeline where dirty data goes in and clean data goes out.

The scenario
Let’s start and have a look at a real-life scenario on how we set up the “make life of data scientists”-process easier. But first let’s look at the task which our process shall take care of.

A lovely customer of Microsoft Italy, wants to retrieve certain information from printed receipts like the total price, the sender and the receiver. Sounds easy enough, right? Well, think again. Receipts come in many different shapes and sizes. Some are taxi-like receipts or those little receipts you get from parking machines or handheld payment terminals. Other might meticulously designed hotel bills or even span multiple pages . In other words the receipts in question don’t have nearly anything in common. What they did have in common though is that they were saved as PDFs.

The approach
Our approach – as we sat down together in Milan during a week of hacking –  was the following. We first collected a big bunch of keywords which indicate that around this word the information we are looking for could be found. For example the keyword “Totale” was always near the total amount – which is info we want to retrieve from the receipt. In this manner we went on and tried to collect as many keywords which match patterns like this.

By having the idea of “the keyword indicates where our info is probably located” we started looking through different services which can give us location info of text on a document. The search was quite short though, because we remembered that the Cognitive Services toolbox has something useful here: the Computer Vision OCR API. It returns OCRed text from a picture and gives additional info on the location of the text.

The location data of the text is structured in regions, which then have lines, which then have words. So we get not only location-identical text packed together – we might even get semantic info out of this if we assume that info which is close to each other just belongs to each other 🙂

So our first step then was to convert our input PDF receipts into images, because our Computer Vision OCR API just takes images as an input. Those images were then scanned by the service for text and returned with its location metadata in JSON format (remember the region-line-word trio). See image above for a sample result of the OCR API.

For further analysis the JSON format was not that practical yet. That’s why we decided to flatten it. We cleaned out the detailed info about the regions and just packed all the text within one region together into a single string. The advantage of this was then to easily scan the strings for keywords. Imagine scanning the strings for the keyword “Totale” and if it fits have then a look for a number in that same string. The keyword “Totale” and if it fits have then a look for a number in that same string. The possibility is then quite high that we have found the total amount of the receipt. Victory!

That’s the basic process for preparing the data for the later following Machine Learning, with which the data can be analyzed further and the algorithm even learns from the structure of the documents and can provide a better detection than just heuristics.

Technically we developed the entire process with just two different cloud components:

–       Azure Functions

–       Azure Blob Storage

For each processing step we created an Azure Function to host and run the logic. And the blob storage was put in place to save the result of each logic step. To make things complete we kick off the pipeline with a blob storage – because all the receipt PDFs were saved in a blob storage. When a new PDF is uploaded to that blob storage the pipeline kicks into action and processes the PDF.

Final architecture
The little illustration shall give a good overview of what we have done here.

But to be on the safe side let’s look at the workflow in detail so that we don’t miss anything essential.

–       Suddenly! A wild document appears!

–       The document is saved in a blob storage, let’s say in folder “documents”.

–         The first Azure Function (called PdfToPng, written in C#) starts running. It is blob storage triggered and converts the PDF to an PNG to save it in the blob storage folder “images”. This function depends on GhostScript to do the conversion. Have a look at this post for a step by step process on how to set-up the dependencies.

–         The creation of this PNG file triggers the next Azure Function (called PngToOcr, written in C#). This function takes the image through the OCR service to save the JSON result in the blob storage folder “json”.

–         By now it should be no surprise that another Azure Function is then triggered (called OcrToFlat, written in Node.js). It takes the JSON data and just creates a single string out of all the text within one detected OCR region to save it in the blob storage folder “flatjson”.

Done! And the incredible thing is – Azure Functions are running “serverless” which means we had no overhead work with setting them up. All we did is “just” put our code in them, define the trigger and where the output should go. We even mixed and matched programming languages, using whatever suited best for that step and/or was the preference of the developer. Our chain was set up! Another big advantage is that Functions run in parallel – or better to say “scale out” –  which saves us a lot of time. The chain is triggered for each document individually which is put into the “documents” folder. So there are no actual waiting times for each document to get processed.

In the illustration we have one additional component which was not mentioned by now, which is LUIS. LUIS is the Language Understanding Service from the same toolbox we’re already using for the Computer Vision OCR API. But other than the OCR API, LUIS does not detect text from images, it detects intent and meaning from text. This is great for our data set because with LUIS we can find out what a flat string, which we are producing in the last step of our chain – is all about. So additionally to just plain keyword matching we can use this smart service to get higher hit rates on finding the info we want from our recipes.

The Code
All of the code is available on this Github repo.

You must admit – it’s a little bit awesome 🙂 And technology just rocks 😀 What do you think?

 

Ultimate Time Based Calculations Cheat Sheet for DAX / Power BI (including Week based calculations)

Power BI provides great time intelligence features to calculate Year-to-date (YTD), Month-to-date (MTD) and Quarter-to-date (QTD) totals. There is no such thing as Week-to-date (WTD) or Period-to-date (PTD) where period could be any arbitrary period definition (I used two-month periods in my example below). If you want those, you will have to create the calculations yourself. I was inspired by this excellent blog post and created an ultimate time-intelligence calculations Power BI file. I used Matt Massons excellent date dimension generation script to generate the date table for my example.

Download the full Power BI file here or get just the DAX formulas. Enjoy!

Basic Measures

  • Total Amount:
  • Total Quantity:

     

Day Measures

  • Total Amount for Last Day (Yesterday)
  • Total Amount for same Day last Year
  • Variance of total Amount compared to Total Amount for Last Day
  • Variance % of total Amount compared to Total Amount for Last Day
  • Variance of total Amount compared to Total Amount for same Day last Year
  • Variance % of total Amount compared to Total Amount same Day last Year
  • Total Quantity for Last Day (Yesterday)
  • Total Quantity for same Day last Year
  • Variance of total Quantity compared to Total Quantity for Last Day
  • Variance % of total Quantity compared to Total Quantity for Last Day
  • Variance of total Quantity compared to total Quantity for same Day last Year
  • Variance % of total Quantity compared to total Quantity for same Day last Year

Week Measures

  • Total Amount Week To Date
  • Total Amount for Last Week
  • Total Amount for same Week Last Year
  • Variance of total Amount Week To Date compared to Total Amount for Last Week
  • Variance % of total Amount Week To Date compared to Total Amount for Last Week
  • Variance of total Amount Week to Date compared to Total Amount for same Week last Year
  • Variance % of total Amount Week to Date compared to Total Amount for same Week last Year
  • Total Quantity Week To Date
  • Total Quantity for same Week Last Year
  • Total Quantity for Last Week
  • Variance of total Quantity Week To Date compared to Total Quantity for Last Week
  • Variance % of total Quantity Week To Date compared to Total Quantity for Last Week

    /li>
  • Variance of total Quantity Week to Date compared to Total Quantity for same Week last Year

    /li>
  • Variance % of total Quantity Week to Date compared to Total Quantity for same Week last Year

Month Measures

  • Total Amount Month To Date
  • Total Amount for same Month Last Year
  • Total Amount for Last Month
  • Variance of total Amount Month To Date compared to Total Amount for Last Month
  • Variance % of total Amount Month To Date compared to Total Amount for Last Month
  • Variance of total Amount Month to Date compared to Total Amount for same Month last Year
  • Variance % of total Amount Month to Date compared to Total Amount for same Month last Year
  • Total Quantity Month To Date
  • Total Quantity for same Month Last Year
  • Total Quantity for Last Month
  • Variance of total Quantity Month To Date compared to Total Quantity for Last Month
  • Variance % of total Quantity Month To Date compared to Total Quantity for Last Month
  • Variance of total Quantity Month to Date compared to Total Quantity for same Month last Year
  • Variance % of total Quantity Month to Date compared to Total Quantity for same Month last Year

Period Measures

  • Total Amount Period To Date
  • Total Amount for same Period Last Year
  • Total Amount for Last Period
  • Variance of total Amount Period To Date compared to Total Amount for Last Period
  • Variance % of total Amount Period To Date compared to Total Amount for Last Period
  • Variance of total Amount Period to Date compared to Total Amount for same Period last Year
  • Variance % of total Amount Period to Date compared to Total Amount for same Period last Year
  • Total Quantity Period To Date
  • Total Quantity for same Period Last Year
  • Total Quantity for Last Period
  • Variance of total Quantity Period To Date compared to Total Quantity for Last Period
  • Variance % of total Quantity Period To Date compared to Total Quantity for Last Period
  • Variance % of total Quantity Period to Date compared to Total Quantity for same Period last Year
  • Variance of total Quantity Period to Date compared to Total Quantity for same Period last Year

Quarter Measures

  • Total Amount Quarter To Date
  • Total Amount for same Quarter Last Year
  • Total Amount for Last Quarter
  • Variance of total Amount Quarter To Date compared to Total Amount for Last Quarter
  • Variance % of total Amount Quarter To Date compared to Total Amount for Last Quarter
  • Variance of total Amount Quarter to Date compared to Total Amount for same Quarter last Year
  • Variance % of total Amount Quarter to Date compared to Total Amount for same Quarter last Year
  • Total Quantity Quarter To Date
  • Total Quantity for same Quarter Last Year
  • Total Quantity for Last Quarter
  • Variance of total Quantity Quarter To Date compared to Total Quantity for Last Quarter
  • Variance % of total Quantity Quarter To Date compared to Total Quantity for Last Quarter
  • Variance of total Quantity Quarter To Date compared to Total Quantity or same Quarter last Year
  • Variance % of total Quantity Quarter To Date compared to Total Quantity for same Quarter last Year

Year Measures

  • Total Amount Year To Date
  • Total Amount for Last Year
  • Variance of total Amount Year To Date compared to Total Amount for Last Year
  • Variance% of total Amount Year To Date compared to Total Amount for Last Year
  • Total Quantity Year To Date
  • Total Quantity for Last Year
  • Variance of total Quantity Year To Date compared to Total Quantity for Last Year
  • Variance% of total Quantity Year To Date compared to Total Quantity for Last Year

Mapping Dutch geographical data using Power BI

I have written before on the subject of using Power BIs mapping capabilities (see the post on ArcGIS) and using the maps with Dutch geographical data (see using Dutch postal codes in Power Map). However, I have never did a more thorough exploration of the mapping capabilities of Power BI using Dutch data. Until now! (Niels, thanks for requesting me to do this).

Please see the Power BI report below – the first tab shows the dataset I used. The other tabs show the way(s) to work with the various maps to plot Dutch data. Unfortunately the ArcGIS visual is not supported in the mode I use here – please download the original Power BI desktop file below. Enjoy!

Download Power BI Desktop file

Updated: Annual radio countdown Top 2000 in Power BI

Oops, I did it again. It’s that time of the year. A day later than last year (sorry about that ;))….

Back by popular demand: the Top 2000 visualized in Power BI. Read all about it in last year’s post.

Full screen

Enjoy and happy holidays! See you next year.

SQL Server 2017: It’s here!

Anxiously you have waited for this day. The day you could finally get your hands on the latest version of the best database platform available. Today is that day. SQL Server 2017 is available and it runs on Windows, Linux and Docker. Find more info here: https://www.microsoft.com/en-us/sql-server/sql-server-2017. What changed? So many things it is hard to mention them all. Have a look at the Ignite sessions to get a feel for what happened.

 

Power BI Connector for CBS Open Data / Statline

Since this post primarily concerns a Dutch data source, the rest of this post will be in Dutch. English summary below.

De open data die het Centraal Bureau voor de Statistiek (CBS) via Statline (http://statline.cbs.nl) aanlevert is van onschatbare waarde voor analisten. Over diverse thema’s wordt informatie gepubliceerd, zoals woning, demografie, inkomen en arbeid. Erg handig om je verkopen per provincie te vergelijken met de samenstelling van de bevolking bijvoorbeeld. De juiste informatie uit Statline halen was erg lastig, maar dankzij de Power BI Connector voor CBS Open Data / Statline is dat niet meer het geval.

Gebruik

Wanneer je de connector hebt geïnstalleerd (zie beneden) vind je de connector in de lijst met data bronnen in Power BI Desktop:

Klik op CBSOpenData en klik Connect. Vul één of meerdere zoekwoorden in (bijvoorbeeld arbeid) en klik OK:

De connector communiceert met de CBS Statline website en levert de thema’s en eventueel bijbehorende tabellen. Dit kan een tijdje duren, afhankelijk van het aantal thema’s en tabellen die horen bij de woorden die je ingevoerd hebt. Kies één of meerdere tabellen en je kunt ze laten of bewerken! Eenvoudiger wordt het niet!

Installatie

Omdat de custom connectors nog erg nieuw zijn is er op dit moment een omweg nodig om ze te kunnen gebruiken in Power BI Desktop. Volg deze stappen:

  1. Zet de Custom data connectors preview feature aan binnen Power BI Desktop via File à Options:

  2. Sluit Power BI Desktop.
  3. Lees de documentatie op Github om te zien welke directory je moet maken om Power BI de custom connectors te laten laden. Op dit moment is dat [My Documents]\Microsoft Power BI Desktop\Custom Connectors.
  4. Download het .mez bestand van de Github van de CBS Open Data connector en sla het bestand op in de directory die je net gemaakt hebt. Je kunt natuurlijk ook de source code helemaal bekijken.
  5. Start Power BI Desktop op. Als het goed is staat de connector nu onder ‘Other’ in de lijst (zie screenshot bovenaan in deze post). De snelheid van ophalen is niet hoog, omdat er gebruik wordt gemaakt van de publieke API, maar toch, met wat geduld moet het werken. Veel succes!

English

The CBS Open Data / Statline website (http://statline.cbs.nl) is a very popular source of all types of statistical information about the Netherlands, provided by the government. In order to use this connector in Power BI, download the .mez file from Github and save the file in your custom connectors directory. Please see the documentation to find which directory to use. Enable the Custom Connectors preview feature in Power BI Desktop, restart Power BI and the connector should show up. You can of course also look at the source code on Github. Enjoy!

Microsoft’s Language Understanding Service now speaks Dutch

Over the weekend Microsoft’s Language Understanding Service (LUIS) which is part of the Cognitive Services and the Cortana Intelligence Suite was updated to support the Dutch (Nederlands) language! Great news for anyone who needs to support Dutch in their applications and be able to understand Dutch. We are the 12th language on the list 🙂

Implementing incremental data load using Azure Data Factory

Azure Data Factory is a fully managed data processing solution offered in Azure. It connects to many sources, both in the cloud as well as on-premises. One of the basic tasks it can do is copying data over from one source to another – for example from a table in Azure Table Storage to an Azure SQL Database table. To get the best performance and avoid unwanted duplicates in the target table, we need to include incremental data load or delta’s. Also, we can build mechanisms to further avoid unwanted duplicates when a data pipeline is restarted.

In this post I will explain how to cover both scenario’s using a pipeline that takes data from Azure Table Storage, copies it over into Azure SQL and finally brings a subset of the columns over to another Azure SQL table. The result looks like this:

The components are as follows:

  • MyAzureTable: the source table in Azure Table Storage
  • CopyFromAzureTableToSQL: the pipeline copying data over into the first SQL table
  • Orders: the first SQL Azure database table
  • CopyFromAzureSQLOrdersToAzureSQLOrders2: the pipeline copying data from the first SQL table to the second – leaving behind certain columns
  • Orders2: the second and last SQL Azure database table

Setting up the basics is relatively easy. The devil is in the details, however.

  1. The linked services

Every data pipeline in Azure Data Factory begins with setting up linked services. In this case, we need two; one to the Azure Table storage and one to SQL Azure. The definition of the linked service to Azure Table Storage is as follows:

The SQL Azure linked service definition looks like this:

Note the name property – we will need to refer to it later.

  1. The datasets

Datasets define tables or queries that return data that we will process in the pipeline. The first dataset we need to define is the source dataset (called MyAzureTable). The definition is as follows:

Note that, again, this item has a name. We will use it in the pipeline later. Also, you will need to specify the name of your Azure Table in the “tablename” property. Note that the “LinkedServiceName” property is set to the name of the linked service we definied earlier. This way, Azure Data Factory knows where to find the table. Also, the “availability” property specifies the slices Azure Data Factory uses to process the data. This defines how long ADF waits before processing the data as it waits for the specified time to pass before processing. The settings above specify hourly slices, which means that data will be processed every hour. We will later set up the pipeline in such a way that ADF will just process the data that was added or changed in that hour, not all data available (as is the default behavior). Minimum slice size currently is 15 minutes. Also note that the dataset is specified as being external (“external”:true). This means that ADF will not try to coördinate tasks for this table as assumes the data will be written from somewhere outside ADF (your application for example) and will be ready for pickup when the slice size is passed.

The target dataset in SQL Azure follows the same definition:

Important to note is that we defined the structure explicitly – it is not required for the working of the first pipeline, but it is for the second, which will use this same table as source. Also note that presence of the column ‘ColumnForADuseOnly’ in the table. This column is later used by ADF to make sure data that is already processed is not again appended to the target table. Of course, the SQL table itself will need to have (at least) the same columns and matching data types:

  1. The first pipeline (from Azure Table to SQL)

The first pipeline takes the order data in the Azure table and copies it into the Orders table in SQL Azure. It does that incrementally and with repeatability – which means that a) each slice will only process a specific subset of the data and b) if a slice is restarted the same data will not be copied over twice. This results in a fast processing engine without duplication in the target table – data is copied over once, regardless of the number of restarts. Note that by default ADF copies all data over to the target so you would get so many rows in the table as there are orders in the Azure Table times the number of slices that ran (each slice bringing over the full Azure table). The definition is as follows:

Note that the pipeline consists of a single activity, which is a Copy activity. I could have specified another activity in the same pipeline – I have not done so for simplicity. The Copy activity takes as input the Azure Table (MyAzureTable) and outputs into the SQL Azure Table “Orders”. The source Query is very important – as this is used to select just the data we want! We use the column ‘OrderTimestamp’ which and select only the orders from MyAzureTable where the OrderTimestamp is greater than or equal to the starting time of the slice and less than the end time of the slice. A sample query against the Azure Table executed in this way looks like this:

OrderTimestamp ge datetime’2017-03-20T13:00:00Z’ and OrderTimestamp lt datetime’2017-03-20T15:00:00Z’

Also, look at the specification of the “sliceIdentifierColumnName” property on the target (sink) – this column is in the target SQL Azure table and is used by ADF to keep track of what data is already copied over so if the slice is restarted the same data is not copied over twice.

This pipeline will run each hour (“scheduler” properties), starting at 09:00:00 local clock (“specified by the “start” property) and can run 10 slices in parallel (specified by the “concurrency” property).

  1. The second pipeline (from SQL to SQL)

The second pipeline is there to prove the mapping of specific columns to others as well as showing how to do an incremental load from SQL Azure to another target. Note that I use the same linked service so this exercise is not really useful – the same effect could be retrieved by creating a view. The definition is as follows:

Note that we specify a “sqlReaderQuery” this time which selects the right subset of data for the slice. We use WindowStart and WindowEnd this time instead of SliceStart and SliceEnd earlier. At this point is does not matter as ADF requires both to be the same. WindowStart and WindowEnd refer to the pipeline start and end times, while SliceStart and SliceEnd refer to the slice start and end times. Using the “translator” properties we specify which columns to map – note that we copy over SalesAmount and OrderTimestamp exclusively.

There you have it – a fully incremental, repeatable data pipeline in Azure Data Factory, thanks to setting up a smart source query and using the “sliceIdentifierColumnName” property. The full source code is available on Github. More info on how this works is available in the official documentation.

Questions? Remarks? Let me know!

Dutch elections analyzed in Power BI

Just in time for the elections in the Netherlands: I made an analysis using Power BI of the results of the elections since 1918. It is published (in Dutch) on the official Microsoft the Netherlands blog: https://blogs.microsoft.nl/technologieenmaatschappij/tweede-kamerverkiezingen-technologie-brengt-uitslagen-tot-leven/.

Enjoy!

Annual radio countdown Top 2000 in Power BI

Full screen

Most of this post will be in Dutch. Every year a list of 2000 songs is voted on by people from the Netherlands. This list is then broadcast by Radio 2 in the last days of the year. I decided to do a Power BI analysis on it and included all the lists from the start (1999). Also, I included an R-based forecast. If you want to get some insight into the musical taste in the Netherlands, go ahead and have fun!

De Top 2000. Fenomeen. Traditie. Zo’n ‘hoort-bij-de-tijd-van-het-jaar’ gevoel. Voor mij wel in ieder geval. Het is in ieder geval de beste afspiegeling van de muzieksmaak van Nederland die ik ken. Maar ja, wat heb je eraan als je geen vergelijking kunt maken met vorig jaar? 2000 nummers, dat is best veel om even te onthouden. En wie was ook alweer de grootste stijger vorig jaar? Wat is de oudste song in de lijst van jouw favoriete artiest? Welke song deed er het langst over om in de lijst te komen? Wat is er eigenlijk waar van de opmerking dat de lijst reageert op overlijden? Nu kun je het antwoord vinden op deze en vele andere vragen!

Ik heb namelijk alle Top 2000’s sinds het begin (1999) opgenomen in deze Power BI, als kerstcadeautje voor jou, lezer. Je kunt de analyse hieronder vinden. Ik heb drie views gemaakt: de eerste is gericht op artiesten zodat je meer te weten kunt komen over hoe jouw favoriete artiest het doet in de lijst. De tweede geeft inzicht in alles over een ‘versie’ van de Top 2000. Als bonus is er ook een forecast / voorspelling view (derde pagina). Hier kun je bekijken hoe op basis van de historie de positie van de artiesten / songs zich gaat ontwikkelen. Het mooiste is: alles is interactief.

Zo vinden we nu eindelijk het antwoord op de vraag of de lijst reageert op overlijden. Het antwoord is ja, kijk maar bij Michael Jackson (overleden 2009 en dat jaar zijn beste notering: #27) of Amy Winehouse (overleden 2011 en dat jaar haar beste notering: #72).

Doe er je voordeel mee. Ik vind het in ieder geval een mooie afspiegeling van de tijdsgeest. Ik wens jullie een fijne kerst!

%d bloggers like this: