Category Archives

17 Articles

Machine learning is like washing clothes

Automating a data wrangling pipeline using chained Azure Functions. This post and its code was co-authored with Codeprincess.

Machine Learning is all about data… and algorithms. But first and most importantly it’s about data. And it’s not just any data, we need enough, high quality and (and in many cases) “cleaned” data. When we talk about cleaning data, we normally use the word ‘wrangling’. Wrangling is like washing clothes: your dirty data goes in and clean, neatly ironed and folded data comes out. We know from experience and research that data scientists spend 80% of their time wrangling (=cleaning) data – no wonder people call them “data janitors” from time to time!

Waste of talent
OK, your data scientists are spending most of their time on janitor and wrangler tasks. So what? Well, it turns out you normally do not need actual data scientists to do wrangling. You can leave that to others and have data scientists do what they do best: casting obscure mathematical spells to teach data tricks.
Mostly wrangling data is a sequential process of certain steps which need to be executed. In many cases this is executed by hand either using visual tools or by creating scripts. It all starts with data exploration to get an grasp of the data before taking steps to wrangle it.

Automating the data wrangling process
To speed up the wrangling process on the one hand and make it less painful on the other hand, we can automate the process. A great technology to use is Azure Functions. Azure Functions are stateless, serverless, short-running pieces of code. You can write them in many languages, we used C# and JavaScript in this situation. Each step on the wrangling process goes into its own Azure Function which then will chained to each other to form a pipeline where dirty data goes in and clean data goes out.

The scenario
Let’s start and have a look at a real-life scenario on how we set up the “make life of data scientists”-process easier. But first let’s look at the task which our process shall take care of.

A lovely customer of Microsoft Italy, wants to retrieve certain information from printed receipts like the total price, the sender and the receiver. Sounds easy enough, right? Well, think again. Receipts come in many different shapes and sizes. Some are taxi-like receipts or those little receipts you get from parking machines or handheld payment terminals. Other might meticulously designed hotel bills or even span multiple pages . In other words the receipts in question don’t have nearly anything in common. What they did have in common though is that they were saved as PDFs.

The approach
Our approach – as we sat down together in Milan during a week of hacking –  was the following. We first collected a big bunch of keywords which indicate that around this word the information we are looking for could be found. For example the keyword “Totale” was always near the total amount – which is info we want to retrieve from the receipt. In this manner we went on and tried to collect as many keywords which match patterns like this.

By having the idea of “the keyword indicates where our info is probably located” we started looking through different services which can give us location info of text on a document. The search was quite short though, because we remembered that the Cognitive Services toolbox has something useful here: the Computer Vision OCR API. It returns OCRed text from a picture and gives additional info on the location of the text.

The location data of the text is structured in regions, which then have lines, which then have words. So we get not only location-identical text packed together – we might even get semantic info out of this if we assume that info which is close to each other just belongs to each other 🙂

So our first step then was to convert our input PDF receipts into images, because our Computer Vision OCR API just takes images as an input. Those images were then scanned by the service for text and returned with its location metadata in JSON format (remember the region-line-word trio). See image above for a sample result of the OCR API.

For further analysis the JSON format was not that practical yet. That’s why we decided to flatten it. We cleaned out the detailed info about the regions and just packed all the text within one region together into a single string. The advantage of this was then to easily scan the strings for keywords. Imagine scanning the strings for the keyword “Totale” and if it fits have then a look for a number in that same string. The keyword “Totale” and if it fits have then a look for a number in that same string. The possibility is then quite high that we have found the total amount of the receipt. Victory!

That’s the basic process for preparing the data for the later following Machine Learning, with which the data can be analyzed further and the algorithm even learns from the structure of the documents and can provide a better detection than just heuristics.

Technically we developed the entire process with just two different cloud components:

–       Azure Functions

–       Azure Blob Storage

For each processing step we created an Azure Function to host and run the logic. And the blob storage was put in place to save the result of each logic step. To make things complete we kick off the pipeline with a blob storage – because all the receipt PDFs were saved in a blob storage. When a new PDF is uploaded to that blob storage the pipeline kicks into action and processes the PDF.

Final architecture
The little illustration shall give a good overview of what we have done here.

But to be on the safe side let’s look at the workflow in detail so that we don’t miss anything essential.

–       Suddenly! A wild document appears!

–       The document is saved in a blob storage, let’s say in folder “documents”.

–         The first Azure Function (called PdfToPng, written in C#) starts running. It is blob storage triggered and converts the PDF to an PNG to save it in the blob storage folder “images”. This function depends on GhostScript to do the conversion. Have a look at this post for a step by step process on how to set-up the dependencies.

–         The creation of this PNG file triggers the next Azure Function (called PngToOcr, written in C#). This function takes the image through the OCR service to save the JSON result in the blob storage folder “json”.

–         By now it should be no surprise that another Azure Function is then triggered (called OcrToFlat, written in Node.js). It takes the JSON data and just creates a single string out of all the text within one detected OCR region to save it in the blob storage folder “flatjson”.

Done! And the incredible thing is – Azure Functions are running “serverless” which means we had no overhead work with setting them up. All we did is “just” put our code in them, define the trigger and where the output should go. We even mixed and matched programming languages, using whatever suited best for that step and/or was the preference of the developer. Our chain was set up! Another big advantage is that Functions run in parallel – or better to say “scale out” –  which saves us a lot of time. The chain is triggered for each document individually which is put into the “documents” folder. So there are no actual waiting times for each document to get processed.

In the illustration we have one additional component which was not mentioned by now, which is LUIS. LUIS is the Language Understanding Service from the same toolbox we’re already using for the Computer Vision OCR API. But other than the OCR API, LUIS does not detect text from images, it detects intent and meaning from text. This is great for our data set because with LUIS we can find out what a flat string, which we are producing in the last step of our chain – is all about. So additionally to just plain keyword matching we can use this smart service to get higher hit rates on finding the info we want from our recipes.

The Code
All of the code is available on this Github repo.

You must admit – it’s a little bit awesome 🙂 And technology just rocks 😀 What do you think?

 

Updated: Annual radio countdown Top 2000 in Power BI

Oops, I did it again. It’s that time of the year. A day later than last year (sorry about that ;))….

Back by popular demand: the Top 2000 visualized in Power BI. Read all about it in last year’s post.

Full screen

Enjoy and happy holidays! See you next year.

First look: Esri ArcGIS Maps in Power BI

Esri is a leader in the GIS industry and ArcGIS is a very popular product to build great maps. Now, you can use ArcGIS maps in Power BI (in preview). See the official information here: https://powerbi.microsoft.com/en-us/blog/announcing-arcgis-maps-for-power-bi-by-esri-preview/. This is really cool, I know a lot of you have been asking for this for a long time!

You will find the option to enable this preview in PowerBI.com, not in the Power BI Desktop. Log in to PowerBI and open the settings. You can find the ArcGIS preview there and enable it by simply selecting the checkbox:

With that enabled, create a report with some geographical information (or edit an existing one). I used the Google Analytics data that keeps track of my blog. Google Analytics data can be loaded into Power BI simply by using the content pack. In edit mode in the report you will find the ArcGIS component in the Visualizations list:

Click it and create your map as you would with the normal map. I noticed it needs some time to build the map (probably due to the preview) but once it is done it is fully interactive with the other items on your report as you would expect:

You can change a lot of the ArcGIS options, such as switching out maps, changing symbol styles, adding reference layers, etc.

I love this – the awesome power of ArcGIS and Power BI combined! I cannot wait to see what you will create with this.

Speaking at Experts Live 2015 on BI and Big Data

Just a quick post: Inviting all to visit Experts Live 2015, 19th of November in Ede, the Netherlands. I will be speaking on BI and Big Data.
See http://www.expertslive.nl/ for more info on the event.
Hope to see you there!EXPERTSLIVE.5011_email-signature_spreker_630x180

Automatically building a Microsoft BI machine using PowerShell – Start of Series

I used to spend quite some time on building and re-building Microsoft BI demo machines. As you can imagine this manual process takes a lot of time and effort. Therefore (and also for my own education on PowerShell) I decided to look into automating the whole process. I will explain this in this series of posts.

The goal

In the end, we want to have a virtual machine that is configured as follows: Windows Server 2012 R2, with Active Directory Domain Controller role. Additionally, SQL Server 2014 is installed and configured as well as SharePoint 2013. Finally, the BI tools like Power Pivot and Power View are configured.

Ok, but how do we build such a machine?

Here are the steps to take. I always do them in this order, partly because there are some dependencies and partly because it stops me from going insane.

  1. Install Windows (doh). I will skip this step (therefore it is number 0) since I use Azure and a VM in Azure comes with Windows Server pre-installed. I happen to use Windows Server 2012 R2 b.t.w.
  2. Disable Internet Explorer Enhanced Security Configuration. Although it is a great idea (see http://technet.microsoft.com/en-us/library/dd883248(v=WS.10).aspx for more info on this) it is hard to give a good demo on the machine with this thing on. So first step is disabling it.
  3. Set up Active Directory; AD is required for the PowerPivot service.
  4. After AD has been set up we need to promote the Domain Controller.
  5. After promotion we configure a very unrestrictive password policy; remember, this is just a demo machine!
  6. Virus protection is important, even for a demo machine; therefore set up System Center Endpoint Protection.
  7. Install SQL Server 2014.
  8. Install SharePoint.
  9. Install PowerPivot Service.
  10. Configure PowerPivot Service.
  11. Configure last parts of PowerPivot Service.
  12. Configure Master Data Services.
  13. Configure Data Quality Services.
  14. Configure other SharePoint Service Applications.
  15. Activate SharePoint site features.
  16. Add favorites in Internet Explorer to point to MDS and SharePoint site.

In this blog series I will share my PowerShell code to accomplish this. Please note that I am not a developer so things can probably be done a lot smarter J

Next step is preparation: the install files.

R Plotting using Azure Machine Learning

Azure Machine Learning is Microsoft’s cloud data mining and machine learning solution. It features a studio that is fully web based. One of the best features is integration with R through the ‘Execute R Script’ component. One of the best things of R is the plotting capability and I recently decided to try to make R plots from Azure ML studio. It is amazing how easy this works and it really brings the power of Azure ML together with the great exploration, plotting and data manipulation capabilities of R.

Here is a very simple sample I made:

I used to Flight Delays sample dataset from Azure ML to make this. In the ML Studio you will need to create a new experiment and drag the ‘Flight Delays Data’ component to the canvas. The only other component you will need to drop on the canvas is ‘Execute R Script’ (I told you this was a very simple example). Drag a line from the data to the left most input port of the R script container like so:

Click on the R script component and edit the R script on the right. Here is my script:

 

This script gets the data from the input port and rbinds it into data.set. Then I executed a very simple plot using the plot base R package to create the plot shown above. The last line of this code is not even necessary but it was there by default.

After running the experiment the plot can be seen by selecting the right output port of the ‘Execute R Script’ container and selecting ‘Visualize’:

The plot will be at the bottom of this page.

Pretty cool huh? Stay tuned for more as I will continue experimenting with R integration in Azure ML as well as other ML things.

Happy new year!

I would like to take this opportunity to wish you all a very happy and data-filled 2015. Thanks for visiting my blog this past year!
Data nerd as I am, I could not resist sharing with you this blogs annual stats report:
http://jetpack.me/annual-report/53080601/2014/

Enjoy!

Power BI for partners

Frequently, partners ask me about getting access to Power BI for a little longer than a trial period. Up until recently the only answer I could give was there was nothing available. However, there is news! Partners can now get access to Power BI through the Partner benefits portal. This blog explains it all: http://blogs.technet.com/b/uspartner_ts2team/archive/2014/07/10/power-bi-for-office-365-now-available-to-competency-partners.aspx

Hope this helps!

Power BI Pro Tip: Name your linked tables

Just a quick Power BI Pro Tip this time: if you use linked tables to add data to your data model in Excel, before you press the ‘Add To Data Model’ button be sure to go to the table properties in Excel and give your table a better name (better than Table X). This makes figuring out which data you are looking at so much easier. You will thank me later J

Power BI Pro Tip: Show Top x results with RANKX() function

One of the more frequent scenarios is listing the top X results, such as most profitable products, biggest customers, top 10 best selling stores, etc. Also doing a top X selection helps reduce clutter in charts: a lot of data points can work as noise and obscure the data points that really matter and make the biggest impact.

In this post I describe an approach to implementing these scenarios using Power Pivot’s RANKX() function.

Let’s start with a simple dataset consisting of products (P1…P20 in my sample), Cities, Sales Amount and Number of products sold:

 

After adding this table to the Power Pivot data model, we can use the RANKX() function to get the best selling products / cities etc. I added the following measures to my table:

Sum of Sales Amount:=SUM([Sales Amount])

Sum of Number Sold:=SUM([Number Sold])

Rank of products by sales amount:=RANKX(ALL(Sales[Product]);[Sum of Sales Amount])

Rank of city by number sold:=RANKX(ALL(Sales[City]);[Sum of Number Sold])

 

These measures allow me to determine the top selling products by sales amount and best cities by number of products sold.

Only thing left to do is to use a Pivot Table / Pivot Graph or Power View / Power Map visualization and display the results.

 

If you create a new Pivot Table and add the Product column and the ‘Rank of product by sales amount’ measure you get the following:


 

So how do we get the top 10 selling products by sales amount is a nice ordered fashion? Very easy, just a matter of the right sorting and filtering. Click on the little downwards pointing triangle button at Row Labels and choose ‘More Sort Options’. There I chose Ascending and then selected the rank measure:


 

Now the Pivot Table is sorted by rank with the highest ranking product at the top. Now, to filter out only the top ten, we press the same button again and choose Value Filters and then Top 10. Here I made the following selections:


 

This seems maybe a bit counter intuitive, but what this does is return the lowest ten ranks (which would be 1 to 10 or the highest ranking products). Alternatively I could have used a Lower Than or Equal To Value Filter with these settings to produce the same result:

And here it is: a top 10 of products by sales amount.

 

Of course, you can also use Power View or Power Map to visualize these results. Here is a Power View based on the same information:

The trick here is to create the visualization just as normal (as above). Above displays the sales amount by product and the number sold by city. However, the catch here is that both the graph as well as the map have a filter on them that utilizes the rank measures I created. Here is the filter for the chart. The ‘Rank of products by sales amount’ measure is filtered to showing only values less than or equal to 10, i.e. the top 10.

What’s best about this is that it is very easy to change from top 10 to top 15 to top 5 or anything you desire. Also, the Power View is fully interactive. For example, clicking on one of the cities on the right shows which products are sold in that city. Note that it does not show the top 10 products in that city however.

Hope you liked this Power BI Pro Tip!

%d bloggers like this: