A bit later this year, but even after my move to the US I decided to keep up the tradition of updating my Power BI analysis of the Top 2000 analysis for this year.
Enjoy and happy holidays! See you next year.
I am happy to share with you all that I am stepping back into the ring. After a ‘brief’ period (“It has been 6 months, how can you call that brief?”) I am back. And better than before – or at least – I hope.
Ever since my move to the US to work at Microsoft headquarters I have been waiting for the opportunity to be more visible ‘outside’. And now I will! We are getting started with YouTube: we just re-launched the Data Exposed show!
Data Exposed is all about data; relational and non-relational, on-premises and in the cloud, big and small. Join us as we demonstrate features, discuss the latest news, and share the love for data technology including SQL Server, Azure Data Services like SQL DB, Cosmos DB, SQL DW, Data Factory, Databricks and more.
I am proud to say that my team and I will make this a reality – expect to see me in future episodes. For more details, See my blog post on Tech Communities.
It feels good to be part of this. Stay tuned, there will be more to come, including videos with me (goodbye comfort zone!).
This is my first blog post since I moved to Redmond. A lot of time has passed, sorry for that! I kept busy filling out forms, forms and some more forms. After that came more forms. I dream about forms, I breath forms. I am a form. Wait, let’s stop there.
Recently I had to figure out how to pass in settings to an SQL Server Integration Services (SSIS) package when calling it using dtexec from bash on Linux.
The dtexec utility is utility to execute a SSIS packages. The name goes back to the time that SSIS was called Data Transformation Services (DTS) – that is also the reason why SSIS packages carry the .dtsx extension. The dtexec utility is available for both Windows and Linux.
One of the interesting things you can do with dtexec is passing in values to variables (using the /Par(ameter) option or even change a connection string at runtime (using the /Conn[ection]). I had to do the latter. Here is where the fun started. The /Par and /Conn options (and maybe others) expect something like this: [name];[value]. For example:
dtexec /F myfile.dtsx /Conn "MasterSQL";"Data Source=myserver;User ID=myuser;Initial Catalog=master;Password=mypassword"
Notice the ‘;’ between the name of the source connection and the connection string. This is all great, unless you are trying to call dtexec from the bash command line in Linux, because ; actually means something in bash: when bash sees ; it thinks the current command has ended and the next instruction follows. That is not what we want here.
I had to escape a number of items to get this to work correct in bash:
/CONN "MasterSQL"\;"\"Data Source=myserver;User ID=myuser;Initial Catalog=mydb;Password=mypassword\""
See what I did there? I had to escape the first ; (between ‘MasterSQL’ and ‘Data Source’) as well as escape the quotes that surround the data source definition itself and have it double quoted.
If you need to pass in a value to a string parameter you need to follow the same logic:
For a string variable you can pass in the value like this:
Hope this helps!
|17 October||10:00||Bussum||CraniumConnect||Voor leden||Website|
|17 October||17:00||Den Haag||TopBI||Intern||n.v.t.|
|30 October||13:30||Veenendaal||HSO||Op uitnodiging||Website|
|30 October||16:00||Huizen||Sigma Data Consulting||Intern||n.v.t.|
|2 November||09:00||Amsterdam||Breinwave||Op uitnodiging||n.v.t.|
|7 November||19:00||Veenendaal||Infosupport||Publiek (beperkt aantal plaatsen)||Inschrijven|
|13 November||17:30||Utrecht||Caesar Experts||Publiek||Inschrijven|
|14 November||10:00||Bussum||CraniumConnect||Voor leden||Website|
|15 November||15:00||Amsterdam||Ireckonu||Op uitnodiging||n.v.t.|
|29 November||17:00||Den Bosch||Power BI Gebruikersgroep||Publiek||Inschrijven|
I have updated the instructions and the code for the Custom Connector for CBS Open Data / Statline. Please see the original post:
Automating a data wrangling pipeline using chained Azure Functions. This post and its code was co-authored with Codeprincess.
Machine Learning is all about data… and algorithms. But first and most importantly it’s about data. And it’s not just any data, we need enough, high quality and (and in many cases) “cleaned” data. When we talk about cleaning data, we normally use the word ‘wrangling’. Wrangling is like washing clothes: your dirty data goes in and clean, neatly ironed and folded data comes out. We know from experience and research that data scientists spend 80% of their time wrangling (=cleaning) data – no wonder people call them “data janitors” from time to time!
Waste of talent
OK, your data scientists are spending most of their time on janitor and wrangler tasks. So what? Well, it turns out you normally do not need actual data scientists to do wrangling. You can leave that to others and have data scientists do what they do best: casting obscure mathematical spells to teach data tricks.
Mostly wrangling data is a sequential process of certain steps which need to be executed. In many cases this is executed by hand either using visual tools or by creating scripts. It all starts with data exploration to get an grasp of the data before taking steps to wrangle it.
Automating the data wrangling process
Let’s start and have a look at a real-life scenario on how we set up the “make life of data scientists”-process easier. But first let’s look at the task which our process shall take care of.
A lovely customer of Microsoft Italy, wants to retrieve certain information from printed receipts like the total price, the sender and the receiver. Sounds easy enough, right? Well, think again. Receipts come in many different shapes and sizes. Some are taxi-like receipts or those little receipts you get from parking machines or handheld payment terminals. Other might meticulously designed hotel bills or even span multiple pages . In other words the receipts in question don’t have nearly anything in common. What they did have in common though is that they were saved as PDFs.
Our approach – as we sat down together in Milan during a week of hacking – was the following. We first collected a big bunch of keywords which indicate that around this word the information we are looking for could be found. For example the keyword “Totale” was always near the total amount – which is info we want to retrieve from the receipt. In this manner we went on and tried to collect as many keywords which match patterns like this.
By having the idea of “the keyword indicates where our info is probably located” we started looking through different services which can give us location info of text on a document. The search was quite short though, because we remembered that the Cognitive Services toolbox has something useful here: the Computer Vision OCR API. It returns OCRed text from a picture and gives additional info on the location of the text.
The location data of the text is structured in regions, which then have lines, which then have words. So we get not only location-identical text packed together – we might even get semantic info out of this if we assume that info which is close to each other just belongs to each other 🙂
So our first step then was to convert our input PDF receipts into images, because our Computer Vision OCR API just takes images as an input. Those images were then scanned by the service for text and returned with its location metadata in JSON format (remember the region-line-word trio). See image above for a sample result of the OCR API.
For further analysis the JSON format was not that practical yet. That’s why we decided to flatten it. We cleaned out the detailed info about the regions and just packed all the text within one region together into a single string. The advantage of this was then to easily scan the strings for keywords. Imagine scanning the strings for the keyword “Totale” and if it fits have then a look for a number in that same string. The keyword “Totale” and if it fits have then a look for a number in that same string. The possibility is then quite high that we have found the total amount of the receipt. Victory!
That’s the basic process for preparing the data for the later following Machine Learning, with which the data can be analyzed further and the algorithm even learns from the structure of the documents and can provide a better detection than just heuristics.
Technically we developed the entire process with just two different cloud components:
– Azure Functions
– Azure Blob Storage
For each processing step we created an Azure Function to host and run the logic. And the blob storage was put in place to save the result of each logic step. To make things complete we kick off the pipeline with a blob storage – because all the receipt PDFs were saved in a blob storage. When a new PDF is uploaded to that blob storage the pipeline kicks into action and processes the PDF.
The little illustration shall give a good overview of what we have done here.
But to be on the safe side let’s look at the workflow in detail so that we don’t miss anything essential.
– Suddenly! A wild document appears!
– The document is saved in a blob storage, let’s say in folder “documents”.
– The first Azure Function (called PdfToPng, written in C#) starts running. It is blob storage triggered and converts the PDF to an PNG to save it in the blob storage folder “images”. This function depends on GhostScript to do the conversion. Have a look at this post for a step by step process on how to set-up the dependencies.
– The creation of this PNG file triggers the next Azure Function (called PngToOcr, written in C#). This function takes the image through the OCR service to save the JSON result in the blob storage folder “json”.
– By now it should be no surprise that another Azure Function is then triggered (called OcrToFlat, written in Node.js). It takes the JSON data and just creates a single string out of all the text within one detected OCR region to save it in the blob storage folder “flatjson”.
Done! And the incredible thing is – Azure Functions are running “serverless” which means we had no overhead work with setting them up. All we did is “just” put our code in them, define the trigger and where the output should go. We even mixed and matched programming languages, using whatever suited best for that step and/or was the preference of the developer. Our chain was set up! Another big advantage is that Functions run in parallel – or better to say “scale out” – which saves us a lot of time. The chain is triggered for each document individually which is put into the “documents” folder. So there are no actual waiting times for each document to get processed.
In the illustration we have one additional component which was not mentioned by now, which is LUIS. LUIS is the Language Understanding Service from the same toolbox we’re already using for the Computer Vision OCR API. But other than the OCR API, LUIS does not detect text from images, it detects intent and meaning from text. This is great for our data set because with LUIS we can find out what a flat string, which we are producing in the last step of our chain – is all about. So additionally to just plain keyword matching we can use this smart service to get higher hit rates on finding the info we want from our recipes.
All of the code is available on this Github repo.
You must admit – it’s a little bit awesome 🙂 And technology just rocks 😀 What do you think?
Power BI provides great time intelligence features to calculate Year-to-date (YTD), Month-to-date (MTD) and Quarter-to-date (QTD) totals. There is no such thing as Week-to-date (WTD) or Period-to-date (PTD) where period could be any arbitrary period definition (I used two-month periods in my example below). If you want those, you will have to create the calculations yourself. I was inspired by this excellent blog post and created an ultimate time-intelligence calculations Power BI file. I used Matt Massons excellent date dimension generation script to generate the date table for my example.
TotalAmount = SUM(Sales[Amount])
TotalQuantity = SUM(Sales[Quantity])
Amount_LastDay = CALCULATE([TotalAmount],PREVIOUSDAY('Date'[Date]))
Amount_SameDayLastYear = CALCULATE([TotalAmount],SAMEPERIODLASTYEAR('Date'[Date]))
Amount_DOD_Variance = [TotalAmount]-[Amount_LastDay]
Amount_DOD_Variance% = DIVIDE([Amount_DOD_Variance],[Amount_LastDay])
Amount_YOY_Variance = [TotalAmount]-[Amount_SameDayLastYear]
Amount_YOY_Variance% = DIVIDE([Amount_YOY_Variance],[Amount_LastDay])
Quantity_LastDay = CALCULATE([TotalQuantity],PREVIOUSDAY('Date'[Date]))
Quantity_SameDayLastYear = CALCULATE([TotalQuantity],SAMEPERIODLASTYEAR('Date'[Date]))
Quantity_DOD_Variance = [TotalQuantity]-[Quantity_LastDay]
Quantity_DOD_Variance% = DIVIDE([Quantity_DOD_Variance],[Quantity_LastDay])
Quantity_YOY_Variance = [TotalQuantity]-[Quantity_SameDayLastYear]
Quantity_YOY_Variance% = DIVIDE([Quantity_YOY_Variance],[Quantity_LastDay])
Amount_WTD = IF ( HASONEVALUE ( 'Date'[Year] ) && HASONEVALUE ('Date'[WeekNumber] ), CALCULATE( [TotalAmount], FILTER ( ALL ( 'Date' ), 'Date'[Year] = VALUES ( 'Date'[Year] ) && 'Date'[WeekNumber] = VALUES ( 'Date'[WeekNumber] ) && 'Date'[Date] <= MAX ( 'Date'[Date] ) ) ))
Amount_LastWeek = SUMX( FILTER(ALL('Date'), IF(SELECTEDVALUE('Date'[WeekNumber])=1, 'Date'[WeekNumber]=CALCULATE(MAX('Date'[WeekNumber]), ALL('Date')) && 'Date'[Year]=FORMAT(VALUE(SELECTEDVALUE('Date'[Year]))-1,""), 'Date'[WeekNumber]=SELECTEDVALUE('Date'[WeekNumber])-1 && 'Date'[Year]=FORMAT(VALUE(SELECTEDVALUE('Date'[Year])),"")) ), [TotalAmount])
Amount_SameWeekLastYear = IF ( HASONEVALUE ( 'Date'[Year] ) && HASONEVALUE ('Date'[WeekNumber] ), CALCULATE( SUM ( Sales[Amount] ), FILTER ( ALL ( 'Date' ), 'Date'[Year] = FORMAT(VALUES ( 'Date'[Year] )-1,"") && 'Date'[WeekNumber] = VALUES ( 'Date'[WeekNumber] ) && 'Date'[Date] <= MAX ( 'Date'[Date] ) ) ))
Amount_WTD_WOW_Variance = [Amount_WTD]-[Amount_LastWeek]
Amount_WTD_WOW_Variance% = DIVIDE([Amount_WTD_WOW_Variance],[Amount_LastWeek])
Amount_WTD_YOY_Variance = [Amount_WTD]-[Amount_SameWeekLastYear]
Amount_WTD_YOY_Variance% = DIVIDE([Amount_WTD_YOY_Variance],[Amount_SameWeekLastYear])
Quantity_WTD = IF ( HASONEVALUE ( 'Date'[Year] ) && HASONEVALUE ('Date'[WeekNumber] ), CALCULATE ( [TotalQuantity], FILTER ( ALL ( 'Date' ), 'Date'[Year] = VALUES ( 'Date'[Year] ) && 'Date'[WeekNumber] = VALUES ( 'Date'[WeekNumber] ) && 'Date'[Date] <= MAX ( 'Date'[Date] ) ) ), BLANK () )
Quantity_SameWeekLastYear = IF ( HASONEVALUE ( 'Date'[Year] ) && HASONEVALUE ('Date'[WeekNumber] ), CALCULATE( SUM ( Sales[Quantity] ), FILTER ( ALL ( 'Date' ), 'Date'[Year] = FORMAT(VALUES ( 'Date'[Year] )-1,"") && 'Date'[WeekNumber] = VALUES ( 'Date'[WeekNumber] ) && 'Date'[Date] <= MAX ( 'Date'[Date] ) ) ))
Quantity_LastWeek = SUMX( FILTER(ALL('Date'), IF(SELECTEDVALUE('Date'[WeekNumber])=1, 'Date'[WeekNumber]=CALCULATE(MAX('Date'[WeekNumber]), ALL('Date')) && 'Date'[Year]=FORMAT(VALUE(SELECTEDVALUE('Date'[Year]))-1,""), 'Date'[WeekNumber]=SELECTEDVALUE('Date'[WeekNumber])-1 && 'Date'[Year]=FORMAT(VALUE(SELECTEDVALUE('Date'[Year])),"")) ), [TotalQuantity])
Quantity_WTD_WOW_Variance = [Quantity_WTD]-[Quantity_LastWeek]
Quantity_WTD_WOW_Variance% = DIVIDE([Quantity_WTD_WOW_Variance],[Quantity_LastWeek])
Quantity_WTD_YOY_Variance = [Quantity_WTD]-[Quantity_SameWeekLastYear]
Quantity_WTD_YOY_Variance% = DIVIDE([Quantity_WTD_YOY_Variance],[Quantity_SameWeekLastYear])
Amount_MTD = TOTALMTD([TotalAmount],'Date'[Date])
Amount_SameMonthLastYear = CALCULATE([Amount_MTD],SAMEPERIODLASTYEAR('Date'[Date]))
Amount_LastMonth = CALCULATE([TotalAmount],PREVIOUSMONTH('Date'[Date]))
Amount_MTD_MOM_Variance = [Amount_MTD]-[Amount_LastMonth]
Amount_MTD_MOM_Variance% = DIVIDE([Amount_MTD_MOM_Variance],[Amount_LastMonth])
Amount_MTD_YOY_Variance = [Amount_MTD]-[Amount_SameMonthLastYear]
Amount_MTD_YOY_Variance% = DIVIDE([Amount_MTD_YOY_Variance],[Amount_SameMonthLastYear])
Quantity_MTD = TOTALMTD([TotalQuantity],'Date'[Date])
Quantity_SameMonthLastYear = CALCULATE([Quantity_MTD],SAMEPERIODLASTYEAR('Date'[Date]))
Quantity_LastMonth = CALCULATE([TotalQuantity],PREVIOUSMONTH('Date'[Date]))
Quantity_MTD_MOM_Variance = [Quantity_MTD]-[Quantity_LastMonth]
Quantity_MTD_MOM_Variance% = DIVIDE([Quantity_MTD_MOM_Variance],[Quantity_LastMonth])
Quantity_MTD_YOY_Variance = [Quantity_MTD]-[Quantity_SameMonthLastYear]
Quantity_MTD_YOY_Variance% = DIVIDE([Quantity_MTD_YOY_Variance],[Quantity_SameMonthLastYear])
Amount_PTD = IF ( HASONEVALUE ( 'Date'[Year] ) && HASONEVALUE ('Date'[TwoMonthPeriod] ), CALCULATE ( [TotalAmount], FILTER ( ALL ( 'Date' ), 'Date'[Year] = VALUES ( 'Date'[Year] ) && 'Date'[TwoMonthPeriod] = VALUES ( 'Date'[TwoMonthPeriod] ) && 'Date'[Date] <= MAX ( 'Date'[Date] ) ) ), BLANK () )
Amount_SamePeriodLastYear = CALCULATE([Amount_PTD],SAMEPERIODLASTYEAR('Date'[Date]))
Amount_LastPeriod = CALCULATE([TotalAmount],DATEADD('Date'[Date],-2,MONTH))
Amount_PTD_POP_Variance = [Amount_PTD]-[Amount_LastPeriod]
Amount_PTD_POP_Variance% = DIVIDE([Amount_PTD_POP_Variance],[Amount_LastPeriod])
Amount_PTD_YOY_Variance = [Amount_PTD]-[Amount_SamePeriodLastYear]
Amount_PTD_YOY_Variance% = DIVIDE([Amount_PTD_YOY_Variance],[Amount_SamePeriodLastYear])
Quantity_PTD = IF ( HASONEVALUE ( 'Date'[Year] ) && HASONEVALUE ('Date'[TwoMonthPeriod] ), CALCULATE ( [TotalQuantity], FILTER ( ALL ( 'Date' ), 'Date'[Year] = VALUES ( 'Date'[Year] ) && 'Date'[TwoMonthPeriod] = VALUES ( 'Date'[TwoMonthPeriod] ) && 'Date'[Date] <= MAX ( 'Date'[Date] ) ) ), BLANK () )
Quantity_SamePeriodLastYear = CALCULATE([Quantity_PTD],SAMEPERIODLASTYEAR('Date'[Date]))
Quantity_LastPeriod = CALCULATE([TotalQuantity],DATEADD('Date'[Date],-2,MONTH))
Quantity_PTD_POP_Variance = [Quantity_PTD]-[Quantity_LastPeriod]
Quantity_PTD_POP_Variance% = DIVIDE([Quantity_PTD_POP_Variance],[Quantity_LastPeriod])
Quantity_PTD_YOY_Variance = [Quantity_PTD]-[Quantity_SamePeriodLastYear]
Quantity_PTD_YOY_Variance% = DIVIDE([Quantity_PTD_YOY_Variance],[Quantity_SamePeriodLastYear])
Amount_QTD = TOTALQTD([TotalAmount],'Date'[Date])
Amount_SameQuarterLastYear = CALCULATE([Amount_QTD],SAMEPERIODLASTYEAR('Date'[Date]))
Amount_LastQuarter = CALCULATE([TotalAmount],PREVIOUSQUARTER('Date'[Date]))
Amount_QTD_QOQ_Variance = [Amount_QTD]-[Amount_LastQuarter]
Amount_QTD_QOQ_Variance% = DIVIDE([Amount_QTD_QOQ_Variance],[Amount_LastQuarter])
Amount_QTD_YOY_Variance = [Amount_QTD]-[Amount_SameQuarterLastYear]
Amount_QTD_YOY_Variance% = DIVIDE([Amount_QTD_YOY_Variance],[Amount_SameQuarterLastYear])
Quantity_QTD = TOTALQTD([TotalQuantity],'Date'[Date])
Quantity_SameQuarterLastYear = CALCULATE([Quantity_QTD],SAMEPERIODLASTYEAR('Date'[Date]))
Quantity_LastQuarter = CALCULATE([TotalQuantity],PREVIOUSQUARTER('Date'[Date]))
Quantity_QTD_QOQ_Variance = [Quantity_QTD]-[Quantity_LastQuarter]
Quantity_QTD_QOQ_Variance% = DIVIDE([Quantity_QTD_QOQ_Variance],[Quantity_LastQuarter])
Quantity_QTD_YOY_Variance = [Quantity_QTD]-[Quantity_SameQuarterLastYear]
Quantity_QTD_YOY_Variance% = DIVIDE([Quantity_QTD_YOY_Variance],[Quantity_SameQuarterLastYear])
Amount_YTD = TOTALYTD([TotalAmount],'Date'[Date])
Amount_LastYear = CALCULATE([Amount_YTD],PREVIOUSYEAR('Date'[Date]))
Amount_YTD_YOY_Variance = [Amount_YTD]-[Amount_LastYear]
Amount_YTD_YOY_Variance% = DIVIDE([Amount_YTD_YOY_Variance],[Amount_LastYear])
Quantity_YTD = TOTALYTD([TotalQuantity],'Date'[Date])
Quantity_LastYear = CALCULATE([TotalQuantity],PREVIOUSYEAR('Date'[Date]))
Quantity_YTD_YOY_Variance = -[Quantity_YTD]-[Quantity_LastYear]
Quantity_YTD_YOY_Variance% = DIVIDE([Quantity_YTD_YOY_Variance],[Quantity_LastYear])
I have written before on the subject of using Power BIs mapping capabilities (see the post on ArcGIS) and using the maps with Dutch geographical data (see using Dutch postal codes in Power Map). However, I have never did a more thorough exploration of the mapping capabilities of Power BI using Dutch data. Until now! (Niels, thanks for requesting me to do this).
Please see the Power BI report below – the first tab shows the dataset I used. The other tabs show the way(s) to work with the various maps to plot Dutch data. Unfortunately the ArcGIS visual is not supported in the mode I use here – please download the original Power BI desktop file below. Enjoy!