Azure Machine Learning Studio: the best place for initial data analysis?

While looking at a (relatively small, 1.7 million records) big data example of New York Yellow Cab taxi trips, I am coming to the conclusion that the best place (if as we do you are using Microsoft tools) for initial analysis, including the all important first step of finding outliers/errors, is Azure Machine Learning Studio (Azure ML, as opposed to Excel, Power BI or bespoke analysis using e.g. Kendo UI).

Why Azure ML for initial analysis?

  1. It loads data quite quickly (e.g. just over a minute to import almost 2 million records from an Azure SQL database). This is currently much quicker than Power BI.
  2. It automatically produces histograms and box plots of numeric fields (see the images below, and above, where the field FareAmount has been selected). We can tell immediately from the box plot that there are several outliers (and in fact probable errors that will need to be either corrected or removed, in that FareAmount should not have negative values!).
PowerBIDashboard

Creating a corporate dashboard (using OData and Microsoft PowerBI)

Building a corporate dashboard so that you have key management information at a glance

(This article was first posted on 17 June 2017 on a different blog site, but migrated here 05 Feb 2018).

I have recently been building some corporate dashboards (as recommended by Daniel Priestley in his best selling book “24 Assets: Create a digital, scalable, valuable and fun business that will thrive in a fast changing world”). From chapter 15 of the book:

A key asset is a dashboard that allows the team to see how the business is performing. Carefully select some of the metrics that drive performance and make sure they show up prominently on your dashboard. You might select metrics like cash at bank, payments collected, expected invoices, revenue per employee or monthly users; the general rule is that whatever you measure will improve.

Accessing your valuable and key data

Dashboards need data, and this data will almost certainly need to come from a variety of sources in your organisation. There are lots of different ways of exposing your data sources so that the key information can be pulled into your dashboard. I reviewed several different options (including direct connections to databases, WebApi or MVC from websites and OData). My conclusion was that OData seemed the best current approach. Your data is valuable, so whatever method you use needs to be secure (i.e. with access protected via encryption and passwords) and you can do this with OData (and the other methods I have mentioned too). (Contact us if you need help with this. )