While looking at a (relatively small, 1.7 million records) big data example of New York Yellow Cab taxi trips, I am coming to the conclusion that the best place (if as we do you are using Microsoft tools) for initial analysis, including the all important first step of finding outliers/errors, is Azure Machine Learning Studio (Azure ML, as opposed to Excel, Power BI or bespoke analysis using e.g. Kendo UI).
Why Azure ML for initial analysis?
- It loads data quite quickly (e.g. just over a minute to import almost 2 million records from an Azure SQL database). This is currently much quicker than Power BI.
- It automatically produces histograms and box plots of numeric fields (see the images below, and above, where the field FareAmount has been selected). We can tell immediately from the box plot that there are several outliers (and in fact probable errors that will need to be either corrected or removed, in that FareAmount should not have negative values!).
Kendo UI is useful for producing powerful filterable/sortable grids such as the one shown below:
and also useful charts, e.g. a scatter plot of Total (Fare) Amount by Trip Distance:
However, it takes time and effort to set up such grids and charts, a bespoke webpage or site with the Kendo UI components installed is needed, and using more than a few thousand points in the charts can be very slow.
Start in Azure ML, move to Power BI for reporting
So while the Kendo UI components or Power BI are useful in my view for doing final, interactive reporting in a dashboard to senior management, my preference at the moment is to load the data in Azure ML. I get box plots and histograms without having to do any work, and when I want to look at scatter plots and faceted histograms, I can quickly produce these from Azure ML by running Python or R scripts.