While looking at a (relatively small, 1.7 million records) big data example of New York Yellow Cab taxi trips, I am coming to the conclusion that the best place (if as we do you are using Microsoft tools) for initial analysis, including the all important first step of finding outliers/errors, is Azure Machine Learning Studio (Azure ML, as opposed to Excel, Power BI or bespoke analysis using e.g. Kendo UI).
Why Azure ML for initial analysis?
- It loads data quite quickly (e.g. just over a minute to import almost 2 million records from an Azure SQL database). This is currently much quicker than Power BI.
- It automatically produces histograms and box plots of numeric fields (see the images below, and above, where the field FareAmount has been selected). We can tell immediately from the box plot that there are several outliers (and in fact probable errors that will need to be either corrected or removed, in that FareAmount should not have negative values!).
The above screenshot shows an initial analysis (in Microsoft Power BI) of 1,723,099 records of New York taxi trip records uploaded to the cloud. The top chart shows a scatter plot of Trip Distance in miles against the Total Fare Amount (in US $). This useful chart shows straightaway that there are some outliers in the data (e.g. some trips cost over $1,000 despite being only for short distances). These records are almost certainly errors (where e.g. the fare was entered with the decimal point in the wrong place, e.g. $1000.00 instead of $10.00) and should be corrected or removed. Similar errors in the Trip Distance fields had already been removed in that 2 records had implausible distance values (e.g. 300,833 miles for a total fare of $14.16, and 1,666 miles for a total fare of $10.30).
In order to analyse big data, it often needs to be moved from its original sources (e.g. separate csv or txt files, or a stream) to somewhere where it can be collated and processed (e.g. an online database, or Microsoft PowerBI, or an xdf, extensible data format, file that can be analysed by Microsoft R Server).
As a “proper” programmer, used to programming in heavy duty, compiled languages like C# (and before that C++ and C), my reaction on discovering during my Data Science journey that R and Python are heavily used by data scientists was: why??
Why would anyone use an interpreted language, which is therefore bound to be slower, and why would anyone go to the trouble of using yet another language when there are perfectly good compiled languages around like C#, F# and VB.net?
The answer seems to be partly that R and Python are free (open source), and also because R and Python have excellent visualisation tools, which the other languages currently lack.
I mentioned a couple of days ago (here) that I had completed the 10 courses required for the Microsoft Professional Program for Data Science. I was delighted to receive confirmation earlier today from Microsoft via a nice certificate (see pic above), or you can view it here.
I am delighted to have now completed the Microsoft Professional Program for Data Science. It has been 10 online courses (taking a total of 322 hours) over just more than 11 months and my average (mean) mark over the 10 courses was 96.6%. The final course was a capstone project which involved analysing data from the 2015 earthquake in Nepal, building a model to help predict the degree of damage to buildings (amongst other things to help emergency response teams prioritise rescue efforts) and producing a report on this. This was an extremely practical way to complete the course.
I have created a series of slides (collected together in a Microsoft Sway online document) showing the main stages of my journey. You can see them at https://sway.com/lsUjwGITuGFpsHIM?ref=Link