Uploading Big Data: it’s very different from normal data

The above screenshot shows an initial analysis (in Microsoft Power BI) of 1,723,099 records of New York taxi trip records uploaded to the cloud.  The top chart shows a scatter plot of Trip Distance in miles against the Total Fare Amount (in US $).  This useful chart shows straightaway that there are some outliers in the data (e.g. some trips cost over $1,000 despite being only for short distances).  These records are almost certainly errors (where e.g. the fare was entered with the decimal point in the wrong place, e.g. $1000.00 instead of $10.00) and should be corrected or removed. Similar errors in the Trip Distance fields had already been removed in that 2 records had implausible distance values (e.g. 300,833 miles for a total fare of $14.16, and 1,666 miles for a total fare of $10.30).

In order to analyse big data, it often needs to be moved from its original sources (e.g. separate csv or txt files, or a stream) to somewhere where it can be collated and processed (e.g. an online database, or Microsoft PowerBI, or an xdf, extensible data format, file that can be analysed by Microsoft R Server).