The above screenshot shows an initial analysis (in Microsoft Power BI) of 1,723,099 records of New York taxi trip records uploaded to the cloud. The top chart shows a scatter plot of Trip Distance in miles against the Total Fare Amount (in US $). This useful chart shows straightaway that there are some outliers in the data (e.g. some trips cost over $1,000 despite being only for short distances). These records are almost certainly errors (where e.g. the fare was entered with the decimal point in the wrong place, e.g. $1000.00 instead of $10.00) and should be corrected or removed. Similar errors in the Trip Distance fields had already been removed in that 2 records had implausible distance values (e.g. 300,833 miles for a total fare of $14.16, and 1,666 miles for a total fare of $10.30).
In order to analyse big data, it often needs to be moved from its original sources (e.g. separate csv or txt files, or a stream) to somewhere where it can be collated and processed (e.g. an online database, or Microsoft PowerBI, or an xdf, extensible data format, file that can be analysed by Microsoft R Server).
Smaller data can often just be copied or pasted, or even directly read from, csv or txt files. But when the number of records is hundreds or thousands or more, this is very slow and error prone. For example, what if the internet connection breaks partway through the upload process? In that case how many records are uploaded, and whether they will be uploaded correctly is unpredictable/ time consuming to check.
Using the Bulk Copy Utility (bcp)
So other techniques are often needed, e.g. using the bulk copy utility (bcp) to upload the data in batches (of say 10,000 to 50,000 records at a time) to a SQL Server Azure database. (See https://docs.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-2017).
An example of such a command that could be run in a command prompt such as the one above (after the bcp utility has been installed) is
bcp MyDatabase.dbo.MyTable in D:\Dropbox\ExampleBigDataFilesForUpload\yellow_tripsample_2016-01.txt -c -b 10000 -S servernamehere.database.windows.net -U usernamehere -P passwordhere -h “TABLOCK”
What this does is upload the contents of file yellow_tripsample_2016-01.txt (which needs to be a tab separated text file with the records with fields in the order and format that the database table expects, without a first row containing field headings) and adds them (in batches of size 10,000 records) to database table MyTable in database MyDatabase in the specified SQL Azure server. Normally this will append the records to the database, i.e. you can upload the contents of several files in turn, which will be necessary for example if (as in this case) the original data files came from Excel comma separated values, which have a maximum size of 1,048,576 records).
Using the bcp utility, I was able to transfer just over 1.7 million records of the publicly available New York Yellow Cab taxi trips data see http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml) at a rate of about 500 records per second. The whole process took about an hour. (I hope to discover techniques to speed this up further, perhaps during the Microsoft Professional Program for Big Data program that I am currently about 40% of the way through. The program for Data Science did not cover this area, perhaps unsurprisingly, as it relates to Big Data rather than data science/data analytics).
Here is a screenshot of one of the tab delimited text files that I used to upload the data (note that you can save an Excel csv as a tab delimited text file):
Note that I had to add a key field (which I called RecordID, the left most field) to the raw taxi data records as Azure database tables require such an index field.
Alternatively, write custom code and upload the data programmatically in batches
Or this can be done programmatically via C# code, and that is the route we adopt in some parts of Pensions Concerto, our online pensions valuation system.
Alternative places to store the data
Note that the above example shows updating to a SQL Azure database table. Many alternatives are possible (e.g. just within the Microsoft technologies arena, the data could be uploaded to Azure storage tables, to an Azure DocumentDB database, or to Azure Machine Learning Studio, and several other places).
Be prepared to use different techniques
The point of this article is that techniques which work fine with a few tens or hundreds of thousands of records seem to breakdown when you have millions or more records, particularly when you are wanting to store the records in the cloud, as opposed to a local database on your pc or network.
Note also that in Big Data terms, the data sample in this article is quite small at 1.7 million records, and there was no particular urgency to upload the data. Different techniques again are needed when millions of records need to be saved to a cloud storage location as quickly as possible (e.g. Twitter, Facebook or Instagram posts, telemetric records from the millions of modern cars on the road, or measurements from the increasing number of IoT – Internet of Things – devices).
As I continue my Big Data journey (through the Microsoft Professional Program route, and before that the Data Science program, and my own experiences with data), I plan to add more posts on this topic.