Uploading Big Data: it’s very different from normal data

The above screenshot shows an initial analysis (in Microsoft Power BI) of 1,723,099 records of New York taxi trip records uploaded to the cloud.  The top chart shows a scatter plot of Trip Distance in miles against the Total Fare Amount (in US $).  This useful chart shows straightaway that there are some outliers in the data (e.g. some trips cost over $1,000 despite being only for short distances).  These records are almost certainly errors (where e.g. the fare was entered with the decimal point in the wrong place, e.g. $1000.00 instead of $10.00) and should be corrected or removed. Similar errors in the Trip Distance fields had already been removed in that 2 records had implausible distance values (e.g. 300,833 miles for a total fare of $14.16, and 1,666 miles for a total fare of $10.30).

In order to analyse big data, it often needs to be moved from its original sources (e.g. separate csv or txt files, or a stream) to somewhere where it can be collated and processed (e.g. an online database, or Microsoft PowerBI, or an xdf, extensible data format, file that can be analysed by Microsoft R Server).

couple walking on beach

How corporate pension liabilities could vary by 10% or more, even on an agreed set of assumptions

(Posted by Patrick Lee on 1 August 2017 at a different location, but migrated here on 05 Feb 2018).

Why is there a range of answers, even using a given set of assumptions? Are these differences real, or artificial?

It would clearly make a difference whether a company’s pension liabilities were £475m, £500m or £525m …

The value of an organisation’s defined benefit (final salary or CARE – career average revalued) pension plan promises normally depends on many uncertainties, including:

  • how long the plan members and their partners are expected to live
  • what proportions of active members will leave service, or retire on ill health, before reaching normal retirement age
  • what the future rates of salary growth and price inflation (and hence pension increases) will be
  • assuming that a perfectly matching asset portfolio can’t be found (normally such a portfolio doesn’t exist), then what the future rates of reinvestment (for cashflow mismatches) will be.