Why do data scientists use R and Python, as opposed to other languages like C#?

As a “proper” programmer, used to programming in heavy duty, compiled languages like C# (and before that C++ and C), my reaction on discovering during my Data Science journey that R and Python are heavily used by data scientists was: why??

Why would anyone use an interpreted language, which is therefore bound to be slower, and why would anyone go to the trouble of using yet another language when there are perfectly good compiled languages around like C#, F# and VB.net?

The answer seems to be partly that R and Python are free (open source), and also because R and Python have excellent visualisation tools, which the other languages currently lack.

In particular, in classification problems, it is very easy in R or Python to produce conditioned or faceted charts, e.g. histograms of features (which is what independent variables are often called in Data Science) conditioned on the values of a categorical label (the name often used for the dependent variable, or the variable which we are trying to predict).  This conditioning, together with using aesthetics (e.g. the use of colour or shapes) to project additional dimensions on to what is of necessity usually a two dimensional chart, really helps when exploring the data to try and find features which are likely to help predict the label.

The following chart shows a scatter plot (in an analysis of the prices of cars in the USA) of price versus city-mpg (i.e. the fuel consumption in miles per gallon when driving in city conditions), but with an additional dimension (fuel type) shown by colour (red for diesel, and blue for gas/petrol):


The separate clustering of the blue and red dots indicate that fuel type has a significant influence on the price of cars, and in particular that for a given price, diesel cars tend to produce more miles per gallon in city driving. This is as one might expect, but sometimes such use of aesthetics can help modellers spot features which may be important much more easily, or more quickly than without such visualisations.

Similarly, in regression problems (where the label is a real number – e.g. the price of a car, as opposed to a discrete set of values – e.g. whether an individual is a good or bad credit risk), R and python make it very easy to produce a matrix of pairwise scatter plots, such as the image at the very top of this post.

Both R and Python also include libraries for performing many machine learning algorithms/statistical models.

If someone were to produce similar visualisation tools in .net (maybe there are some already? I couldn’t find any on a quick search), then my suspicion is that many .net data scientists would switch from R or Python to C#, particularly as many machine learning algorithms are available via Azure Machine Learning Studio (known popularly as Azure ML).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s