Introduction to data science in R and Python

A video recording of the session is available here on the Skills Matter website.

Plan for the session

Kirill Egorov will use R (and RStudio) to:
1. get S&P 500 data from Quandl,
2. browse through and clean the data,
3. carry out a clustering analysis to identify groupings in the data.
Robert Hardy then shows some data munging in Python:
1. a brief look at installing the Anaconda Python distribution,
2. using Pandas to clean data from Quandl.

The whole session is on GitHub at https://github.com/robert-hardy/fsq_data_science_intro.

Installing the Python data-science stack

The fact is that R and RStudio have been substantially easier to install on Windows and OS X than the Python equivalents for a long time.

The company Continuum Analytics is working hard to fix this and their Anaconda system is widely used.

[ Installing the Python data-science stack on Linux is much easier than on Windows, you can install everything with Python’s in-built pip install. ]

There is a good guide to all the pieces we will use written by Hans Fangohr: http://www.southampton.ac.uk/~fangohr/blog/installation-of-python-spyder-numpy-sympy-scipy-pytest-matplotlib-via-anaconda.html

Briefly, these instructions are:

Download the Anaconda installer, instructions here. I chose the graphical installer for the Python 3.5 version. Note that to ‘Install for me only’ you have to hit the Change Install Location… button in the ‘Installation Type’ part of the installer. Finally I open a terminal and do:
```
source .bash_profile
conda install quandl
anaconda-navigator &
```
to install the Quandl module and to start the Anaconda Navigator.
I am going to use the Spyder IDE that already comes installed and can be started from the navigator. Read here if you want to find out more about other IDEs with Anaconda.
See here if you ever need to know a bit more about the directory layout of the Anaconda distribution.

Some useful links

Introduction to Pandas data structures: http://pandas.pydata.org/pandas-docs/stable/dsintro.html.
Plotting data from Pandas: http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html.