Wednesday, December 16, 2015

How to Learn R

How to Learn R

December 10, 2015
By 
There are tons of resources to help you learn the different aspects of R, and as a beginner this can be overwhelming. It’s also a dynamic language and rapidly changing, so it’s important to keep up with the latest tools and technologies.
That’s why R-bloggers and DataCamp have worked together to bring you a learning path for R. Each section points you to relevant resources and tools to get you started and keep you engaged to continue learning. It’s a mix of materials ranging from documentation, online courses, books, and more.
Just like R, this learning path is a dynamic resource. We want to continually evolve and improve the resources to provide the best possible learning experience. So if you have suggestions for improvement please emailtal.galili@gmail.com with your feedback.

Learning Path

Getting started:  The basics of R

image02
The best way to learn R is by doing. In case you are just getting started with R, this free introduction to R tutorial by DataCamp is a great resource as well the successorIntermediate R programming (subscription required). Both courses teach you R programming and data science interactively, at your own pace, in the comfort of your browser. You get immediate feedback during exercises with helpful hints along the way so you don’t get stuck.
Another free online interactive learning tutorial for R is available by O’reilly’s code school website called try R. An offline interactive learning resource isswirl, an R package that makes if fun and easy to become an R programmer. You can take a swirl course by (i) installing the package in R, and (ii) selecting a course from the course library. If you want to start right away without needing to install anything you can also choose for the online version of Swirl.
There are also some very good MOOC’s available on edX and Coursera that teach you the basics of R programming. On edX you can find Introduction to R Programming by Microsoft, an 8 hour course that focuses on the fundamentals and basic syntax of R. At Coursera there is the very popular R Programming course by Johns Hopkins. Both are highly recommended!
If you instead prefer to learn R via a written tutorial or book there is plenty of choice. There is the introduction to R manual by CRAN, as well as some very accessible books like Jared Lander’s R for Everyone or R in Action by Robert Kabacoff.

Setting up your machine

You can download a copy of R from the Comprehensive R Archive Network (CRAN). There are binaries available for Linux, Mac and Windows.
Once R is installed you can choose to either work with the basic R console, or with an integrated development environment (IDE). RStudio is by far the most popular IDE for R and supports debugging, workspace management, plotting and much more (make sure to check out the RStudio shortcuts).
image05
Next to RStudio you also have Architect, and Eclipse-based IDE for R. If you prefer to work with a graphical user interface you can have a look at R-commander  (aka as Rcmdr), or Deducer.

R packages

image04
R packages are the fuel that drive the growth and popularity of R. R packages are bundles of code, data, documentation, and tests that are easy to share with others. Before you can use a package, you will first have to install it. Some packages, like the base package, are automatically installed when you install R. Other packages, like for example the ggplot2 package, won’t come with the bundled R installation but need to be installed.
Many (but not all) R packages are organized and available from CRAN, a network of servers around the world that store identical, up-to-date, versions of code and documentation for R. You can easily install these package from inside R, using the install.packages function. CRAN also maintains a set of Task Views that identify all the packages associated with a particular task such as for example TimeSeries.
Next to CRAN you also have bioconductor which has packages for the analysis of high-throughput genomic data, as well as for example the github andbitbucket repositories of R package developers. You can easily install packages from these repositories using the devtools package.
Finding a package can be hard, but luckily you can easily search packages from CRAN, github and bioconductor using Rdocumentation, inside-R, or you can have a look at this quick list of useful R packages.
To end, once you start working with R, you’ll quickly find out that R package dependencies can cause a lot of headaches. Once you get confronted with that issue, make sure to check out packrat (see video tutorial) or checkpoint. When you’d need to update R, if you are using Windows, you can use the updateR() function from the installr package.

Importing your data into R

The data you want to import into R can come in all sorts for formats: flat files, statistical software files, databases and web data.
image03
Getting different types of data into R often requires a different approach to use. To learn more in general on how to get different data types into R you can check out this online Importing Data into R tutorial (subscription required), this post on data importing, or this webinar by RStudio.
  • Flat files are typically simple text files that contain table data. The standard distribution of R provides functionality to import these flat files into R as a data frame with functions such as read.table() andread.csv() from the utils package. Specific R packages to import flat files data are readr, a fast and very easy to use package that is less verbose as utils and multiple times faster (more information), and data.table’sfread() function for importing and munging data into R (using the fread function).
  • In case you want to get your excel files into R, it’s a good idea to have a look at thereadxl package. Alternatively, there is the gdata package which has function that supports the import of Excel data, and the XLConnectpackage. The latter acts as a real bridge between Excel and R meaning you can do any action you could do within Excel but you do it from inside R. Read more on importing your excel files into R.
  • Software packages such as SAS, STATA and SPSS use and produce their own file types. The haven package by Hadley Wickham can deal with importing SAS, STATA and SPSS data files into R and is very easy to use. Alternatively there is the foreign package, which is able to import not only SAS, STATA and SPSS files but also more exotic formats like Systat and Weka for example. It’s also able to export data again to various formats. (Tip: if you’re switching from SAS,SPSS or STATA to R, check out Bob Muenchen’s tutorial (subscription required))
  • The packages used to connect to and import from a relational database depend on the type of database you want to connect to. Suppose you want to connect to a MySQL database, you will need the RMySQL package. Others are for example theRpostgreSQL and ROracle package.The R functions you can then use to access and manipulate the database, is specified in another R package called DBI.
  • If you want to harvest web data using R you need to connect R to resources online using API’s or through scraping with packages likervest. To get started with all of this, there is this great resource freely available on the blog of Rolf Fredheim.

Data Manipulation

Turning your raw data into well structured data is important for robust analysis, and to make data suitable for processing. R has many built-in functions for data processing, but they are not always that easy to use. Luckily, there are some great packages that can help you:
  • The tidyr package allows you to “tidy” your data. Tidy data is data where each column is a variable and each row an observation. As such, it turns your data into data that is easy to work with. Check this excellent resource on how you can tidy your data using tidyr.
  • If you want to do string manipulation, you should learn about thestringr package. The vignette is very understandable, and full of useful examples to get you started.
  • dplyr is a great package when working with data frame like objects (in memory and out of memory). It combines speed with a very intuitive syntax. To learn more on dplyr you can take this data manipulation course (subscription required) and check out this handy cheat sheet.
  • When performing heavy data wrangling tasks, the data.table package should be your “go-to”package. It’s blazingly fast, and once you get the hang of it’s syntax you will find yourself using data.table all the time.Check this data analysis course (subscription required) to discover the ins and outs of data.table, and use this cheat sheet as a reference.
  • Chances are you find yourself working with times and dates at some point. This can be a painful process, but luckily lubridate makes it a bit easier to work with. Check it’s vignette to better understand how you can use lubridate in your day-to-day analysis.
  • Base R has limited functionality to handle time series data. Fortunately, there are package like zoo, xts and quantmod. Take this tutorial by Eric Zivot to better understand how to use these packages, and how to work with time series data in R.
If you want to have a general overview of data manipulation with R, you can read more in the book Data Manipulation with R or see the Data Wrangling with R video by RStudio. In case you run into troubles with handling your data frames, check 15 easy solutions to your data frame problems.

Data Visualization

One of the things that make R such a great tool is its data visualizations capabilities. For performing visualizations in R, ggplot2 is probably the most well known package and a must learn for beginners! You can find all relevant information to get you started with ggplot2 onhttp://ggplot2.org/ and make sure to check out the cheatsheet and the upcomming book. Next to ggplot2, you also have packages such as ggvis for interactive web graphics (seetutorial (subscription required)), googleVis to interface with google charts (learn to re-create this TED talk), Plotly for R, and many more. See the task view for some hidden gems, and if you have some issues with plotting your datathis post might help you out.
In R there is a whole task view dedicated to handling spatial data that allow you to create beautiful maps such as this famous one:
image01
To get started look at for example a package such as ggmap, which allows you to visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps. Alternatively you can start playing around with maptools, choroplethr, and the tmap package. If you need a great tutorial take this Introduction to visualising spatial data in R.
You’ll often see that visualizations in R make use of all these magnificent color schemes that fit like a glove on the graph/map/… If you want to achieve this for your visualizations as well, then deepen yourself into the RColorBrewer package and ColorBrewer.
One of the latest visualizations tools in R is HTML widgets. HTML widgets work just like R plots but they create interactive web visualizations such as  dynamic maps (leaflet), time-series data charting (dygraphs), and interactive tables (DataTables). There are some very nice examples of HTML widgets in the wild, and solid documentation on how to create your own one (not in a reading mode: just watch this video).
If you want to get some inspiration on what visualization to create next, you can have a look at blogs dedicated to visualizations such as FlowingData.

Data Science & Machine Learning with R

There are many beginner resources on how to do data science with R. A list of available online courses:
Alternatively, if you prefer a good read:
Once your start doing some machine learning with R, you will quickly find yourself using packages such as caret, rpart and randomForest. Luckily, there are some great learning resources for these packages and Machine Learning in general. If you are just getting started,this guide will get you going in no time. Alternatively, you can have a look at the booksMastering Machine Learning with R and Machine Learning with R. If you are looking for some step-by-step tutorials that guide you through a real life example there is the Kaggle Machine Learning course or you can have a look at Wiekvoet’s blog.

Reporting Results in R

R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It is a great tool for reporting your data analysis in a reproducible manner, thereby making the analysis more useful and understandable. R markdown is based on knitr and pandoc. With R markdown, R generates a final document that replaces the R code with its results. This document can be in an html, word, pfd, ioslides, etc. format. You can even create interactive R markdown documents using Shiny. This 4 hour tutorial on Reporting with R Markdown (subscription required) get’s you going with R markdown, and in addition you can use this nice cheat sheet for future reference.
Next to R markdown, you should also make sure to check out  Shiny. Shiny makes it incredibly easy to build interactive web applications with R. It allows you to turn your analysis into interactive web applications without needing to know HTML, CSS or Javascript. RStudio maintains a great learning portal to get you started with Shiny, including this set of video tutorials (click on the essentials of Shiny Learning Roadmap). More advanced topics are available, as well as a great set of examples.
image00

Next steps

After spending some time writing R code (and you became an R-addict), you’ll reach a point that you want to start writing your own R package. Hilary Parker from Etsy has written a short tutorial on how to create your first package, and if you’re really serious about it you need to read R packages, an upcoming book by Hadley Wickham that is already available for free on the web.
Once you become more fluent in writing R syntax (and consequently addicted to R), you will want to unlock more of its power (read: do some really nifty stuff). In that case make sure to check out RCPP, an R package that makes it easier for integrating C++ code with R, or RevoScaleR (start the free tutorial).
Finally, if you want to start learning on the inner workings of R and improve your understanding of it, the best way to get you started is by reading Advanced R.

Monday, November 2, 2015

50+ free online resources to learn more about data science and analysis


Machine learning

I selected those resources that are more suitable for beginners together with the parts of machine learning that I like the most.

Statistics

Python

Once you are familiar with Python, the following resources for machine learning and data analysis can take your skills to the next level:

R

I’ve been trying hard to like R. It’s been in fact more than 5 years of trying to like it and I just simply prefer Python. In any case, I still frequently launch an R prompt to use some fantastic packages that R has.

Applying data science to your organization

To end with, some examples on how data science and machine learning can be used to add value to your organization:

Monday, October 5, 2015

R interview questions

1.) What is the difference between a matrix and a dataframe?
Answer: A dataframe can contain heterogenous inputs and a matrix cannot. (You can have a dataframe of characters, integers, and even other dataframes, but you can't do that with a matrix -- a matrix must be all the same type.)
2.) What is the difference between sapply and lapply? When should you use one versus the other? Bonus: When should you use vapply?
Answer: Use lapply when you want the output to be a list, and sapply when you want the output to be a vector or a dataframe. Generally vapply is preferred over sapply because you can specify the output type of vapply (but not sapply). The drawback is vapply is more verbose and harder to use.
3.) What is the difference between seq(4) and seq_along(4)?
Answer: seq(4) produces a vector from 1 to 4 (c(1, 2, 3, 4)), whereas seq_along(4) produces a vector of length(4), or 1 (c(1)).
4.) What is f(3) where:
y <- 5
f <- function(x) { y <- 2; y^2 + g(x) }
g <- function(x) { x + y }
Why?
Answer: 12. In f(3), y is 2, so y^2 is 4. When evaluating g(3), y is the globally scoped y (5) instead of the y that is locally scoped to f, so g(3) evaluates to 3 + 5 or 8. The rest is just 4 + 8, or 12.
5.) I want to know all the values in c(1, 4, 5, 9, 10) that are not in c(1, 5, 10, 11, 13). How do I do that with one built-in function in R? How could I do it if that function didn't exist?
Answer: setdiff(c(1, 4, 5, 9, 10), c(1, 5, 10, 11, 13)) and c(1, 4, 5, 9, 10)[!c(1, 4, 5, 9, 10) %in% c(1, 5, 10, 11, 13).
6.) Can you write me a function in R that replaces all missing values of a vector with the mean of that vector?
Answer:
mean_impute <- function(x) { x[is.na(x)] <- mean(x, na.rm = TRUE); x }
7.) How do you test R code? Can you write a test for the function you wrote in #6?
Answer: You can use Hadley's testthat package. A test might look like this:
testthat("It imputes the median correctly", {
  expect_equal(mean_impute(c(1, 2, NA, 6)), 3)
})
8.) Say I have...
fn(a, b, c, d, e) a + b * c - d / e
How do I call fn on the vector c(1, 2, 3, 4, 5) so that I get the same result as fn(1, 2, 3, 4, 5)? (No need to tell me the result, just how to do it.)
Answer: do.call(fn, as.list(c(1, 2, 3, 4, 5)))
9.)
dplyr <- "ggplot2"
library(dplyr)
Why does the dplyr package get loaded and not ggplot2?
Answer: deparse(substitute(dplyr))
10.)
mystery_method <- function(x) { function(z) Reduce(function(y, w) w(y), x, z) }
fn <- mystery_method(c(function(x) x + 1, function(x) x * x))
fn(3)
What is the value of fn(3)? Can you explain what is happening at each step?
Answer:
Best seen in steps.
fn(3) requires mystery_method to be evaluated first.
mystery_method(c(function(x) x + 1, function(x) x * x)) evaluates to...
function(z) Reduce(function(y, w) w(y), c(function(x) x + 1, function(x) x * x), z)
Now, we can see the 3 in fn(3) is supposed to be z, giving us...
Reduce(function(y, w) w(y), c(function(x) x + 1, function(x) x * x), 3)
This Reduce call is wonky, taking three arguments. A three argument Reduce call will initialize at the third argument, which is 3.
The inner function, function(y, w) w(y) is meant to take an argument and a function and apply that function to the argument. Luckily for us, we have some functions to apply.
That means we intialize at 3 and apply the first function, function(x) x + 1. 3 + 1 = 4.
We then take the value 4 and apply the second function. 4 * 4 = 16.


Q: How would you calculate the variance of the columns of a matrix (called mat) in R without using for loops.
A: This question establishes familiarity with R by indirectly asking about one of the biggest flaws of the language. If the candidate has used it for any non-trivial application, they will know the apply function and will bitch about the slowness of for loops in R. The solution is:
apply(mat, 2, var)
Q: Suppose you have a .csv files with two columns, the 1st of first names the 2nd of last names. Write some code to create a .csv file with last names as the 1st column and first names as the 2nd column.
A: You should know basic cat, awk, grep, sed, etc.
cat names.csv | awk -F “,” ‘{print $2″,”$1}’ > flipped_names.csv
Q: Explain map/reduce and then write a simple one in your favorite programming language.
A: This establishes familiarity with map/reduce. See my previous blog post.
Q: Suppose you are Google and want to estimate the click through rates (CTR) on your ads. You have 1000 queries, each of which has been issued 1000 times. Each query shows 10 ads and all ads are unique. Estimate the CTR for each ad.
A: This is my favorite interview question for a statistician. It doesn’t tackle one specific area, but gets at the depth of statistical knowledge they possess. Only good candidates receive this question. The candidate should immediately recognize this as a binomial trial, so the maximum likelihood estimator of the CTR is simply (# clicks)/(# impressions). This question is easily followed up by mentioning that click through rates are empirically very low, so this will estimate many CTRs at 0, which doesn’t really make sense. The candidate should then suggest altering the estimate by adding pseudo counts: (# clicks + 2)/(# impressions + 4). This is called the Wilson estimator and shrinks your estimate towards .5. Empirically, this does much better than the MLE. You should then ask if this can be interpreted in the context of Bayesian priors, to which they should respond, “Yes, this is equivalent to a prior of beta(2,2), which is the conjugate prior for the binomial distribution.”
The discussion can be led multiple places from here. You can discuss: a) other shrinkage estimators (this is an actual term in Statistics, not a Seinfeld reference, see Stein estimators for further reading) b) pooling results from similar queries c) use of covariates (position, ad text, query length, etc.) to assist in prediction d) method for prediciton logistic regression, complicated ML models, etc. A strong candidate can talk about this problem for at least 15 of 20 minutes.
Q: Suppose you run a regression with 10 variables and 1 is significant at the 95% level. Suppose you then find 10% of the data had been left out randomly and had their y values deleted. How would you predict their y values?
A: I would be very careful about doing this unless its sensationally predictive. If one generates 10 variables of random noise and regresses them against white noise, there is a ~40% chance at least one will be significant at a 95% confidence level. This question helps me understand if the individual understands regression. I also usually ask about regression diagnostics and assumptions.
Q: Suppose you have the option to go into one of two bank branches. Branch one has 10 tellers, each with a separate queue of 10 customers, and branch two has 10 tellers, sharing one queue of 100 customers. Which do you choose?
A: This question establishes familiarity with a wide range of basic stat concepts: mean, variance, waiting times, central limit theorem, and the ability to model and then analyze a real world situation. Both options have the same mean wait time. The latter option has smaller variance, because you are averaging the wait times of 100 individuals before you rather than 10. One can fairly argue about utility functions and the merits of risk seeking behavior over risk averse behavior, but I’d go for same mean with smaller variance (think about how maddening it is when another line at the grocery store is faster than your own).
Q: Explain how Random Forests differs from a normal regression tree.
A: This question establishes familiarity with two popular ML algorithms. “Normal” regression trees, have some splitting rule based on decrease in mean squared error or some other measure of error or misclassification. The tree grows until the next split decreases error by less than some threshold. This often leads to overfitting and trees fit on data sets with large numbers of variables can completely leave out many variables from the data set. Random Forests are an ensemble of fully grown trees. For each tree, a subsample of the variables and bootstrap sample of data are taken, fit, and then averaged together. Generally this prevents overfitting and allows all variables to “shine”. If the candidate is familiar with Random Forests, they should also know about partial dependence plots and variable importance plots. I generally ask this question of candidates that I fear may not be up to speed with modern techniques. Some implementations do not grow trees fully, but the original implementation of Random Forests does.

Bad Interview Questions

The following are probability and intro stat questions that are not appropriate for a data scientist or statistician roll. They should have learned this in intro statistics. These would be like asking an engineering candidate the complexity of binary search (O(log n)).
Q: Suppose you are playing a dice game; you roll a single die, then are given the option to re-roll a single time after observing the outcome. What is the expected value of the dice roll?
A: The expected value of a dice roll is 3.5 = (1+2+3+4+5+6)/6, so you should opt to re-roll only if the initial roll is a 1, 2, or 3. If you re-roll (which occurs with probability .5), the expected value of that roll is 3.5, so the expected value is:
4 * 1/6 + 5 * 1/6 + 6 * 1/6 + 3.5 * .5 = 4.25
Q: Suppose you have two variables, X and Y, each with standard deviation 1. Then, X + Y has standard deviation 2. If instead X and Y had standard deviations of 3 and 4, what is the standard deviation of X + Y?
A: Variances are additive, not standard deviations. The first example was a trick! sd(X+Y) = sqrt(Var(X+Y)) = sqrt(Var(X) + Var(Y)) = sqrt(sd(X)*sd(X) + sd(Y)*sd(Y)) = sqrt(3*3 + 4*4) = 5.

1) Explain what is R?
R is data analysis software which is used by analysts, quants, statisticians, data scientists and others.
2) List out some of the function that R provides?
The function that R provides are
• Mean
• Median
• Distribution
• Covariance
• Regression
• Non-linear
• Mixed Effects
• GLM
• GAM. etc.
3) Explain how you can start the R commander GUI?
Typing the command, (“Rcmdr”) into the R console starts the R commander GUI.
4) In R how you can import Data?
You use R commander to import Data in R, and there are three ways through which you can enter data into it
• You can enter data directly via Data  New Data Set
• Import data from a plain text (ASCII) or other files (SPSS, Minitab, etc.)
• Read a data set either by typing the name of the data set or selecting the data set in the dialog box
5) Mention what does not ‘R’ language do?
• Though R programming can easily connects to DBMS is not a database
• R does not consist of any graphical user interface
• Though it connects to Excel/MS office easily, R language does not provide any spreadsheet view of data
R
6) Explain how R commands are written?
In R, anywhere in the program you have to preface the line of code with a #sign, for example
• # subtraction
• # division
• # note order of operations exists
7) How can you save your data in R?
To save data in R, there are many ways, but the easiest way of doing this is
Go to Data > Active Data Set > Export Active Data Set and a dialogue box will appear, when you click ok the dialogue box let you save your data in the usual way.
8) Mention how you can produce co-relations and covariances?
You can produce co-relations by the cor () function to produce co-relations and cov () function to produce covariances.

9) Explain what is t-tests in R?
In R, the t.test () function produces a variety of t-tests. T-test is the most common test in statistics and used to determine whether the means of two groups are equal to each other.
10) Explain what is With () and By () function in R is used for?
• With() function is similar to DATA in SAS, it apply an expression to a dataset.
• BY() function applies a function to each level of factors. It is similar to BY processing in SAS.
11) What are the data structures in R that is used to perform statistical analyses and create graphs?
R has data structures like
• Vectors
• Matrices
• Arrays
• Data frames
12) Explain general format of Matrices in R?
General format is
Mymatrix< – matrix (vector, nrow=r , ncol=c , byrow=FALSE,
dimnames = list ( char_vector_ rowname, char_vector_colnames))
13) In R how missing values are represented ?
In R missing values are represented by NA (Not Available), why impossible values are represented by the symbol NaN (not a number).
14) Explain what is transpose?
For re-shaping data before, analysis R provides various method and transpose are the simplest method of reshaping a dataset. To transpose a matrix or a data frame t () function is used.
15) Explain how data is aggregated in R?
By collapsing data in R by using one or more BY variables, it becomes easy. When using the aggregate() function the BY variable should be in the list.
16) What is the function used for adding datasets in R?
rbind function can be used to join two data frames (datasets). The two data frames must have the same variables, but they do not have to be in the same order.
17) What is the use of subset() function and sample() function in R ?
In R, subset() functions help you to select variables and observations while through sample() function you can choose a random sample of size n from a dataset.
18) Explain how you can create a table in R without external file?
Use the code
myTable = data.frame()
edit(myTable)
This code will open an excel like spreadsheet where you can easily enter your data.