data scientist: October 2015

1.) What is the difference between a matrix and a dataframe?

Answer: A dataframe can contain heterogenous inputs and a matrix cannot. (You can have a dataframe of characters, integers, and even other dataframes, but you can't do that with a matrix -- a matrix must be all the same type.)

2.) What is the difference between sapply and lapply? When should you use one versus the other? Bonus: When should you use vapply?

Answer: Use lapply when you want the output to be a list, and sapply when you want the output to be a vector or a dataframe. Generally vapply is preferred over sapply because you can specify the output type of vapply (but not sapply). The drawback is vapply is more verbose and harder to use.

3.) What is the difference between seq(4) and seq_along(4)?

Answer: seq(4) produces a vector from 1 to 4 (c(1, 2, 3, 4)), whereas seq_along(4) produces a vector of length(4), or 1 (c(1)).

4.) What is f(3) where:

y <- 5
f <- function(x) { y <- 2; y^2 + g(x) }
g <- function(x) { x + y }

Why?

Answer: 12. In f(3), y is 2, so y^2 is 4. When evaluating g(3), y is the globally scoped y (5) instead of the y that is locally scoped to f, so g(3) evaluates to 3 + 5 or 8. The rest is just 4 + 8, or 12.

5.) I want to know all the values in c(1, 4, 5, 9, 10) that are not in c(1, 5, 10, 11, 13). How do I do that with one built-in function in R? How could I do it if that function didn't exist?

Answer: setdiff(c(1, 4, 5, 9, 10), c(1, 5, 10, 11, 13)) and c(1, 4, 5, 9, 10)[!c(1, 4, 5, 9, 10) %in% c(1, 5, 10, 11, 13).

6.) Can you write me a function in R that replaces all missing values of a vector with the mean of that vector?

Answer:

mean_impute <- function(x) { x[is.na(x)] <- mean(x, na.rm = TRUE); x }

7.) How do you test R code? Can you write a test for the function you wrote in #6?

Answer: You can use Hadley's testthat package. A test might look like this:

testthat("It imputes the median correctly", {
  expect_equal(mean_impute(c(1, 2, NA, 6)), 3)
})

8.) Say I have...

fn(a, b, c, d, e) a + b * c - d / e

How do I call fn on the vector c(1, 2, 3, 4, 5) so that I get the same result as fn(1, 2, 3, 4, 5)? (No need to tell me the result, just how to do it.)

Answer: do.call(fn, as.list(c(1, 2, 3, 4, 5)))

9.)

dplyr <- "ggplot2"
library(dplyr)

Why does the dplyr package get loaded and not ggplot2?

Answer: deparse(substitute(dplyr))

10.)

mystery_method <- function(x) { function(z) Reduce(function(y, w) w(y), x, z) }
fn <- mystery_method(c(function(x) x + 1, function(x) x * x))
fn(3)

What is the value of fn(3)? Can you explain what is happening at each step?

Answer:

Best seen in steps.

fn(3) requires mystery_method to be evaluated first.

mystery_method(c(function(x) x + 1, function(x) x * x)) evaluates to...

function(z) Reduce(function(y, w) w(y), c(function(x) x + 1, function(x) x * x), z)

Now, we can see the 3 in fn(3) is supposed to be z, giving us...

Reduce(function(y, w) w(y), c(function(x) x + 1, function(x) x * x), 3)

This Reduce call is wonky, taking three arguments. A three argument Reduce call will initialize at the third argument, which is 3.

The inner function, function(y, w) w(y) is meant to take an argument and a function and apply that function to the argument. Luckily for us, we have some functions to apply.

That means we intialize at 3 and apply the first function, function(x) x + 1. 3 + 1 = 4.

We then take the value 4 and apply the second function. 4 * 4 = 16.

Q: How would you calculate the variance of the columns of a matrix (called mat) in R without using for loops.

A: This question establishes familiarity with R by indirectly asking about one of the biggest flaws of the language. If the candidate has used it for any non-trivial application, they will know the apply function and will bitch about the slowness of for loops in R. The solution is:

apply(mat, 2, var)

Q: Suppose you have a .csv files with two columns, the 1st of first names the 2nd of last names. Write some code to create a .csv file with last names as the 1st column and first names as the 2nd column.

A: You should know basic cat, awk, grep, sed, etc.

cat names.csv | awk -F “,” ‘{print $2″,”$1}’ > flipped_names.csv

Q: Explain map/reduce and then write a simple one in your favorite programming language.

A: This establishes familiarity with map/reduce. See my previous blog post.

Q: Suppose you are Google and want to estimate the click through rates (CTR) on your ads. You have 1000 queries, each of which has been issued 1000 times. Each query shows 10 ads and all ads are unique. Estimate the CTR for each ad.

A: This is my favorite interview question for a statistician. It doesn’t tackle one specific area, but gets at the depth of statistical knowledge they possess. Only good candidates receive this question. The candidate should immediately recognize this as a binomial trial, so the maximum likelihood estimator of the CTR is simply (# clicks)/(# impressions). This question is easily followed up by mentioning that click through rates are empirically very low, so this will estimate many CTRs at 0, which doesn’t really make sense. The candidate should then suggest altering the estimate by adding pseudo counts: (# clicks + 2)/(# impressions + 4). This is called the Wilson estimator and shrinks your estimate towards .5. Empirically, this does much better than the MLE. You should then ask if this can be interpreted in the context of Bayesian priors, to which they should respond, “Yes, this is equivalent to a prior of beta(2,2), which is the conjugate prior for the binomial distribution.”

The discussion can be led multiple places from here. You can discuss: a) other shrinkage estimators (this is an actual term in Statistics, not a Seinfeld reference, see Stein estimators for further reading) b) pooling results from similar queries c) use of covariates (position, ad text, query length, etc.) to assist in prediction d) method for prediciton logistic regression, complicated ML models, etc. A strong candidate can talk about this problem for at least 15 of 20 minutes.

Q: Suppose you run a regression with 10 variables and 1 is significant at the 95% level. Suppose you then find 10% of the data had been left out randomly and had their y values deleted. How would you predict their y values?

A: I would be very careful about doing this unless its sensationally predictive. If one generates 10 variables of random noise and regresses them against white noise, there is a ~40% chance at least one will be significant at a 95% confidence level. This question helps me understand if the individual understands regression. I also usually ask about regression diagnostics and assumptions.

Q: Suppose you have the option to go into one of two bank branches. Branch one has 10 tellers, each with a separate queue of 10 customers, and branch two has 10 tellers, sharing one queue of 100 customers. Which do you choose?

A: This question establishes familiarity with a wide range of basic stat concepts: mean, variance, waiting times, central limit theorem, and the ability to model and then analyze a real world situation. Both options have the same mean wait time. The latter option has smaller variance, because you are averaging the wait times of 100 individuals before you rather than 10. One can fairly argue about utility functions and the merits of risk seeking behavior over risk averse behavior, but I’d go for same mean with smaller variance (think about how maddening it is when another line at the grocery store is faster than your own).

Q: Explain how Random Forests differs from a normal regression tree.

A: This question establishes familiarity with two popular ML algorithms. “Normal” regression trees, have some splitting rule based on decrease in mean squared error or some other measure of error or misclassification. The tree grows until the next split decreases error by less than some threshold. This often leads to overfitting and trees fit on data sets with large numbers of variables can completely leave out many variables from the data set. Random Forests are an ensemble of fully grown trees. For each tree, a subsample of the variables and bootstrap sample of data are taken, fit, and then averaged together. Generally this prevents overfitting and allows all variables to “shine”. If the candidate is familiar with Random Forests, they should also know about partial dependence plots and variable importance plots. I generally ask this question of candidates that I fear may not be up to speed with modern techniques. Some implementations do not grow trees fully, but the original implementation of Random Forests does.

Bad Interview Questions

The following are probability and intro stat questions that are not appropriate for a data scientist or statistician roll. They should have learned this in intro statistics. These would be like asking an engineering candidate the complexity of binary search (O(log n)).

Q: Suppose you are playing a dice game; you roll a single die, then are given the option to re-roll a single time after observing the outcome. What is the expected value of the dice roll?

A: The expected value of a dice roll is 3.5 = (1+2+3+4+5+6)/6, so you should opt to re-roll only if the initial roll is a 1, 2, or 3. If you re-roll (which occurs with probability .5), the expected value of that roll is 3.5, so the expected value is:

4 * 1/6 + 5 * 1/6 + 6 * 1/6 + 3.5 * .5 = 4.25

Q: Suppose you have two variables, X and Y, each with standard deviation 1. Then, X + Y has standard deviation 2. If instead X and Y had standard deviations of 3 and 4, what is the standard deviation of X + Y?

A: Variances are additive, not standard deviations. The first example was a trick! sd(X+Y) = sqrt(Var(X+Y)) = sqrt(Var(X) + Var(Y)) = sqrt(sd(X)*sd(X) + sd(Y)*sd(Y)) = sqrt(3*3 + 4*4) = 5.

1) Explain what is R?

R is data analysis software which is used by analysts, quants, statisticians, data scientists and others.

2) List out some of the function that R provides?

The function that R provides are

• Mean
• Median
• Distribution
• Covariance
• Regression
• Non-linear
• Mixed Effects
• GLM
• GAM. etc.

3) Explain how you can start the R commander GUI?

Typing the command, (“Rcmdr”) into the R console starts the R commander GUI.

4) In R how you can import Data?

You use R commander to import Data in R, and there are three ways through which you can enter data into it

• You can enter data directly via Data  New Data Set
• Import data from a plain text (ASCII) or other files (SPSS, Minitab, etc.)
• Read a data set either by typing the name of the data set or selecting the data set in the dialog box

5) Mention what does not ‘R’ language do?

• Though R programming can easily connects to DBMS is not a database
• R does not consist of any graphical user interface
• Though it connects to Excel/MS office easily, R language does not provide any spreadsheet view of data

6) Explain how R commands are written?

In R, anywhere in the program you have to preface the line of code with a #sign, for example

• # subtraction
• # division
• # note order of operations exists

7) How can you save your data in R?

To save data in R, there are many ways, but the easiest way of doing this is

Go to Data > Active Data Set > Export Active Data Set and a dialogue box will appear, when you click ok the dialogue box let you save your data in the usual way.

8) Mention how you can produce co-relations and covariances?

You can produce co-relations by the cor () function to produce co-relations and cov () function to produce covariances.

9) Explain what is t-tests in R?

In R, the t.test () function produces a variety of t-tests. T-test is the most common test in statistics and used to determine whether the means of two groups are equal to each other.

10) Explain what is With () and By () function in R is used for?

• With() function is similar to DATA in SAS, it apply an expression to a dataset.
• BY() function applies a function to each level of factors. It is similar to BY processing in SAS.

11) What are the data structures in R that is used to perform statistical analyses and create graphs?

R has data structures like

• Vectors
• Matrices
• Arrays
• Data frames

12) Explain general format of Matrices in R?

General format is
Mymatrix< – matrix (vector, nrow=r , ncol=c , byrow=FALSE,
dimnames = list ( char_vector_ rowname, char_vector_colnames))

13) In R how missing values are represented ?

In R missing values are represented by NA (Not Available), why impossible values are represented by the symbol NaN (not a number).

14) Explain what is transpose?

For re-shaping data before, analysis R provides various method and transpose are the simplest method of reshaping a dataset. To transpose a matrix or a data frame t () function is used.

15) Explain how data is aggregated in R?

By collapsing data in R by using one or more BY variables, it becomes easy. When using the aggregate() function the BY variable should be in the list.

16) What is the function used for adding datasets in R?

rbind function can be used to join two data frames (datasets). The two data frames must have the same variables, but they do not have to be in the same order.

17) What is the use of subset() function and sample() function in R ?

In R, subset() functions help you to select variables and observations while through sample() function you can choose a random sample of size n from a dataset.

18) Explain how you can create a table in R without external file?

Use the code
myTable = data.frame()
edit(myTable)

This code will open an excel like spreadsheet where you can easily enter your data.

data scientist

Monday, October 5, 2015

R interview questions

Bad Interview Questions