# Data Scientist Interview Questions in United States

Data scientist interview questions shared by candidates

## Top Interview Questions

### Data Scientist at Facebook was asked...

You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle? 33 AnswersBayesian stats: you should estimate the prior probability that it's raining on any given day in Seattle. If you mention this or ask the interviewer will tell you to use 25%. Then it's straight-forward: P(raining | Yes,Yes,Yes) = Prior(raining) * P(Yes,Yes,Yes | raining) / P(Yes, Yes, Yes) P(Yes,Yes,Yes) = P(raining) * P(Yes,Yes,Yes | raining) + P(not-raining) * P(Yes,Yes,Yes | not-raining) = 0.25*(2/3)^3 + 0.75*(1/3)^3 = 0.25*(8/27) + 0.75*(1/27) P(raining | Yes,Yes,Yes) = 0.25*(8/27) / ( 0.25*8/27 + 0.75*1/27 ) **Bonus points if you notice that you don't need a calculator since all the 27's cancel out and you can multiply top and bottom by 4. P(training | Yes,Yes,Yes) = 8 / ( 8 + 3 ) = 8/11 But honestly, you're going to Seattle, so the answer should always be: "YES, I'm bringing an umbrella!" (yeah yeah, unless your friends mess with you ALL the time ;) I thought about this a little differently from a non-bayes perspective. It's raining if any ONE of the friends is telling the truth, because if they are telling the truth then it is raining. If all of them are lieing, then it isn't raining because they told you that it was raining. So what you want is the probability that any one person is telling the truth. Which is simply 1-Pr(all lie) = 26/27 Anyone let me know if I'm wrong here! Here's another perspective on how to answer a question like this: Bring an umbrella. It's Seattle - if it's not raining right now, it probably will be by the time you get there. Show more responses I flagged Nub data scientist's answer as useful, because it shows an interesting flaw in reasoning. The 3 random variables are not to be treated as intrinsically independent. Only conditioned on the truth (raining/not raining) are they independent. Isn't the answer 2/3. The key thing is that they are ALL saying "Yes". You can't have all 3 says yes and have some people lying and some people telling the truth. It either is raining or it isn't. Not both. They either are all lying or all telling the truth. Since they are all in agreement (all lying or all truthful), they are essentially voting as one person. What is the probability that one person is telling the truth? 2/3 Answer from a frequentist perspective: Suppose there was one person. P(YES|raining) is twice (2/3 / 1/3) as likely as P(LIE|notraining), so the P(raining) is 2/3. If instead n people all say YES, then they are either all telling the truth, or all lying. The outcome that they are all telling the truth is (2/3)^n / (1/3)^n = 2^n as likely as the outcome that they are not. Thus P(ALL YES | raining) = 2^n / (2^n + 1) = 8/9 for n=3 Notice that this corresponds exactly the bayesian answer when prior(raining) = 1/2. I'm not sure why it's not just as simple as this: All three friends say it is raining. Each friend has prob. 1/3 of lying. Since the friends all say the same thing, they are either all telling the truth or all lying. The question asks what is the probability that it is raining. This is equivalent to asking, what is the probability that all three friends are telling the truth. And that is equivalent to asking, what is the probability that not one of them is lying. Since the the friends were asked independently, this should equal 1 - (1/3 * 1/3 * 1/3) = 0.962. Ah. Looks like my answer agrees with "nub data scientist". What is the probability that both he and I are wrong? :-) TLP and nub data scientists, Your answers include possibilities which are not feasible; we cannot have any combination of 2/3 and 1/3 together... what about (2/3)^3? I agree with TLP and nub scientist. For me, the question is really (1 - the odds that all three of your friends are lying to you) Clearly 1 - 1/3 * 1/3 * 1/3. It's convenient that they all gave the same answer, otherwise it would be more difficult. Let Y denote rain, N denote no rain Actual Answer probability ------------------------------------------ Y=> 8/27 YYY, 1/27 NNN, 12/27 YYN, 6/27 YNN N=> 1/27 YYY, 8/27 NNN, 6/27 YYN, 12/27 YNN So, P(Y|YYY) = (8/8+1) = 8/9 The probability of raining is that they are all telling the truth, therefore, (2/3)^3. P(rain / yes yes yes) = (2/3)^3 / ((2/3)^3 + (1/3)^3) =(8/27) / ((8/27) + (1/27)) = 8 / (8 +1) = 8/9 26/27 is incorrect. That is the number of times that at least one friend would tell you the truth (i.e., 1 - probability that would all lie: 1/27). What you have to figure out is the odds it raining | (i.e., given) all 3 friends told you the same thing. Because they all say the same thing, they must all either be lying or they must all be telling the truth. What are the odds that would all lie and all tell the truth? In 1/27 times, they would the all lie and and in 8/27 times they would all tell the truth. So there are 9 ways in which all your friends would tell you the same thing. And in 8 of them (8 out of 9) they would be telling you the truth. Show more responses There is an obvious conceptual reason as to why several answers here (ones that don't use Bayes' formula) are incorrect. The probability in question has to depend on the probability of rain in Seattle. If, for the sake of discussion, it ALWAYS rains in Seattle, i.e. P(rain)=1, then the required prob. is always 1 as well. Likewise if it's a place where it never rains, or if the question asks about the prob. of it raining elephants given the 3 friends said yes, it'd be still 0. I believe this is a std. textbook example of the Bayes' formula, anything short of that I don't think will work out. Please correct me if incorrect. But I would just prefer to condition. either they are all telling the truth and its it raining or they are all lying and it is not raining. P(rain)=P(rain|truth,truth,truth)*P(truth,truth, truth)+P(rain|lie,lie,lie)*P(lie,lie,lie) notice that truth does not mean yes it is raining, it simply corresponds to them telling the truth. Since they said yes, IF they were lying and we knew they were lying then the probability of rain would be zero, thus eliminating the second term. P(rain)=P(rain|3xtruth)*P(3xtruth) and the probability of the truth is (2/3)^3 and the probability of rain if they are telling the truth is 1. I did a little skipping of steps, since truth doesnt equal yes, but i just sort of meshed it toegher towards the end YES=yes,yes,yes T=truth, truth, truth L=lie,lie,lie P(Rain|YES)=P(Rain|YES,T)*P(T)+P(Rain|YES,L)*P(L) P(Rain|YES,L)=0==> whats the probability of rain given we know that they are lying and theyve told us it is raining. P(Rain|YES)=P(Rain|YES,T)*P(T) P(Rain|YES,T)=1==> whats the probability of it raining given that they are telling the truth and have told us its raining then P(T)=(2/3)^3 its obvious. why in the world would i do bayesian methods when its certain I think the first answer is incorrect. The basic flaw is that it is assumed that all three friends lie together or be honest together, so it does not take the cases of Yes.no.Yes or Yes.Yes.no ...etc For the correct answer we need to update posterior probability after each yes so Assuming P(raining) =0.75 prior probabilty P(raining | yes) = (2/3)*0.75 / ( (2/3)*0.75 + (1/3)*0.25 ) = 6/7 P(raining | yes,yes) = (6/7)*(2/3) / ( 6/7*2/3 + 1/7*1/3) = 12/13 P(raining | yes,yes,yes) = (12/13)*(2/3) / ( 12/13*2/3 + 1/13*1/3) = 24/25 I dont see the interview saying that all friends are sitting together so they are independent which means they can lie separately I agree with (2/3)^3. Interview Candidate solves this problem using Bayesian stats despite the fact that no enough information is given to do Bayesian probability analysis i.e. he had to pull the probability of it raining in Seattle out of thin air when it was not given in the interview question. With only the information from the interview question, we have to assume that friends are either all lying or all telling the truth. Let truth=T and lie=L P(TTT)=8/27, P(LLL)=1/27, P(TLL)=2/27,P(TTL)=4/27. But we know that they all had the same answer, so we must compare P(TTT) to P(LLL). P(TTT) is 8 times more likely than P(LLL), so we have P(All same answers|TTT)=8/9, P(All same answers|LLL)=1/9. Therefore the solution given ONLY THE INFORMATION GIVEN is P(Rain)=8/9, P(Dry)=1/9. This problem requires the marginal probability of rain to solve, following Interview Candidate's answer. M.B. provides the rationale behind why the bayes approach is necessary: if the pr(rain) = 0, then the pr(rain|y, y, y) = 0. (maybe it is July in Seattle). A few conceptual problems in many answers that I want to point out: 1) There is lots of conflation between Pr(truth) and Pr(Y). Pr(truth) = Pr(Y|R) does not equal Pr(Y). 2) Consider there is only a single friend and they say yes, the logical conclusion from a lot of these answers is that Pr(Rain|Yes) = Pr(Yes|Rain) = 2/3, which is not correct. Bayes' rule is very clear in this simpler case. 3) The friends' answers are conditionally independent assuming no collusion. The combinations of their honesty/lying adds no additional information. The marginal probabilities are not independent, Pr(y,y,y) does not equal pr(y)^3, it equals pr(y,y,y,rain) + pr(y,y,y, no rain), the integration of the joint space over rain. Using conditional independence and bayes rule, this becomes: pr(y|rain)^3*pr(rain) + pr(y|no rain)^3(1-pr(rain)). A more general solution using Pr(rain) = r. Pr(rain|y,y,y) = Pr(y,y,y|rain)*pr(rain)/pr(y,y,y) #Bayes' formula pr(y,y,y|rain) = pr(y|rain)^3 = (2/3)^3 #conditional independence pr(y,y,y) = pr(y|rain)^3*pr(rain) + pr(y|no rain)^3*pr(no rain) #by definition, see point 3 the answer: r*(2/3)^3 / [r*(2/3)^3 + (1 - r)*(1/3)^3] It should be (2/3)^3, I think zen and todo is correct. As a big dumb animal, I have to write out a probability tree and thing about this simply. You only have 2 scenarios where all three say it is raining (all three are telling the truth-raining OR all three are lying - not raining). Assume the probability of rain is 0.5 for simplicity. P(Rain and YYY) = 1/2 * 2/3 * 2/3 * 2/3 = 8/54 P(Not Rain and YYY) = 1/2 * 1/3 * 1/3 * 1/3 = 1/54 Thus P(Rain | YYY) = P(Rain and YYY) / [P(Rain and YYY) + P(Not Rain and YYY)] = 8 / (8+1) = 8/9 I know it isn't the most mathematically rigorous or syntactically correct solution, but I'd bet a pretty penny that the answer is 8/9 with the following assumptions (P(rain) = 0.5 and naive bayes - friends didn't collaborate). Most of the answers/comments made all unconditional assumptions except a few reasonings that lead to the 8/9 probability. Note that the question states that "Each of your friends has a 2/3 chance of telling you the truth". This essentially means P(raining, yes) + P (non-raining, no) = 2/3. Any attempts to interpret this as conditional probability P(raining | yes) = 2/3 or P(yes | raining) = 2/3 are making other assumptions. Show more responses 8/27 is not the answer. For the weather to be nice in this case, all 3 of your friend NEED to have lied to you. Therefor the odds are 1/27. It's really shocking to see how many people post incorrect answers here with such confidence. That said, Bayes' rule is somewhat counterintuitive if you're not familiar with probability theory. Let P(y|r) = prob of each yes given raining = 2/3, P(y|n) = prob yes given not raining = 1/3. Let P(r) = probability of rain = 1/4 given the prior knowledge. P(n) = probability of no rain = 3/4. P(r | y^3) = ( P(y^3 | r) P(r) ) / ( P(y^3 | r) P(r) + P(y^3 | n) P(n) ) = ( P(y | r)^3 P(r) ) / ( P(y | r)^3 P(r) + P(y | n)^3 P(n) ) = ( (2/3)^3 (1/4) / ( (2/3)^3 (1/4) + (1/3)^3 (3/4) ) = (2/27) / ( (2/27) + (.75/27) ) = 2/2.75 = 8/11 What if the answer is 50% since the chance of rain and not rain does not depend on what your friends tell you. In the absence of further information, the only correct answer is the posterior probability of rain p is in the interval (0, 1). In the absence of further information any prior is as good as any other, so by implication the posterior can take any value as well. The interval for p can be restricted to [0, 1] on the assumption that the question to the friends would not be posed if the prior is absolute certainty whether it will rain or not. With the further assumption that the prior probability is measured with limited precision (e.g. rounded to a percentage point), the posterior would be in the interval (0,075, 1). If the alternative assumption is made that information from the friends will be requested only if it had any chance to move the posterior below or above 0.5, the posterior interval for the probability is (0.5, 1). any more precise answer than that requires further information about the prior which is not supplied in the original problem formulation. Also note that even a precise answer about the probability of rain is not sufficient to answer the question whether an umbrella should be brought or not. Assume probability of raining in Seattle P(R) = 1/4 Assume friend says Y 50% of the time (Theoretical probability) P(Y) = 1/2 Probability of friend saying yes given its raining P(Y/R) = 2/3 Probability of 3 friends saying yes given its raining = P(YYY/R) = 8/27 Probability of 3 friends saying yes = P(YYY) = 1/8 P(R/YYY) * P(YYY) = P(YYY/R)*P(R) P(R/YYY) = 8/27*1/4/(1/8) = 16/27 (About 59%) A posterior probability of 59% given 3 yes and a prior probability of 25% sounds reasonable to me The probability of each of the friend say "YES" is 2/3 * 2/3 * 2/3 = 8/27. Now the probability that it is actually raining in Seattle depends on that how do I select them to phone. There is only three way to select and phone them. So, the probability that it is actually raining in Seattle is 3 * (8/27) = 8/9. Probability that it is raining given that all 3 of them said "yes" = P(AT LEAST one of them is telling the truth) = P(exactly 1 of them telling the truth) + P(2 of them telling the truth) + P(all 3 of them telling the truth) P(exactly 1 of them telling the truth) = P(of first person telling truth) * P(of 2nd person telling lie) * P(of 3rd person telling a lie) = (2/3) * (1/3) * (1/3) = 2/27 + P(exactly 2 of them telling the truth) = P(of first person telling truth) * P(of 2nd person telling the truth) * P(of 3rd person telling a lie) = (2/3) * (2/3) * (1/3) = 4/27 + P(exactly 3 of them telling the truth) = P(of first person telling truth) * P(of 2nd person telling the truth) * P(of 3rd person telling the truth) = (2/3) * (2/3) * (2/3) = 8/27 ANSWER: Probability that it is raining given that all 3 of them said "yes" = P(AT LEAST one of them is telling the truth) = P(exactly 1 of them telling the truth) + P(2 of them telling the truth) + P(all 3 of them telling the truth) = (2/27) + (4/27) + (8/27) = 14/27 Rule of conditional probability states P(A|B) = P( A & B ) / P(B) Reformulating to this case, P(Rain | 3Y) = P(R & 3Y) / P(3Y) P(R & 3Y) = 2/3 ^3 (if it is raining, then they must all speak the truth) = 8/27 (one could multiply probability of rain here. I assumed as prior) P(3y) = all truth or all lie = 2/3 ^ 3 + 1/3 ^3 = 9/27 hence P(R | 3Y) = 8/9 Let X be the probability it's raining. Obviously we want P(X|all three say yes). Now let Y be the probability at least one of them is lying. If Y = 0 it's easy to solve, if not then not so easy. Now you keep going. Obvious, bayesian is a way to go... |

### Data Scientist Intern at LinkedIn was asked...

Find the second largest element in a Binary Search Tree 16 Answersfind the right most element. If this is a right node with no children, return its parent. if this is not, return the largest element of its left child. One addition is the situation where the tree has no right branch (root is largest). In this special case, it does not have a parent. So it's better to keep track of parent and current pointers, if different, the original method by the candidate works well, if the same (which means the root situation), find the largest of its left branch. if (root == null || (!root.hasRightChild() ) { return null;} else return findSecondGreatest(root, root.getValue()); value findSecondGreatest(Node curr, value oldValue) { if(curr.hasRightChild()) { return (findSecondGreatest( curr.getRightChild(), curr.value)); } else return oldValue; } Show more responses Above answer is wrong. it has to be something like this. public static int findSecondLargest(Node node) { Node secondLargest = null; Node parent = null; Node child = node; if (node!=null && (node.hasLeftChild()||node.hasRightChild())) { if (node.hasRightChild()) { while (child.hasRightChild()) { parent = child; child = child.rightChild(); } secondLargest = parent; } else if (node.hasLeftChild()) { child = node.leftChild(); while (child.hasRightChild()) { child = child.rightChild(); } secondLargest = child; } } return secondLargest; } The above answer is also wrong; Node findSceondLargest(Node root) { // If tree is null or is single node only, return null (no second largest) if (root==null || (root.left==null && root.right==null)) return null; Node parent = null, child = root; // find the right most child while (child.right!=null) { parent = child; child = child.right; } // if the right most child has no left child, then it's parent is second largest if (child.left==null) return parent; // otherwise, return left child's rightmost child as second largest child = child.left; while (child.right!=null) child = child.right; return child; } Soln by "mindpower" works. Thank you. I am trying to solve a similar problem Find the 2nd nearest high(in in-order traversal) value for a given node Eg: Given nums: 12 7 14 3, construct a BST. If the given value is: 7 then we should return 14 (in the sort order: 3, 7, 12, 14) if the given value is: 3 then we should return 12 (in the sort order: 3, 7, 12, 14) Generic solution in C# for any k. Notice that this example can be easily changed to find the k-th smallest node by doing a depth-first recursion on root.Left first, and then a tail recursion on root.Right. public Node GetKthLargest(int k) { return GetKthLargest(ref k, this.Root); } Node GetKthLargest(ref int k, Node root) { if (root == null || k < 1) return null; var node = GetKthLargest(ref k, root.Right); if (node != null) return node; if (--k == 0) return root; return GetKthLargest(ref k, root.Left); } recursion is not needed. SecondLargest(Node root, Node secondLarge) { if(root.right==null) return root.left; Node secondLargest = root; while(secondLargest.right.right==null) secondLargest=secondLargest.right; return secondLargest; } int getmax(node *root) { if(root->right == NULL) { return root->d; } return getmax(root->right); } int secondmax(node *root) { if(root == NULL) { return -1; } if(root->right == NULL && root->left != NULL) { return getmax(root->left); } if(root->right != NULL) { if(root->right->right == NULL && root->right->left == NULL) { return root->d; } } return secondmax(root->right); } In-order traverse the tree. The second last element in the array in the answer. In Python: def find_second_largest_bst_element(root, parent=None): if parent is None: # BST root if root.right is None: # no right subtree if root.left is not None: # if a left subtree exists... return root.left else: # root is the only element of the BST return False else: if root.right is None: # right-most element if root.left is not None: # left subtree exists return root.left else: # leaf return parent else: # check right subtree find_second_largest_bst_element(root.right, root) find_second_largest_bst_element(root) For kth smallest, descend the left subtree first. class Node: def __init__(self, value, left=None, right=None): self.value = value self.left = left self.right = right def findKthLargest(root, k): global count if root is None: return findKthLargest(root.right, k) count += 1 if count == k: print root.value return findKthLargest(root.left, k) count = 0 r = Node(10, Node(5, Node(2), Node(7)), Node(30, Node(22), Node(32))) findKthLargest(r, 3) // solution in java // main routine Node findSecondMax(Node root) { if(root == null || (root.left == null && root.right == null) return null; else { Node max = findMax(root); return (max.parent == null) ? findMax(max.left) : max.parent; } } //helper routine, recursive implementation.... can also be done non-recursively Node findMax(Node root) { return (root.right == null) ? root : findMax(root.right); } Show more responses Find the largest number in the binary tree and delete it. And again find the largest number. Short and fast. Reverse in-order traversal of the BST, keeping a count of # of visited nodes. This methods works great to return the kth largest element in a BST. mindpower's solution looks right |

### Data Scientist at Facebook was asked...

Write a function that takes in two sorted lists and outputs a sorted list that is their union. 10 Answersf(a,b) { return sort(unique(a,b)) } def sortedUnion(list1,list2): list3 = [x for x in list1 if x in list2] return sorted(list(set(list3))) google merge sort Show more responses write 2 helpers: 1) INSERT(A, b) = put element b within A in the sort order 2) DEL(A, a) = delete element a from A Then do this recursion: f(A,B) : if max(A) <= min(B) return [A B] else { B = INSERT(B, max(a)); A = DEL(A, max(a); f(A,B); } something like that. try coding and testing. I haven't. Oops, check/write a termination condition On Python, you could do: from sets import Set def merge_sort(a,b): return sorted( Set(a).union(Set(b)) ) def sorted_union(list1, list2): union=set(list1).union(set(list2)) sorted_union=sorted(list(union)) return sorted_union Second part of merge sort. Don't answer with sort(a), etc. Anyone can do that... def merge(A, B): i=0 j=0 sorted_list = [] while i < len(A) and j < len(B): if A[i] <= B[j]: sorted_list.append(A[i]) i += 1 else: sorted_list.append(B[j]) j += 1 if i < len(A): sorted_list.extend(A[i:]) elif j < len(B): sorted_list.extend(B[j:]) return sorted_list I assumed that we can not use any "sort" function and we want it with linear time. so here it is: def my_sort(list_a, list_b): if len(list_a) ==0: return list_b elif len(list_b) ==0: return list_a else: if list_a[-1] > list_b[-1]: return( my_sort(list_a[0:-1], list_b) + [list_a.pop(-1)]) else: return(my_sort(list_a,list_b[:-1]) + [list_b.pop(-1)]) In SQL SELECT List1 FROM Table1 UNION SELECT List2 FROM Table2 ORDER BY List1, List2; |

### Data Scientist at LinkedIn was asked...

generating a sorted vector from two sorted vectors. 3 Answerskeep two pointers and compare the two numbers they point to. Move the pointer which points to the smaller or equal number. End loop when two pointers reach the end. look at merge in mergesort, does exact same thing. Merge sort is the best...many languages have this function inbuilt...else this can also be done manually, assume two vectors A [1,2,3,4] And B[5,6,7,8]...merge them...compare the last value of A and first value of B...in our case 4<5 is true...thus the result...if it is false then move the number up and then compare it with the previous number and so on... |

### Data Scientist at Square was asked...

How do you test whether a new credit risk scoring model works? What data would you look at? 2 AnswersI think I did fairly well on the data side, but I think I should have connected this to a model or something. Not fully sure on this one. One could use the machine learning concept known as cross validation as an element to solve for this case... Assuming that in the development of the model, borrower data has already been broken into several subsets (a training, a validation, and a test set) and part of this subset data has already been used to fit and tune the model (the training and validation sets), the test set can then be used to provide an unbiased and independent assessment of the model's performance. In this case, we would be interested in comparing the MSE's of both the training and test sets - which should be roughly equivalent if the model is good. |

### Security Guard at Murray Guard was asked...

Why'd you leave your old job? 1 AnswerJust don't say 'personal reasons' |

The interviewer asked details about k-means clustering 1 AnswerExplained the basics of k-means clustering |

### Paralegal at Federal Government was asked...

How would you handle a situation in which a co-worker repeatedly makes a similar mistake in his or her work and you are not this person's supervisor. (wording was something like this) 1 AnswerMake sure that there is an actual pattern of a repeated mistake and not just a one time thing. Ask the person if there is a new process for this task that you haven't been made aware of. Then try to demonstrate that there is a different way of doing the task that you were shown and try to resolve which one is right. |

### Data Scientist at zulily was asked...

What is A/B testing 1 AnswerCentral limit theorem, etc. |

Why Client Services? 1 AnswerI listed the perks of working at C.S. |

**1**–

**10**of

**2,321**Interview Questions

## See Interview Questions for Similar Jobs

- Software Engineer
- Data Analyst
- Software Developer
- Business Analyst
- Analyst
- Senior Software Engineer
- Intern
- Software Development Engineer
- Associate
- Consultant
- Senior Business Analyst
- Senior Data Analyst
- Manager
- Financial Analyst
- Research Scientist
- Senior Analyst