首发于 数据应用学院
109个Data Scientist必备面试问题

109个Data Scientist必备面试问题

准备面试不是一件容易的事情。无论你有多少工作经验或技术技能,面试官都可以用一些你没有想到的问题来抛弃你。


对于数据科学访谈来说,面试官会问广泛的话题,需要受访者强大的技术知识和沟通技巧。

本指南包含面试者在面试数据科学家时应该期望的所有数据科学面试问题。


从这个数据科学面试问题列表中,受访者应该能够准备好棘手的问题,了解哪些答案会与雇主产生积极的共鸣,并培养面试的信心。


我们将数据科学访谈问题分为六个不同的类别:统计,编程,建模,行为,文化和解决问题。

来源: springboard.com/blog/da


DS 面试问题 Data Scientist Interview Questions

Statistics

Programming

General

Big Data

Python

R

SQL


Modeling

Behavioral

Culture Fit

Problem-Solving


Statistics

没有统计学的先进知识,很难成为一个数据科学家。但是如何把统计解释给一个没有统计学背景的人比你想象中困难。建议大家记一些简单的例子,用浅显易懂的方式介绍给周围人听。


1 What is the Central Limit Theorem and why is it important?

2 What is sampling? How many sampling methods do you know?

3 What is the difference between Type I vs Type II error?

4 What is linear regression? What do the terms P-value, coefficient, R-Squared value mean? What is the significance of each of these components?

5 What are the assumptions required for linear regression?

There are four major assumptions:

1. There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data,

2. The errors or residuals of the data are normally distributed and independent from each other,

3. There is minimal multicollinearity between explanatory variables, and

4. Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.


6 What is a statistical interaction?

7 What is selection bias?

8 What is an example of a dataset with a non-Gaussian distribution?

9 What is the Binomial Probability Formula?


Programming

为了测试你的编程技巧,面试官会问两个问题:在理论上如何解决编程问题而不写出代码,然后他们还会要求现场编程进行白板练习。

General

1 With which programming languages and environments are you most comfortable working?

2 What are some pros and cons about your favorite statistical software?

3 Tell me about an original algorithm you’ve created.

4 Describe a data science project in which you worked with a substantial programming component. What did you learn from that experience?

5 Do you contribute to any open source projects?

6 How would you clean a dataset in (insert language here)?

7 Tell me about the coding you did during your last project?


Big Data

1 What are the two main components of the Hadoop Framework?

2 Explain how MapReduce works as simply as possible.

3 How would you sort a large list of numbers?

4 Here is a big dataset. What is your plan for dealing with outliers? How about missing values? How about transformations?


Python

1 What modules/libraries are you most familiar with? What do you like or dislike about them?

2 What are the supported data types in Python?

3 What is the difference between a tuple and a list in Python?


R

1 What are the different types of sorting algorithms available in R language?

There are insertion, bubble, and selection sorting algorithms.


2 What are the different data objects in R?

3 What packages are you most familiar with? What do you like or dislike about them?

4 How do you access the element in the 2nd column and 4th row of a matrix named M?

5 What is the command used to store R objects in a file?

6 What is the best way to use Hadoop and R together for analysis?

7 How do you split a continuous variable into different groups/ranks in R?

8 Write a function in R language to replace the missing value in a vector with the mean of that vector.


SQL

1 What is the purpose of the group functions in SQL? Give some examples of group functions.

2 Group functions are necessary to get summary statistics of a dataset. COUNT, MAX, MIN, AVG, SUM, and DISTINCT are all group functions.

3 Tell me the difference between an inner join, left join/right join, and union.

4 What does UNION do? What is the difference between UNION and UNION ALL?

5 What is the difference between SQL and MySQL or SQL Server?

6 If a table contains duplicate rows, does a query result display the duplicate values by default? How can you eliminate duplicate rows from a query result?


Modeling

1 Tell me about how you designed the model you created for a past employer or client.

2 What are your favorite data visualization techniques?

3 How would you effectively represent data with 5 dimensions?

4 How is kNN different from k-means clustering?

kNN, or k-nearest neighbors is a classification algorithm, where the k is an integer describing the the number of neighboring data points that influence the classification of a given observation. K-means is a clustering algorithm, where the k is an integer describing the number of clusters to be created from the given data. Both accomplish different tasks.


5 How would you create a logistic regression model?

6 Have you used a time series model? Do you understand cross-correlations with time lags?

7 Explain the 80/20 rule, and tell me about its importance in model validation.

8 Explain what precision and recall are. How do they relate to the ROC curve?

Recall describes what percentage of true positives are described as positive by the model. Precision describes what percent of positive predictions were correct. The ROC curve shows the relationship between model recall and specificity – specificity being a measure of the percent of true negatives being described as negative by the model. Recall, precision, and the ROC are measures used to identify how useful a given classification model is.


9 Explain the difference between L1 and L2 regularization methods.

10 What is root cause analysis?

11 What are hash table collisions?

12 What is an exact test?

13 In your opinion, which is more important when designing a machine learning model: Model performance? Or model accuracy?

14 What is one way that you would handle an imbalanced dataset that’s being used for prediction? (i.e. vastly more negative classes than positive classes.)

15 How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression?

16 I have two models of comparable accuracy and computational performance. Which one should I choose for production and why?

17 How do you deal with sparsity?

18 Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy?

19 What are some situations where a general linear model fails?

20 Do you think 50 small decision trees are better than a large one? Why?

21 When modifying an algorithm, how do you know that your changes are an improvement over not doing anything?

22 Is it better to have too many false positives, or too many false negatives?


Past Behavior

从这些问题中,面试官想看看候选人对过去的情况如何反应,你如何表达自己的角色以及从你的经验中学到了什么。在面试之前,写下与这些问题相关的工作经历的例子,以便更新你的记忆。 能够简明扼要地制作一个故事来详细描述你的经历是非常重要的。


1 Tell me about a time when you took initiative.

2 Tell me about a time where you had to overcome a dilemma.

3 Tell me about a time where you resolved a conflict.

4 Tell me about a time you failed, and what you have learned from it.

5 Tell me about (a job on your resume). Why did you choose to do it and what do you like most about it?

6 Tell me about a challenge you have overcome while working on a group project.

7 When you encounter a tedious, boring task, how would you deal with it and motivate yourself to complete it?

8 What have you done in the past to make a client satisfied/happy?

9 What have you done in your previous job that you are really proud of?

10 What do you do when your personal life is running over into your work life?


Culture

面试官试图了解你是谁,以及你如何配合公司。 他们想知道你对数据科学和招聘公司的兴趣来自哪里。 看看这些例子,想想你最好的答案是什么,但要记住诚实地对待这些问题是很重要的。 这些问题没有正确的答案,但最好的答案是自信和微笑。


1 Which data scientists do you admire most? Which startups?

2 What do you think makes a good data scientist?

3 How did you become interested in data science?

4 Give a few examples of “best practices” in data science.

5 What/when is the latest data science book / article you read? What/when is the latest data mining conference / webinar / class / workshop / training you attended

6 What’s a project you would want to work on at our company?

7 What unique skills do you think you’d bring to the team?

8 What data would you love to acquire if there were no limitations?

9 Have you ever thought about creating a startup? Around which idea / concept?

10 What can your hobbies tell me that your resume can’t?

11 What are your top 5 predictions for the next 20 years?

12 What did you do today? Or what did you do this week / last week?

13 If you won a million dollars in the lottery, what would you do with the money?

14 What is one thing you believe that most people do not?

15 What personality traits do you butt heads with?

16 What are you passionate about?


Problem Solving

面试官在面试过程中希望测试你解决问题的能力。在回答时记得始终表达你的思考过程。因为对面试官来说,过程往往比结果本身更重要。


1 How would you come up with a solution to identify plagiarism?

2 How many “useful” votes will a Yelp review receive?

3 How do you detect individual paid accounts shared by multiple users?

4 You are about to send one million emails. How do you optimize delivery? How do you optimize response?

5 You have a dataset containing 100K rows and 100 columns, with one of those columns being our dependent variable for a problem we’d like to solve. How can we quickly identify which columns will be helpful in predicting the dependent variable. Identify two techniques and explain them to me as though I were 5 years old.

6 How would you detect bogus reviews, or bogus Facebook accounts used for bad purposes?

7 How would you perform clustering on one million unique keywords, assuming you have 10 million data points – each one consisting of two keywords, and a metric measuring how similar these two keywords are? How would you create this 10 million data points table in the first place?

8 How would you optimize a web crawler to run much faster, extract better information, and better summarize data to produce cleaner databases?

代开银行存款证明公司宜宾资信证明开具徐州银行存款证明代发河池出国留学资金证明订做安阳开定期存单六安企业资金证明报价办留学存款证明河池企业资金证明开具咸阳留学存款证明办理日喀则资金证明哪家专业永州存款证明推荐南通代办企业资金证明泰州打印留学存款证明咸阳出国留学存款证明代发淮北银行存款证明推荐天水出国留学资金证明用途沧州银行定期存单怎么样宁德定期存单服务商泉州存款证明代发鹤岗本地资信证明哈密企业资金证明代开焦作代开资信证明晋城出国留学资金证明定制云浮开银行定期存单铁岭资信证明怎么样巴彦淖尔办理出国留学资金证明惠州定期存单价格泸州留学存款证明开具雅安企业资信证明代发温州资信证明哪家比较好深圳出国留学存款证明多少钱香港通过《维护国家安全条例》两大学生合买彩票中奖一人不认账让美丽中国“从细节出发”19岁小伙救下5人后溺亡 多方发声汪小菲曝离婚始末卫健委通报少年有偿捐血浆16次猝死单亲妈妈陷入热恋 14岁儿子报警雅江山火三名扑火人员牺牲系谣言手机成瘾是影响睡眠质量重要因素男子被猫抓伤后确诊“猫抓病”中国拥有亿元资产的家庭达13.3万户高校汽车撞人致3死16伤 司机系学生315晚会后胖东来又人满为患了男孩8年未见母亲被告知被遗忘张家界的山上“长”满了韩国人?倪萍分享减重40斤方法许家印被限制高消费网友洛杉矶偶遇贾玲何赛飞追着代拍打小米汽车超级工厂正式揭幕男子被流浪猫绊倒 投喂者赔24万沉迷短剧的人就像掉进了杀猪盘特朗普无法缴纳4.54亿美元罚金周杰伦一审败诉网易杨倩无缘巴黎奥运专访95后高颜值猪保姆德国打算提及普京时仅用姓名西双版纳热带植物园回应蜉蝣大爆发七年后宇文玥被薅头发捞上岸房客欠租失踪 房东直发愁“重生之我在北大当嫡校长”校方回应护栏损坏小学生课间坠楼当地回应沈阳致3死车祸车主疑毒驾事业单位女子向同事水杯投不明物质路边卖淀粉肠阿姨主动出示声明书黑马情侣提车了奥巴马现身唐宁街 黑色着装引猜测老人退休金被冒领16年 金额超20万张立群任西安交通大学校长王树国卸任西安交大校长 师生送别西藏招商引资投资者子女可当地高考胖东来员工每周单休无小长假兔狲“狲大娘”因病死亡外国人感慨凌晨的中国很安全恒大被罚41.75亿到底怎么缴考生莫言也上北大硕士复试名单了专家建议不必谈骨泥色变“开封王婆”爆火:促成四五十对测试车高速逃费 小米:已补缴天水麻辣烫把捣辣椒大爷累坏了

代开银行存款证明公司 XML地图 TXT地图 虚拟主机 SEO 网站制作 网站优化