Do you write production code as a data scientist?

In the past month, I posted this question to my friends, peers, online tech forum, and got responses from more than 30 data scientists in various industries and different academic background and career path. The responses show a wide spectrum of data scientists’ involvement in production, and reveal some shared concerns about career development among data scientists.

This is No.4 post in the Connect the Dots series. See full table of contents.

I have received permission from individuals mentioned in this post to share their ideas and comments, either anonymously or with their forum ID. Some comments are originally in Chinese and are translated into English.

Table of Contents

Who are the sampled data scientists?

Their official job titles include data scientist, research scientist, applied scientist, machine learning scientist, machine learning engineer, and quantitative researcher.
They work in finance, healthcare, real estate, online advertisement, marketing, telecommunication, transportation, and e-commerce.
Some work in small start ups and some work in large international companies.
They research, experiment, build, and evaluate machine learning algorithms and models for real world problems.
They have a master/PhD degree in physics, biology, neuroscience, statistics, and computer science.
Most of them are individual contributors.

Research

20% of sampled data scientists describe their work as pure research.

Researcher A is working in a large tech company:

“We read papers to come up with research ideas, perform preliminary analysis, write research proposals with motivation, and ask funding from the company to start a project. This “idea-proposal-funding” cycle is essentially like academic research in a university. I’m working on a team with a well-known professor as our head and a few extremely talented post-docs. For me, it feels like I am still in graduate school.

When the project is done, we will give a presentation to the product team, try to sell ideas to their head, and convince them to assign engineers to do implementation. It is very crucial to persuade their head who has the power and resource to push implementation.

It is the best situation if a research project is not only published as a paper, but also implemented in production.

Meanwhile, most of us have open-source projects as our side projects. In case the main research project is blocked or delayed, we can work on other interesting projects in parallel.”

Researcher B is working in a small marketing company:

“We should always think about the core technical competency as a data scientist. Is it the ability to come up with innovative research ideas and design experiments to test ideas? Is it the ability to implement and deploy the code in production at scale? Also, we need to think about the future, in 5 years, in 10 years. If you decide to stay on the technical track, I personally think the former will differentiate you from others, the ability to innovate.”

Researcher C is working in a small E-commerce company:

“What do you mean by “research”? In my opinion, if you do not aim to publish papers, you are not a researcher.”

Researcher D on what is engineering:

“Production and engineering is much more than writing clean code, or understanding data structure and algorithms. You may nail the coding interview, but we all know interview is usually not what you will be doing at work, especially when it comes to production. Network protocol, memory optimization, concurrency, system design for OOP (Object Oriented Programming), etc. At the end of the day, as data scientists, you still have to collaborate with the engineering team, who has their own framework and code bases. Even if a software engineer is hired under the title of “data scientist” by a data science team, without on-boarding by the engineering team, it is almost inevitable to ask for collaboration with the engineering team when it comes to production. Most data science teams do not push their code to production directly. Code review and refactoring from the engineering team is often required.”

Engineering

20% of sampled data scientists describe their work as intensive engineering.

Engineer A is working in a small tech company:

“I consider myself an engineer. When designing an algorithm, you always have to consider how to eventually deploy it to the company’s system, in a distributed and scalable way. It can take quite some time to learn all the toolings and framework in a company. I have been told that if your features and labels are correlated, it does not really matter which algorithm you choose as long as it is simple, scalable, and transferable. “

Engineer B is working in a small company:

“I’m working in a small company with the title Quantitative Researcher. What I do every day is a hybrid of both research and engineering. I start a project by some research, build models, push to production, and also I do model monitoring and oncall. I’ve improved my coding skills a lot from code review, and learned a lot on system design from senior engineers on the team. When I was initially looking for jobs, I was shooting for machine learning engineer, but my coding skill was not quite strong at that moment. During my onsite interview, I was lucky to solve an algorithm question, but my interviewer looked at my code and said “you don’t seem to code a lot because your variable naming is not consistent.” I enjoy what I’m doing now, although I plan to transition to a software engineer position. By the way, all my teammates except me are software engineers.”

Engineer C is working in a small biotech company:

“Industry cares more about actual production than theory and research, and everything you do at work is essentially for production. In fact, as employees, we are doing what the upper management expects us to do. “

Hybrid

60% of sampled data scientists describe their work as research and engineering.

Hybrid A on transition:

“My title is research scientist, and I am facing similar confusion. Our team is like hybrid of research and engineering. On the research side, we are solving open-ended problems; on the engineering side, we do not have the luxury of pair coding with software engineers, and thus we have to do production completely by ourselves. Most of the team members are STEM PhD including myself and I personally think this hybrid approach is pretty good. One caveat of pure research is that it may be too theoretical and idealistic, and may not consider the practical requirement and constraints in production. And pure engineering, well, is essentially the job of a software engineer.

The hybrid approach is not a simple “research + engineering”. I’ve learned a lot on the transition and conversion between them. From a more personal development standpoint, it is a rare and valuable skill set to be able to do both research and engineering.

I understand some research-focused data scientists do not own the features end-to-end, and they have high dependencies on the engineering team and need to coordinate with the engineers’ priority. Most of the time, this is more of an organizational decision. I still think, even data scientist do not do production themselves, it is helpful to stay close with the engineering team, align research goals with features in production.”

Hybrid B (1point3acres BBS, 小K) on pair coding:

“It is easier to set up pair programming in a small company. When I was working on data analysis and model building, I was paired up with one of the most experienced software engineers in the company, and I was very glad to learn so much from him.”

Hybrid C on interview:

“A data scientist who does engineering tasks intensively without much research is essentially a software engineer. Interview for such engineering-data scientist position often has similar format as software engineer interview, plus some machine learning models, and their compensation package is similar to data scientists. Interview for a hybrid data scientist position also has coding questions, but simpler than software engineer interview. It is for people who would like to work closely with production but do not have strong programming skills.”

How about business impact?

Without directly involvement in production, how can a data scientist demonstrate business impact?

Experienced researcher is working in a large tech company:

“It is up to what you are looking for: width or depth. Business impact is largely related to the established technical framework and how to make things happen at large scale. Large and mature companies tend to offer a large platform for big impact. Bluntly speaking, your position (title) decides your impact. In early career, it is usually more difficult to have direct and big impact.”

Experienced engineer is working in a large tech company:

“Data science is a very broad field, and you need to explore and find out what you truly enjoy and position yourself properly. If you are interested in business, you need to learn how to build business models, how to negotiate with different parties, how to manage product development, how to deliver products to users and react to their feedback. We should always focus on one particular path, research, engineering, or business, and develop our own competitive edge. At the same time, we should keep informed on other aspects.”

Researcher X is working in an E-commerce company:

“Honestly speaking, it is up to your performance review system. If you are evaluated on research, there is nothing wrong with working on theoretical and less practical research projects. If you are evaluated on production and your impact is judged by how many codes you push, you probably should do more production, even if your title is not an engineer.”

Additional resource

As I was discussing with my data scientist friends, I also started to read blogs and articles written by other data scientists who share their opinions and understanding on production and engineering. Here are some blogs and posts I find quite insightful:

Final thoughts

From a STEM research background, I personally find exploratory and innovative research intriguing and exciting and I enjoy experimenting different ideas and making discoveries. At the same time, as I collaborate more often with software engineers in my company, I start to understand and appreciate engineering more. I try to adopt the best practice in engineering to data science research and implementation, including version control, naming convention, modular code, abstraction, and unit test. I also communicate with engineers on how we can efficiently transfer research insights into production. I have learned that in order to make machine learning work at scale in the real world, we need to build high-quality engineering framework and infrastructure, and machine learning model is only part of this big picture.

Ju Yang

Ph.D. / Machine Learning Practitioner in New York