Time flies. It’s been 7 months since I started to work as a full-time Data Scientist. It sometimes feels much shorter than 7 months: imaging neurons in the laboratory as a graduate student and walking on the stairs in front of Alma mater on campus was just like yesterday. It sometimes feels much longer than that: working is so vastly different from academic study, and I’ve learned so much on so many aspects of data science within this short-long period of time, that my mind and understanding of the industry and the world hardly resembles the graduate student me.
Here I am summarizing a few lessons I learned from work.
Forget about linearity
How wonderful it is if things could all go well according to our grandiose plan. How satisfying it is if we could cross out all the bullet points in our to-do list in one shot and never visit the “done” items again. How simple it is if project development is a pure linear process and all we need to do is to meet one milestone after another, and BOOM, project is completed.
I would have thought so if I have never worked in the real world. School projects and courseworks have well-defined objectives and assignments, as well as a linear road map towards the final exam week after week, called “syllabus”. To solve a course question and problem, no matter how complex it is, there is always a clear step 1, 2, 3. Professors and instructors will let the students know, in advance, what the tasks are, what the key learning points are, and what to expect in the evaluation. As long as you cross out all items in the to-do list, you can be sure to receive a good reward of knowledge and high scores.
In more open-ended research projects, such as my thesis research, there is less well-defined instruction nor clear road map than courses. However, a research project is usually conducted in an over-simplified and ideal condition. For example, theoretical study in math and physics may set up strict ideal assumptions which rarely exist in the real word, and experimental research in biology and chemistry may relay on arbitrary buffer solution and convenient cell types which represent what scientists believe to be a close resemblance of the real life. Thanks to such idealization and simplification, scientific research is done in a very rigorous and reproducible manner. On the other hand, such simplification may overlook actual complexity, especially in the generation of ideas and development of projects. In a scientific research proposal, it is relatively easy to define the hypothesis and steps/experiments to test the hypothesis. As long as all steps are done, it is fair to come to a conclusion that the hypothesis is true or not. In addition, research projects are usually performed by a very small group of experts from the same field, speaking the same jargon, and looking at the same directions. It is far more easier to communicate with your lab mates than your parents about your research.
With this “linearity and to-do list” mindset, it is quite shocking to me that project development in industry is far from being linear. There are a lot of ongoing back-and-forth discussions and goal-settings among different interest groups: engineers, product managers, sales, and strategy, and so on. Each group tends to speak their own language, has their own agenda, and views the project from their particular perspective. Engineers may be interested in improving the precision of the machine learning algorithms, product may be pressured to meet the deadline, sales may be motivated to develop new features for upselling. And when it comes to planning, the first step is set a clear objective and define the problem. This usually requires multiple meetings across different groups in order to reach a consensus. Sometimes, my problem may seem trivial in your eyes, and vice versa. Then comes to the tasks. Most of the time, the problem is only vaguely defined in business, and it is until some initial analysis and trial is done, the problem definition becomes more and more concrete. This means, you may have to face the fact that there is no to-do list at certain stages of a project, and no clear hypothesis to test. Sometimes your task depends on others’s task and it is difficult for you to foresee your task until others finish their part. It is quite challenging if you have a student mindset and are used to being told what to do and what to expect with strict guidelines. In business, you may have to define your own tasks and come up with ideas to test, as well as communicating your expertise to other groups about what is expected and how to evaluate. As the market and customer demand is dynamic, you may have to revise the same task after “finishing” it. Often times, the to-do list just keeps getting longer and longer and it seems like you can never cross out a single task because it is never done and there is always revision and improvement. To students fresh off college, the working environment and vague project setting may seem like a chaos and mess as there are no well-defined syllabus and learning objectives any more. However, real world never has bullet point. It is more like a loosely structured network. It requires human effort to define clusters of problems and tasks, and to clarify goals in the midst of vagueness and uncertainty. To come up with an actionable task list in itself is a big task!
Linearity is simple and nice, but forget about it in the real world.
Local optimum is a good solution
In my discussion with a friend about modern college education, he was holding the opinion that the 4-year college education is deeply flawed and expensive, and small group teaching and MOOC are the future, while I thought there is no better alternative for the 4-year college and classroom teaching at this moment so however flawed it is, 4-year college is still the “best” we can have. There are a great amount of discussion on the fast development of technology in the future and what we are learning in school now may not be relevant in decades. That may be true. However, no matter how unsatisfied you are about the 4-year college, your best bet (statistically speaking) right now for the future is still to go to a college.
Real world is full of trade-offs and constraints. You know some solutions are not the best, some plans are not the most efficient, some structures are not the most organized. It is not difficult to point out flaws and shortcomings because they always exist, but it is more difficult to improve the status quo than you may have thought. There could be multiple stakeholders and different interest groups involved, there could be budget limitations, people may have different priorities and values, or you just do not have time! Of course, it is better to have all your codes commented and to check naming consistency. However, when you have more important tasks to do, would you still spend your afternoon checking the spaces after each parenthesis? It is nice to be organized, but not so much to be lost in it.
With limited time, resource, and energy, sometimes, you just got to accept the local optimum as a solution that is good enough for you to proceed to the next step. Accepting local optimum does not mean no improvement. Through future iteration and feedback from others, we may reach a better local optimum, and an even better one. It would be much less efficient to try to find a so-called “global optimum” in one shot.
PS: “Time flies. It’s been 6 months since I started to work as a full-time Data Scientist.” When I wrote this line, it was a month ago, a 6-month milestone. I had so many thoughts in my mind, so many notes on my notebook, and so many personal experience to share, that I never got to finish this post. Now, another month has passed, and I still think this one single post does not well record my mind journey through this first few months of working. With this being said, this post serves as a pilot episode in a new season of my data science blog (season 1 being the internship), and many more are coming in the future!