“Which language do you use at work?” I get this question quite often. My short answer is usually “Python for research and Scala for production”. In this post, I will give more descriptive answers and examples to this question.
Table of Contents
SQL for data preprocessing
The first step to explore a new research idea or test a new hypothesis is to get the relevant data. Engineers in the company have built well-established ETL (Extract, Transform and Load) pipelines and various sophisticated aggregation jobs to process and store data in Google BigQuery. After rounds of group discussion, literature research, and planning meetings, we kick off a project. To learn about which types of data I could use and where I can find it, I prefer to have in-person discussion with knowledgeable engineers who are familiar with the data schema and specification.
Once I learn where the data is being stored and how it is structured, the next step I usually do is exploratory data analysis (EDA) on BigQuery using SQL. EDA allows me to have an overview of the data, identify distribution, unusual trend, outliers, and in the simplest format, numbers of properties of interest. For example, there may be some data ingestion errors in the table, resulting in empty values in certain fields; there may be unusually high number of the value “others” in the table, suggesting this value may be the default “fall-back” placeholder. When in doubt, I investigate and discuss it with colleagues until I am certain that I understand the data and its meaning.
In my project, I often work on more than one dataset, and most of the time I am interested in the combination of several datasets. To achieve this, I write SQL queries to join, filter, and select data of interest, and do sanity check on the results.
In addition, for the preliminary research, I may not need the whole dataset (which is easily beyond 10 TB!). So I may create a randomly sampled dataset using SQL and export the sampled dataset (less than 10 GB) to Google Cloud Storage, and then download the data to a local computer for further research.
On a separate note, for transparency, reproducibility, and visibility, I copy-and-paste all the queries I have used in data preprocessing in a Google Doc and intermediate results in a Google Spreadsheet during work, organize the queries and results in a readable order, and summarize the analysis results and visualization on the company’s internal wikipage Confluence, along with a problem statement or background description. Current and future colleagues can easily pick up where I left and continue working on the project.
Git for version control
You do not want to mess up the master branch (a deadly crime), and you do not want to lose your code revision history when something unexpected breaks down. Git version control is essential and very important. It allows teammates to work on separate branches simultaneously without interrupting the underlying master framework, and allows you to have a full copy of the code history. After I am done with a new feature, it feels greatly satisfying to create a pull request, complete code reviews, and merge my branch to the master, conflict free!
In addition, I use bash commands on Terminal to download, upload, and organize files, run shell scripts, and modify basic configuration.
Python and R for research
I use Python as my playground during research and development to test and explore new ideas. In particular, I like using Jupyter notebook with all kinds of (un)common data science packages (pandas, numpy, scipy, matplotlib, sklearn, etc). At this stage, the major goal is to explore the data, write decent working code, implement machine learning models, and test new machine learning algorithms and feature engineering methods published from latest research. The data I use at this stage is usually a sample of the original data from BigQuery, and the size of the sampled data is relatively small (a few GB) so that I can run the code on my local computer fast.
A lot of exciting data mining, modeling, and machine learning research happen at this stage. The versatility and ease of use of Python makes it possible to get preliminary results within a few days (or even hours), and Jupyter notebook with visualization makes it easier to share the analysis and results with coworkers for feedback and further iteration.
Some colleagues use R language as their research tool, in addition to Python.
Scala for research at scale and production
While Python and sklearn are convenient for research and prototyping, the results from a small subset of data may not be representative of the whole dataset. In addition, Python and sklearn are not scalable enough to handle large dataset and if the input data is more than a few GB, it could take a very long time to train a model with the risk of crash in the middle of the training job.
Our current machine learning pipeline is written in Scala, a scalable functional programming language that was invented particularly to handle large data, along with the Spark framework. My first encounter with Spark was described in this post Time to Spark. In practice, after having an initial assessment and analysis of a research topic in Python, I compose the code in Scala and integrate the new code to the current pipeline, so that I can run Spark jobs to generate features and train machine learning models.
I started to use Scala in production 3 months ago, and found it is extremely powerful and enjoyable to use for data science projects. The functional part of Scala makes me think differently when I write a loop (in fact, you do not need to write loops most of the time, the map function will do that for you!) or when I want to transform data. Scala’s immutable nature and type safe characteristics may catch you at surprise (bug alert!).
This whole process involves collaboration and code pairing with engineers, debugging, unit test, code reviews, refactoring, and more debugging.
We also have machine learning modules written in Python which provides more flexible and diverse selection of machine learning algorithms. My internship project was indeed written in Python, and run at scale on the cluster as Spark jobs. However, Scala is still the dominant language in production in the company.
Language of your audience for efficient communication
Last but not the least, always, always, use the language of your audience. I have written a post during my internship on this topic (Take the audience on a journey) and I think the language of your audience is as important as other programming languages at work.
If I am talking with data science colleagues on technical issues, it is generally okay to use some jargons and acronyms. Nevertheless, I would try not to take it for granted that anything should be well-known or obvious, even when I am talking with very senior and experienced data scientists. After all, nobody knows everything, especially if the conversation is about your own project: no one except yourself have dedicated so much time and thinking to it.
If I am talking with coworkers from other teams, I will pay extra attention to my language and restrain my temptation to throw jargons or expedient mental shortcuts. Although you might think that using technical acronyms and “data sciencey” terms makes your look smart and professional, the fact is that no one likes to feel stupid and being “condescendingly educated” when they cannot follow what you are saying or do not understand the meaning of a particular English word. The purpose of a conversation, in any format, one-on-one or group meeting, is not to show off how much we know, but to listen and learn something we do not know, and to open the discussion and ask for feedback. If the people we are speaking with feel shut down, lost, or nothing to contribute, our monologue will certainly hurt the teamwork, as well as hurting our own growth. As Albert Einstein put it, “If you can’t explain it to a six year old, you don’t understand it yourself.”
Like your blog and study a lot! Thank you!