(meme) Welcome to dark side of science - data science

After posting What I do or: science to data science I got a lot of emails on how to make this transition.

In this post I try to summarize my advice. I don’t intend to write a complete walkthrough, but to provide a starting point, with links to further materials. I target it at people with academic, quantitative background (e.g. physics, mathematics, statistics), regardless if they are undergraduate students, PhDs or after a few postdocs. Some points may be valid for other backgrounds1 (but then - use it at your own risk).

Here and everywhere else: please don’t take approach of learn book[s] then play - start with playing!

My story

In short:

  • I had a strong background in physics and interest in complex system; I did a lot of academic programming and none of - practical.
  • After the 1st year of my PhD studies I started learning Python (for web scraping and plotting) on my own time.
  • 9 months later I participated in a 1-month data science school (Big Dive in Turin).
  • 8 months later I went to a summer internship in data science in San Francisco (for 4 months).
  • I started part-time freelancing (as I was finishing my PhD).
  • After finishing PhD I made it my main activity.

All projects required me to learn something new - be it a library, a machine learning model or a software tool.

What is data science?

Analyzing real, and often - dirty, data using a mixture of programming and statistics. Or, as Josh Wills put it:

Data scientist is a person who is better at statistics than any programmer and better at programming than any statistician.

From my perspective the whole process looks that way:

  • ask question that is relevant to the project
  • get data (CSV, SQL, plain text)
  • process it (joining, cleaning, supplementing it)
  • run analysis (statistical tests or machine learning)
  • interpret and use results (being able to understand the above)
  • present results (a report, plot, interactive data visualization)

And everything needs to be done in a reproducible way - so others can interact with your code, or even run it on a server. Depending on the job, there may be more emphasis on one part or the other. Or even look at this tweet - while humorous2, it shows a balanced list of typical skills and activities of a data scientist:

a data scientist should be able to (by Joel Grus)

If you want to learn more about what is data science, look at the following links:

On the transition

When you have some academic title, no-one will question your intelligence. But they are justified to question your practical skills. From my experience, you need to fulfill two requirements:

  • have minimal skills so that you are useful starting from day 1 (e.g. you can get data and present summary statistics; they don’t want to start with teaching you Python and Git),
  • be able and eager to learn (in general, their technologies, be self-driven to discover and solve new problems even without being explicitly guided).

Most data science things are simple and at the point that you are able to use R or Python you can start working, gradually increasing your knowledge and experience. That is, after a few months you should be ready to start an entry-level job.

Initially, I was afraid that it is a problem that I lack 10+ years of experience with C++ and Java. So how could I compete with serious software engineers, who did their computer science major? But it turned out that most of my commercial projects are for IT companies - they have wonderful programmers but often no-one proficient at dealing with real data. So (from Academia to Industry linked below):

While having a strong coding ability is important, data science isn’t all about software engineering (in fact, have a good familiarity with Python and you’re good to go). Data scientists live at the intersection of coding, statistics, and critical thinking.

See also:

Priorities

In academia, you are allowed to cherry-pick an artificial problem and work on it for 2 years. The result needs to be novel, and you need to research previous and similar solutions. The solution needs to be perfect, even if not on time.

In industry, you should solve a given problem end-to-end. Things need to work, and there is little difference if it is based on an academic paper, usage of an existing library, your own code or an impromptu hack. The solution needs to be on time, even if just good enough and based on shady and poorly understood assumptions.

So, contrary to its name, it’s rarely science3. That is, in data science the emphasis is on practical results (like in engineering) - not proofs, mathematical purity or rigor characteristic to academic science.

Resume vs academic CV

In the software industry resume plays a different role than CV in academia. Rather than being a complete record or all positions, awards and publication, it is a short (typically 1 page) summary of the main skills and the most important positions/accomplishments. It is used to screen candidates, not as the final judgement. To see the difference, compare and contrast my data science resume with my academic CV.

Interviews

Applying for a job involves being asked technical questions - on the phone or Skype. For software engineering it involves both conceptual questions and whiteboard coding; for data science it may vary. In any case, take a look at:

If you need learn basic algorithms and data structures, I recommend:

If you get no technical questions, it may be a red flag. If you get only software engineering questions, it may be a sign that they want to hire a programmer, not - a data scientist (no matter what their job calling says); and given you background you want to be a Type A Data scientist (i.e. more a statistician than a regular programmer), according to this taxonomy.

Programming languages

Most likely practical programming is the main skill you are missing. For general data science, the standard tools are Python and R. If you already know some other languages it will help, still - learn one of the above.

But… Python or R? There are some crazy fights, right?

tl;dr: both are good choices. Pick one you prefer for any reason; two really good ones are:

  • This thing is great! I want to apply it to [some other data]. Oh, it is in [a language]!
  • Having a community of people from whom you can learn.

I mean, there are use cases when one is better than the other. But in the majority of tasks both are fine. And well (some may disagree), but they are tools, not religions (no need of fighting, not need of using exclusively one).

I won’t point to a general tutorials - there are tons of it and personal preferences vary (MOOCs, interactive courses, websites, textbooks, …) and I tired to link only to things I recommend myself. When I provide links - it is usually web materials rather than classical books. And it is for a reason:

  • things change fast; a 2-year old book on a programming language may be well out-of-date,
  • it is important how much you use in practice; dry-reading won’t teach you a thing.

R

R is a tool for statistics turned into a language. The standard way of using it is via RStudio (though, you can use Jupyter). Be sure to learn basics of dplyr and ggplot2 (I almost always load them by default; especially dplyr, which makes operations on dataframes much easier, faster and more readable). Then everything else depends on the problems you are solving.

If you go the R way, at least:

Some R pearls:

  • R Markdown - dynamic documents, presentations, and reports from R
  • Shiny - turn your analyses into interactive web applications

Python

Python is a much better general-purpose language (with pros and cons on not being statistics-oriented).

For Python, I would suggest installing it (Python 3) through Anaconda, and using Jupyter Notebook. Main packages are NumPy, SciPy (numerics), Pandas (like R dataframes), matplotlib (plots, but not as nice as ggplot2) and scikit-learn (for machine learning). Learn to be comfortable with Python (installing packages, loading, saving and transforming data, etc) - links below may help:

Statistics and Machine Learning

You need some basic linear algebra (vectors, matrices, SVD, …), calculus (exp, log, differentiation, integration, …) probability (independence, conditional probability, …), but if you are from natural science background, you already know that. It does not mean that you know all - it just means that right now you have mathematical skills sufficient to be an employable data scientists and you are able to read about other methods, algorithms, etc.

If you need to get a real dataset suitable for working with a given machine learning algorithm, there is a wonderful collection:

For statistics, screw learning by heart various statistical distributions and tests - you can easily look them up later. What is crucial, is to understand the idea of tests, cross validation, bootstrapping and Bayesian inference. For the latter I recommend:

It’s a fast changing field - I am constantly tracking new libraries and updates to ones I am using. I read a lot of academic papers - not just to stretch my intellectual muscles, but solve a particular problem.

Other software skills

Often you will need to install something, collaborate with others and do other tasks. The crucial point so to know what is possible - especially not to reinvent the wheel.

Don’t be afraid of learning new technologies (e.g. this data is in MongoDB, a NoSQL database; can you fetch it?) - often you can get the basics in a day. Most technologies, from the user’s perspective, are easy (at least comparing to algebraic geometry or quantum field theory).

Practicing and building a showcase

Some people recommend Kaggle as a starting point but I would take it with a grain of salt. Don’t get me wrong - there are great resources, it provides feedback (otherwise it is hard to tell if your solution is good) and some people find it really engaging. But if you start with a goal of winning - you will end up disappointed, with neither fame nor gold (prized competitions are not beginner-level). Moreover, beware that industrial problems rarely look like that (e.g. in all mine data cleaning was a big thing, and in none 5% score improvement mattered). More on that:

Personally, I enjoy the most working on data I care about and find genuinely interesting. It drives my motivation much more than any competition could. Also, this way it is a complete data science - from asking questions and getting data to presenting the results in a meaningful form.

Making results public, including code, is a great room for both feedback and building a showcase. It can be an IPython Notebook, or a website, or even a just a plot (but then be sure to sign it - it it goes viral you want to get due recognition!). E.g. some mine (see also Projects):

So, once again, be sure to get a GitHub account (for hosting code, notebooks and websites). Mine looks like that: github.com/stared. And don’t be afraid to put premature code: if it is not good yet then no-one will notice (or care) anyway. Also, some people like writing about problems they have just learnt (e.g. How gzip uses Huffman coding - Julia Evans). If it is your thing - just do it (see my post on Jekyll)!

Data science boot camps

It’s totally fine to learn things on your own. But doing on a boot camp may be a huge boosts - motivational, with access to tutors/experts, with job opportunities. Here are some camps I am aware of:

Internships

If you are still a student - doing an internship may be a great way to get a lot of experience, feedback, confidence and contacts. I did mine during my PhD studies (in Europe it is not common to take a break, and a lot of people in academia dissuaded me, but I consider it a wonderful, life-changing experience)4.

To search for offers try googling data science/scientist intern/internship and visit some job listings (e.g. Indeed). Sometimes it makes sense to mail a company even if they don’t use words intern or internship - especially smaller ones may be flexible. Some bigger tech companies (Facebook, Google, IBM, Microsoft) offer internships5, see:

Aim at tech companies (to actually work in data science). In the [San Francisco] Bay Area (i.e. north of Silicon Valley) there are plenty opportunities to learn data science - it should be your primary destination. To work in US you need to get J-1 visa (of course, after they want you), but it’s relatively easy (but takes ~2-3 months).

Once on-site, start look for various meeting and hackathons, especially via Meetups. Search for anything that may fit (data science, R communities, big data etc) and try to visit a lot of events. In the Bay Area it is an advantage to be “bold”. So don’t be afraid to asking about or for anything, starting talking to people etc - on the average it will be much better than taking a passive posture. See also:

Feed

Never stop learning. Some feeds:

And if you have a question, a good place to ask (and search for answers) is:

Advanced stuff

Since you are in maths, it may be possible for you to make a shortcut and get into advanced topics. Here is a random list of starting points I consider interesting:

About

This blog post started as emails, and went through a stage of an extract of emails (shared on Google Docs). It took me way more time than I expected to present it in the current form.

There are many people who helped me with this post, at its various stages (starting from asking me questions!). But I would like to especially thank to: Adam Goliński, Sebastian Jaszczur, Kasia Kulma and Robert Bogucki for their remarks on the final version.

I would love to hear your feedback! Did you find it useful? Or maybe you would recommend another learning strategy? Or additional links?

Or maybe your company needs a data science training? I would be happy to provide it! See workshops.deepsense.io for the menu (and we are happy to make custom workshops) and fill the form or contact me directly!

  1. For instance, if you don’t have a quantitative background, you need to focus on it (and it may be the hardest part). Since it was not my path, I can’t help.

  2. In particular, hacking p-value is wrong. But you should be aware what is p-value and why it can be hacked (accidentally or purposefully).

  3. But if you come from a non-academic background (e.g. web dev), then from your perspective data science is science. Or to make it precise - it is engineering, but more like designing new engines, than building a house.

  4. Great thanks to Adam Zadrożny for showing me this possibility (he interned at Facebook while doing his PhD in gravity waves) and to Jacek Migdał for convincing me to apply to the Bay Area, rather than somewhere else.

  5. If you have background in computer science, it will be like playing on the easy level (it was not my case, though). It may be possible to apply as a software engineer expressing interest in data - and learn from that point.

  6. Hacker News is my best general-purpose non-personal feed, complemented by The Economist.