To a biologist or biomedical scientist starting out in Data Science, it may seem as if these two languages are interchangeable, and for some functionalities they are. However, there are clear distinctions between them too, and the answer is “it depends on what you’re trying to do”.
Python is based on C, it is a software development language which is deep and huge and intuitive. It is easier to learn than many other languages, and you don’t need to be totally fluent in order to make use of it for genomics or other biological data analysis. It can do some statistics, and is a great scripting language to help you link your workflow or pipeline components together. Think of it as connective tissue.
R is a statistical and visualization language which is deep and huge and mathematical. Once you know your way around it you can conduct statistical analysis of almost anything, and you need a good degree of fluency in order to make best use of it. It can be used as a scripting language but that is not its strength. Think of it as muscle fibers.
We surveyed several respected tech and data science websites, and came to the following general consensus:
- Python is a generic programming language with which you can build things, and R is a great statistical platform with which you can analyze and plot things.
- In the context of biomedical data science, learn Python first, then learn enough R to be able to get your analysis done, unless the lab that you’re in is R-dependent, in which case learn R and fill in the gaps with enough Python for easier scripting purposes.
- If you learn both, you can R code into Python using rpy
For more discussion on Python and R for data science, please visit the following websites:
- KDnuggets has an extensive comparison
- ReadWrite explains the cultural differences between them
- DataCamp has a summary infographic:
- The Data School blog approaches it from the teaching perspective