Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Essential skills of a Data Scientist [closed]

Tags:

r

People also ask

What are the essential skills required to be a data scientist?

For example, if a data scientist is working on a project to help the marketing team provide insightful research, the professional should be well adept at handling social media as well. Some of the other skills required are Machine Learning, Artificial intelligence, Deep learning, Probability and Statistics.

What 3 main areas are included in the full data science skillset?

Foundational Data Science Skills Core data science skills, however, fall into three buckets: math/statistics, programming/coding, and business/domain skills.

Which is one of the significant data science skills?

Good knowledge of statistical programming languages like R, and Python. Basic knowledge of a database query language such as SQL. Good mathematical and statistical skills. In-depth knowledge about Machine Learning concepts.


To quote from the intro to Hadley's phd thesis:

First, you get the data in a form that you can work with ... Second, you plot the data to get a feel for what is going on ... Third, you iterate between graphics and models to build a succinct quantitative summary of the data ... Finally, you look back at what you have done, and contemplate what tools you need to do better in the future

Step 1 almost certainly involves data munging, and may involve database accessing or web scraping. Knowing people who create data is also useful. (I'm filing that under 'networking'.)

Step 2 means visualisation/ plotting skills.

Step 3 means stats or modelling skills. Since that is a stupidly broad category, the ability to delegate to a modeller is also a useful skill.

The final step is mostly about soft skills like introspection and management-type skills.

Software skills were also mentioned in the question, and I agree that they come in very handy. Software Carpentry has a good list of all the basic software skills you should have.


Just to throw in some ideas for others to expound upon:

At some ridiculously high level of abstraction all data work involves the following steps:

  • Data Collection
  • Data Storage/Retrieval
  • Data Manipulation/Synthesis/Modeling
  • Result Reporting
  • Story Telling

At a minimum a data scientist should have at least some skills in each of these areas. But depending on specialty one might spend a lot more time in a limited range.


JD's are great, and for a bit more depth on these ideas read Michael Driscoll's excellent post The Three Sexy Skills of Data Geeks:

  1. Skill #1: Statistics (Studying)
  2. Skill #2: Data Munging (Suffering)
  3. Skill #3: Visualization (Story telling)

At dataist the question is addressed in a general way with a nice Venn diagram:

venn diagram


JD hit it on the head: Storytelling. Although he did forget the OTHER important story: the story of why you used <insert fancy technique here>. Being able to answer that question is far and away the most important skill you can develop.

The rest is just hammers. Don't get me wrong, stuff like R is great. R is a whole bag of hammers, but the important bit is knowing how to use your hammers and whatnot to make something useful.


I think it's important to have command of a commerial database or two. In the finance world that I consult in, I often see DB/2 and Oracle on large iron and SQL Server on the distributed servers. This basically means being able to read and write SQL code. You need to be able to get data out of storage and into your analytic tool.

In terms of analytical tools, I believe R is increasingly important. I also think it's very advantageous to know how to use at least one other stat package as well. That could be SAS or SPSS... it really depends on the company or client that you are working for and what they expect.

Finally, you can have an incredible grasp of all these packages and still not be very valuable. It's extremely important to have a fair amount of subject matter expertise in a specific field and be able to communicate to relevant users and managers what the issues are surrounding your analysis as well as your findings.


Matrix algebra is my top pick