Data science

Data Science arises as a new area that aims to materialize processes and practices to explore, analyze and generate models that enable the description and prediction from a wide range of data types. Ultimately, these processes and practices will support better performance and efficacy of the organizations and quality of life of the citizens.

Data Science models and transforms data to subsidize the decision process through computational thinking, towards data-driven decision making.

Data Scientist

Professional of the decade

Profile:

  • Analytical ability
  • Investigative capacity
  • Entrepreneurship
  • Business understanding
  • Programming skills

Data Science in Practice

If you torture the data long enough, it will confess. - Ronald Coase

Data management: several general or specialized platforms for all kinds of data

Data mining: several implementations of each technique

User expertise: does the data scientist need to program?

NO! (S)he needs just to think algorithmically.

Lemonade in the context of data science

Enablers:

  • Wide availability of algorithm implementations
  • Broad spectrum of databases and storage technologies
  • Massively parallel processing commercial solutions
  • Mature virtualization technology
  • Real time transpiling technology is a reality
  • Awareness of the data potential

Motivations

  • Data scientists do not need to program, literally
  • Data scientists need to abstract algorithmically tasks
  • Cloud-fashion web-based platforms provide good interactive support
  • Visual programming is a need

Data mining

Machine learning

Data science 101

Techniques, algorithms and models

How to choose between the different available techniques?

Is my data set ready for what I want to do?

How to formulate the correct question about data?

Predict and evaluate an answer

Standing over the shoulders of giants (or Ctrl+C, Ctrl+V)

Copy workflows

Use external Tutorials

Repositories of machine learning experiments

Resources

Kaggle

"Cortana Intelligence Gallery enables our growing community of developers and data scientists to share their analytics solutions".

Graph analysis

https://blog.cloudera.com/blog/2016/10/how-to-do-scalable-graph-analytics-with-apache-spark/

Regression

https://hortonworks.com/tutorial/predicting-airline-delays-using-sparkr/

Sentiment analysis

https://hortonworks.com/tutorial/sentiment-analysis-with-apache-spark/