Monday, February 25, 2013

Data Science as a science

For some time I felt unsatisfied every time going to the Wikipedia article on Data Science, a canonical Radar article and some other sources about Data Science. It does not feel, that there are any actual definitions of the term "Data Science". We see more descriptions of what Data Science includes, requires, or produces. Most of those seem to have a healthy dose of marketing or job security build inside. Kind of like the "Big Data" feeling — which may have some meaning to in a narrow circle of academics, yet completely lost it when abused by marketing teams.

If you look at the Wikipedia's Data Science Illustration, you will find that at least three leafs "feeding" Data Science concept are not defined themselves on the Wikipedia as of January 22d, 2013 - half a year after the file was uploaded. Specifically, "Domain Expertise", "Hacker Mindset", and "Advanced Computing" are intuitively familiar concepts, but not defined by themselves. Strictly speaking they are not fit to be a basis of another definition, being complex concepts themselves.

I think the reason for that is simple. The definition is overly complex. We should make it simple to have a change to solve problems. My suggestion is to apply the Occam's Razor principle. Cut everything but essentials off and see if that is enough.

A simple definition

Data Science is a data science.

In other words, consider defining the term Data Science through the combination of notions of data and science.

More practically, we may put it more verbose like this:

Data Science is an accumulated knowledge about transforming data streams using scientific method for the purpose of ongoing value extraction from alike data streams.

Data sets in this context need to have reoccurring commonalities to be valuable for consideration. With enough commonalities they effectively can be considered discrete forms of data streams.

An (incomplete) break down of Scientific Method

A common (among others) break down of scientific method is mentioned in the Overview of Scientific Method on Wikipedia as follows:

  • Formulate a question
  • Hypothesis
  • Prediction
  • Test (Experiment)
  • Analysis

(I was writing this when the excellent +Daniel Tunkelang post came along)

Later in the Wikipedia article and other resources (including Daniel Tunkelang) add extra steps to the process, most importantly:

  • Replication
  • Application

Application being a practical use of the obtained knowledge to create value.

Applying Occam's Razor tests

We now need to walk through the common definition (rather, description) of Data Science and see, if this short definition of "Data Science" as "data science" implies what is expressed in the wider common description from Wikipedia. I have selected what I feel most representative claims of Data Science from the Wikipedia article. Clearly, this is my biased selection, and I do not address repeating concerns.

Data science seeks to use all available and relevant data to effectively tell a story that can be easily understood by non-practitioners.

A result of scientific method is what is considered a "proven fact", meaning that a consumer of that fact can use it without possessing a special skills needed for the proof itself. Not very clear, what is understood as "telling a story" in the article, but I count that as check.

Data scientists solve complex data problems through employing deep expertise in some scientific discipline.

Complex data problems - check, deep expertise in some scientific expertise - check (considering that Data Science is a scientific discipline by definition).

It is generally expected that data scientists are able to work with various elements of mathematics, statistics and computer science… Data science must be practiced as a team, where across the membership of the team there is expertise and proficiency across all the disciplines

All mentioned areas of knowledge are necessary in a curriculum of one claiming themselves a Data Scientist in this definition - check. Must the scientific research in this context be a team effort - yes (see below on hardware knowledge and data size) - check.

Good data scientists are able to apply their skills to achieve a broad spectrum of end results [followed by examples of that spectrum]

Given data taxonomy or lingo (see below) at a right level of abstraction results of data science research can apply uniformly to data sets from diverse fields of knowledge - check.

What is the next step?

To live up to be a science, Data Science need to be able to pose questions in a way which allows hypothesis to be tested on multiple data streams. And as it stands right now, we cannot do it as a general case. It is very rarely we can predict, that "a data stream with qualities X,Y, and Z satisfies hypothesis H with α confidence." When we can make such a statement, it is great. However, there are no commonly accepted ways to describe a data stream. Without being able to describe the subject of our experiments precisely in its nature and scope we will not enable others to reasonably replicate or analyze any experiment we conduct.

With the need to describe our subject of study — data streams — it seems to me that the most dire need of Data Science is for a modern Carl Linnaeus to come to scene and create some sort of taxonomy of data and data streams. Although foundations of such taxonomy may already be present as branches of semiotics (syntax, semantics, and pragmatics), there is no unifying data taxonomy effort I am aware of, which would enable posting a meaningful hypothesis. Not to say that I am right on this one :-) .

It does not have to be a taxonomy. It may happen to be a common lingo, specific to each sub-field, like it happened in mathematics. The point is that to meaningfully follow scientific method when exploring data streams we need the ability to describe the nature and scope of data elements in streams being tested for a hypothesis.

Does size matter?

With all the disrespect to marketing uses of "Big Data" label, it is still important to understand if the data size matters for Data Science.

Like in every other science it does. Sample size may be too small to proof anything or too big and waste valuable resource. Think of this - you do not need a bowl of soup to decide if it is overly salted. If it is reasonably mixed, a table spoon will be enough. Same with data samples in any science - they need to be above threshold of statistical significance for us to be confident in the results.

Consider the confidence illustration in the article on statistical significance, also shown above. If signal to noise ratio will be determined by data quality and the quality of our data taxonomy or lingo, then the rest of confidence comes from less-then linear correlation with sample size. Bigger samples will improve the confidence, but processing more data once a comfortable confidence level achieved is just a waste of compute resources.

Given poor signal to noise ratio in some internet-generated streams, as well as potentially a very selective hypothesis (the one which makes a statement about a small subset of data), there is and will be a need to process massive amounts of data for the sake of a proof, especially on the intake end. That pulls in hardware, data layout and architecture expertise as a prerequisite to projects under those conditions.

Finally, consider the application step of the scientific process loop. As results of Data Science process are applied in practice, the value of the process increases with each chunk of data processed. By the nature of the way Data Science produces value it encourages more data being processed. Even if a hypothesis did not take much data to get to a proof, its' application, the engineering chunk of the process, may end up dealing with cumulatively large data sets.

A contentious issue of tools

It should matter if a prediction is tested in Hadoop or MongoDB only in one sense - that results are replicable using technology of any capable vendor. Likewise, when chemists test a spectral analysis prediction, it is not ultimately important which brand of spectrometer is used in the experiment, but it is important, that the prediction is confirmed outside of a single tool vendor's ecosystem.

Multiple vendor consideration may put extra constrains on how to specify data stream parameters in hypothesis.

Is it going to happen?

Science is traditionally carried out in academia and proprietary research labs. However, corporate research labs most often focus on engineering innovations, not theoretical science progress - so the likes of Linked-in, Facebook, or Google are unlikely to pick up this fairly abstract topic.

Some colleges offer what they consider Data Science training. Judging by course descriptions, though, those lean towards practical number-crunching skills, and not the application of scientific method to data. It is yet to be seen if any of them tackle generic abilities of Data Science - starting with precise definitions of data streams.

I will cite my lack of understanding of how academia works and withhold a prediction if a widespread scientific approach in Data Science (in common sense) is a possibility. Given current commercial focus of universities my bets are low.

And that is a pity, because we do not know a more effective methodology than scientific method to achieve a reliable knowledge.

This work is licensed to general public under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit

read more ...

No comments :

Post a Comment

Comments which in my opinion do not contribute to a discussion will be removed.