Fathom That!: Dichotomous Views of Data

For some time now I have been toying with this idea in my head about an aspect of our thinking on data organization. Recently, there is one issue that recurs frequently - at the heart of which is the distinction between the unit of analysis and unit of observation. Here is my attempt to organize my thoughts and see if they fit in anywhere in our article.

Dichotomy of Data Organization

Expertise in dealing with data is scattered across many different professions - the statistician knows how to synthesize and summarize data, the computer scientist has known for many decades now how to structure and store data for faster search and retrieval, the graphic designer is learning how to communicate insights from this data, journalists are increasingly making data representations the thrust of how they present evidence. All these point to the emergence of a new profession or discipline - that of data science. For the purpose of this section, we stay away from the communication of data and concentrate on the analysis and organization.

As we struggle to identify the process by which students organize the data that they encounter in unstructured forms, we come across the same dichotomy characterized in many different ways:
* Analysis v/s Observation
* Record v/s Represent
* Collect v/s Store

Different concerns justify different organizations. We observe that students routinely create many different data organizations when asked to record data from a traffic protocol.

Snapshot of Road at 8am

Snapshot of Road at 4pm

Specifically:

About half organized data in fully normalized "flat" tables with repeated values for attributes - this is the structure that we had in our mind as the only possible way to comprehensively capture all data in such a way that would allow them later to answer questions about relations among attributes. This turned out not to be the case.
A majority used an organizational method that kept the information about individual vehicles together in such a way that it is possible to determine, e.g., the correlation between distance and speed, employing an organizational method consistent with a hierarchical data model, in that they partitioned information spatially to reflect different case levels (date/time, lane direction, vehicle information).

This motivates a new way to think about how users can record hierarchical data. Note that this is distinct from representing or analyzing hierarchical data.

Motivation for New Ways to Collect Data
The question that motivates the data collection is often too vague to proactively create a fully normalized flat structure of the kind required by traditional computer programs. Instead, the nature of the situation about which data is being gathered may imply an organizational structure that is very different from the one needed for analysis to answer the motivating question.

We draw inspiration from the student representations to create prototypes for two new data structures:
Nested Tables and Partitioned Plot Spaces. Below are the two prototype data structures populated by the data from the above traffic snapshots.

Nested Table

Partitioned Plot Space

Compare them to the more traditional "fully flat" table structure required by traditional computer programs.

Our next next task is to create seamless transitions between the two views - that are illuminating while being obvious. Data "munging" - or fitting it to the needs of the analysis is a task that while arduous offers many learning opportunities for delving into the structure of the data. We have to be careful to not lose this opportunity while creating the transitions for the student.

to be continued...

Monday, January 3, 2011

Dichotomous Views of Data

Dichotomy of Data Organization