“Everything should be made as simple as possible, but not simpler.” A. Einstein
While investigating the vigorous stirring of scikit-Learn with Unity3D, I am struggling to maintain a clear goal and avoid a chaotic jumble of assorted technologies.
Einstein’s quote above continually echoes in my mind.This quote has an interesting background. Apparently Einstein never wrote or said those words; however, in numerous ways he implied and endorsed this theme of simplicity. The best documentation indicates that the quote was in the context of evaluating music, not physics. On the other hand, it was an accurate paraphrase of his lecture on the Method of Theoretical Physics in 1933.
If immersive data worlds are to be effective in understanding the complexities of data analytics, then these worlds must be as simple as possible, but not simpler. They must be intuitive, given human experience and human endeavors. Yet, they must vividly show the fundamental complexities inherent in the analytics. A few initial insights follow…
Immersive Implies Experiencing
Most equate the term immersive with technology, rather than its objectives. This year, an explosion of inexpensive and reliable immersive VR technology will plummet us. VR hype will pervade! However, we need to use this technology to ‘experience analytics’. That is, we have to sense, think, and behave as citizens of an immersive data world, in the spirit of Flatland by Edward Abbott [1]. How would you (as a dataset) sense, think and behave as you are being processed by some learning algorithm that suck the informational essential from your bits? Would it feel like a gentle massage or quite the opposite? The point is not to witness the raw matrix operations, but to sense those discerning eigenvectors (for example) pop out, revealing simpler dimensions.
It is all about experiencing the analytics as simple as possible, but not simpler. It is ironic that our objective is similar to that of our video gaming colleagues when using VR technology. It is about creating an experience, one that is engaging and even entertaining. However, our objective has a more rigorous standard of enhancing human judgment.
Generalizing Beyond Known Data
As previously noted, the purpose of immersive analytics is to augment human judgment with analytical reasoning that generalizes (statistical inferencing) beyond known data. It is in the spirit of Decision Support Systems from several decades ago.
I have been reading The Master Algorithm [2] by Pedro Domingos (which I plan to address in a later blog). As contrasted with ‘generalizing’, the objective of ‘learning’ seems more robust. In this book and earlier writings, Domingos explains the components of (machine) learning as Representation + Evaluation + Optimization where:
- Representation is the fit/predict model, both its structure (supervised classification/regression or unsupervised clustering) plus hyperparameters.
- Evaluation (also known as cross-validation) is a method to “distinguish good learners from bad ones”.
- Optimization is a process of generating and evaluating among an ensemble of models, using techniques like bagging, boosting, and stacking.
The point is that a data world should render many and varied models of the same ground truth and that the best solution may be combinations among several models. Further, this may continually change and evolve over time. …While hopefully keeping it as simple as possible, but not simpler.
Data Physics
It is essential that we get the physics correct for a data world. What is the analogy for weight, gravity, momentum and the like?
What’s up? It seems that the ‘up’ dimension should be a measure of abstraction, with the ground being the ‘ground truth’ of observational data. And, below the ground is the unknowable complexity of reality. So, the analogy is to build skyscrapers that allow us to generalize beyond the ground truth. The taller the skyscraper, the better (and hopefully simpler) is the generalization. Learning would appear has a vigorous forest of skyscrapers, constantly striving upward.
Information entropy measures the uncertainty or surprise in information (i.e., prediction as generated by the model). Hence, going in the ‘up’ direction should imply more certainty and less surprise. We might even find a specific measure that uses units of bits or shannons. The less shannons, the higher our analytic models would rise!
Amid this upward dynamic, another physics component for data worlds would be an attractive force between clusters of similar data. In fact, if two datasets were identical, they should blend into one dataset. Using the analogy of parallel coordinate plots, another example is that the attribute poles would attract each other if their data exhibits a high correlation.
These are a few initial thoughts on the physics of a data world. Much more thought is needed! …While hopefully keeping it as simple as possible, but not simpler.
Clean Interface
The final insight is the requirement for a clean interface between analytics and its rendering as a data world. It seems that all interaction should pass through a structured (SQL) database, at least initially. In other words, the analytic processes create results at various levels of abstraction and insert them into an SQL database. These results are retrieved and rendered as the data world.
Why? It will force us to define the data structures required to create data worlds as interlinking SQL tables. Everything will be data driven by the analytics, and the data world can focus on properly rendering those results from the database.
For prototype development, SQLite is a good choice, being a stable component of both Python (via pandas and sqlalchemy) and Unity3D (via SimpleSQL). I have made good progress with this approach, as will be documented in future blogs.
– – –
In conclusion, the above insights are several disconnected ideas that emerged from initial work with scikit-Learn, some of which was merged into the IEEE VR2016 paper.