Looking at Data

Infovis 2007 tutorial

Sacramento, California, 31 October 2007

Infovis 2007 Tutorial

Workshop resources

See the GGobi book site for all slides, code, and data sets used in the tutorial

Introduction

This tutorial will be useful for anyone doing data mining, or working with multivariate data. The methods presented are the currently most useful ways to look at and explore high-dimensional data using interactive and dynamic graphics.

The approaches have grown from work in exploratory data analysis, and in the context of statistical data analysis. Underlying the graphics is a solid foundation that incorporates sampling variation and probability.

A book on the material will be published July 2007 by Springer. The tutorial provides a live illustration of the material in the book, and the book provides a follow-up source of information for attendees.

You will get the most out of the tutorial if you have struggled with visualising high-dimensional data in the past. We will be presenting techniques from a different ancestry than most infovis tools, so expect to learn some new ideas.

The methods taught are readily available in open source software, enabling all participants to reproduce, extend and use them with their own data after the workshop.

Outine

  • Toolbox (20 minutes). An overview of the statistical graphics toolbox, including multiple linked plots, grand, guided and manual tours, and categorical variable linking.
  • Missing values (30 mins). How are missing values distributed in the data? Are they missing at random, completely at random or not at random? Do the imputed values match the distribution of the complete data?
  • Supervised classification (45 mins). How do we check that the data is consistent with the assumptions of the classification method? How do we assess the results of black box methods such as support vector machines (SVM) and neural networks using graphics? How can we explore class boundaries in high dimensions?
  • Unsupervised classification (45 mins). How do we examine the cluster structure in multivariate data? How do self-organizing maps (SOM) compare to multidimensional scaling (MDS) as a method for summarizing the interpoint distances? How can we visualise how the map wraps itself into the data space?
  • Inference (30 mins). Is the structure we see real or likely due to sampling variability?

If time is available we'll touch on methods for exploring multivariate longitudinal and spatio-temporal data, and multidimensional scaling. The techniques will be demonstrated using R and GGobi.

About the instructors

Dianne Cook is a full professor at Iowa State University. She has been an active researcher in the field of interactive and dynamic graphics for 16 years, and regularly teaches information visualization, multivariate analysis and data mining. Contact: dicook@iastate.edu

Heike Hofmann is an associate professor at Iowa State University. She is a prolific international researcher on interactive graphical methods for multivariate data, with emphasis on categorical data, modeling and exploratory data analysis. Contact: hofmann@iastate.edu

Michael Lawrence is completing his PhD (``Biologist-accessible software for integrated exploratory analysis and interactive visualization of transcriptomic, metabolomic and biochemical networks'') at Iowa State University. Contact: lawremi@iastate.edu

Hadley Wickham is a New Zealander completing his PhD (``A grammar of interactive graphics'') at Iowa State University. He is interested the use of graphics to reveal interesting and unexpected features of data, as well as practical tools to make dealing with real-life data easier. He won the John Chambers Award for Statistical Computing for his work on the ggplot and reshape packages. Contact: hadley@iastate.edu

If you have any more questions, please do not hesitate to contact any one of us.