Looking at Data

Learning to explore data with graphics

Washington DC, 30-31 July 2009

Graphics are a fundamental part of data analysis, used in initial data inspection and exploration, model building and checking and also communicating information. In this course we will teach the basics of static graphics and move on to the new developments in direct manipulation and dynamic graphics that facilitate exploratory data analysis. The methods taught are readily available in open source software, enabling all participants to reproduce, extend and use them with their own data after the workshop.

Who should take this course?

This course is targeted at anyone interested in learning a new way of looking at their data or learning about tools that make producing graphics easier. We will use R to demonstrate static graphics and to link analysis and exploratory graphics, so a basic knowledge of R will be helpful, but not necessary. Ideally, you should have read a book by Bill Cleveland, Naomi Robbins, Stephen Few or Edward Tufte, as these authors all touch on important themes in statistical graphics.

If you are already familiar with GGobi or ggplot2, this course may be too basic, although you will receive expert hands on instruction that you wouldn't otherwise.

Please bring your own laptop. Closer to the course we'll let you know what you need to install beforehand.

What will you learn?

The course will be split into two roughly equal parts: static graphics, and direct manipulation/dynamic graphics. We will alternate between instructional and hands-on components. The presentations will provide a solid foundation to the use of graphics and the hands-on components will give you the practical skills needed apply these techniques to their data.

Static graphics (day one)

You will learn how to create a wide variety of static graphics using the ggplot2 R package. In particular, you will learn:

  • The basics of data manipulation. To plot data you need to first get it into a usable format. We will discuss two R packages, reshape and plyr, that provide a toolbox of useful functions for manipulating data.
  • The building blocks of a plot, and the formal grammar that can be used to describe (almost) all statistical graphics. This will help you to critique and reproduce existing graphics, and create new graphics specially tailored for your problems.
  • Geometric objects and statistics control exactly what data is displayed and what it looks like. Scales adjust how data values are mapped to aesthetic values. We will cover default scales, adjusting scales, and defining your own scales. Displaying the same graph for different subsets of your data is often useful, and is called facetting (or conditioning, or trellising). How can we do this with ggplot2?

With these basic tools in hand, we will explore how we can apply ggplot2 to different problem domains:

  • Geographic data including map backgrounds and choropleth maps. Using good colours, and breaking continuous values into bin. Using small multiples.
  • Model diagnostics and summaries for linear and mixed models.

Data examples will include diamond prices, movie ratings, and automobile fuel economy, with sizes ranging from 200 to 50,000 rows

The day will conclude with a discussion of inference for data graphics. Inference for graphics helps you to confirm that you've found something real with your exploratory graphics, not just a random fluctuation. This is an important tool, and creating these graphics will tie together many of the themes of the day.

Direct manipulation/dynamic graphics (day two)

Direct manipulation and dynamic graphics will be demonstrated with GGobi, and the R package rggobi, which provides access to GGobi from R. GGobi is an open source visualization program for exploring high-dimensional data. It provides highly interactive and dynamic graphics such as linked windows and tours, on the familiar scatterplot, barchart and parallel coordinates plots. Direct manipulation on the plots includes scaling, moving points, linked brushing and identification using categorical variables. This in this section of the course you will learn about:

  • The toolbox, which contains a collection of basic plot types, ways to link multiple plots and tour methods for examining multivariate data.
  • How to use direct manipulation and dynamic graphics to rapidly explore data and uncover new and unexpected features.

These techniques will be applied to multiple application areas including:

  • Missing values: How are missing values distributed in the data? Are they missing at random, completely at random or not at random? Do the imputed values match the distribution of the complete data?
  • Supervised classification: How can we explore the class structure in a labelled data set in multiple dimensions? How do we check that the data is consistent with the assumptions of the classification method? How do we assess the results of black box methods such as support vector machines (SVM) and neural networks using graphics?
  • Cluster analysis: How do we examine the cluster structure in multivariate data? How can we compare the results from several clustering algorithms? Does the model parameterization in model-based clustering match the variance-covariance present in the data? How do self-organizing maps (SOM) compare to multidimensional scaling (MDS) as a method for summarizing the interpoint distances?
  • Multivariate longitudinal data analysis: Using functional data analysis tools in R, we will explore how to connect data modelling and exploration.