Honours project for Daniel Ballinger
19 March 2002
The objective of this project will be to apply program visualisation techniques to a corpus of spreadsheets to analyse the dependency structures and layout patterns/organisation. More specifically (based on my current understanding or corpus analysis), to investigate styles and approaches of how people program visual spreadsheets, such as Microsoft Excel, to help improve understanding of what makes a high-quality layout. This reason as to why the corpus analysis is carried out will be reviewed as the project is undertaken.
In recent years there have been several projects relating to the visualisation of visual programs performed by James Noble and Robert Biddle. My project will apply a similar process to the analysis of a large body of spreadsheets.
Based on definitions from [WFE]:
Spreadsheets are a visual programming language designed to perform general computation tasks using spatial relationships rather than time as the primary organising principle. Many programs designed to perform general computation use timing, the ordering of computational steps, as their primary way to organise a program. A well-defined entry point is used to determine the first instructions, and all other instructions must be reachable from that point.
In a spreadsheet, however, a set of cells is defined, with a spatial relation to one another. In the earliest spreadsheets, this arrangement was a simple two-dimensional grid. Over time, the model has been expanded to include a third dimension, and in some cases a series of named grids. The most advanced examples allow inversion and rotation operations that can slice and project the data set in various ways.
It is the spatial relationships or dependencies between cells, both in two and three dimensions that are of interest to visualise. Excel being a visual programming environment, generally has acyclic (can't contain circular references) relationships between cells [DSGPASL] [QCS], which creates a tree like dependency structure (or a forest of trees due to multiple roots). It should be noted that this is not always the case [Refer. Extract from Excel Help File].
Spreadsheets have a wealth of uses in businesses, physical sciences, statistics and many other areas. A good understanding of how people use them could encourage better design, layout, and functionality of spreadsheet programs.
Major steps in the project:
Some of these phases may need to be carried out in parallel, go through several iterations, and be revisited later in the project.
One simple way in which it may be possible to collect the spreadsheets would be to add filetype:xls and num:100 as terms in a search using G o o g l e . This would return a list of links to 100 Excel files. A Download manager like GetRight v4.3 could be given the resulting search page then download and store all the Excel files with no programming required on my part. By choosing appropriate search terms it may be possible to narrow down the search results to those of a particular category. For example, using terms like "Fiscal year", "revenue", and "statement" would mainly produce financial records. I consider this method of collecting spreadsheets valid as how they are collected is not as interesting as their actual contents in terms of this project.
Research will be undertaken into the general methodologies of corpus analysis with the intent of looking for generally accepted guidelines. This will involve looking for relevant literature, web pages, and research papers.
The most suitable style of graphics tool will most likely be vector based with a file format that can be easily produced/read (ASCII). SVG is one possibility that has been used in other honours projects [SVG]. The ability to create three dimensional images using native support from the graphics tool would be a huge advantage.
Data will exist in large clumps or islands in many spreadsheets and this clustering (pockets of data) of cells should be visible in an analysis of numerous spreadsheets. I suspect that because many spreadsheets would be arranged to facilitate printing that an A4 (paper size) ratio pattern would emerge.
Figure 1: A topographical map with contouring lines.
To show this property I intend to create a diagram based on how many times
each cell is occupied over all the spreadsheets, creating a diagram similar
to the Real Estate Utilisation figure from [VVP].
This could be shown in a diagram analogous to a topographic map with elevation/contour
information. A more ambitious approach would be to create a 3D image. Sectioning
of this contouring map could be used to show distribution of cells in two directions
from the origin.
This diagram could also be applied to different layers of worksheet to examine layer utilisation.
Figure 2: Scatterplot from [PVVP]
The rectangular grid layout of a spreadsheet would cause many of the references
between cells to be vertical or horizontal. Fixed cell references will be one
of the main causes of long vectors and angled vectors.
I suspect that summation will be one of the most common commands, leading to large clusters of data being referenced by very few cells in many situations.
The principle reason behind this will be syntax for block selections/references (Ranges) of cells being rectangular in Excel.
The regular patterns used in cell referencing could be an area of interest
in visualising. Like the $ replication command that explicitly encourages this.
Related work has been done by John T. Stasko on execution patterns and by the IBM Jinsight project about finding patterns with visualisation [IBMJ].
The three dimensional property of many worksheets would show a flow of dependencies between the worksheets. I suspect that most dependencies would flow down from the front sheets towards the lower sheets.
Data flow pipes/valleys and ridges. Creating another diagram similar to the topographical maps again but this time increasing the count on each cell whenever a reference passes through it. I suspect this may create a strong ridging and valley effect. Drawing diagrams that show all the vectors at once with their relative start and finish positions.
It may be possible to create a tree style diagram using the dependency structures of many spreadsheets.
The final three phases should flow on from the earlier work. It makes sense to do smaller scale testing of the visualisations if they end up being computation intensive.
The type of data being stored in the worksheets could immensely effect the layout. I'd like to categorise the spreadsheets so it would be possible to compare styles between, say, accounting and scientific sheets.
It may be necessary to limit or crop the size of the spreadsheets as data may stretch out indefinitely. Some consideration will have to be given to the compression and expansion of column width and row height. Ideally any formatting could be ignored (merged cells?) along with the actual value of the cell. Only that the cell is occupied and any references it has to other cells is of interest. Some elements of the spreadsheet may have to be ignored for simplicity, such as tables, links to external documents, macros, embedded images, etc. Any imbedded programs, like Visual Basic for Applications, that allow excel to go beyond normal spreadsheet programming will be excluded, at least initially, from the corpus.
(Dates marked in bold are official deadlines to meet course requirements, other dates are considered as a guide only.)
|15 March||Project Proposal (this document)|
|22 March||Phase 1 - Collection - Complete collecting a large body of spread sheets to look at the type of data will be working with. Also look at categorising.|
|5 April||Phase 2 - Guidelines - Complete investigation into the generally accepted guidelines of corpus analysis.|
|19 April||Phase 3 - Tools - Complete investigation into visualisation tools.|
|26 April||Phase 4 - File Format - Based on understanding of visualisation tool.|
|11 May||Phase 5 - Visualisations - Looking at possible diagrams to produce.|
|24 May||Phase 6 - Possible Merits - Speculating as to the use of various diagrams.|
|31 May||Honours Progress report due|
|21 June||Phase 7 - Testing - Plausibility of visualisations.|
|12 July||Phase 8 - Experiment - Producing full scale diagrams using full corpus.|
|16 August||Phase 9 - Results - Analysing and drawing conclusions.|
|3 September||Honours Draft report due (to supervisor)|
|2nd half of trimester 2||Honours Seminar 20 - 30 minutes|
|18 October||Honours Final report due (to graduate coordinator)|
Some time will also need to be set aside for document production.
[WFE] - Wikipedia (Free Encyclopaedia).
[DSGPASL] Alan G. Yoder, David L. Cohn (2002).
Domain-Specific and General-Purpose Aspects of Spreadsheet Languages, Distributed Computing Research Lab, University of Notre Dame.
[QCS] David Chadwick, Brian Knight, Kamalasen Rajalingham (2002).
Quality Control in Spreadsheets: A Visual Approach using Color Codings To Reduce Errors In Formulae, Information Integrity Research Centre,
School of Computing & Mathematical Sciences, University of Greenwich.
[VVP] James Noble, Robert Biddle. (2001).
Visualising 1,051 Visual Programs - Module Choice and Layout in the Nord Modular Patch Language, Australian Computer Society, Inc.
[PVVP] James Noble, Robert Biddle. (August 2001).
Program Visualisation for Visual Programs, Victoria University of Wellington.
http://www.mcs.vuw.ac.nz/~elvis/db/hfiles/circular_ref.txt - Extract from Excel Help File.
http://www.mcs.vuw.ac.nz/~elvis/db/calendar.html - Calendar of relevant dates.