Royal Statistical Society
Statistical Computing Section
New Approaches to Software for Statistical Processing
Joint seminar of the RSS-SCS and the ASC
The seminar aims to explain and illustrate how advances in both data structures and technology are starting to impact on statistical computing in its many shapes and forms.
The day is divided into three modules. In the first introductory session we examine how metadata structures have impacted upon statistical processing and look at what can be achieved by adopting this technology. In the second, we'll be looking at how SPSS MR's Dimensions Data Model has been adopted within the commercial market research sector, and some of the challenges and benefits presented by it. Finally, in the third section, we'll be focussing in on dissemination of data and results, and on some of the software tools now available.
1) Introduction - process integration through formal structures
11:00am - 12:30pm
Simon Musgrave ( University of Essex
"New approaches to structuring data and metadata in statistical systems, with implications for usability and functionality."
Synopsis: Data are so much more that the content of a table in statistical systems. In the web centric world, data act as both information components in their own right and as a gateway to knowledge.
As information, data are meaningless without good contextual explanations. With a growing reliance on performance indicators, it is essential that data are accompanied by clear metadata that enable an increasingly diverse audience to understand; what the numbers mean; how robust they are; and the source.
As signposts, numbers can be a gateway to more or less aggregated information enabling a user to interact with the data and understand not just their context but also their relevance via, for example, expanded tables, links to underlying surveys or registers and comparative data.
Taking a user driven perspective, we require systems that enable the user to navigate through the data life cycle. Consequently we need systems that have the potential to 'join up' the data via robust metadata. These systems have to meet the complex, and sometimes contradictory, goals of transparency and simplicity on the one hand and the need to handle and integrate very diverse and complex data structures and manifestations. The existing government metadata standards are a start but make little accommodation to the demands of statistical information, so it beholds us to develop improved metadata models and structures that allow us to develop systems that are high in terms of usability and functionality.
- Slides from presentation (PPS format - 963Kb)
Jerome Sopocko ( Askia
"Reusability of survey and questionnaire designs. From design through to dissemination"
Synopsis: Software design has embraced the Object Oriented Design (OOD) paradigm. Will survey design follow? The Market Research industry moved away from machine code (card column punch manipulation) to compiled languages and then Graphic User Interface software - the next big step could be OOD.
OOD means reusability and therefore cost savings through better design surrounding all parts of the survey process and in being able to foresee questionnaire creation, cleaning, weighting and analysis. It could also mean interoperability between software providers.
The presentation will be illustrated by an overview of an OOD inspired system, BMRB's Publisher's Toolkit, which is the software behind the TGI.
- Slides from presentation (PPS format - 370Kb)
Hans Rutjes & Jannes Hartkamp ( Desan Research Solutions
"Issues in "metadata" standards aiming at the reuse of questionnaires. Experiences from IQML, IT, and others."
The variety of different CAI-packages used by data collectors and the lack of standards and interfaces between these to share data and metadata is an important current issue in data collection. Questionnaire 'scripts' from one package cannot easily be re-used by other packages. Moreover, merging of survey data collected with different CAI-systems usually happens via export to one 'external' data format (e.g. SPSS or PC-Axis), which results in a loss of data collection metadata and other metadata.
Given the collective benefits of having a common standard and the private benefits of having ones own language adopted as the common standard, standardising initiatives naturally exist. Some of these attempt to develop a standard from existing software (e.g. SPSS), others try to build a standard to which existing software could adapt (e.g. IQML).
The presentation explores the difficulties in developing a standard via either of the two ways. The ideal standard is sufficiently rich to be practically useful, while not too complex to be hard to implement. A crucial issue here is that the adoption of a standard needs to have benefits for those who implement.
- Slides from presentation (PPS format - 91Kb)
2) Using the SPSS MR Data Model
1:30pm - 3:00pm
Sam Winstanley & Richard Kottler ( SPSS MR
"The SPSS MR data model."
Synopsis: SPSS MR was formed by the amalgamation of 4 companies, all offering products and services to the Market Research industry. In bringing these companies together SPSS faced both a challenge and an opportunity. The challenge was getting 4 (or more) sets of software to "talk" to one another. They each were built, like most MR software in the world today, on efficient, but proprietary file systems and this makes conversions and imports very difficult if one is to achieve a high degree of match. The concentration of a powerful set of development minds in a centre of excellence gave the opportunity to create a new paradigm in survey software.
Underpinning all these activities is the SPSS MR Data Model, an interface between data and applications that pushes the MR software industry to focus on applications and enterprise data integration instead of reinventing incompatible wheels.
Sam Winstanley, one of the developers of the Data Model, and currently a solutions implementation expert, will describe some of the early thinking and the richness and power of the data model. He will also give an insight into some of the trials and tribulations that were faced in developing the model to its current state.
- Slides from presentation (PPS format - 314Kb)
John Lyon ( ATP
"How Data Model has benefited application development and methods at ATP."
Synopsis: This presentation will be a general overview of how ATP have used the data model to help build a range of reporting solutions from automated charting to full complete web-based cross-tabulation systems. It will cover use of the data model from both directions - use of data model components in end user applications and low level integration with the data model through the development of specialised DSCs and MDSCs.
It will discuss the obvious benefits of the data model in both these scenarios but also point out a few areas we have found frustrating. It will also cover the impact of new releases of the data model on our development process and in particular the new features available for hierarchical data in V2.8.
- Slides from presentation (PPS format - 557Kb)
3) Exchange and dissemination of data and results
3:30pm - 5:00pm
Andrew Westlake ( S&SC
"UXML for data exchange - StatDataML and triple-s."*
Synopsis: The movement of statistical micro-data between systems is the most pressing need for many statisticians. This is not a new need, and for many years has been addressed by using fixed or free format data files, plus a data dictionary for layout and labelling information. These two proposals update that idea and implement it using XML. StatDataML is focussed on 'R', while triple-s is widely used for exchange between MR systems.
- Slides from presentation (PPS format - 282Kb)
Ewan Hunter (Silicon Genetics
"Statistics, Java, XML, & Biology!"
Synopsis: Silicon Genetics is a private commercial company based at Stanford University, and is the market leader in gene expression data analysis and management.
GeneSpring is programmed in Java, making the application independent of the operating system, and it has an XML backend for better data management at the client level. Due to the complexity and noise (both biological and experimental) in gene expression data, powerful mining techniques are required. Simple interfaces make these complex methods accessible to non-statisticians. Routine data analyses can be automated via our scripting language, reducing complex procedures into single-step processes. A flexible interface allows programmatic exchange of data between GeneSpring and applications, so that in-house analysis routines written in R, SAS, Java, C++ etc. can be incorporated.
The presentation will describe and illustrate some of these capabilities.
- Slides from presentation (PPS format - 4691Kb)
Jostein Ryssevik ( Nesstar Ltd
"Designing and building web-based infrastructures for data sharing and collaboration (with examples from Nesstar)."
Synopsis: What if statistical data could be published as easily on the Web as documents and pictures? Is there a "data Web" on the horizon where numbers, estimates and measures from a variety of publishers could start integrate and inter-operate as easily on the Web as texts and pixels? And what are the requirements to such a "data Web" and how can it be built?
These are the questions that will be addressed in this presentation that will draw heavily on experiences from the parallel development of NESSTAR (a web-based system for data publishing and on-line analysis) and the DDI (the Data Documentation Initiative). Special focus will be kept on the Semantic Web and Web Services and how these new ideas and concepts might help us to build the statistical dream machine.
- Slides from presentation (PPS format - 1125Kb)
Questions and Answers
5:00pm - 5:30pm
Date & Time
Wednesday 21st January 2004, 11:00am to 5:30pm
Coffee will be available at 10:30am and a buffet lunch will be served.
The cost of attendance at the workshop is as follows:-
- £40.00 for RSS/ASC members
- £35.00 for C.Stats/GradStats
- £25.00 for RSS student members
- £60.00 for non-members
which includes refreshments and lunch.
Booking details (and form) are included in the December issue of RSS News and are also available as a PDF file here.
Phil Bowtell (+44 (0)1784 498374)
Andrew Westlake (+44 (0) 20 8374 4723)
The Royal Statistical Society
Statistical Computing Section home page