Technical Information

Images

The Glasgow Guardian Digital Archive is a 'film first' newspaper digitisation project, that is, the newspapers were microfilmed first and the digital images generated from the microfilm. This contrasts with the Computer Output Microfilm (COM) or 'scan first' approach whereby microfilm is created from the digital images.

The microfilm was created to National Preservation Office standards and 300ppi 8 bit greyscale TIFF master images were generated from it. For online delivery the TIFF images were converted to JPEG format at 150ppi 50% reduction for full size files and 72ppi, 650 pixel wide format for page view files.

Text

Searches are carried out on an uncorrected, full text, database generated by Optical Character Recognition (OCR) from the master TIFF files. The accuracy of OCR is particularly sensitive to the quality of the source material and newspaper print, particularly historic newsprint, is of poor quality. The papers in this collection all suffer, to varying degrees, from print bleed-through, noise, fractured characters, joined characters and skewed pages. Holes, tears and fold lines are further problems, and the relatively complex, multi-column layout of newspapers are inherently challenging.

As a consequence of these factors the level of OCR accuracy within the archive varies enormously, between approximately 90 and 97%. Ideally, uncorrected OCR should have an accuracy rate between 95 and 99% to provide consistently reliable results for full text searching.

In the sample text below there are 395 characters, including spaces, with 20 errors, an accuracy rate of 95%.


With the object of helping to give
freshmen a more complete picture of
JjpJ\rersjty life, The Student s' Repre-
sentative Council held a camp for
c freshers " at Balmaha from rj Ies
clay to Friday of last week.
The project proved a grqat success,
and fufry maintained the standards
of recent years. Nearly sixty
students accepted this opportunity
of meeting their fellow freshmen
under informal conditions.

It is the project's long-term goal to manually correct the text to improve its accuracy. In the meantime, users should be aware of the current limitations when viewing their search results not all instances of words or phrases may be found and equally false hits might be returned.

Database

The text and images in the archive are housed in a bespoke MySQL database with search queries and results handled by PHP scripts.