Digitizing Archives on the Cheap
By Brian Greene, Columbia College Librarian
Last October a beloved retired welding instructor at Columbia College named Mack Frost passed away. One of the remembrances that was shared included a partial clipping from a decades-old student newspaper profile of him, along with a request for the complete article, if anyone happened to have it. The archive in our library includes scores of old student newspapers and I started poking around to try and find the original piece. Alas I didn’t find it, but looking through the fascinating stories from throughout the college’s history did make me think we should try and digitize the collection. Not only would that facilitate searching, but it would also make the collection available to a much broader audience and serve as a way of preserving and safeguarding the collection in case of a disaster.
Our budget is modest in general and we have very little ongoing funding for preservation projects. That meant an expensive third-party digitization effort and hosted solutions were non-starters. Instead, I brainstormed with one of our savvy student workers about doing everything in-house.1 She was game and we decided to test out our plan, which was exceedingly basic: 1) scan the papers by hand on our lone public-access scanner; 2) OCR the files to make them searchable; 3) upload everything to our website; and 4) add some sort of search functionality. We got started right away and had a couple of test documents uploaded later the same day.
Over the next two months the student worker assigned to the project spent several hours of their time each week scanning student newspapers. Because we were using a public access scanner there were numerous interruptions to work around, but that time was used to check the quality of the scans and OCR the files. By early January all of the papers in our archive were scanned and uploaded to the webpage we created for the project. This included more than 100 issues from eight publications covering a period from 1971 to 2016.
For search functionality we implemented a Google Programmable Search Engine (formerly Google Custom Search Engine). To get an ad-free experience we took advantage of the Google for Education integration that provides ad-free search to non-profits for free. This was easily the most maddening part of the project as the seemingly straightforward registration process included several verification hoops to jump through that were challenging because of our specific organizational structure and who has access to our website. Eventually I got it working and the ads disappeared, which makes for a much more professional looking search results list.
As part of our digitization process we ran OCR on all of the PDFs to make them more accessible and searchable. As we loaded them onto the website and started running searches we quickly realized there were a large number of errors with the OCR results. This was expected on the issues with unusual graphics and fonts, but in fact is widespread throughout the collection. The upshot is that the resulting documents were not optimized for screenreaders and therefore did not meet our accessibility goals. We started to manually fix the errors but our already slow pace for this phase of the project has been exacerbated by the Covid-19 shutdowns and the reality is it will take time to correct everything.
Another concern arose over the summer when a former student contacted the college to request that an issue he was quoted in be taken down. I’m sympathetic to his request and suspect waivers were never acquired from any students over the years. Still, all of the publications were created to be published (albeit not necessarily online) and contributors would not have had an expectation of privacy at the time. In the short term we’ve removed the issue in question until we can determine a policy on how to proceed going forward. If you’ve dealt with this challenge and have suggestions I would appreciate hearing from you.
On a positive note, the introduction to the project on our website encourages people to send us copies of any issues they may have in their possession that we don’t already have in our collection. In response, we have already received missing issues from three different people, one on campus, another in the Bay Area and a third out of state.
While we still have work to do in order to make the collection more accessible and to improve search functionality, all in all this has been a successful project and the feedback from our community has been encouraging. Most importantly, access to the collection has been greatly expanded, with people able to search the titles and read the content from anywhere in the world. Project costs have been close to zero, with literally the only expense being the wages for the student worker involved. And with this collection now preserved alongside previous digitization projects, a sizable portion of our archive has at least rudimentary digital versions backed up in the cloud, an initial safeguard in case of a disaster.