Skip to main content
more options

I. A. 4. "Backfiles": Processing Large Numbers of Journal Issues at a Time

 

As with Euclid current issues, Metadata Services works with ePublishing Technologies staff to process this metadata. We have done "batch jobs" or "backfiles" for both Euclid and DPubs titles.


1. What's a Backfile?
When we work with large numbers of back issues, a.k.a. a "batch job" or "backfile," we try to automate the process as much as possible.  In order to do that, the data sent by the publisher has to be evaluated with an eye towards what filetypes are they providing; is there any existing form of metadata included; are there any special concerns such as diacritics, scientific equations, special issues, for example.  Based on an evaluation done by the ePubs Project Manager (PM), a quote is drawn up for the job, which specifies the work to be done, the amount of time it will take, and the costs involved.  Metadata Services is responsible for correcting any errors or problems in the .xml files that cannot be fixed in an automated way.  The work we do is usually referred to as "handwork."  Our job is to try to meet our part of the quote, or to stay under the quoted price if possible.  If an unforeseen issue arises that may cause us to exceed the quote, the project manager should be notified as soon as possible.  We may have to find a workaround or a compromise in order to satisfy the customer's needs.

2. What does the workflow look like?
The process of completing a "backfile" project is very complex, with much work done by the ePubs PM, a programmer, and members of the Digital Media Group before any data is ready for Metadata Services staff to work on it.  For an idea of what the general workflow looks like, refer to the "Backfile Digitization Project Checklist" on the e-Publishing Technologies wiki.  There you can see just how involved this process is, and where our workflow fits into the grand scheme.  Because every backfile project is different, it is not practical to write a set of step-by-step instructions detailing the process.  Instead, it is important to know that our work is given to us by the ePubs PM after a programmer has manipulated the data through several iterations.  The ePubs PM will give the Metadata Services staff member verbal and/or written instructions plus a list of .xml records to be reviewed and edited.  Much of the work involves creating or correcting cross-references between related_items and reviewed_items.  Any questions or problems encountered during the editing process should be referred to the ePubs PM.  When finished, the Metadata Services staff member should notify the ePubs PM and the programmer.  They will then work with the data some more, and there may be more refinements to be done by the MS staff member on the same records, or a new set of records with a different kind of problem may be given to the MS staff member to work on.

3. Types of workflows for projects that involve MS: Below is an edited excerpt of a 5/3/2007 email from the current ePubs PM to the Head of Metadata Services (MS) discussing the different types of workflows.

a. "Standard" math journal backfile project
Examples: TMJ, OJM/HMJ, PRIMS

Start with print copies of journal backfile and end with issue directories that contain Euclid metadata .xml files and .pdfs and .tiffs.

Primary metadata source: MR data

Primary Euclid metadata creator: programmer

The programmer uses MR data used initially to get a count of the number of articles, pages prior to quote being made to client.

The programmer uses MR data used to "predict" page ranges of text units. After images received from scanning vendor, the programmer compares these predictions with the filenames of .pdfs and multipage .tiffs to detect image files that have been misnamed or have been split incorrectly by the scanning vendor.  The report of "mismatches" is used by DMG (Digital Media Group) in their quality assurance (QA) process on the images; they either fix the image files or, if the number of errors is high, request that the vendor fix them or rescan them.

Once the image files have been finalized, we send a copy of the .pdf's to a vendor for reference linking.  But as a result of problems with our vendor recently, we have asked the programmer to do the reference linking.

The programmer uses MR data to create Euclid metadata issue .xml files that are valid according to the Euclid dtd.  The programmer does this without any mapping supplied by ePublishing Technologies (EPT). (We used to have Keith Dennis do this for us and he was familiar with the MR data and cleaned it up. We haven't had the programmer do a job using raw MR data yet, so we don't know what he might need from us.)

After the first cut at the data, the ePublishing Technologies Project Manager (ePubs PM) looks at the Euclid metadata and makes suggestions as to changes that can be made in batch.  This stage might involve asking the programmer for a report of files that meet some condition to help assess if there are problems.  After some back and forth, the programmer and the ePubs PM agree that no more changes will be made to the conversion script; now handwork can be done.  The programmer and the ePubs PM develop a list of files that need attention and perhaps handwork.  This stage might also involve asking the programmer for a report of files that meet some condition to help asses if there are problems.  For known problems, the programmer of the ePubs PM might assign MDS staff some of the handwork; for more complicated handwork, the ePubs PM does it.

Areas that typically cause problems are related_item(s), reviewed_item(s), supplements, special issues, multiple series, obituaries, nonarticle or review types of records, divs.



b. "Hindawi"-style math journal conversion project
Examples: PJA, BAMS

Start with print copies of journal backfile and end with issue directories that contain Euclid metadata .xml files and .pdfs and .tiffs.

Vendor (Hindawi) supplies both scanned images and simplified Euclid issue- and article-level metadata in rationalized issue directories.

Once image QA on scanned images is completed, copy of .pdfs to vendor for reference linking (but lately we have asked the programmer to do this).

EPT supplies the programmer with directions to create more fleshed-out article-level metadata from the vendor's metadata.

EPT does metadata QA on the Euclid metadata (to check on quality of vendor's work).  EPT might ask for some assistance from MS on the metadata QA.

EPT works with the programmer to refine the output metadata (first in batch) and later with handwork as in the standard-style project (above).



c. JSTOR-style conversion projects
Examples: IMS journals such as AOP, AOS, AOMS, SS, and ASL journals such as BSL, JSL

Start with article .pdf's and issue- and article-level metadata xml files supplied by JSTOR using TEI format.

EPT supplies MS with a detailed mapping from JSTOR data to Euclid metadata.

The programmer writes a conversion script that produces Euclid issue- and article-level metadata.

EPT works with the programmer to refine the output metadata (first in batch) and later with handwork as in the standard-style project.



d. Other publisher-supplied metadata conversion projects
Examples: Berkeley Mathematics Symposium

Start with issue- and article-level metadata .xml files supplied by publisher in some standard metadata scheme ( e.g., METS, MODS, TEI) and article .pdf's from the publisher.

EPT supplies MS with detailed mapping from the publisher-supplied metadata to Euclid metadata.

The programmer writes a conversion script that produces Euclid issu- and article-level metadata.

EPT works with the programmer to refine the output metadata (first in batch) and later with handwork as in the standard-style project.



e. Manual metadata creation projects
Examples: International Press journal issues; Notre Dame mathematical lectures Monograph Series.

A catch-all category for now.  Includes creating Euclid issue- and article-level metadata from publisher-supplied article .pdf's, creating monograph- and chapter-level metadata from .pdf's of entire monographs.

Since the metadata is created by hand, there is minimal supervision of the work by EPT after the initial project is explained.