Skip to main content
more options

Preparing Project Euclid Journal Issues (LTS Procedure # 69)

 

08/01/07 PLEASE NOTE: Due to the flexible nature of online publishing practices, this document is constantly in the process of being updated, and may not accurately reflect current practices.  If you have any questions regarding this procedure, please contact Nancy Solla.

Scope: This procedure outlines the procedures for the creation and maintenance of Project Euclid Journal Issue metadata.

Contact: Nancy Solla

Unit: Metadata Services

Date created: 08/08/07

Date of next review: August 2009 

 

I. Introduction

A. What is Euclid and what is our part in it?

1. From the Project Euclid Website:

"Project Euclid's mission is to advance scholarly communication in the field of theoretical and applied mathematics and statistics. Project Euclid is designed to address the unique needs of low-cost independent and society journals. Through a collaborative partnership arrangement, these publishers join forces and participate in an online presence with advanced functionality, without sacrificing their intellectual or economic independence or commitment to low subscription prices. Full-text searching, reference linking, interoperability through the Open Archives Initiative, and long-term retention of data are all important components of the project.

The end result is a vibrant online information community for independent and society journals. This will assure that mathematics and statistics will continue to benefit from a healthy balance of commercial enterprises, scholarly societies, and independent publishers."

2. Workflow Between DLIT and MS: This section walks through the workflow that transacts between DLIT staff and MS staff when working on a "current issue" for Project Euclid. There may be variations of this workflow, depending on the agreements made with the publisher, filetypes and data available to us, etc.

3. Doing our part: Our task is to create an online version of the journal issue which will essentially duplicate its printed counterpart. We may choose not to include certain pages, (for example any advertising pages which may be part of the print version), focusing instead on the "meat" of the content. The table of contents, notes from the editor, the list of editorial staff for the issue, the articles themselves, indices, corrections, and addenda are all examples of content to be included in the online version we are creating. We receive TeX and PDF files for the journal's articles from the publishers, or we harvest them from online sources. (Occasionally there will be TIFF or DjVu files included, as well. There are also often files included which are extraneous for our purposes. See File Structure, and see also Data Submission Guidelines, below.) We then create an XML document to accompany these TeX and PDF files. This XML document is created in XMLSpy, or in Oxygen, and contains links to the TeX and PDF files. The whole "package" is compressed into a .zip file and submitted to the Euclid parser, which will accept or reject the package, depending upon the validity of the XML and the completeness of the package. The metadata becomes the content on the Project Euclid site. This procedure will outline the protocol for preparing the metadata for Euclid submission.

4. The Publisher's Part: Below is a copy of the memo sent to journal publishers explaining our requirements for submitting journal issue content to Metadata Services for Euclid article-metadata production.

Data Submission Guidelines for Project Euclid

CD's containing these files can be sent to Nancy Solla, Metadata Services, 107E Olin Library, Cornell University, Ithaca NY 14853.

ZIP files containing these files can be sent to metaserve-L@cornell.edu. Any questions regarding the delivery of raw data may also be sent to this address.

  1. Send one web-ready PDF file (i.e., with no printer marks) for each article, review, addendum, etc, in the issue. We can accept one high-resolution PDF and one reduced-resolution PDF for each article, but not multiples of each PDF type for the same article.
  2. Send TeX for each article in the issue.
  3. If available, you may send a DjVu file for each article.
  4. Send a PDF (or some eye-readable version) of the table of contents for the issue. This allows us to verify the files we have against the table of contents and to confirm the order of the articles in the issue.
  5. Please name PDF and TeX files consistently. It saves time if we can easily match corresponding PDF and TeX files and can associate file names with the articles they represent.
  6. To expedite our work, folder structures should be as simple as possible. Please do not separate the files for different articles into separate folders.
  7. You may include "supplemental materials", which are loosely defined as any ancillary files related to, or which in some way supplement, the main article. Things like tables, charts, figures, or even small software programs are included in this category. For an example, see: http://projecteuclid.org/getRecord?id=euclid.em/1128371753. Please include documentation explaining with which article(s) these supplemental materials are associated, and any special instructions necessary to ensure their proper rendering in the Project Euclid interface.

5. The Difference Between Euclid and DPubs Issues: Metadata Services staff create metadata for journals besides Project Euclid titles. These non-Euclid journals are colloquially referred to as "DPubs" journals. There are a few differences between Euclid and DPubs journals.

6. The Difference Between Euclid "Backfiles" and Current Issues: See this document for an outline of the batch job process. There are times when we are contracted to harvest data and create metadata for all the back issues of a journal. This process is colloquially referred to as a "backfile" or "batch job," and its process can be quite different from that of a single, current issue. Backfiles may be done for both Euclid and DPubs titles.

7. Project Euclid Monographs: See this document for an outline of the procedures used for processing a monographic series in Project Euclid.

B. Basic Definitions

1. TeX

2. Unicode

3. DTD's

4. PDF's

5. Parser

6. XML

7. TIFF's

8. DjVu

9. ZIP

 


  

II. Data Harvesting and Management

A. How do we receive and store Euclid data?

1. Explanation of file structure: All Metadata Services staff have access to a shared drive on the "Library21" server. Depending upon when the drive was mapped and who mapped it, it may be assigned a different drive letter on your computer than what is used in the examples of this document. (See also "Using a Previous Issue as a Template")  In the MSU shared folder on Library 21, there are folders for each Euclid journal title, using the official Euclid journal abbreviation as the folder name. If you look in each of those folders, you will see that within them are folders for each issue; sometimes these issue folders are tucked within a folder for its entire volume, but not always. The issue folder name should have the volume and issue numbers entered in a 6-digit format, preceded by the Euclid journal abbreviation. (Although you will find some that are not entered that way.)

Ex: The correct path to the issue folder for Annals of Probability, volume 31, issue 1 would be: Library21/metadata/aop/aop31/aop031001.

The folder should contain the TeX and PDF files for each article in the issue. The folder will also be home to the final XML and the zipped file containing the XML file and all the accompanying TeX and PDF files to be submitted to the Euclid server. The issue folder (and sometimes the volume folder) are created at the time the data comes from the publisher and is loaded onto the shared drive. Whoever is creating the XML for an issue is responsible for loading the data, creating the necessary folders, organizing them, and analyzing the data to determine which files are to be used in the final submission. (Often the publishers send files which are necessary for their processes, but are extraneous for our purposes. When receiving files from the publisher via email, it is advisable to place all files received from them in this folder. You may not need everything they send us for your work, but it should be saved in the folder if you are not receiving the data on a disk. That having been said, sometimes publishers send us supplementary files - beyond TeX and PDFs - which should be referred to in the .xml and should be in the submission package. Publishers have been requested to include instructions with these files, but they may forget...)

2. Discs: Many journals are sent to us on disks. Sometimes more than one issue is sent on a disk, in which case separate file structures would have to be set up for each issue. All the PDF and TeX files on the disk should be loaded into the folder for the issue(s). "Extraneous" files such as .eps and .aux files do not have to be loaded onto Library21, unless of course you have received instructions from the publisher to include them in the issue. (If you have need of them, you can always go back to the disk for them, so don't worry if you don't put them up there immediately.) You may find that the data is incomplete: there may not be TeX or PDF files for some articles. If that is the case, you will need to check the MathSciNet website to see if you can harvest the missing TeX files from there. (See "Downloading TeX Files from MathSciNet".) Any TeX files not available on MathSciNet, and any missing PDF files, will have to be obtained from the publisher. Ask the DLIT Project Euclid Project Manager to contact the publisher and request any missing files.

3. A Note about Filenames: There are times when the filenames created by the publisher are confusing, or poorly formed, and need to be renamed. For example, TeX files often have a "double file extension" on the end, i.e. "Butler.tex.txt" You should remove the ".txt" from the end of the filename, otherwise the Euclid server will not recognize it as a TeX file. Other times you may receive files with names containing tildes, dashes, or other punctuation or diacritics. These characters may cause the server to reject the file, so it is a good practice to remove them from the filename.

4. Downloading TeX Files from MathSciNet

5. Creating Metadata for International Press Titles: We have a special agreement with International Press (IP) in which they contact us via email when the newest issues have been mounted on their website.  We then add them to Euclid by following the procedure below. IP publishes AJM, ATMP, CAG, CIS, CMS, HHA, JDG, JSG, and MAA.

Harvesting the PDF's:

  •  Go to www.intlpress.com.
  •  On the left side of the page is a navigation bar with the titles listed in alphabetical order by title abbreviation. Click on the abbreviation of the journal you wish to work on.
  •  Once the page for the journal title has opened, click on “Browse Journal,” found in the middle of the page.
  •  Select the issue you wish to work on.
  •  Open each .pdf by double-clicking on the “View .pdf” link. It will open Adobe Acrobat within your browser window.
  •  Click “Save a copy” icon in the top left corner of your window. Browse to fine the folder for the journal title. If no folder exists for this particular issue of the journal, create one, using our naming convention. (see "Using a Previous Issues as a Template")
  •  Click “save.”
  •  You will need to do this for every .pdf for the issue. To get out of the .pdf and get back to the issue, click the “Back” button on your browser toolbar.

You will have to create the .xml file without any .tex files. The .pdfs are OCR’d, however, so you can select, copy, and paste things from them into your .xml. If there are complicated math expressions in the abstract, you will probably have to leave the abstract out. If there’s math in the title, you will have to find the .tex somehow. You may be able to find the .tex expressions you need by searching in MathSciNet. You can search by keyword, or try searching for other articles by the same author. If you are having problems with this, ask the Metadata Services Euclid Project Manager for assistance.

There aren’t any .pdfs of the tables of contents for these issues. If you want a copy of the TOC information for your own reference while you’re working on the issue, you can copy and paste the content from the issue’s webpage into a Word file.

6. Harvesting Data for A. K. Peters, Ltd. Titles: We have a special agreement with A. K. Peters, Ltd. (AKP) in which they contact us via email when .zip files containing the .tex and .pdf files for the newest issues have been mounted on their website.  We then add them to Euclid by following the procedure below. AKP publishes Experimental Math (EM) and Internet Math (IM).

Harvesting Datafiles:

  • Alice or Charlotte will send an email saying that the files are ready to be loaded.  The email will contain a link to the .zip file, which is loaded on their website.
  • Click on the link to the .zip file. A dialog box will open. Choose "Open with...(your .zip program)" and click ok.
  • The .zip file will open in your .zip program.  Select all the individual files within it, click "Extract" and browse to the folder for the issue in hand.  (You can create this folder now if necessary.)
  • The files should appear in the folder for the issue in hand.  Be aware that they may be contained within a subfolder, as .zip files sometimes create extra folder layers.  If necessary, rearrange the issue folder so that the data files sit in the root folder.  They're easier to work with that way.

B. Managing PDF's

1. About PDF's: Each article must be accompanied by a PDF, or Portable Document Format, of the article as it appeared in the print version of the journal. Occasionally you will need to harvest PDF's from an online source, but usually they are provided by the publisher on a disk. You must confirm that each PDF is complete, and verify the pagination by comparing the PDF's to the Table of Contents for the issue. The publisher may divide the PDF of an article into two (or more) PDF documents, especially if the article has complex images and diagrams. Euclid cannot accept more than one PDF per article. You will therefore need to combine the PDF files, using Adobe Acrobat Standard 7.0. (See II. B. 3., below.)

2. Harvesting PDF's from an Online Source: If the PDF's needed for your issue are available via an online source, you can save them to the folder for your issue. With the PDF open, select the "File/Save As" menu option and browse to find the issue folder location to save it into.

3. Combining PDF Files: See Adobe Acrobat Standard 7.0 help topic, "Combining Adobe PDF Documents."

4. Converting Adobe PDF images to a TIFF: See Adobe Acrobat Standard 7.0 help topic, "Converting Adobe PDF Documents to Other File Formats."

 


  

III. Creating XML Documents: (Note: The most comprehensive reference for Euclid XML documents is the "Data Element Dictionary for euclid_issue.dtd version 2.0 (2007-03-29)," written by the Director of ePublishing Technologies.)

A. Processes

1. Processing Data Harvested from an Online Source: For some journals, you must retrieve basic TeX files from the publishers' own site(s) or from MathSciNet. (See "Downloading TeX Files from MathSciNet") Once you save these files as a NotePad document, you then run a PERL script against it to set up a basic XML document populated with the TeX data for each article. The PERL script sets up a skeletal XML document which follows the Euclid DTD. The TeX now populating this new XML document includes author, title and pagination metadata. Your next task is to cut and paste the TeX for the articles' abstracts from the files provided by the publisher into the appropriate place in the XML file. You must also insert the links to the PDFs and the TeX files for each article.

2. Creating a Draft Metadata XML File Using a PERL Script

3. Using a Previous Issue as a Template: Another method of creating a new XML document is to open a previously-created document for the same journal title, "Save As" the issue in hand, and then edit the metadata content. This method has the advantage of already having the basic header and issue data information already set up - only a few things will need to be edited to make it match the data for the new issue in hand. The one thing you will need to be very careful about, however, is getting all the metadata from the last issue deleted before you start entering the new information. You will also need to be sure that the document uses the correct dtd version, named

Z:\Euclid_DPUBS\dtds\euclid\euclid_issue-m_internal.dtd.

In this case, "Z" is the letter assigned to the mapped drive to Library21 on my computer. You may need to change this in your XML document to another letter, in order to match it to the drive letter on your computer. You can check this by going to "My Computer" and looking at the list of drives on your computer. (See also "How is this XML Document Formatted?" below.)

4. Creating an XML Document Using Oxygen's "File/New" Menu: You may, of course, start the creation of your XML document by opening Oxygen and selecting the "File/New" menu. Oxygen will then ask you what type of document you wish to create. Select "XML document" and click "OK." The "Create an XML Document" window will appear:

undefined

Select the DTD tab. Browse to find the DTD in the Euclid_DPubs folder on Library 21. It is "Euclid_DPubs/dtds/euclid/euclid_issue-m_internal.dtd." The value for "Document Root" should be euclid_issue. Click "OK." Oxygen will create a skeletal XML file for you to work in. (see below) This rough document will not contain all the elements necessary for creating a valid XML file for submission. See IIIB, Syntax and IIIC, Special Protocols for more information regarding proper XML syntax for Project Euclid files.

undefined

B. Syntax

1. How is this XML Document Formatted? Any XML document created for Project Euclid must be checked against the Euclid DTD, or Document Type Definition, for validation. This DTD is a description of the desired syntax for Euclid issues. In order to have XMLSpy check your XML document against the file "euclid_issue.dtd," the "DOCTYPE" element in your XML document must contain the following value:

<!DOCTYPE euclid_issue SYSTEM "Z:\Euclid_DPUBS\dtds\euclid\euclid_issue-m_internal.dtd">

If your XML document has been created by running a PERL script or by using a template, the content for this element should already be properly set. Refer to the "Data Element Dictionary for euclid_issue.dtd version 2.0 (2007-03-29)," written by the Director of ePublishing Technologies, for details concerning the proper syntax for Euclid submissions.

2. Sample XML File: (Note: The most comprehensive reference for Euclid XML documents is the "Data Element Dictionary for euclid_issue.dtd version 2.0 (2007-03-29),"  written by the Director of ePublishing Technologies. The sample XML document below is meant to be used as a quick basic reference only.)

Click here to download a sample XML file, without commentary. (pdf)

Click here to download the annotated version of that same sample XML. (pdf)

3. Managing TeX Expressions: TeX is a typesetting system created by Donald Knuth. It is especially useful for rendering the unusual characters needed for mathematical formulae, also called math expressions. A Tex expression should always begin and end with a $ sign. In order to display the abstracts for articles in Euclid, we open the .tex files supplied by the publisher and cut and paste the abstract into the .xml file. You may also copy titles, author's names, subject heading codes, and keywords from the .tex files. (See "TeX Resources")

If you find any odd-looking characters or expressions which are not bracketed by $ signs, look at the .pdf for the article in question. Chances are, these odd-looking characters are other typesetting codes used to created the print version of the article, and they should be removed from the material you are pasting into the .xml file. A classic example of this type of spurious code is "\\" - this is a typesetter's code which calls for a line break. Euclid can generate its own line breaks, so we would want to remove these characters. You may see other expressions, in the abstract especially, which begin with "\\". These are often macro codes for commonly used expressions, used by typesetters to save them keystrokes. Obviously, the best way to ensure that you are removing all spurious typesetter's code is to actually read the abstract and compare the .tex version to the .pdf version.

4. Unicode, or "HEX" Codes: In order to render language diacritics properly on the Euclid website, you will need to use Unicode hexi-decimal codes. There are charts for all the diacritics you may need on charts at the Unicode website. These codes must begin with "&#x" and close with ";" in order to be recognized and rendered properly. If you were to look on the chart for the code for "<", for example, it would simply say "003C." You must therefore enter "&#x" before the "003C" and ";" after it in the space where the diacritic should appear in order to correctly formulate the code.

5. Citations in Abstracts: Sometimes an article's abstract contains citations for works which are in the bibliography for the article. These citations may come in the form of a simple reference to the work's number in the bibliography, i.e. [12]. The citation may have some code around it, which would look strange to the reader if it were left as is, i.e. \cite{Poor} Poor 1994}. Citations may be an integral part of the sentence structure, or they may be a parenthetical "aside." If the citation is an "aside," the removal of which would not alter the meaning of any sentences, you may simply remove it from the abstract. If the citation is "built in" to the sentence, however, alterations will have to be made. One way of determining how to deal with this situation is to start by looking at the .pdf for the abstract. For the second example above, the citation is an important part of the sentence and should not be removed. The abstract shows this citation in this form: [Poor 1994]. This is perfectly acceptable. In the case of the first example above, where only a reference number is give in the .tex file, the best option is to check the article's bibliography and find the actual citation. In place of the [12], one would insert the title of the article in italics, and the author(s) name(s) listed in parentheses afterwards.

C. Special Protocols

1. Special Kinds of Euclid Issues. This document contains sections on the following topics:

a. Supplemental Issues

b. Issues with  Special Titles

c. Index Issues

2. Special Types of Records in Euclid Journal Issues. This document contains sections on the following topics:

a. Miscellaneous Frontmatter

b. Table of Contents

c. Editorial Staff Lists

d. Editorial Statements

e. Prefaces and Dedications

f. Obituaries

g. Articles in Foreign Languages

h. Related Items

i. Reviewed Items

j. Indices

k. Miscellaneous Backmatter 

3. Special Kinds of Elements within Records. This document contains sections on the following topics:

a. Records with No Page Numbers

b. Records with Roman Numeral Page Numbers

c. Non-alphabetical Author Order

d. Use of the <div> Tag


  

IV. Submission and Follow-Up

A. Creating a Zip File

B. Submitting a File for Project Euclid

C. Entering Data in the Euclid Production Chart

D. Making Submission Packages Available to TMJ Staff:  After completing any issue of Tohoku Math. Journal (TMJ), tell the Metadata Services webmaster that it is completed. The MS webmaster will need to put a copy of the .zip file up in a special website for the people who publish TMJ so that they can download it. This is done for TMJ ONLY. Once the .zip file is available online, the webmaster should notify the Metadata Services Project Euclid Production Manager so that they can in turn email TMJ and tell them the file is available.




 

V. Resources

A. TeX Resources

1. Charts

a. Basic LaTeX symbols

b. LaTeX symbols prepared by L. Kocbach

c. Fonts and Symbols from tex.loria.fr

2. Tutorials

a. Beginning LaTeX - Cornell Computer Science's tutorial

b. An Introduction to Using TeX in the Harvard Math Department

c. Getting Started with LaTeX by David R. Wilkins

B. Unicode Resources

C. Software Information

1. Oxygen

2. XMLSpy

3. MathPlayer (download)

4. XMLNotePad (download)

D. Who's Who in Project Euclid

1. Euclid Partners

2. Publishers' Sites for "Current Issue" Euclid Journals whose metadata is created by Metadata Services

a. Annals of Mathematics (ANNM, published by Princeton University, Department of Mathematics)

b. Asian Journal of Mathematics (AJM, published by International Press)

c. Advances in Theoretical and Mathematical Physics (ATMP, published by International Press)

d. Communications in Analysis and Geometry (CAG, published by International Press)

e. Communications in Information and Systems (CIS, published by International Press)

f. Communications in Mathematical Sciences (CMS, published by International Press)

g. Experimental Mathematics (EM, published by A.K. Peters)

h. Hiroshima Mathematical Journal (HMJ, published by Hiroshima University, Department of Mathematics)

i. Homology, Homotopy and Applications (HHS, published by International Press)

j. Internet Mathematics (IM, published by A.K. Peters)

k. Journal of Differential Geometry (JDG, published by International Press)

l. Journal of Symplectic Geometry (JSG, published by International Press)

m. Lecture Notes - Monograph Series (LNMS, a monographic series published by the Institute of Mathematical Sciences)

n. Methods of Applications and Analysis (MAA, published by International Press)

o. Publications of the Research Institute for Mathematical Sciences (PRIMS, published by the Research Institute for Mathematical Sciences, Kyoto University)

p. Tohoku Mathematical Journal (TMJ, published by the Mathematical Institute, Tohoku University)

3. Publishers of "Backfile Project Euclid Journals - See ePublishing Technologies wiki section on backfiles.

4. Examples of DPubs Journals

a. Medieval Philosophy and Theology (MPAT)

b. Cornell Real Estate Review (CRER)

c. Indonesia (INDO)