Harvey et al. J Cheminform (2015) 7:43 DOI 10.1186/s13321-015-0093-3
Standards-based curation ofa decade-old digital repository dataset ofmolecular information
Matthew J Harvey1, Nicholas J Mason2, Andrew McLean1, Peter MurrayRust3, Henry S Rzepa2*http://orcid.org/0000-0002-8635-8390
Web End = and James J P Stewart4
http://orcid.org/0000-0002-8635-8390
Web End = Background
Research data repositories based on platforms such as DSpace [1] were introduced about 10years ago, and their use in domains such as chemistry and molecular sciences has gradually increased [2, 3]. Their importance has recently come to the fore with funding agencies in the USA, Europe and Asia all indicating that open deposition of research data will become a mandatory aspect of their funding, and many universities are now starting to consider the implications of implementing research data management, or RDM [46]. An early example of such RDM is illustrated with a project to produce a library of quantum-mechanically-optimised molecular coordinates derived from a computable subset of the National Cancer
Institutes (NCI) collection of small molecules [7]. The information for each molecule was originally annotated by optimising the coordinates with respect to the energy obtained using the semi-empirical PM5 parameter set in MOPAC [8] (then the most current parameter set) and creating a DSpace collection. At the commencement of the present project, the original deposition of this information for 175,356 molecules into the institutional repository of the University of Cambridge [9] represented the only openly accessible copy.
An issue frequently raised in the context of research data management relates to the prospects of being able to access and use such digitally held information in the future. Relatively recently, such questions were largely directed towards the expected longevity of physical media such as punched cards and oppy disks (both now eectively extinct), hard drives, CDROMs, DVDs, magnetic tape etc. Few of these media have proven lifetimes
*Correspondence: [email protected]
2 Department of Chemistry, Imperial College London, South Kensington Campus, London SW7 2AZ, UKFull list of author information is available at the end of the article
2015 Harvey et al. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/
Web End =http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/
Web End =http://creativecommons.org/ http://creativecommons.org/publicdomain/zero/1.0/
Web End =publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.
Harvey et al. J Cheminform (2015) 7:43
Page 2 of 14
exceeding 20years and the real problem would be locating working devices capable of reading such physical media in the future. Quite dierent problems are associated with virtual collections, where the physical medium is less important than the information associated with the data itself. In this context, it is becoming increasingly accepted that successful long-term preservation of digital data depends upon repeated incremental improvements or curations taking place in 510year cycles. Such operations can in principle be repeated indenitely, thus creating a long-term mechanism with an anticipated lifetime of 100+ years if required. These curation cycles can track the evolution of data storage hardware, data formats and introduction of new software, so ensuring that the data remains accessible and in a usable form. The purpose of this project was to explore the viability of the long-term preservation of the 10year old Cambridge dataset through such an incremental curation by performing its migration to the SPECTRa repository hosted at Imperial College London [2]. Specic benets of undertaking such a curation include re-ltering the original source data for errors not previously eliminated, to produce an enhanced metadata record for each entry, and to recompute the optimised molecular coordinates by using the newer PM7 method. The original PM5 method used to obtain the molecular geometries was never formally published and is now unavailable, whereas the succeeding PM7 method has been formally peer reviewed and published [10].
We will also compare our approach with two other examples drawn from computational chemistry. The rst [11] is typical of how almost all datasets derived from molecular computations are currently curated; in this case the stochastic generation of all possible stable molecular structures from an initial set of specied atoms. The trend in scientic publication in recent years has required authors reporting such studies to include more extensive data in the form of supporting information (SI) to accompany the scientic narrative from which their models are constructed and their conclusions drawn. We will argue here that these SI-based mechanisms for depositing, retrieving and re-using the data components of journal articles are no longer t for this purpose (if indeed they ever were) and should be urgently replaced by repositories of data and closely-coupled metadata as a fundamentally dierent model for research data management. The second example describes [12] such a deposition of a dataset containing the quantum mechanically computed structures and properties of 134,000 molecules into the Figshare digital repository. We will ask here what the attributes of such a deposition must be in order to enable efficient formal re-curation 10years after the original creation of the dataset, arguing
that there are some essential structures and standards that must be fullled for such a process to be properly enabled.
Methods
The migration of the original dataset was performed in three sequential phases, retrieval from the original repository, a technical validation and re-deposition into the SPECTRa repository.
Retrieval
Both the Cambridge and Imperial-SPECTRa repositories are implemented using DSpace [1]. Although this software contains a component that can provide structured data representations of entries for harvesting (OAI-ORE [13] resource maps), this was not enabled on the Cambridge repository when we started our migration in July 2014. However, since the human-readable landing pages for each entry all conform to a structured HTML template, it nevertheless proved possible to extract all the data using ad hoc scripting and HTML processing (a process often informally referred to as web-scraping). This process was markedly inefficient, requiring three separate HTTP requests to the server per record, and took several days to complete. This approach is by no means unique; most large existing collections of (chemical) data require similar processes whereby a human has to initially read the documentation (if available) for the templates used to access the items and then to write appropriate custom codes or scripts to retrieve them. Such a method means that any unexpected change in the template resulting from, for example, the release of a new version of the dataset then inherits the risk of breaking these scripts. Stated more formally, the inferred uniform resource locators (URLs) for such collections of data are not persistent. The principal aim of our curation objectives therefore was to eliminate the need for such ad hoc scripting and replace it with a more efficient and standards-based workow for achieving this persistence.
The following were retrieved from the original deposition [9] at Cambridge:
The source URL for 175,356 records.
175,356 documents in XML-CML syntax encoded using chemical mark-up language (CML) [14], containing a molecular structure from the NCI database and some metadata describing the entry.
158,879 XML-CML documents containing the PM5 optimized coordinates of the NCI database structure and basic metadata, including the NCI identier for version 3 of the NCI Open Database and the computed InChI and InChIkey [15]. Of these, 158,122 were found to be unique. The remaining 16,477
Harvey et al. J Cheminform (2015) 7:43
Page 3 of 14
entries had no reported PM5 calculation. These entries were previously identied [7] as having additional complexities such as the presence of metal atoms or problems with correctly adding hydrogen atoms and charges, and so a PM5 calculation had not been attempted. Here we have adopted the same strategy of not recovering these entries in the present curation.
Technical validation
No metadata were provided in the original depositions that gave an unambiguous description of the two XMLCML documents present in the form of a CML schema declaration, and no MOPAC version information or MOPAC input or output les were saved to act as alternative sources of this information. The CML syntax corresponding to the annotation derived from PM5 optimisation in the form of les named e.g. nsc138467_post-mopac.cml in the original collection was incomplete; bond connection terms were missing and the CML documents failed validation according to the CML Schema version 2.4 [16]. The rst task was therefore to develop a protocol to produce a reliable and valid input le suitable for re-calculating the properties using the newer PM7 method [10]. Many entries in the NCI collection comprise two or more disconnected components, of which only the larger component was retained in the original editing [7]. The resulting missing component in the starting structure was predominantly a counter ion and its removal requires a charge to be assigned to the remaining fragment. This information was originally captured in both original XML-CML documents, the rst as part of an identier element containing an early form of the InChI string:
<identier version=0.932Beta tautomeric=0> <basic>C13H21N2O,1H312H(2H3)
15(13H(3H3)4H3)11(16)107H6H8H14(5H3)9H10</basic> <charge>+1</
charge> </identier>
The second is declared more formally in the CML molecule element associated with the PM5 calculation:
<molecule id=NSC138467 formalCharge
=1 name=mol1>
Of the 158,122 unique documents in the latter category, the formalCharge declarations were distributed as follows; 153,127 (0), 28 (1), 4,456 (+1), 18 (2), 483 (+2), 2 (3), 3 (+3), 1 (+4), 4 (5). Manual inspection of the species with very large formal charges (>|3|) indicates these are all errors arising from the original curation process because of incorrect interpretations of e.g. metal centres. Our original attempt to transform this information into a MOPAC input involved the standard OpenBabel [17] program, version 2.3.2. It transpired
OpenBabel did not correctly propagate the charge information in either of the original CML les by transformation into an appropriate MOPAC keyword declaration such as CHARGE=1. Instead the generic statement PUT
KEYWORDS HERE was the only content of the MOPAC keyword line. This raises some interesting issues:
1. Absolute delity in any syntactic transformation of data from one format to another is very difficult to achieve. Thus there are often multiple syntaxes for any given information eld, such as the two shown above for expressing the charge on a molecule, and all such variations must be honoured with complete delity to achieve reliability. Although some forms can be quickly deprecated (such as the rst example above), these forms cannot be ignored and they must be processed.
2. The MOPAC program does not mandate the presence of all keywords. A calculation may still succeed on the assumption that a missing keyword simply defaults to a pre-determined value. In this case, MOPAC will assume that the value of an undeclared CHARGE keyword corresponds to zero, which is a clear error if the charge was intended to be non-zero. This issue of implicit semantics is perhaps the single largest problem in ensuring validation. It can be very difficult, if not impossible to nd complete denitions of what implicit assumptions are made in any system. Often the only source of these is the actual computer code itself.
3. A further implicit rule for MOPAC keywords is that the spin-multiplicity of the system is computed from the total electron count after the appropriate charge is applied. For a system where a charge of e.g. +1 is
left undeclared, that will result in a molecule with an odd number of electrons, and this is then treated implicitly as a molecule with a DOUBLET spin state.
We also note that these implicit rules are not universal; other programs such as Gaussian use dierent conventions.4. If the explicit keyword SINGLET (spin state) is declared, a safe assumption for virtually all real molecules that exist as physical samples, this can act as a checksum. The MOPAC program will then throw an error and the calculation will not proceed if this spin state conicts with any declared or undeclared/ implicit charge.
Instead of using OpenBabel, we made a custom conversion of the original post-MOPAC PM5 calculation into CML les to ensure the correct keywords were written to the MOPAC input le. The atom positions were expressed in internal coordinates rather than cartesian
Harvey et al. J Cheminform (2015) 7:43
Page 4 of 14
coordinates. This is not a critical decision, since the nal atom positions do not in general depend on the initial coordinate system selected. A PM7 geometry optimisation was then performed using the resources of the Imperial College High Performance Computing Service. The majority of calculations completed within tens of seconds and the total required approximately 20 CPU days of computer time.
InChI identiers
An InChI identier [15] is a canonicalization based on the atom connectivity of a molecule, which in turn is derived from Cartesian coordinates for each atom using simple heuristic rules specifying a range of atom pair distances for any element combination. These distance ranges are built into OpenBabel [17]. Unfortunately, atom connection distances are not formally dened as accepted standards, and the precise values are ultimately the choice of the designers of any program implementing them. The limits however are usually sufficient tolerant to cover the vast majority of real molecules without any disagreement, and this would especially be true of the NCI set which cover real systems rather than hypothetical or computed molecules. This does not entirely exclude there being a very small number of molecules where specic atom-pair distances might fall within e.g. a bond range using PM5-optimised coordinates but which are e.g. outside such a range using PM7 values. We note that whilst it is possible to replace these relatively arbitrary rules by using a quantum mechanically derived property of the electron density topology called the BCP (bond critical point) to dene an atom-pair connectivity [18], this is not currently used for determining InChI identiers.
We proceeded to derive InChI identiers using the following OpenBabel [17] commands:
1. babel i cml nsc383508_original.cml o xyz out.xyz canonical (for the original NCI-based data)
2. babel i cml nsc383508_postmopac. cml o xyz out.xyz canonical (for the
PM5-computed data [7])3. babel i mopout MOPACPM7.out o xyz out.xyz canonical (for the newly generated
PM7-computed data).4. babel i xyz in.xyz o inchi out. inchi
Commands 13 convert all the data into Cartesian coordinates to remove any possible atom connection data that might have been generated by MOPAC or other sources. Command 4 generates a canonical InChI identier [15] using these coordinates. This process ensures that the connectivities created using the last command and then used to create the InChI are normalised against a single connection algorithm (being the one contained in OpenBabel, version 2.3.2). These InChI strings are then compared with those derived in a similar manner using the original NCI and the original PM5 computed coordinates (Table1).
Of the 158,122 unique values (Table1), 97.7% matched for all three instances, which provides a great measure of condence that the atom-connection algorithm is robust. To identify the origin of the 2.3% of InChI mis-matches, we have to dissect the InChI identier itself into its component layers:
1. The molecular formula layer (1131).2. The pairwise atom connectivity layer, determined as described above (127).
3. The hydrogen layer, in which hydrogen atoms are added to all heavy atoms where a valence is perceived to be unsatised if the hydrogens are not already declared. Because we have subjected all the systems to computational quantum modelling, all hydrogen atoms are already explicitly dened in our coordinates (1252).
4. A charge layer, also dened for all the molecules in our collection (9).
5. A stereochemical layer. Because our coordinates are all specied in 3D space, the stereochemistry is always dened. This layer includes double-bond isomerism (292) and tetrahedral congurations (267).
6. An isotope layer (22).
The distribution of the 2,997 dierences between the PM7 and the NCI InChI identiers (2,041+470+486,
Table1) are shown in parenthesis in the listing above, and each is very briey discussed below:
1. The discrepancies in the formula layer originate from mismatches in the hydrogen count. This is because, historically, molecules were not always dened with explicit coordinates for all hydrogen atoms. Instead they were inferred from residual valences, these in
Table 1 Comparison ofgenerated InChI identiers
NCI=PM5=PM7 (PM5=PM7)NCI (PM7=NCI)PM5 (PM5=NCI)PM7 PM7PM5NCI
154,552 2,041 573 470 486
Harvey et al. J Cheminform (2015) 7:43
Page 5 of 14
turn inferred from bonding angles and other geometric and heuristic information. The process of replacing such implicit hydrogens with explicit ones is not always exact.2. The connection layer mismatch originates from bonds that are on the verge of connection and derives from (possibly small) geometric changes from the quantum mechanical re-optimisation. A typical example of such uncertainty are putative SS bonds in sulfur species [19].
3. This and the formula layer together account for the great majority of the mis-matches.
4. The small number of mis-matches in the charge layer may result from the InChI code heuristic for deciding the appropriate charge for a molecule. As noted above, we detected some unreasonable high charges resulting from this process.
5. Because traditionally molecules were expressed in the MolFile V2 format which allows just 2D coordinates to be dened, stereochemistry had to be added using an additional parameter associated with each bond connection and equivalent to the stereochemical wedge notations used in organic chemistry. This information is not free of ambiguity, since the stereochemistry is dened relative to other atoms and can lead to logical contradictions. When such two dimensional coordinates and this additional information is converted into 3D coordinates (a process carried out during the original deposition [7]), ambiguities can result.
6. Isotopes were not included in the MOPAC-PM7 calculation.
Redeposition
For each remaining entry, the PM7-derived InChI strings and keys were added to the SMILES strings and the NCI and CAS accession identiers obtained from the original data and propagated as metadata. We note that the NCI identiers themselves may not necessarily persist across dierent versions of the NCI database, which was version 3 at the time of the original curation and has subsequently been updated to version 4 in 2012 [20].
Prior to import to SPECTRa, each entry was packaged individually to produce an archive le, termed a SWORD [21, 22] bundle. SWORD (Simple Web-service Oering Repository Deposit) is an interoperability standard for data ingest into digital repositories, rendering these bundles suitable for import into any SWORD-compliant repository, not just Dspace-based SPECTRa. The bundles contains a METS manifest [23] and data les and were created using a locally written tool.
The METS manifest contained the following metadata:
InChI and InChIKEY and SMILES string.
CAS and NCI accession IDs, NCI entry name.
Back-link back to the entry in the Cambridge repository.
DOI link to the published description [7].
ORCID [24] identiers for the contributing authors.
Link to Creative Commons License terms.
The datales included within the bundle were:
Two CML les [14] containing unaltered copies of the NCI coordinates [20] and PM5 computed MOPAC output documents obtained from the original source repository.
A third CML le conating the three previous structures, containing the original NCI structure, the original PM5 structure from the original repository and the newly computed PM7 structure.
MOPAC input and output les for the new PM7 calculation.
Import of this leset to the destination SPECTRa repository was performed using the SWORD web service interface. Owing to a limitation of the Dspace-SPECTRa SWORD interface, no bulk-import function was available and all of the new packages had to be to uploaded individually, a process that took approximately 60 days. Doubtless this exceptionally long time resulted from some undiagnosed server misconguration and should not be considered a representative characteristic.
Exposing the metadata structures onDSpaceSPECTRa
The outcome of the curation process resides in a new collection on the SPECTRa repository comprising 158,122 entries. The new curation has two persistent identiers for the collection itself [25] and within that collection, individual molecular entries are themselves also assigned two persistent identiers, as for example the entry shown in Figs.1 and 2 [26, 27]. The rst of these is the handle with a registered prex 10042 associated with the SPECTRa DSpace server. The second is the DataCite DOI associated with the prex 10.11469 as registered to Imperial College, with individual entries prexed with the common string ch/ to indicate the chemistry department at that institution. The individual items in the collection also have a full set of associated metadata descriptors (Fig.1).
Newly introduced metadata since the creation of the original collection include the following:
The contributors are listed individually, with each name linked to their corresponding ORCID [24] entry page.
Harvey et al. J Cheminform (2015) 7:43
Page 6 of 14
The computational resource used for annotation is also linked with a non-persistent identier; currently to the Web landing page for the organisation.
Several chemical identiers are included such as SMILES, InChI and the CAS accession number. The signicance of including such metadata is that it is registered automatically with DataCite.org [28], and hence available for elded searches [29].
The ORCID entries [24] for all collaborators are explicitly listed, and again become available for searching [29].
A back-link to the original item deposition [9] allows comparison of the original and the newly curated entry. Because this handle prex (1810) is unregistered with CNRI, the central authority for Handle registration, it cannot be treated formally for reso-
Harvey et al. J Cheminform (2015) 7:43
Page 7 of 14
lution as a persistent identier. This is one of the aspects we wished to rectify in the current curation.
There is also a persistent identier link to the journal article [7] describing the original work. In due time, the present article could itself be so-referenced in a future curation.
A pair of new persistent identiers for each molecule has been minted as part of the curation. The rst is a handle assigned using the Handle manager tool in DSpace itself and which can be resolved using either of the services http://hdl.handle.net/
Web End =http://hdl.handle.net/ or http://doi.org/
Web End =http://doi. http://doi.org/
Web End =org/ . This handle is internally annotated with 10320/ loc records [30] to enable automated retrieval of individually requested les from the deposition. The prex 10042 is registered to the SPECTRa server.
The second persistent identier [27] is assigned using the DataCite API [28], and serves as a mechanism to allow DataCite to acquire the metadata for this entry. The prex 10.14469 and the suffix ch are as described above.
A (non-persistent) link to the original publisher is included.
A (non-persistent) link to the open license for the data, in this instance Creative Commons Attribution (CC0) [31]. It is perhaps surprising that this license is itself not identied by its own persistent identier, but the URIs for the CC licenses and the corresponding resources are however machine-processable.
This metadata describes the contents of the data les resident for each entry (Fig.2).
The leset for the deposition comprises two so-called bundles. The rst item is identied internally as the SWORD bundle. This compressed archive contains the METS manifest [23] for the deposition, expressed
syntactically as an XML le containing a number of declared namespaces dening various metadata schemas. The METS manifest, along with another internal XML document, the OAI-ORE resource map [13] denes the contents, locations and properties of the documents comprising the collection.
The second item (Fig. 2) includes three XML les expressed syntactically as XML documents declaring the CML schema [16]. One can nd a semantically rich encoding of the molecular information within each le. Also included in this leset are three les relating to the MOPAC program: the input le, the corresponding PM7 output and PM7 archive le summarising the computed properties. In principle, all the information in these les could also be absorbed into the CML descriptors, although this has not been done in the present instance. These les in turn have associated MIME types [32], information that allows automated retrieval of the les using one of the mechanisms briey described below.
Metadata interfaces toDataCite
In curating the original collection at Cambridge by relocating it to a separate DSpace server, we wished to ensure that new persistent identiers for each entry could be minted using DataCite. That in turn required the metadata follows the Dublin Core Schema held on the DSpace-SPECTRa repository to be mapped onto the DataCite Schema using an XSLT-based crosswalk transform. The following procedure was used to achieve this.
A recent release of DSpace (DSpace4) largely automates the minting of DOIs using DataCite. Our target DSpace (SPECTRa) is running version 1.8; the DOI module for DSpace 4 is conned to a few distinct packages that were implemented into version
Harvey et al. J Cheminform (2015) 7:43
Page 8 of 14
1.8 without aecting the other components. The following Java packages were extracted from DSpace4 and used within DSpace 1.8:
org.dspace.identier.doi, org.dspace. services, org.dspace.versioning, org. apache, httpcomponents, httpclient4.2.jar, org.apache.httpcomponents. httpcore4.3.1.jar
DOI-specic properties in the existing install were congured via dspace.cfg. An auxiliary conguration le spring-dspace-addon-identier-services.xml is packaged within the org.dspace.identier package and used for connection details.
Conguring the XML schema transformation that translates or crosswalks between the DSpace Dublin Core metadata schema and the DataCite meta-data schema. DSpace4 delivered the requisite crosswalk, DIM2Datacite.xsl, for version 2 of the DataCite schema.
A requirement was to provide metadata that described the locations, lenames and le types of the individual datales associated with each DOI, in order to provide a machine discoverable and operable path from the DOI directly to the les containing chemical data. To achieve this, the DSpace 4 XML schema transformation (crosswalk) was extended to include the locations of the METS and OAI-ORE metadata les that are generated by DSpace, as relatedIdentiers. These related identiers used the HasMetadata relation type which was introduced in version 3.0 of the DataCite Schema:
<relatedIdentier relatedIdentier Type=URL relationType=HasMetadata
r e l a t e d M e t a d a t a S c h e m e = M E T S
schemeURI=http://www.loc.gov/METS/
Web End =http://www.loc.gov/METS/ >
https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/159060/mets.xml
Web End =https://spectradspace.lib.imperial.ac.uk:8443/meta- https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/159060/mets.xml
Web End =data/handle/10042/159060/mets.xml
</relatedIdentier>
Both the METS and OAI-ORE les contain the desired metadata and can be processed as required. As an example, the leSec section of the METS is show in part below:
<mets:le CHECKSUMTYPE=MD5 GROUPID= group_le_1367638
ID=le_1367638 MIMETYPE=chemical/ xcml SIZE=18955
CHECKSUM=88761c87f8f090182d910f 33a7467435>
<mets:FLocat LOCTYPE=URL xlink:title
=PM7.xml
xlink:type=locator
xlink:href=/bitstream/handle/10042/ 159060/PM7.xml?sequence=3/>
</mets:le>
The crosswalk was also extended to add metadata for ORCID as name identiers for the contributors and various chemical identiers (InChI, InChIKey, CAS, NCI and SMILES) as a set of alternate identiers.
In addition, the PM7.xml le containing the newly computed structures and properties was registered against its chemical MIME type [32], chemical/xcml for each DOI, using the DataCite Media API. The DataCite content resolver [28] then allows this CML le to be directly retrieved from the associated DOI through content negotiation using the resource http://www.crosscite.org/cn/
Web End =http://www.crosscite.org/cn/ or directly by URL.
It took around 8h to mint DOIs for an initial run of 23,240 items, and further subsequent updates took only a few hours. Each update required about 24h to become visible in DataCite. New items in the repository are now synchronised hourly using the DSpace4 programme, RegisterDOI.
At this stage, the DataCite SearchAPI [29] proved to be a useful tool for checking the quality and validity of the curation and its metadata. Search queries were used to retrieve lists of all entries belonging to the new DSpace-SPECTRa collection in an easily parsed format and with the necessary metadata to identify discrepancies, such as duplicate DSpace depositions, duplicate assigned DataCite DOIs or corrupted or invalid meta-data. Some examples of such use are collected in Table2 and are also described below. An advantage is that this kind ofanalysis can be done without privileged access to the hostrepositoryand itsunderlying databases, which makes it easier for peers and users to scrutinize the quality of large open data collections and ag any potential errors.
Results anddiscussion
The congured metadata infrastructures now associated with each item in the collection enable individual data-les to be accessed based only on knowledge of the persistent identiers and media type, which can be allowed to default to specic type. We have implemented three procedures for doing this; these are fully described elsewhere with discussion of the pros and cons of each approach [33, 34]. Here we provide only a summary of these methods.
1. The rst access method to be developed [33] is based on extensions to CNRI Handle record types known as 10320/loc [30]. These allow the handle record to be
Harvey et al. J Cheminform (2015) 7:43
Page 9 of 14
2http://search.datacite.org/ui%3f%26q%3dalternateIdentifier:smiles%5c:*.*%2balternateIdentifier:NCI%5c:*
Web End =http://search.datacite.org/ui?&q=alternateIdentier:smiles\:*.*+alternateIdentier:NCI\:* This returns all entries for which both a SMILES and NCI molecular descriptor is specied and
which contains a period in the SMILES string
4http://search.datacite.org/ui%3fq%3dORCID:*%2bprefix:10.14469
Web End =http://search.datacite.org/ui?q=ORCID:*+prex:10.14469 A variation of the preceding example, illustrating all entries at Imperial College that have an
associated ORCID for their creator
5http://search.datacite.org/ui%3fq%3dORCID:*%2bdoi:10.14469%5c/CH%5c/*
Web End =http://search.datacite.org/ui?q=ORCID:*+doi:10.14469\/CH\/* A second variation of the preceding example, illustrating all entries at Imperial College that
Returns the rst three entries for which any NCI descriptor is specied, restricted to the Imperial
College prex, containing values for the title, doi and RelatedIdentier and expressed in XML
syntax
This returns the metadata about the ORCID associated with the creator of a data set, along with
a specied period for its creation
have an associated ORCID for their creator and a DOI assigned to the Chemistry department
6http://search.datacite.org/ui%3fq%3dhas_media:true%26fq%3dprefix:10.14469
Web End =http://search.datacite.org/ui?q=has_media:true&fq=prex:10.14469 Searches for any entry associated with a declared media type. The prex is that registered by
Imperial College London; the media type found for this prex is chemical/xcml
molecular collection, where a single InChI descriptor relates to the important molecular
entity in that collection
This provides a URL resolution report for all DOIs associated with the Imperial College London
prex
This returns the number of datasets associated with Imperial College as a whole
Table 2 Examples ofdata discovery anddatametrics using metadata
EntryURLPurpose ofsearch
7http://search.datacite.org/ui%3fq%3dInChIKey%3dLQPOSWKBQVCBKS-PGMHMLKASA-N
Web End =http://search.datacite.org/ui?q=InChIKey=LQPOSWKBQVCBKSPGMHMLKASAN A search using a specied value for the InChI chemical identier associated with the dataset.
Our repository was constructed along the lines that each deposition describes a single
8http://search.datacite.org/ui%3fq%3dalternateIdentifier:InChIKey%5c:*
Web End =http://search.datacite.org/ui?q=alternateIdentier:InChIKey\:* A variation on the preceding specic search, where all entries containing an InChIKey are
returned
1http://search.datacite.org/api%3f%26q%3dprefix:10.14469%26alternateIdentifier:NCI%5c:*%26fl%3ddoi%2ctitle%2crelatedIdentifier%26wt%3dxml%26rows%3d3
Web End =http://search.datacite.org/api?&q=prex:10.14469&alternateIdentier:NCI\:
http://search.datacite.org/api%3f%26q%3dprefix:10.14469%26alternateIdentifier:NCI%5c:*%26fl%3ddoi%2ctitle%2crelatedIdentifier%26wt%3dxml%26rows%3d3
Web End =*&=doi,title,relatedIdentier&wt=xml&rows=3
3http://search.datacite.org/ui%3fq%3dORCID:0000-0002-8635-8390%2bpublicationYear:%5b2014%2bTO%2b2014%5d
Web End =http://search.datacite.org/ui?q=ORCID:0000000286358390+publicationYear:[2014
http://search.datacite.org/ui%3fq%3dORCID:0000-0002-8635-8390%2bpublicationYear:%5b2014%2bTO%2b2014%5d
Web End =+TO+2014]
9http://stats.datacite.org/%3ffq%3ddatacentre_facet%253A%2522BL.IMPERIAL%2b-%2bImperial%2bCollege%2bLondon%2522%26fq%3dallocator_facet%253A%2522BL%2b-%2bThe%2bBritish%2bLibrary%2522%26q%3d%23tab-resolution-report
Web End =http://stats.datacite.org/?fq=datacentre_facet%3A%22BL.IMPERIAL++Imperial
http://stats.datacite.org/%3ffq%3ddatacentre_facet%253A%2522BL.IMPERIAL%2b-%2bImperial%2bCollege%2bLondon%2522%26fq%3dallocator_facet%253A%2522BL%2b-%2bThe%2bBritish%2bLibrary%2522%26q%3d%23tab-resolution-report
Web End =+College+London%22&fq=allocator_facet%3A%22BL+
http://stats.datacite.org/%3ffq%3ddatacentre_facet%253A%2522BL.IMPERIAL%2b-%2bImperial%2bCollege%2bLondon%2522%26fq%3dallocator_facet%253A%2522BL%2b-%2bThe%2bBritish%2bLibrary%2522%26q%3d%23tab-resolution-report
Web End =+The+British+Library%22&q=#tabresolutionreport
10http://stats.datacite.org/%3ffq%3ddatacentre_facet%253A%2522BL.IMPERIAL%2b-%2bImperial%2bCollege%2bLondon%2522%26q%3d%23tab-prefixes
Web End =http://stats.datacite.org/?fq=datacentre_facet%3A%22BL.IMPERIAL++Imperial
http://stats.datacite.org/%3ffq%3ddatacentre_facet%253A%2522BL.IMPERIAL%2b-%2bImperial%2bCollege%2bLondon%2522%26q%3d%23tab-prefixes
Web End =+College+London%22&q=#tabprexes
Harvey et al. J Cheminform (2015) 7:43
Page 10 of 14
retrieved using the Handle REST API, which allows programmatic access to handle resolution using HTTP. A typical invocation would be using a URL of the type http://doi.org/10042/31117%3flocatt%3dmimetype:chemical/x-cml
Web End =http://doi.org/10042/31117?locatt=mim
http://doi.org/10042/31117%3flocatt%3dmimetype:chemical/x-cml
Web End =etype:chemical/x-cml where the string 10042/31117 is the assigned Handle identier and chemical/x-cml the requested media type.2. The DataCite Media API also allows a DOI to be resolved based on the media type of the required document, typically a URL of form http://data.datacite.org/chemical/x-cml/10.14469/ch/153690
Web End =http://data.data- http://data.datacite.org/chemical/x-cml/10.14469/ch/153690
Web End =cite.org/chemical/x-cml/10.14469/ch/153690 , where the string 10.14469/ch/153690 is the assigned Data-Cite identier, and chemical/x-cml the requested media type [34]. This URL can be passed to any requesting program and the le associated with this information will then be retrieved from the repository.
3. OAI-ORE Resource Maps exposed through DataC-ite metadata. We have made the OAI-ORE Resource Map [13] and the METS manifest [23] (both generated internally by Dspace) discoverable by including their locations as relatedIdentiers within the DataC-ite metadata for the dataset [35], as described above. This allows a script to query for example the resource map to retrieve the URL associated with the data le. Again, the only information required by the script is datacite_jmol(10.14469/ch/153690?chemical/xcml), where datacite_jmol is the Javascript function written to process the responses [36].
Any of the above methods [34] can be used in conjunction with e.g. a visualisation program which can convert the data contained in the retrieved le into a graphical representation, or as part of a script which could retrieve a greater number of les for the purpose of e.g. data mining.
Data discovery anddatametrics
Enhancement of the original Cambridge dataset with the features described above greatly improves the discoverability of the data. Enriching metadata and then exposing it in a manner that allows the Datacite organisation to harvest it enables exploitation using the DataCite interface [29] and allows statistics to be collected [37]. Examples of both are shown in Table2.
The current DataCite search resource is still styled beta, and it is probable that the features oered in the future will become greatly enhanced.
The benets ofachieving SWORD/OAIORE andMETSenabled endpoints
Perhaps the most signicant technical improvement realised as a result of this activity is the facilitation of
future curation eorts, as part of a strategy to address the issue of what has graphically been described as link rot [38], whereby a worryingly large proportion of non-persistent identiers used to cite data and associated information are found not to link correctly after just a few years or in some cases months. Digital repositories are intrinsically designed to enable replication of content to other locations whilst preserving essential information such as persistent identiers. Here we focus on the DSpace repository, which provides an OAI-ORE endpoint implementing the Open Archives Initiatives Object Reuse and Exchange standards [13] to achieve such replication. The ORE manifest for the deposition illustrated in Figs.1 or 2 for example is declared in metadata as: https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/159060/ore.xml
Web End =https://spectrad https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/159060/ore.xml
Web End =space.lib.imperial.ac.uk:8443/metadata/ https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/159060/ore.xml
Web End =handle/10042/159060/ore.xml or https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/159060/mets.xml
Web End =https:// https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/159060/mets.xml
Web End =spectradspace.lib.imperial.ac.uk:8443/ https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/159060/mets.xml
Web End =metadata/handle/10042/159060/mets.xml for the METS manifest (see above). These locators derive from the assigned handle for this entry as 10042/159060. For each entry, a structured XML representation of the data (for example PM7.xml), including a declared standard XML schema (CML 2.4) is included. This allows the data to be directly parsed using a generic XML import/ export tool, so enhancing any future wholesale export of the dataset. The use of XML is to be preferred to older legacy chemical formats, for which no explicit schemas are, or indeed can be, declared.
The following illustrates a programmatic method for a curation procedure that could be employed if starting from a SWORD [21, 22] /OAI-ORE [13] and/or METS-enabled endpoints.
Obtain a list of all the individual entries for the collection. This is accomplished by using DataCite to search for any unique identier associated with the collection, which is dened in this example by the string alternateIdentier:NCI:
This can be accomplished using the command: curl http://search.datacite.org/api?&q=
prex:10.14469&alternateIdentier: NCI\:*&=doi,title,relatedIdentier& wt=csv&rows=170000oNCI.csv.
The value 170,000 in this string is the expected upper bound. The prex 10.14469 restricts the search to collections at Imperial College only (to disambiguate from any other collections with the same name elsewhere).
This returns the following information for each entry (in this example in csv format, with other options being XML, OAI-PMH or json): doi,title,relatedIdentier
Harvey et al. J Cheminform (2015) 7:43
Page 11 of 14
1 0 . 1 4 4 6 9 / C H / 1 2 3 3 1 5 , N S C 5 3 9 6 , HasMetadata:URL:https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/130536/mets.xml
Web End =https://spectrad https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/130536/mets.xml
Web End =space.lib.imperial.ac.uk:8443/meta https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/130536/mets.xml
Web End =data/handle/10042/130536/mets.xml , HasMetadata:URL:https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/130536/ore.xml
Web End =https://spectrad https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/130536/ore.xml
Web End =space.lib.imperial.ac.uk:8443/meta https://spectradspace.lib.imperial.ac.uk:8443/metadata/handle/10042/130536/ore.xml
Web End =data/handle/10042/130536/ore.xml , IsPartOf:Handle:10042/31117
This reveals that ORE and METS manifests are associated with the Related identier metadata element, and the direct path to each is obtained from the value of the HasMetadata child.
These provide programmatic access (using XSLT transforms or other methods) to the METS bit-stream itself, which contains all the les in the deposition as a compressed archive. The METS bit-stream has the URL: https://spectradspace.lib.imperial.ac.uk:8443/dspace/bitstream/handle/10042/31117/mets.zip?sequence=8
Web End =https://spectradspace.
https://spectradspace.lib.imperial.ac.uk:8443/dspace/bitstream/handle/10042/31117/mets.zip?sequence=8
Web End =lib.imperial.ac.uk:8443/dspace/ https://spectradspace.lib.imperial.ac.uk:8443/dspace/bitstream/handle/10042/31117/mets.zip?sequence=8
Web End =bitstream/handle/10042/31117/mets. https://spectradspace.lib.imperial.ac.uk:8443/dspace/bitstream/handle/10042/31117/mets.zip?sequence=8
Web End =zip?sequence=8
This is retrievable using:curl https://spectradspace.lib.imperial.ac.uk:8443/dspace/bitstream/handle/10042/31117/mets.zip?sequence=8
Web End =https://spec https://spectradspace.lib.imperial.ac.uk:8443/dspace/bitstream/handle/10042/31117/mets.zip?sequence=8
Web End =tradspace.lib.imperial.ac.uk:8443/ https://spectradspace.lib.imperial.ac.uk:8443/dspace/bitstream/handle/10042/31117/mets.zip?sequence=8
Web End =dspace/bitstream/handle/10042/31117/ https://spectradspace.lib.imperial.ac.uk:8443/dspace/bitstream/handle/10042/31117/mets.zip?sequence=8
Web End =mets.zip?sequence=8 .
Each METS manifest can then be injected into the destination repository, with the string 10042/31118 dening the Handle for the entire new collection (not that for the individual entries):
curl si databinary @mets.zip \-H Content-Disposition:lename=mets.zip \
-H Content-Type:application/zip \-H X-Packaging:http://purl.org/net/sword-types/METSDSpaceSIP
Web End =http://purl.org/net/sword-types/ http://purl.org/net/sword-types/METSDSpaceSIP
Web End =METSDSpaceSIP \
-H X-No-Op:false -H X-Verbose:true https://USER:[email protected]:8443/sword/deposit/10042/31118
Finally in this section we note PREMIS, another international standard for metadata supporting the preservation of digital objects to help ensure their long-term usability [39]. Currently, the PREMIS Schema is only used in DSpace instances to represent technical metadata about DSpace bitstreams (i.e. les), being generated bya PREMIS crosswalk.
Comparison withother repositories
Here we compare our approach for data deposition with that of two alternative existing data repositories, one of which is also based on DSpace (Dryad [40]) and a second Figshare [41] that is not. The rst is run as a not-for-prot organisation that oers data deposition services, with
persistent identiers provided by both the DSpace handle manager and also via DataCite. Dryad deploys a subset of the metadata congured for our SPECTRa server, but signicantly this does include [42] an OAI-PMH based programmatic method for access to the data object via the METS manifest, allowing a procedure similar to the OAI-ORE resource map outlined above to be used to access the datale. Dryad diers in one signicant regard from our approach in terms of the granularity of the deposition. Since our data is based on the computed properties of discrete molecules, we have adopted the strategy of one data record per molecule, and hence the dataset for each molecule is also assigned its own persistent identiers. In contrast, the primary model used by Dryad oers coarser granularity of one data record per associated publication whereby the complete Dryad data set is linked with a peer reviewed journal publication. The net result is a pair of persistent identiers, one for the article and one for the data, with the data component embargoed until the article itself is released after peer-review into the public. We do not regard this approach as an optimal one when dealing with molecular data, since it cannot permit any discovery process for individual molecules contained in such a collection.
DOIs can also be minted using the current (2015) version of Figshare using the DataCite API. This commercial repository is not currently OAI-PMH/OAI-ORE compliant and so no standard ORE or METS resource maps are declared to DataCite using e.g. the related identier element of the DataCite metadata schema. This lack of OAI-PMH/OAI-ORE compliance would render a lossless curation of our SPECTRa collection to e.g. Figshare more difficult to achieve programmatically, but such an operation is not excluded in the future when the functionality becomes available.
Comparison withtwo other collections ofmolecular quantum mechanical calculation data
We rst return to discussing the article reporting the results of a stochastic exploration of the structures predicted using quantum mechanical procedures [11]. Initial approximations based on approximate methods are rened using much higher levels of theory. The molecular coordinates for unexpected, unusual or interesting outcomes from this procedure were then deposited into the supporting information (SI) associated with the published article. This contains just 10 species, although clearly far more molecules were computed at various levels of theory and these now appear lost to science. The SI itself takes the form of a paginated PDF le downloadable from the article landing page, and which contains no exposed associated metadata for any individual entry. Discussion is included here because it is very typical of
Harvey et al. J Cheminform (2015) 7:43
Page 12 of 14
the data associated with studies of this type. Curation of such data is really only worthwhile if it is rst aggregated into a larger collection, a process that is never attempted because of this formal lack of metadata. The resulting fragmentation and hence loss of valuable data is, we argue, one of the broken aspects of current publishing models that require urgent attention.
As with the previous example, the next article [12] describes quantum mechanics based procedures to obtain the molecular structures of a much larger collection of 134 kilo-molecules and the subsequent methods involved in creating a digital repository based collection of these. Depositing all the calculations recovered from this process goes one important stage beyond the previous example, and is therefore to be welcomed. However, an important unanswered question is how easy would it be to curate this collection in a decade from now. In fact, several fundamental design features [12] have made such an operation unnecessarily difficult.
The entire dataset is associated with a single persistent identier [43] expressed as a compressed archive that a user can download and expand into a folder containing 133,886 individual les. The collected metadata however does not refer to these les, but to the folder containing them, which in turn means that the contents of this folder are in eect not discoverable using the mechanisms described above.
In general, it is quite difficult on most computer systems to navigate a single folder containing 133,886 items. One would have to resort to using specialised software to do this, and this would probably restrict inspection to individual les and not to a sub-collection with specied properties.
The individual entries adopt the original XMol XYZ syntax. That syntax has then been annotated with a number of other properties, both to the individual atoms and to the molecule as a whole, the latter including both SMILES and InChI strings. Unfortunately, this annotation is in eect ad hoc in a manner that was not envisaged for the original XYZ format. A human has to read the associated documentation to establish the precise meaning of the annotations, and then write suitable code to extract the annotations to render them useable for e.g. metadata. It is in general uncertain whether software that has been written to process standard XYZ les lacking annotations could successfully cope with this additional content. At best, one might expect the annotations to be simply discarded, since their semantics are not accessible to such a program, only to a human. At worst, it could render the document entirely unreadable by standard software.
The individual les themselves contain no information about the procedure used to compute the coordinates. In this regard, it would be quite difficult to use these les to reproduce the original calculation; thus the original program inputs are not available, nor indeed are the original program outputs from the quantum mechanical calculation. Curating such a collection therefore would require bespoke interpretation by a human, which always tends to be an expensive and error-prone solution.
The Harvard Clean Energy project [44] is another recent deposition based on quantum chemical calculations, with a claimed 2.3 million molecules associated with an even more impressive 150,000,000 DFT calculations. Access to any individual calculation on any specic molecule however is available only via a search front-end to the database based on specied search parameters. No metadata is exposed on any molecule or its calculation parameters in any standard form and it is difficult to envisage any type of curation that could be successfully applied to such a collection. We think it unlikely that enabling open curation was a design feature of the system, although we also believe that this should be included in future designs of such collections.
The recently announced CERN OpenData Portal [45] is also included here, since the data described is very different from the chemical information described above, both in terms of the cost of its acquisition, and of its size and granularity. The organisational prex for the collection is 10.7483 and this reveals (in December 2014) 53 entries. A typical entry [46] itself contains 3211 datales totalling 3.4TB in size. Analysing this data requires very specialised software, which is itself assigned a persistent identier [47]. The software is distributed as a virtual image and is designed to be used in the form of a virtual machine containing all the tools required to acquire and analyse the data. The equivalent in our own implementation is the virtual JSmol container for the chemical data [48], that is made available indirectly in the web browser document object model (DOM) as an HTML5 canvas, rather than as a virtual instance on a computer. Working outside the virtual containers provided by the CERN data portal is unlikely to be useful, whereas for chemistry the JSmol container could be replaced by other containers such as e.g. Avogadro 2 [49].
Conclusions
This brief survey of two recently published molecular data collections indicates that each subject domain will benet from specically optimising the features of repository collections for its own needs. We believe that in the chemistry domain, it is useful to adopt a molecular
Harvey et al. J Cheminform (2015) 7:43
Page 13 of 14
granularity and to develop metadata, search and acquisition mechanisms appropriate for this granularity, even at a scale of 2.3 million molecules. We think it less useful to aggregate the molecules into single containers for which metadata about individual molecules is not exposed. It is also essential that the procedures adopted are program-matic, in that all the required information to re-curate the dataset is available for machine processing. If this is so, then there is no reason why the process could not scale well beyond 2.3 million molecules if required.
Code availability
The MOPAC software, including the latest PM7 parameter set [10] can be obtained and licensed from http://openmopac.net
Web End =http:// http://openmopac.net
Web End =openmopac.net . The DSpace software itself is open source [1]. The SpectraDSpace DIM2DataCite crosswalk is archived [50]. The Javascript routines implementing [36] the functionality described in the results [34] section are available via the repository entries cited in ref 36.
Authors contributions
The manuscript was written through contribution from all the authors, who have given approval to the nal version of the manuscript. All authors read and approved the nal manuscript.
Author details
1 High Performance Computing Service, Imperial College London, London SW7 2AZ, UK. 2 Department of Chemistry, Imperial College London, South Kensington Campus, London SW7 2AZ, UK. 3 Department of Chemistry, Centre for Molecular Informatics, Lenseld Road, Cambridge CB2 1EW, UK. 4 Stewart Computational Chemistry, 15210 Paddington Circle, Colorado Springs, CO 80921, USA.
Acknowledgements
One of us (JJPS) thanks the National Institute of General Medical Sciences of the National Institutes of Health (Award Number R44GM108085) for funding. The project was funded by an Imperial College Green shoots RDM grant.
Competing interests
The authors declare that they have no competing interests.
Received: 13 May 2015 Accepted: 2 August 2015
References
1. Smith M, Barton M, Bass M, Branschofsky M, McClellan G, Stuve D et al (2003) DSpace: An Open Source Dynamic Digital Repository, Dlib Magazine, 9, http://doi.org/10.1045/january2003-smith
Web End =http://doi.org/10.1045/january2003smith . The latest release of the software is available via http://www.dspace.org/latest-release
Web End =http://www.dspace.org/latestrelease
2. Downing J, MurrayRust P, Tonge AP, Morgan P, Rzepa HS, Cotterill F et al (2008) SPECTRa : The deposition and validation of primary chemistry research data in digital repositories. J Chem Inf Mod 48:15711581
3. Rzepa HS (2013) Chemical datuments as scientic enablers. J Chemin form 5:6
4. See for example the UK policy at EPSRC policy framework on research data. http://www.epsrc.ac.uk/about/standards/researchdata/
Web End =http://www.epsrc.ac.uk/about/standards/researchdata/ . (Retrieved 9 May, 2015)5. Frey JG, Bird CL (2014) Scientic and technical data sharing: a trading perspective. J Comput Aided Mol Des 28:989996
6. Badiola KA, Bird C, Brocklesby WS, Casson J, Chapman RT, Coles SJ et al (2015) Experiences with a researchercentric ELN. Chem Sci 6:16141629
7. MurrayRust P, Rzepa HS, Stewart JJP, Zhang Y (2005) A global resource for computational chemistry. J Mol Model 11:532541
8. Stewart JJP (1990) MOPAC: a semiempirical molecular orbital program. J Comput Aided Mol Des 4:1103
9. The link for this collection is The WorldWideMolecularMatrix, an Open collection of information on small molecules. https://www.repository.cam.ac.uk/handle/1810/724
Web End =https://www.repository. https://www.repository.cam.ac.uk/handle/1810/724
Web End =cam.ac.uk/handle/1810/724 . (Retrieved 9 May, 2015). The handle prex 1810 is not registered for this repository, and so the handle 1810/724/ cannot be resolved using http://hdl.handle.net/1810/724/ or http://doi. org/1810/724/
10. Stewart JJP (2013) Optimization of parameters for semiempirical meth ods VI: more modications to the NDDO approximations and reoptimiza tion of parameters. J Mol Model 19:132
11. Bera PP, Sattelmeyer KW, Saunders M, Schaefer HF, Schleyer PVR (2006) Mindless Chemistry. J Phys Chem A 110:42874290
12. Ramakrishnan R, Dral PO, Rupp M, von Lilienfeld OA (2014) Quantum chemistry structures and properties of 134 kilo molecules. Sci Data 1, article 140022
13. Open Archives Initiative Object Reuse and Exchange. See http://www.openarchives.org/ore/
Web End =http://www. http://www.openarchives.org/ore/
Web End =openarchives.org/ore/ . (Retrieved 9 May, 2015)
14. MurrayRust P, Rzepa HS (1999) Chemical Markup Language and XML Part I. Basic principles. J Chem Inf Comp Sci 39:92815. Heller S, McNaught A, Stein S, Tchekhovskoi D, Pletnev I (2013) InChI the worldwide chemical structure identier standard. J Cheminform 5:7. Technical documentation can be found at http://www.inchi-trust.org/technical-faq/
Web End =http://www.inchitrust.org/ http://www.inchi-trust.org/technical-faq/
Web End =technicalfaq/ . (Retrieved 9 May, 2015)
16. CML Schema version 2.4 http://www.xml-cml.org/schema/schema24/
Web End =http://www.xmlcml.org/schema/schema24/ . (Retrieved 9 May, 2015)
17. OBoyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR (2011) OpenBabel: An open chemical toolbox. J Cheminform 3:33. Documentation is found at http://openbabel.org/docs/dev/FileFormats/Overview.html
Web End =http://openbabel.org/docs/dev/FileFormats/ http://openbabel.org/docs/dev/FileFormats/Overview.html
Web End =Overview.html . (Retrieved 9 May, 2015)
18. Jenkins S, Liu Z, Kirk SR (2013) A bond, ring and cage resolved Poincar Hopf relationship for isomerisation reaction pathways. Mol Phys 111:31043116
19. Rzepa HS (2009) The importance of being bonded. Nat Chem 1:51051220. Downloadable Structure Files of NCI Open Database Compounds, http://cactus.nci.nih.gov/download/nci/
Web End =http:// http://cactus.nci.nih.gov/download/nci/
Web End =cactus.nci.nih.gov/download/nci/ . (Retrieved 9 May, 2015)
21. Alinson J, Franois S, Lewis S (2008) SWORD: Simple WebService Oering Repository Deposit Ariadne, vol 54, 30 January
22. Lewis S (2012) SWORD: Facilitating eposit Scenarios. DLib Magazine 18. doi:http://dx.doi.org/10.1045/january2012-lewis
Web End =10.1045/january2012lewis . (Retrieved 9 May, 2015). See also http://swordapp.org
Web End =http:// http://swordapp.org
Web End =swordapp.org . (Retrieved 22 July, 2015)
23. Metadata encoding and transmission standard (METS). http://www.loc.gov/standards/mets/
Web End =http://www.loc. http://www.loc.gov/standards/mets/
Web End =gov/standards/mets/ . (Retrieved 9 May, 2015)
24. Haak LL, Fenner M, Paglione L, Pentz E, Ratner H (2012) ORCID: a system to uniquely identify researchers. Learn Publish 25:25926425. Zang T, Rzepa HS, MurrayRust P, Harvey MJ, Mason NJ, McLeanA (2015) Revised Cambridge NCI database. hdl:10042/31117 and doi:10.14469/ch/2, shortDOI:6cw. (Retrieved 9 May, 2015)
26. Zang T, Rzepa HS, MurrayRust P, Harvey MJ, Mason NJ, McLean A (2015) NSC92832, NSC92832, hdl:10042/159060. (Retrieved 9 May, 2015)
27. Zang T, Rzepa HS, MurrayRust P, Harvey MJ, Mason NJ, McLean A (2015) NSC92832, NSC92832, doi:10.14469/ch/153690, shortDOI:6cv. (Retrieved 9 May, 2015)28. DataCite: http://www.datacite.org/
Web End =http://www.datacite.org/ . (Retrieved 9 May, 2015)29. Datacite metadata search interface: http://search.datacite.org
Web End =http://search.datacite.org . (Retrieved 9 May, 2015)30. See DOI Name Values http://doi.org/10320/loc; Handle REST API http://www.handle.net/overviews/rest-api.html
Web End =http:// http://www.handle.net/overviews/rest-api.html
Web End =www.handle.net/overviews/restapi.html ; 3 Resolution http://0www. doi.org.libcat.lafayette.edu/doi_handbook/3_Resolution.html#3.8.4. 3. (Retrieved 9 May, 2015)
31. Creative Commons Attribution (CC0): http://creativecommons.org/publicdomain/zero/1.0/
Web End =http://creativecommons.org/ http://creativecommons.org/publicdomain/zero/1.0/
Web End =publicdomain/zero/1.0/ . (Retrieved 9 May, 2015)
32. Rzepa HS, MurrayRust P, Whitaker BJ (1998) The application of chemical multipurpose internet mail extensions (Chemical MIME) internet standards to electronic mail and worldwide web information exchange. J Chem Inf Comput Sci 38:976982
33. Harvey MJ, Mason NJ, Rzepa HS (2014) Digital data repositories in chemistry and their integration with journals and electronic laboratory notebooks. J Chem Inf Mod 54:26272635
Harvey et al. J Cheminform (2015) 7:43
Page 14 of 14
34. Harvey MJ, McLlean A, Mason NJ, Rzepa HS (2015) Standardsbased metadata procedures for retrieving data for display or mining utilizing Persistent (dataDOI) Identiers. J Cheminform. doi: http://dx.doi.org/10.1186/s13321-015-0081-7
Web End =10.1186/s13321015 http://dx.doi.org/10.1186/s13321-015-0081-7
Web End =00817 . See also demonstration presented at the FORCE2015 Conference, Oxford, England, January 1213, 2015. doi:http://dx.doi.org/10.6084/m9.figshare.1266197
Web End =10.6084/m9.gshare.1266197 & shortDOI:xn3. (Retrieved 9 May, 2015)
35. For example this page represents DataCites metadata for doi:http://dx.doi.org/10.14469/ch/153690
Web End =10.14469/ http://dx.doi.org/10.14469/ch/153690
Web End =ch/153690 . http://data.datacite.org/10.14469/ch/153690
Web End =http://data.datacite.org/10.14469/ch/153690 reveals the metadata associated with the entry shown in Figures 1 and 2. (Retrieved 9 May, 2015)36. Harvey MJ, Mason N, McLean A, Rzepa HS (2015) The JavaScripts are archived Figshare. doi:http://dx.doi.org/10.6084/m9.figshare.1342036,shortDOI:2zb
Web End =10.6084/m9.gshare.1342036,shortDOI:2zb
37. Datecite statistics search interface http://stats.datacite.org
Web End =http://stats.datacite.org . (Retrieved 9 May, 2015)
38. Zittrain J, Albert K, Lessig L, Perma (2015) Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations, Harvard Public Law Working Paper No. 1342. Available at SSRN: http://ssrn.com/abstract=2329161
Web End =http://ssrn.com/ http://ssrn.com/abstract=2329161
Web End =abstract=2329161 or doi:http://dx.doi.org/10.2139/ssrn.2329161
Web End =10.2139/ssrn.2329161 . (Retrieved 9 May, 2015)
39. PREMIS (Preservation Metadata: Implementation Strategies) see http://www.loc.gov/standards/premis/
Web End =http:// http://www.loc.gov/standards/premis/
Web End =www.loc.gov/standards/premis/ . (Retrieved 22 July, 2015)
40. Dryad (2015) http://www.datadryad.org
Web End =http://www.datadryad.org . (Retrieved 9 May, 2015)41. Figshare, see http://figshare.com/
Web End =http://gshare.com/ . (Retrieved 9 May, 2015)42. Programmatic access to data les: http://wiki.datadryad.org/Data_Access
Web End =http://wiki.datadryad.org/Data_ http://wiki.datadryad.org/Data_Access
Web End =Access # Programmatic_access_to_individual_data_les_using_OAIPMH. (Retrieved 9 May, 2015)
43. Raghunathan R, Dral PO, Rupp M, von Lilienfeld OA (2014) Quantum chemistry structures and properties of 134 kilo molecules, Figshare. doi:http://dx.doi.org/10.6084/m9.figshare.978904,shortDOI:6cr
Web End =10.6084/m9.gshare.978904,shortDOI:6cr . (Retrieved 9 May, 2015)
44. Hachmann J, OlivaresAmaya R, AtahanEvrenk S, AmadorBedolla C, SnchezCarrera RS, GoldParker A et al (2011) The harvard clean energy project: largescale computational screening and design of organic pho tovoltaics on the world community grid. J Phys Chem Lett 2:22412251
45. The CERN OpenData Portal: http://opendata.cern.ch/
Web End =http://opendata.cern.ch/ and an associated data repository: http://zenodo.org
Web End =http://zenodo.org . (Retrieved 9 May, 2015)
46. A typical CERN OpenData collection: doi:http://dx.doi.org/10.7483/OPENDATA.CMS.PDY4.7H2H,shortDOI:6cs
Web End =10.7483/OPENDATA.CMS. http://dx.doi.org/10.7483/OPENDATA.CMS.PDY4.7H2H,shortDOI:6cs
Web End =PDY4.7H2H,shortDOI:6cs . (Retrieved 9 May, 2015)
47. A software object in the CERN OpenData collection: doi:http://dx.doi.org/10.7483/OPENDATA.CMS.GS6N.54B9.2,short
Web End =10.7483/OPEN http://dx.doi.org/10.7483/OPENDATA.CMS.GS6N.54B9.2,short
Web End =DATA.CMS.GS6N.54B9.2,short . (Retrieved 9 May, 2015)
48. Hanson RM, Prilusky J, Zhou R, Nakane T, Sussman JL (2013) JSmol and the nextgeneration webbased representation of 3D molecular structure as applied to proteopedia. Israel J Chem 53:207216
49. Hanwell MD, Curtis DE, Lonie DC, Vandermeersch T, Zurek E, Hutchison GR (2012) Avogadro: An advanced semantic chemical editor, visualization and analysis platform. J. Cheminform 4:17
50. Rzepa HS, Harvey MJ, Mason NJ, Mclean A, MurrayRust P, Stewart JJP (2015) Standardsbased curation of a decadeold digital repository data set of molecular information. Figshare. doi:http://dx.doi.org/10.6084/m9.figshare.1330063,shortDOI:6cq
Web End =10.6084/m9.gshare.1330063,s http://dx.doi.org/10.6084/m9.figshare.1330063,shortDOI:6cq
Web End =hortDOI:6cq . (Retrieved 9 May, 2015)
Publish with ChemistryCentral and every scientist can read your work free of charge
Open access provides opportunities to our colleagues in other parts of the globe, by allowing anyone to view the content free of charge.
W. Jeffery Hurst, The Hershey Company.
available free of charge to the entire scientific community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Centralyours you keep the copyright
Submit your manuscript here: http://www.chemistrycentral.com/manuscript/
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Journal of Cheminformatics is a copyright of Springer, 2015.
Abstract
Background
The desirable curation of 158,122 molecular geometries derived from the NCI set of reference molecules together with associated properties computed using the MOPAC semi-empirical quantum mechanical method and originally deposited in 2005 into the Cambridge DSpace repository as a data collection is reported.
Results
The procedures involved in the curation included annotation of the original data using new MOPAC methods, updating the syntax of the CML documents used to express the data to ensure schema conformance and adding new metadata describing the entries together with a XML schema transformation to map the metadata schema to that used by the DataCite organisation. We have adopted a granularity model in which a DataCite persistent identifier (DOI) is created for each individual molecule to enable data discovery and data metrics at this level using DataCite tools.
Conclusions
We recommend that the future research data management (RDM) of the scientific and chemical data components associated with journal articles (the "supporting information") should be conducted in a manner that facilitates automatic periodic curation. [Figure not available: see fulltext.]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer