Digital Preservation
General Statement and System Strategies for Digital Preservation
As stated in Maintenance and Development of the Technological Infrastructure, the Library System primarily stores and backs up its digital data within its storage and backup infrastructure. It uses the services of the University of Padua concerning digital asset replication. To preserve integrity, digital assets are replicated remotely and validated by checksum or similar methods.
Our university data centre — the ASIT infrastructure (ITA) — is equipped with modern systems to detect hardware deterioration and failures, and can replace and service the hardware quickly. Redundancy is implemented in the infrastructure and allows it to be operational even in the presence of multiple failures.
Data recovery and business continuity plans are implemented in a short time through redundant backups, UPS, servers, and redundant networking equipment and service systems.
The software and hardware system monitors any problems and can notify whether there are malfunctions, or if disks or parts need to be replaced. The hardware infrastructure is continuously monitored to minimise risks. Backup and restoration of server images is guaranteed, has an adequate number of servers in case of malfunctions, and has primary and secondary storage. The infrastructure can react quickly to malfunctions and setbacks. ASIT will step in in case of catastrophic events to bring its services back online in a short time. Adequate physical storage capacity on servers is guaranteed to support Phaidra and Research Data Unipd activities.
As regards storage, advanced architectures for multiple replications of data and automatic backup in multiple locations at the University of Padua are available. Moreover, every night, a copy of the repository is planned, with the possibility of restoration in a few hours.
With regards to safety, the network is protected by a firewall and access to the machines is restricted to the operators.
The Library System ensures an adequate infrastructure and expertise to transfer, manage and publish data in a safe and traceable way. The data managers have documented and clear procedures in the internal wiki, so every storage operation can be performed by the team with the same results.
Regarding the electrical components, all equipment is fitted with safety devices compliant with legal standards. The system is divided by switches for each zone and type of use. In addition, the implementation of UPS systems ensures continuity. Concerning safety, the premises are regularly monitored by the University of Padua's Security Unit.
As regards connectivity, our provider is the University of Padua, which uses the GARR network, the Italian Research Network. The University can provide and restore connectivity in a short time. The devices are all equipped with UPS systems to allow continuity of service in the event of a blackout.
Concerning storage, enterprise-level storage (ITA) are used that implements all redundancy measures (RAID, power supply) in case of failures.
Public evidence:
- The section below on the general strategies for digital preservation
- The internal wiki (https://wiki.cab.unipd.it/wiki-itcab/index.php/Servizi/Storage) describes the procedures. Below is a summary of the content.
Procedures for managing the storage in the University Library System
Data management in the Library System is mainly done by managing the virtual machines that contain it. The main operations are:
Creation of a VM
Through the hosting environment (VMware), data managers can create a new VM. The network and storage parts require the most attention. For the network, the machine must be inserted into the relevant network, making sure that it uses an appropriate address. For the storage part, the data manager can allocate the needed disk space.
Moving a VM
VMs are usually assigned to servers in the available pool. Using VMware tools, it is possible to move a machine from one management server to another, with minimal downtime, in case of necessity (planned or unplanned maintenance, for example). This part is managed by ASIT internally.
Restoring a VM
At our request, ASIT can restore a VM from a dedicated backup storage. The recovery can replace the existing machine after or alongside a disaster recovery, for partial recovery of data or with the need to compare the evolution of the data over time.
Allocating an additional disk to a VM
An additional disk can be attached to the VMs. It is recommended for machines with a lot of data. The disk will be backed up by the VM."
Security
The Library System has computer staff who manage the information services of the Library System. These are nine computer scientists from the Information Systems sector of the Digital Library Office. Two computer scientists from this sector work on Phaidra.
The Library System includes the security of its spaces among its aims. There is a local security officer in the University of Padua's security plan, who can assess the risks and prevent any dangers. In any case, the safety management is in charge of the University of Padua, which carries out periodic checks for risk prevention of the plants, intrusion, manipulation or data theft. The technological infrastructures are in specific places to which only authorised and qualified people have access. In addition to local and remote backups at the University, we have a local backup system based on Bacula. Devices are locked in supervised rooms. The technical support for ordinary and extraordinary operations is guaranteed by a special service of the University of Padua.
In summary, the infrastructures are protected by security plans from a physical point of view, as well as being monitored and accessible only by authorised staff. This ensures that the data is safe for long-term preservation and use.
From an IT point of view, the data is protected by the presence of firewalls that limit access to the infrastructure. Servers can only be accessed via the local network or VPN, and access is restricted only to authorised operators with their accounts. Some experts assess the risks of cyber-attacks, and we have monitoring tools for the network and the servers.
In addition to the backups, there are internal procedures with instructions for the operators in case of data or infrastructure recovery, and for the regulation of access to systems by authorised staff.
Phaidra
Procedures
Phaidra is based on Fedora, "a robust, modular, open-source repository system for the management and dissemination of digital content", which includes OAIS-compliance capabilities. The two steps of the deposit are now described.
In the first step, the ingesting, a SIP (Submission Information Package) is received for the selection, evaluation and organisation of the content. There are two default procedures, the Phaidra Importer tool and the bulk upload script, but the content is usually sent through the simple and friendly Phaidra web interface.
The producers need authentication by issuing a local account, and they can then upload the digital objects.
The second step, the submission, is done automatically by the platform, through which the metadata is validated, the binary data (Octets) is stored, and the checksums are created. Then an AIP (Archival Information Package) is generated for storing and archiving data. The platform automatically generated the DIP, as the dissemination of data is open, except in cases where the producers have closed access to the binary content. However, the platform allows the dissemination of metadata.
The AIP consists of metadata and data in formats suitable for long-term storage. All the data in the System is regularly backed up. All the servers are located in a room that is monitored and whose physical access is limited to the authorised staff. When the staff is not present, an alarm system and remote security monitoring are guaranteed. In regard to safety, the premises are regularly monitored by the University of Padua Security Unit.
Phaidra has an automated backup system. Each element (metadata and file) is saved with the MD5 for integrity verification. Phaidra has the tools to perform regular analyses and intervene if necessary.
As mentioned above, users with a verified account can upload digital objects to the platform.
Digital Preservation
Digital preservation involves the combination of policies, strategies and actions to ensure the authenticity of the content and long-term preservation, regardless of future technological changes. Digital preservation applies to both native digital and digitised content.
Strategic activities in support of digital preservation are aligned with the rules of the Library System, i.e. the preservation, updating and enjoyment of bibliographic and documentary heritage and access to scientific information through the development of the University Digital Library (Regulations of the University of Padua Library System, Heading I, Art. 1, Paragraphs 1 and 2).
Strategies and actions for digital preservation apply to the creation, integrity and maintenance of content.
The main actions pursued by the SBA for the long-term preservation of digital collections are as follows:
- Development and maintenance of digital archives for the long-term preservation of digital objects
- Management of different file and metadata formats
- Implementation of robust processes and automated mechanisms to ensure good management and preservation of content
- Continuous and reliable access to the content of digital objects for the designated community
Our long-term preservation strategy is based mainly on the standardisation of input data.
The document Recommended file formats for long-term archiving and web dissemination in Phaidra provides an overview of the file formats to be used for long-term archiving and uploading to Phaidra. The document reads:
“There are no absolute criteria for choosing the file format. The choice is always dependent on different evaluations that the person who archives will have to make promptly, case by case, and will often result in a compromise between the best achievable quality and the limits imposed by the costs of production, processing and storage of files, as well as, for the previous, by the opportunity of a conversion to a new format.
This choice is particularly significant from the perspective of long-term archiving, for which a quality that respects the authenticity and integrity of the original document and a format that guarantees long-term access to data is desirable.
Some general criteria can be followed when choosing the most suitable format for archiving: Openness, Portability, Quality and Functionality, Development Support, Dissemination, Transparency, Self-documentation.”
For our designated community, the priority is the preservation of the information content (images, books, videos, etc.). We have a single level of preservation, given the variety of content archived on the platform.
Important steps to guarantee the preservation of the platform are:
- Control over the entire data supply chain. All the necessary actions are taken with the producer to integrate the missing information, paying close attention to the completeness of the data, both for long-term preservation (provenance, holding, rights, technical characteristics) and any formats suitable for long-term preservation. For example, even if we keep an original proprietary and undocumented file format, we convert the images into a TIFF file format.
- Control of data integrity. Every data modification is registered, using mostly the built-in mechanisms of Fedora. In particular, every metadata change is saved and is available to the repository. Checksums are applied to data and metadata, so we can monitor and intervene in case of accidental changes, software errors or incidents of another nature, through the recovery of data from the storage management.
- Readability of the data. Phaidra's objective is that the data is always readable and interpretable by the designated community. We interact with the designated community (refer to Brief description of the designated community) on the use of data, and we keep up-to-date about the evolution of the text, image, audio and video formats.
Together with the producer, Phaidra considers legal, ethical and copyright issues, preserving this information so that the usability of the data will be known in the future. The usability is guaranteed by the preservation of the content and by the wealth of information for the study and comprehension of the designated community.
There is a clear agreement between the producer and Phaidra through the Terms of Use. By accepting the Terms of Use, the producer agrees that the platform manages and disseminates the contents. Furthermore, the producer states that he/she owns the rights to deposit the object, the copyright and data confidentiality have been cleared, and he/she has taken into consideration the ethical issues, according to the ethical code of the University of Padua.
Phaidra provides a simple and secure ingestion of data (files and metadata), including the provenance and life cycle of the digital object.
Phaidra is committed to the long-term custody of the items deposited in the repository and strives to adopt the current best practices in digital preservation.
As explained in Maintenance and Development of the Technological Infrastructure, all conditions for continuity assurance are met. The files and metadata formats respect the characteristics of long-term preservation and satisfy our designated community. Below is a description of the Phaidra migration plan:
- Submit our plan to stakeholders for approval
- Analysis of the functionalities of the target platform for migration
- The preliminary phase of analysis on file formats and the level of preservation required for data and metadata. This phase is facilitated by the choices made on the formats and the format analysis document, as well as by the general nature of the metadata format.
- Determining preservation actions based on format analysis
- Metadata analysis, vocabulary normalisation, and other preliminary operations
- Writing of the necessary actions, including a possible transformation of the formats according to the required level of preservation, copying of the data and mapping of the metadata to new formats if necessary, given the required level of preservation.
- Definition of the test criteria to evaluate the success of the migration
- Definition and planning of any costs and verification of the availability of the planned resources
- Identification of responsibilities within the plan and definition of migration workflows
In the migration phase, it will be necessary:
- Prepare for the cleaning of the data
- Carry out the migration tests. This phase is very important because it is useful to evaluate the planned actions and verify them, making the necessary corrections
- Possible modification of the plan and simultaneous updating of the documentation
A migration plan requires a high level of collaboration among all parties involved, from the stakeholders to all those who have to provide information (e.g., the level of preservation, verification of the success of the migration, and information on the destination platform).
To date, the responsibility for digital preservation follows the Maintenance and development of the technological infrastructure at the institutional level, and in the Security section (below) for the technical part.
The Terms of Use specify the level of responsibility defined by Phaidra towards users and the needs related to long-term preservation (see: "The University undertakes to preserve to the best of its ability the digital objects stored in Phaidra and to make them accessible and usable over time") and authorize Phaidra to manage data in a way appropriate to the purpose (see: "Authorized users can deposit digital objects in the platform making them available - according to the licenses issued - to third parties").
Infrastructure
Phaidra software architecture is based on Fedora version 6.0.X, one of the best-known open-source platforms for creating digital repositories. In line with the requirements of the designated community, Phaidra has developed a data model based on JSON-LD, LOM, Dublin Core and the Italian museum standards of the ICCD (Central Institute for Cataloguing and Documentation of the Ministry of Culture). Fedora follows the OAIS reference model. Both Fedora and Phaidra are supported by the developer community. Phaidra's public wiki documents the platform’s technical specifications (see GitHub).
As regards the software, a tool called Phaidra Book Importer has been developed for the creation and archiving of books. All Phaidra components run on well-known open-source platforms, such as Apache and Nginx, using Perl, Java and shell scripts as programming languages, and modern web technologies (HTML5, JavaScript frameworks such as Vue.js, CSS3) for the web interface. We use Debian Linux as the operating system for our servers.
The documentation is available on GitHub; it contains information on Phaidra's technical specifications.
Research Data Unipd
It is the service for the management and long-term archiving of research data produced by the designated community, i.e. all users with an institutional account (Single Sign-On) that produce research data.
It allows the data needed to validate the results presented in a scientific publication to be archived, made accessible and reusable, as required by major funding bodies and numerous international journals.
Deposit
The data management workflow starts with the deposit action performed by the members of the research community, via access to the platform.
For specifics on the data submission workflow, the user can consult the HowTo page, where detailed guides can be found. For the pre-ingesting phase, specifications and recommendations can be found in the "Before you start to upload data" section.
The video guide in English and Italian explains the main steps, from login to the completion of the submission. The walkthrough guide accompanies the user by explaining in detail all the options and functionalities of the archive, as well as the different viewing modes depending on whether you are logged in as a depositor or an external user.
File formats
The "Recommended formats and data files" page of the repository provides information on the preferred file formats to ensure long-term accessibility of data and guidance on file sizes. The documentation "Recommended file formats for long-term archiving and web dissemination" in Phaidra is also available to users.
Licences and re-use
All published metadata is released under a CC0 licence.
The depositing user is encouraged to license the datasets to promote the re-use of the research data.
The "Access and re-use of data" page provides information on available licences.
Archive licence
By accepting the data deposit agreement, the user agrees that the platform will manage and disseminate the content. Furthermore, the depositor subscribes that he or she has the right to deposit the object, that the copyright and confidentiality of the data have been verified and implemented, and that he or she has assessed ethical issues, following the guidelines of the University of Padua's code of ethics.
Pre-publication control
Submitted metadata and research data are reviewed by library staff specialised in metadata curation and then published. Metadata for long-term preservation and the format of deposited files are also assessed.
Data Authenticity
Every change in the data is recorded, mainly using Eprints' built-in mechanisms. In particular, every metadata change is saved and is available to the repository.
Infrastructure
Research Data Unipd is based on the EPrints software, "the world-leading open-source digital repository platform", widely used by universities and other institutions for the dissemination of digital content, e.g. publications, theses or research data, with a historical presence in the Open Access world.