Digital preservation
General Statement
As stated in Maintenance and Development of the Technological Infrastructure, the Library System retains its digital data mainly in its storage and backup infrastructure. It uses the services of the University of Padua concerning digital asset replication. To preserve integrity, digital assets are replicated remotely and validated by MD5 checksum or similar methods.
The hardware infrastructure is equipped with modern systems to detect hardware deterioration and failures and can replace and service the hardware quickly. Redundancy is implemented in the infrastructure and allows it to be operational even in the presence of multiple failures.
Data recovery and business continuity plans are implemented in a short time through redundant backups, UPS, servers, and redundant networking equipment and service systems.
Phaidra
Procedures
Phaidra is based on Fedora, "a robust, modular, open-source repository system for the management and dissemination of digital content", which includes OAIS-compliance capabilities. The two steps of the deposit are now described.
In the first step, the ingesting, a SIP (Submission Information Package) is received for the selection, evaluation and organization of the content. There are two default procedures, the Phaidra Importer tool and the bulk upload script, but the content is usually sent through the simple and friendly Phaidra web interface.
The producers need authentication by issuing a local account, and they can then upload the digital objects.
The second step, the submission, is done automatically by the platform, through which the metadata is validated, the binary data (Octets) is stored and the checksums are created. Then an AIP (Archival Information Package) is generated for storing and archiving data. The platform automatically generated the DIP, as the dissemination of data is open, except in cases where the producers have closed access to the binary content. However, the platform allows the dissemination of metadata.
The software and hardware system monitors any problems and can notify whether there are malfunctions, or if disks or parts need to be replaced. The hardware infrastructure is monitored via ILO (HP servers) and the software through the VMware tools to minimize the risks. The Library System is provided with Veeam software for backup and the restoration of server images, has an adequate number of servers in case of malfunctions, and has secondary storage. The infrastructure can react quickly to malfunctions and setbacks. The Library System relies on ASIT (the University Data Centre) in case of catastrophic events to bring its services back online in a short time. Adequate physical storage capacity on servers is guaranteed to support Phaidra's activities.
The AIP consists of metadata and data in formats suitable for long storage. All the data of the System is regularly backed up. All the servers are located in a room that is monitored and whose physical access is limited to the authorized staff. When the staff is not present, an alarm system and remote security monitoring are guaranteed. In regards to safety, the premises are regularly monitored by the University of Padua Security Unit.
Phaidra has an automated backup system. Each element (metadata and file) is saved with the MD5 for integrity verification. Phaidra has the tools to perform regular analyses and intervene if necessary.
As regards storage, we possess advanced architectures for multiple replications of data, and automatic backup in multiple locations at the University of Padua. Moreover, every night a copy of the repository is planned, with the possibility of restoration in a few hours.
With regards to safety, the network is protected by a firewall and access to the machines is restricted to the operators. As mentioned above, users with a verified account can upload digital objects to the platform. The Library System ensures an adequate hardware and software system to transfer, manage and publish data in a safe and traceable way. The data managers have documented and clear procedures in the internal wiki, so every storage operation can be performed by the team with the same results.
Regarding the electrical components, all equipment is fitted with safety devices compliant with legal standards. The system is divided by switches for each zone and type of use. In addition, there are two UPS to ensure continuity. Concerning safety, the premises are regularly monitored by the University of Padua's Security Unit.
Public evidence
- The section below on the general strategies for digital preservation
- The internal wiki (https://wiki.cab.unipd.it/wiki-itcab/index.php/Servizi/Storage) describes the procedures. Below is a summary of the content.
Procedures for managing storage in the Library System
Data management in the Library System is mainly done by managing the virtual machines that contain them. The main operations are:
Creation of a VM
Through the VMware environment, data managers can create a new VM. The network and storage parts require the most attention. For the network, the machine must be inserted into the relevant network, making sure that it uses an appropriate address. For the storage part, the default profile must be selected at the storage selection stage, except for virtual machines with special tasks.
Moving of a VM
VMs are usually assigned to servers in the available pool. Using VMware tools, it is possible to move a machine from one management server to another, with minimal downtime, in case of necessity (planned or unplanned maintenance, for example).
Restoring a VM
Through Veeam, it is possible to restore a VM from dedicated backup storage. The recovery can replace the existing machine after or alongside a disaster recovery, for partial recovery of data or with the need to compare the evolution of the data over time.
Allocating an additional disk to a VM
An additional disk can be attached to the VMs. It is recommended for machines with a lot of data. Using the VMware tools, a new disk can be created (select the default policy unless otherwise indicated) on the machine. The disk, in the basic policy, will be backed up by the VM.".
Digital Preservation
Digital preservation involves the combination of policies, strategies and actions to ensure the authenticity of the content and long-term preservation, regardless of future technological changes. Digital preservation applies to both native digital and digitised content.
Strategic activities in support of digital preservation are aligned with the rules of the Library System, i.e. the preservation, updating and enjoyment of bibliographic and documentary heritage and access to scientific information through the development of the University Digital Library (Regulations of the University of Padua Library System, Heading I, Art. 1, Paragraphs 1 and 2).
Strategies and actions for digital preservation apply to the creation, integrity and maintenance of content.
The main actions pursued by the SBA for the long-term preservation of digital collections are as follows:
- Development and maintenance of digital archives for the long-term preservation of digital objects
- Management of different file and metadata formats
- Implementation of robust processes and automated mechanisms to ensure good management and preservation of content
- Continuous and reliable access to the content of digital objects for the designated community
Our long-term preservation strategy is based mainly on the standardization of input data.
The document Recommended file formats for long-term archiving and web dissemination in Phaidra gives an overview of the file formats to be used for long-term archiving and uploading to Phaidra. The document reads:
“There are no absolute criteria for choosing the file format. The choice is always dependent on different evaluations that the person who archives will have to make promptly, case by case, and will often result in a compromise between the best achievable quality and the limits imposed by the costs of production, processing and storage of files, as well as, for the previous, by the opportunity of a conversion to a new format.
This choice is particularly significant from the perspective of long-term archiving, for which a quality that respects the authenticity and integrity of the original document and a format that guarantees long-term access to data is desirable.
Some general criteria can be followed when choosing the most suitable format for archiving: Openness, Portability, Quality and Functionality, Development Support, Dissemination, Transparency, Self-documentation.”
For our designated community the priority is the preservation of the information content (images, books, videos, etc.). We have a single level of preservation, given the variety of content archived on the platform.
Important steps to guarantee the preservation of the platform are:
- Control over the entire data supply chain. All the necessary actions are taken with the producer to integrate the missing information, paying close attention to the completeness of the data, both for long-term preservation (provenance, holding, rights, technical characteristics) and any formats suitable for long-term preservation. For example, even if we keep an original proprietary and undocumented file format, we convert the images into a TIFF file format.
- Control of data integrity. Every data modification is registered, using mostly the built-in mechanisms of Fedora. In particular, every metadata change is saved and is available to the repository. Checksums are applied to data and metadata, so we can monitor and intervene in case of accidental changes, software errors or incidents of another nature, through the recovery of data from the storage management.
- Readability of the data. Phaidra's objective is that the data is always readable and interpretable by the designated community. We interact with the designated community (refer to Brief description of the designated community) on the use of data and we keep up-to-date about the evolution of the text, image, audio and video formats.
Together with the producer, Phaidra considers legal, ethical and copyright issues, preserving this information so that the usability of the data will be known in the future. The usability is guaranteed by the preservation of the content and by the wealth of information for the study and comprehension of the designated community.
There is a clear agreement between the producer and Phaidra through the Terms of use. By accepting the Terms of Use, the producer agrees that the platform manages and disseminates the contents. Furthermore, the producer states that he/she owns the rights to deposit the object, the copyright and data confidentiality have been cleared, and he/she has taken into consideration the ethical issues, according to the ethical code of the University of Padua.
Phaidra provides a simple and secure ingesting of data (files and metadata), including the provenance and life cycle of the digital object.
Phaidra is committed to the long-term custody of the items deposited in the repository and strives to adopt the current best practices in digital preservation.
As explained in Maintenance and Development of the Technological Infrastructure (ITA), all conditions for continuity assurance are met. The files and metadata formats respect the characteristics of long-term preservation and satisfy our designated community. Below is a description of the Phaidra migration plan:
- Submit our plan to stakeholders for approval
- Analysis of the functionalities of the target platform for migration
- The preliminary phase of analysis on file formats and the level of preservation required for data and metadata. This phase is facilitated by the choices made on the formats and the format analysis document, as well as by the general nature of the metadata format.
- Determining preservation actions based on format analysis
- Metadata analysis, vocabulary normalization and other preliminary operations
- Writing of the necessary actions including a possible transformation of the formats according to the required level of preservation, copying of the data and mapping of the metadata to new formats if necessary, given the required level of preservation.
- Definition of the test criteria to evaluate the success of the migration
- Definition and planning of any costs and verification of the availability of the planned resources
- Identification of responsibilities within the plan and definition of migration workflows
In the migration phase it will be necessary:
- Prepare the cleaning of the data
- Carry out the migration tests. This phase is very important because it is useful to evaluate the planned actions and verify them, making the necessary corrections
- Possible modification of the plan and simultaneous updating of the documentation
A migration plan requires a high level of collaboration of all parties involved, from the stakeholders to all those who have to provide information (e.g., the level of preservation, verification of the success of the migration, and information on the destination platform).
To date, the responsibility for digital preservation follows the Maintenance and development of the technological infrastructure (ITA) at the institutional level, and in the Security section (below) for the technical part.
The Terms of Use specify the level of responsibility defined by Phaidra towards users and the needs related to long-term preservation (see: "The University undertakes to preserve to the best of its ability the digital objects stored in Phaidra and to make them accessible and usable over time") and authorize Phaidra to manage data in a way appropriate to the purpose (see: "Authorized users can deposit digital objects in the platform making them available - according to the licenses issued - to third parties").
Infrastructure
The basic infrastructure of Phaidra is based on Fedora Commons 3.8.X, one of the most used open-source platforms to create digital repositories. Considering the designated community, Phaidra has developed a data model based on LOM, Dublin Core and Italian Museum Standards of the ICCD - Central Institute for Cataloguing and Documentation of the Ministry of the Cultural Heritage) metadata schemas. Fedora follows the OAIS reference model. Fedora and Phaidra are both supported by developer communities. The public wiki of Phaidra documents the technical specifications of the platform (refer to GitHub).
The Phaidra metadata is mostly textual and descriptive. Spatial standards, such as Google KML, are implemented in part. JSON and XML are used as exchange formats between platforms and internal components. For low-level data security, Fedora's XACML granular permissions ensure that only designated roles and accounts can access the objects and it also performs checks on authorization over modification, creation and deletion. Accounts are managed using Active Directory LDAP.
As regards connectivity, our provider is the University of Padua, which uses the GARR network the Italian Research network. The University can provide and restore connectivity in a short time. The devices are all equipped with UPS to allow continuity of service in the event of a blackout.
As regards the software, we have developed a Java-based utility named Phaidra Importer for bulk import of collections of images, documents PDF, videos and books. Phaidra's components run in well-known open-source platforms, such as Apache and Nginx, using Perl, Java, shell scripts and modern web technologies (HTML5, javascript frameworks, CSS3) for web frontends. We use Debian Linux as the operating system for our server appliances.
As regards the storage, Hyperconvergence by VMware is used. This technology aggregates the physical space of our current 6 servers and provides it as a virtual resource for all guests in the VMware system. In this way, the data is distributed on multiple servers and the system can continue its functioning, even if a breakdown of 2 servers occurs. In the event of a server failure, it is possible to redistribute the data to the other servers to prevent a possible breakdown of a second server. The low-level storage technology is based on raid5 on SSDs.
The documentation is available on GitHub; there is information on Phaidra’s technical specifications.
Security
The Library System has computer staff who manage the information services of the Library System. These are nine computer scientists from the Information systems sector of the Digital Library Office. Two computer scientists from this sector work on Phaidra.
The Library System includes the security of its spaces among its aims. There is a local security officer in the University of Padua's security plan, who can assess the risks and prevent any dangers. In any case, the safety management is in charge of the University of Padua, which carries out periodic checks for risk prevention of the plants, intrusion, manipulation or data theft. The technological infrastructures are in specific places to which only authorized and qualified people have access. In addition to local and remote backups at the University, we have a local tape backup system (LTP3). The tapes are kept in fireproof containers and locked in a supervised room. The technical support for ordinary and extraordinary operations is guaranteed by a special service of the University of Padua.
In summary, the infrastructures are protected by security plans from a physical point of view, as well as being monitored and accessible only by authorized staff. This ensures that the data is safe for long-term preservation and use.
From an IT point of view, the data is protected by the presence of firewalls that limit access to the infrastructure. Servers can only be accessed via the local network or VPN, and access is restricted only to authorized operators with their accounts. Some experts assess the risks of cyber-attacks and we have monitoring tools for the network and the servers.
In addition to the backups, there are internal procedures with instructions for the operators in case of data or infrastructure recovery, and for the regulation of access to systems by authorized staff.
In regards to recovery procedures, if there are any problems with our main IT infrastructure, we can rely on a secondary IT infrastructure and also on the Veeam system, which can put the image back online directly from the backup. We can also recover the image from the University infrastructure, where we already have some other guests running. In the event of a security incident, the issue is reported to ASIT (University Computer Center https://www.ict.unipd.it/), a designated office for IT security.