Data and Metadata Management

Welcome! If you are involved in data or metadata standards aspects of establishing a Federated EGA node, you are in the right place. The information here covers topics related - but not limited - to best practices for data security, data access management, (meta)data standards, and data flow.

You might find this page useful if you are:

  • a bioinformatician
  • a data steward
  • a data protection officer
  • a support officer

By exploring these materials, you will be able to:

  • Define data security and access management best practices
  • Understand the EGA metadata standard and it’s minimal requirements
  • Identify data models specific to a domain (e.g. COVID-19) and apply them when needed
  • Comprehend the (meta)data flow within the Federated EGA Network

1. Learn about data security best practices

Central EGA has a Security Strategy which defines best practices in ensuring data are stored securely. The EGA has helped develop the recommendations outlined the GA4GH Data Security Infrastructure Policy, which defines guidelines, best practices, and standards for building and operating an infrastructure that promotes responsible data sharing in accordance with the GA4GH Privacy and Security Policy.

Summary of best practices recommended for Federated EGA nodes:

Item Description Examples/Templates
Breach response protocol A protocol for addressing potential security breaches, including consideration of other FEGA nodes, Central EGA, key contacts, and institutional/organisational policies. Coming soon!
Risk register A risk management tool used to track identified risks including information such as the nature of the risk, reference and owner, and mitigation measures. Coming soon!

2. Explore implemented standards

Central EGA largely adhere to GA4GH standards. Specific standards already implemented are summarised below:

Standard Purpose Specification Version Supported Version Implementation Publication/Preprint
Beacon Supports discovery of genomic variants and individuals. v1.0.1 v0.3 Specification, Documentation, Endpoint N/A
Crypt4GH Enables direct byte-level compatible random access to encrypted genetic data stored in community standards (e.g. CRAM, VCF). v1.0 v1.0 Specification, Documentation, Endpoint DOI
Data Use Ontology (DUO) Allows users to semantically tag datasets with usage restrictions so datasets can be automatically discoverable based on a researcher’s authorization level or intended use. 2021-02-23 2021-02-23 Specification, Documentation, Endpoint DOI
htsget Enables secure, efficient, and reliable access to sequencing read and variation data including specific genomic regions. v1.3.0 v1.0.0 Specification, Documentation, Endpoint DOI
refget Enables access to reference sequences using an identifier derived from the sequence itself. v1.2.6 N/A Specification, Documentation, Endpoint DOI
Researcher IDs (Passports and visas) Specifies the collection of researchers that may access a dataset at any given time, and the credentials they must supply. v1.0.1 v1.0.1 Specification, Documentation, Endpoint DOI

Data file standards

Recommended file formats for:

  • Sequencing data (unaligned or aligned reads): CRAM, BAM
  • Variant data: VCF
  • Phenotype/clinical data: Phenopackets

Metadata standards

The following resources represent EGA and community guidelines for submitted metadata:

Quality control

Coming soon!

3. Understand data definitions and flow

As defined in the Federated EGA Collaboration Agreement, and in alignment with definitions outlined in the GDPR, the following data definitions are used in the context of Federated EGA:

  • Administrative Data. Data which are generated through the operation of Federated EGA. This may include personal data according to Art. 4 Nr. 1 GDPR which is directly identifying, such as names and email addresses which are used to communicate with, and support, service users. It may also include personal data and business data which are used internally by staff working on behalf of EGA Central or Federated EGA Nodes and exchanged between them.
  • Non-personal Metadata. Information that describes or annotates research data to facilitate its interpretation or to describe the relationship between data elements that cannot be used to identify a data subject. For example, the name of the instrument used to generate the data. Non-personal metadata generated or processed by the Node will be shared with EGA Central for inclusion in a searchable, online, public metadata catalogue.
  • Personal Metadata. Information that describes or annotates research data to facilitate its interpretation or to describe the relationship between data elements that has the potential to identify a data subject. For example, demographic or ancestry information that can be used to identify individuals. Personal metadata are not made available through a public metadata catalogue and are not shared between EGA Central and the Node.
  • Research Data. Omics or other forms of genetic (according to Art. 4 Nr. 13 GDPR) and health data (according to Art. 4 Nr. 15 GDPR) that are used for scientific research purposes. This is considered to be special category personal data under Art. 9(1) in conjunction with Art. 4 Nr. 1 GDPR.

Here you can view how the different types of data flow within the Federated EGA network.

4. What’s next?

Coming soon!