Data and Metadata Management
Welcome! If you are involved in data or metadata standards aspects of establishing a Federated EGA node, you are in the right place. The information here covers topics related - but not limited - to best practices for data security, data access management, (meta)data standards, and data flow.
You might find this page useful if you are:
- a bioinformatician
- a data steward
- a data protection officer
- a support officer
By exploring these materials, you will be able to:
- Define data security and access management best practices
- Understand the EGA metadata standard and it’s minimal requirements
- Identify data models specific to a domain (e.g. COVID-19) and apply them when needed
- Comprehend the (meta)data flow within the Federated EGA Network
1. Learn about data security best practices
Central EGA has a Security Strategy which defines best practices in ensuring data are stored securely. The EGA has helped develop the recommendations outlined the GA4GH Data Security Infrastructure Policy, which defines guidelines, best practices, and standards for building and operating an infrastructure that promotes responsible data sharing in accordance with the GA4GH Privacy and Security Policy.
Summary of best practices recommended for Federated EGA nodes:
Item | Description | Examples/Templates |
---|---|---|
Breach response protocol | A protocol for addressing potential security breaches, including consideration of other FEGA nodes, Central EGA, key contacts, and institutional/organisational policies. | Coming soon! |
Risk register | A risk management tool used to track identified risks including information such as the nature of the risk, reference and owner, and mitigation measures. | Coming soon! |
2. Explore implemented standards
Central EGA largely adhere to GA4GH standards. Specific standards already implemented are summarised below:
Standard | Purpose | Specification Version | Supported Version | Implementation | Publication/Preprint |
---|---|---|---|---|---|
Beacon | Supports discovery of genomic variants and individuals. | v1.0.1 | v0.3 | Specification, Documentation, Endpoint | N/A |
Crypt4GH | Enables direct byte-level compatible random access to encrypted genetic data stored in community standards (e.g. CRAM, VCF). | v1.0 | v1.0 | Specification, Documentation, Endpoint | DOI |
Data Use Ontology (DUO) | Allows users to semantically tag datasets with usage restrictions so datasets can be automatically discoverable based on a researcher’s authorization level or intended use. | 2021-02-23 | 2021-02-23 | Specification, Documentation, Endpoint | DOI |
htsget | Enables secure, efficient, and reliable access to sequencing read and variation data including specific genomic regions. | v1.3.0 | v1.0.0 | Specification, Documentation, Endpoint | DOI |
refget | Enables access to reference sequences using an identifier derived from the sequence itself. | v1.2.6 | N/A | Specification, Documentation, Endpoint | DOI |
Researcher IDs (Passports and visas) | Specifies the collection of researchers that may access a dataset at any given time, and the credentials they must supply. | v1.0.1 | v1.0.1 | Specification, Documentation, Endpoint | DOI |
Data file standards
Recommended file formats for:
- Sequencing data (unaligned or aligned reads): CRAM, BAM
- Variant data: VCF
- Phenotype/clinical data: Phenopackets
Metadata standards
The following resources represent EGA and community guidelines for submitted metadata:
- EGA metadata model
- General FEGA standards
- Introduction to metadata video produced by CSC
- Introduction to phenotypic data video produced by CSC
- Community-specific FEGA standards
- COVID-19 metadata mapping model across COVID-19 studies in Federated EGA (ELIXIR-CONVERGE)
- COVID-19 Host Genetics Initiative data dictionary
- Phenotypic metadata: the COVID-19 example video produced by CSC
- Building a metadata model for COVID-19 video produced by CSC
- Recommended ontologies to search for concepts and terms using the Ontology Look-up Service (OLS)
- Experimental Factor Ontology (EFO): ontology record at OLS; EFO GitHub repository.
- Data Use Conditions (DUO): ontology record at OLS; DUO GitHub repository.
Quality control
Coming soon!
3. Understand data definitions and flow
As defined in the Federated EGA Collaboration Agreement, and in alignment with definitions outlined in the GDPR, the following data definitions are used in the context of Federated EGA:
- Administrative Data. Data which are generated through the operation of Federated EGA. This may include personal data according to Art. 4 Nr. 1 GDPR which is directly identifying, such as names and email addresses which are used to communicate with, and support, service users. It may also include personal data and business data which are used internally by staff working on behalf of EGA Central or Federated EGA Nodes and exchanged between them.
- Non-personal Metadata. Information that describes or annotates research data to facilitate its interpretation or to describe the relationship between data elements that cannot be used to identify a data subject. For example, the name of the instrument used to generate the data. Non-personal metadata generated or processed by the Node will be shared with EGA Central for inclusion in a searchable, online, public metadata catalogue.
- Personal Metadata. Information that describes or annotates research data to facilitate its interpretation or to describe the relationship between data elements that has the potential to identify a data subject. For example, demographic or ancestry information that can be used to identify individuals. Personal metadata are not made available through a public metadata catalogue and are not shared between EGA Central and the Node.
- Research Data. Omics or other forms of genetic (according to Art. 4 Nr. 13 GDPR) and health data (according to Art. 4 Nr. 15 GDPR) that are used for scientific research purposes. This is considered to be special category personal data under Art. 9(1) in conjunction with Art. 4 Nr. 1 GDPR.
Here you can view how the different types of data flow within the Federated EGA network.
4. What’s next?
Coming soon!