Reference Data Archive (RDA) - User Manual
1 General information about the RDA
The Reference Data Archive (RDA) is a publicly-accessible dataset archive and analytics infrastructure. It was originally intended as a data resource to assist in developing and validating tools for cause of death ascertainment using verbal autopsy data augmented by pathology information from minimally-invasive tissue samples (MITS) - deaths with verbal autopsy, MITS, and a trustworthy reference cause. When the RDA launches it will contain reference deaths of this type. Over time the RDA will grow to be a more general purpose mortality and cause of death data resource, and perhaps more.
1.1 What exactly is the RDA?
The RDA is computational and IT infrastructure that allows or provides:
- Data hosting and curating services
- A secure data analysis environment
- A searchable, documented data repository
- Tools for data sharing
- Clear acknowledgement of data providers and analyts who prepared data through explicit attribution in the data repository and permanent digital object identifiers (DOI) to locate and cite the data
1.2 Who made and operates the RDA
- The WHO’s Data, Digital Health, Analytics and AI Department (DDA) is overall responsible for the RDA.
- The RDA Development Team is a collaborative consortium including the openVA Team, the Africa Health Research Institute (AHRI), and the WHO - see Acknowledgments below. Samuel Clark at The Ohio State University is responsible for the original RDA concept and has led the RDA Development Team.
- The RDA is operated jointly by the DDA Department at WHO and the openVA Team.
1.3 Where does the RDA store data
The RDA computational infrastructure is on-premise at the WHO headquarters in Geneva, Switzerland. The WHO provides security, IT infrastructure, and ongoing maintenance. Data do not leave WHO IT resources and are not ever stored in the ‘cloud’.
1.4 Who owns and controls data in the RDA
Ownership and control of data in the RDA remain with the original data providers, as specified in the RDA Participation Agreement.
1.5 How does the RDA attribute data providers and/or analysts
All data in the RDA are required to have extensive metadata describing how the data came to be and who was responsible. When data are added to the RDA, a new data document is created in the data repository describing the data. That document contains all of the metadata and a digital object identifier (DOI) to itself. This ensures that the data can always be located and cited, thus ensuring that data collectors and/or analysts responsible for creating the data are well acknowledged both directly and with the ability for data users to cite the data.
1.6 How to cite the RDA
1.6.1 Citation for the RDA itself
Template for citing the RDA:
Samuel J. Clark, Doris Ma Fat, Kobus Herbst, Yue Chu, Brendan Gilbert, Philippe Boucher, Jason Thomas, Dan-George Vasilache, David Plotner, Norman Goco, Mona Sharan, Joven Larin. “WHO Reference Data Repository”. 2026. https://data.who.int/rda. Accessed [YYYY-MM-DD].
Substitute the correct values between brackets and omit the brackets themselves. ‘YYYY’ is a four-digit year, ‘MM’ is a two-digit month, and ‘DD’ is a two-digit day.
1.6.2 RDA dataset citation
Each dataset in the RDA has a data documentation initiative (DDI) element containing the citation to the dataset and the permanent DOI for the dataset. All dataset citations should follow this template:
[first name 1] [last name 1], [first name N] [last name N]. [YYYY]. “[dataset title]”, WHO Reference Data Repository, V[v]. DOI: https://dx.doi.org/[doi]. Accessed [YYYY-MM-DD].
Substitute the correct values between brackets and omit the brackets themselves. ‘YYYY’ is a four-digit year, ‘MM’ is a two-digit month, ‘DD’ is a two-digit day, ‘v’ is the dataset version number, and ‘doi’ is the full digital object identifier (DOI) prefix/suffix.
1.7 How to join the RDA
To join the RDA, indicate your interest by sending an email to assistance-RDA@who.int.
The process involves the following steps:
Agree to and sign the WHO RDA Participation Agreement. This is a data use agreement that specifies the responsibilities of both WHO and you and records both party’s agreement to those terms. As part of this you will need to:
- decide if you want to share individual-level data
- if sharing individual-level data, decide if a data use agreement is required for those who want to use the individual-level data
- if needed, supply the requirements for the individual-level data use agreement
Work with the RDA Team to create and save a ‘data ingestion’ package for your data. This involves:
- defining the data, metadata, and paradata that will be included
- evaluating the data for general structure, volume, etc.
- defining data quality and consistency checks
- defining data recoding, reshaping, etc. necessary to ingest your data into the RDA
- fixing any data issues identified during the process
The ingestion package will be saved so that it can be rerun in an automated fashion for future data deposits.
Ingest the first batch of data.
Decide on a frequency and process to submit regular updates to your data.
2 Overview and core concepts
2.1 Overview of RDA infrastructure
The secured infrastructure of RDA is composed of two components:
- RDA Data Repository: A publicly accessible, browsable, searchable repository for all metadata describing the raw reference deaths and all data products from raw reference deaths, and all aggregate datasets derived from the raw reference deaths.
- RDA Analytics Hub: A secure, vetted-access trusted researcher environment that contains the raw reference deaths, allows researchers to manipulate, analyze, and produce new aggregate (not individual level) datasets from the reference deaths, and provides tools to create fully documented datasets that can be shared on the RDA Data Repository.
2.2 Available data sources at launch
A reference death minimally contains a complete standard verbal autopsy (VA) and a reference cause established without using an automated VA cause-coding algorithm. A reference death may additionally contain information from minimally invasive tissue samples (MITS), medical records, or other sources that may help to identify the cause of death.
- Demographics: basic demographics for all deaths registered in the study
- VA: verbal autopsy data reporting circumstances leading to the death
- Reference cause: reference cause established without using an automated VA cause-coding algorithm, usually physician-coded cause of deaths
- MITS: results from minimally invasive tissue samples, which is a postmortem diagnostic technique conducting a series of biopsies using tissue samples of key organs (e.g., lungs, liver, brain), usually conducted when a full autopsy is not practical.
RDA currently stores reference deaths along with their supplementary metadata documents (e.g. protocols, ethical documents, and survey instruments) from one study:
- Mortality surveillance system in the city of São Paulo in Brazil (Brazil SVOC)
- PHMRC short form VA
- Reference cause (autopsy)
Shortly after launch, the following studies are likely to join:
- Mortality surveillance system in the city of São Paulo in Brazil (Brazil SVOC)
- Demographics
- WHO 2022 VA
- Reference cause (autopsy)
- MITS pathology results
- Child Health and Mortality Prevention Surveillance (CHAMPS)
- Demographics
- VA
- Reference cause (DeCoDe)
- MITS pathology results
- Countrywide Mortality Surveillance for Action-Mozambique (COMSA-Mozambique)
- VA
- Healthy Sierra Leone (HEAL-SL), former COMSA-Sierra Leone study
- VA
- Reference cause
Please note that RDA only provides access to de-identified data, which has been cleaned, harmonized and transformed for easier user access. Individual-level data is restricted to the trusted research environment within the RDA Analytics Hub, and cannot be taken outside of the Hub. To access identifiable data, please visit original study sites for each study: CHAMPS, CHAMPS-Mozambique, HEAL-SL. The data from Brazil are only available in de-identified form.
3 RDA data repository
3.1 Access to RDA data repository
The RDA Data Repository can be accessed here. Anyone can visit the repository, browse or search the metadata describing the source studies contributing to RDA raw data, as well as the data products verified and published as part of the repository.
3.2 Download materials and datasets
A user can view downloadable materials in the DOWNLOADS tab of each data product. Some materials, such as PDFs of protocols or survey instruments, can be directly downloaded without registration or request if permitted by source studies.
To download a dataset and its associated code used to generate the dataset, users must first log in via ORCID authentication, and submit an Application for Access to a Licensed Dataset in the “GET MICRODATA” tab. If you do not have an ORCID, please register at https://orcid.org.
After logging in, users can review and agree to the data use agreement, fill in the application form and request access to the dataset directly on the catalog page of the desired data product. The application will be reviewed by an RDA administrator to confirm that the user is a scientist, researcher, or analyst with an appropriate intended use of the data. Once approved, the user will receive an email with a download link for the dataset, along with the appropriate citations for the data product and RDA.
4 RDA Analytics Hub
4.1 Access to RDA Analytics Hub
The RDA Analytics Hub can be accessed here.
To gain access and analyze raw reference deaths, users must fill out an application form and provide their ORCID, to be whitelisted to the server. If you do not have an ORCID, you can register at https://orcid.org. The application will be reviewed by an RDA administrator to confirm that the user is a scientist, researcher, or analyst with a valid analysis plan. Upon approval, the user can sign in to the RDA Analytics Hub using her/his ORCID credentials.
4.2 Launcher page
After logging in to the RDA Analytics Hub server, the user will first see the main launcher page as below.
On the left of the page, users will see two folders available upon log in:
- /data: a read-only folder containing the RDA.SQLite database
- /examples: a read-only folder containing example notebooks for demo purposes
Each user’s instance on the RDA Analytics Hub is private, meaning that any files or folders created outside the designated shared folders (/data, /examples) are visible and accessible only to the user. All data, files, and installed packages will remain available upon the user’s next log in.
RDA supports multiple coding languages, including R, Python, and Julia, which can be accessed through a notebook, console, or a script. Users can create new notebooks/scripts or start console sessions by clicking the corresponding icon on the right side of the launcher page.
4.3 RDA API packages
RDA team developed packages in Julia, Python and R, providing basic functions for users to view and load the data from RDA. These packages are pre-installed in the analytics hub and available to all users.
List of functions available in the API packages
rda_sources(): returns a data frame with the ID and name of the available data sources in the RDArda_countries(): returns a data frame with the country name, country ISO3 code, and the corresponding data source name and ID- arguments:
source_name&source_id(all are strings) filter the results to only return countries for a particular source
- arguments:
rda_sites(): returns a data frame with the sites (i.e., geographic locations) with data in the RDA, along with the corresponding country and source information- arguments:
source_name,source_id,country_name, &country_iso3(all are strings) filter the results by source and/or country
- arguments:
rda_deaths(): returns a data frame containing the death records for each data source in the RDA- arguments:
source_name,source_id,site_name,site_id,country_name, &country_iso3(all are strings) filter the results by source, site, and/or country
- arguments:
rda_datasets: returns a data frame with basic metadta for the available data sets in the RDA, which includes the name, ID, description, and the corresponding unit of analysis- arguments:
doi&repo_id(both are logical, i.e., TRUE/FALSE) supplement the data frame with the DOI and repository ID from NADA
- arguments:
rda_data_dict: returns a data frame with the data dictionary for a given data set identified by thedataset_idargument (numeric) as given by therda_datasets()outputrda_data: returns a data frame with the actual data from a particular data set identified with either thedataset_idargument (numeric) or thedataset_nameargument (string)rda_tables: returns a data frame containing the available tables (with a description) available in the RDA.- arguments:
fields(logical/boolean) supplements the function output with the field names contained in each RDA table
- arguments:
Demo scripts using the available functions can also be found in the section below.
# Demonstration of the R Package: rRDA
# Load package -- (this will take about 10 seconds)
library(rRDA)
# List Data Sources
rda_sources()
# List the Countries with Data in the RDA
rda_countries()
rda_countries(source_name = "CHAMPS")
rda_countries(source_id = 2)
# List Sites (geographic locations)
rda_sites()
rda_sites(source_name = "CHAMPS")
# List Deaths
rda_deaths()
rda_deaths(source_name = "CHAMPS")
# Data Sets Available in the RDA
rda_datasets(doi = TRUE, repo_id = TRUE)
# Data Dictionaries for RDA Data Sets
rda_data_dict(1)
# Extract Data
rda_data(1)
# List available Tables and Descriptions
rda_tables()# Demonstration of the Python Package: pyRDA
import pyRDA as rda
# List Data Sources
rda.sources()
# List the Countries with Data in the RDA
rda.countries()
rda.countries(source_name = "CHAMPS")
rda.countries(source_id = 2)
# List Sites (geographic locations)
rda.sites()
rda.sites(source_name = "CHAMPS").head()
# List Deaths
rda.deaths()
rda.deaths(source_name = "CHAMPS").head()
# Data Sets Available in the RDA
rda.datasets(doi = True, repo_id = True)
# Data Dictionaries for RDA Data Sets
rda.data_dict(1)
# Extract Data
rda.data(1).head()
# List available Tables and Descriptions
rda.tables()# Demonstration of the Julia Package: juRDA
# Load package
using juRDA
# Load RDA SQLite database
db = load_rda()
# List Data Sources
rda_sources()
# List the Countries with Data in the RDA
rda_countries()
rda_countries(source_name = "CHAMPS")
rda_countries(source_id = 2)
# List Sites (geographic locations)
rda_sites()
rda_sites(source_name = "CHAMPS")
# List Deaths
rda_deaths()
rda_deaths(source_name = "CHAMPS")
# Data Sets Available in the RDA
rda_datasets()
rda_datasets(source_name = "CHAMPS")
# Data Dictionaries for RDA Data Sets
dict = rda_data_dict(dataset_id = 1)
dict = rda_data_dict(dataset_name = "CHAMPS_deid_decode_results")
# Extract Data
data=rda_data(dataset_name = "CHAMPS_deid_decode_results")
# List available Tables and Descriptions
rda_tables()
# List RDA schema
rda_schema() #show tables
rda_schema(fields=true) #show fields in each tableFor more information on the R, Python, and Julia packages, please visit the RDA GitHub Repositories (respectively): rRDA, pyRDA, and juRDA If you have any issues or suggestions for the user API, please report on RDA GitHub repository, or reach out to admin teams.
4.4 Advanced data queries
RDA stores and manages all data in a relational SQLite database. In addition to the user API packages, users can directly run SQL queries with the SQLite database for advanced data exploration, transformation, and engineering tasks.
library(RSQLite)
# Connect to SQLite database
db_path <- "/srv/data/RDA.sqlite"
con <- dbConnect(SQLite(), db_path)
# Some SQL queries, e.g. list tables
tables <- dbListTables(con)
print(tables)
# Close connection
dbDisconnect(con)import sqlite3
# Connect to SQLite database
db_path = '/srv/data/RDA.sqlite'
conn = sqlite3.connect(db_path)
# Some SQL queries, e.g. list tables
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
print(tables)
# Close connection
conn.close()using SQLite
# Connect to SQLite database
db_path = "/srv/data/RDA.sqlite"
db = SQLite.DB(db_path)
# Some SQL queries, e.g. list tables
tables = SQLite.tables(db)
println(tables)
# Some SQL queries, e.g. get ingestions for selected source
source_id = 1
sql = """
SELECT
di.data_ingestion_id, di.date_received, di.description
FROM data_ingestions di
JOIN deaths d ON di.data_ingestion_id = d.data_ingestion_id
WHERE source_id = @source_id
ORDER BY di.date_received DESC;
"""
ingestions = DBInterface.execute(db, sql, (source_id = source_id)) |> DataFrame
# Close connection
close(db)For additional information about the data in RDA.SQLite, please refer to the RDA.SQLite section below.
5 Export and release user data product
5.1 Export user data product
Direct downloading of data or files is disabled in RDA Analytics Hub to ensure data security and compliance with access policies. To export a finalized data product outside the RDA Analytics Hub, the user needs to submit the required information and files through the Data Submission Form, accessible via the graphic user interface (user GUI) in RDA Analytics Hub launcher page.
Only aggregate-level data products may be exported outside the RDA Analytics Hub. Publicly release individual-level data is not permitted unless explicitly approved by the RDA data producer for release through the RDA Data Repository.
In addition to the data product(s), the requester must also provide the following information:
- the data dictionaries labeling all variables in the data products;
- the complete codes generating the data products for reproducibility;
- the metadata briefly describing the manipulation process, the producer, and other necessary information.
RDA admin team will review the user submission, and make sure that the data products comply with WHO/RDA cybersecurity regulations and data policies . More information on GUI access and required files can be found here. More information on the admin review process can here.
5.2 Data submission via user GUI
When the data products and associated notebook are finalized and ready for RDA admin review and public release, the user can submit her/his request by completing “Request for public release” form via the user GUI. User GUI for data submission is accessible on the launcher page of RDA Analytics Hub.
After clicking the “Data Submission Form” icon, the GUI will open in a new tab containing section labels, widgets for entering data (e.g., text boxes, dropdown menus, calendars for selecting dates, etc.), and buttons for entering and processing data submissions. When users need to revise a previous submission or update a published data set, the top-right section of the form (shown in the following figure) provides access to the previously submitted metadata. Users can select the task of interest (i.e., revision or update) and the “Select project” dropdown menu will populate with the names of projects whose metadata have been submitted and processed. After identifying the project of interest, clicking the “Load project” button will populate the form with the previously submitted metadata.
The following information is required for admin review and final release on RDA Data Repository:
- Metadata of the final data products for RDA Data Repository catalog page.
- For details about metadata input supported in RDA Data Repository, please refer to Metadata for RDA Data Repository.
- Notebook with all necessary codes and appropriate comments generating the data product(s);
- include dependencies for running the notebook (if any)
- Data products and corresponding data dictionaries;
- Data dictionary: data dictionary describing the variables in the data product
- If you have multiple data products produced in a single notebook, please provide the above for each data product respectively
The labels of required fields also have red text to help users identify the necessary inputs. Requests can be submitted by clicking the “SUBMIT” button located at the bottom of the form. If a required input is missing upon submission, the GUI will print a message indicating which field needs to be filled in (as depicted in the following figure).
Every submission must include a data file, a data dictionary, and a Jupyter notebook containing the code that produces the data file and dictionary. This group of files, referred to as a “data bundle,” can be added using buttons found in the “Data files & data dictionaries” section of the GUI. The “Select” buttons open a file browser allowing the user to choose the appropriate file. To replace a data file or data dictionary, simply select a new file with the corresponding “Select” button. Multiple Jupyter notebooks can be added to a single data bundle, and additional widgets allow users to select and remove a notebook in a particular data bundle. If the user would like to add more than one data file, then they must add another data bundle by clicking on the “Add data bundle” button (as shown in the following figure). Again, each data bundle should include a data file, data dictionary, and at least one Jupyter notebook.
Users can save the form in progress for later edits with the “Save form” button. All user inputs will be saved as a JSON file in the user’s submission directory (/srv/submissions/user_id/saved_submissions). The user can later load the JSON file with the “Load form” button, to continue with editing.
If the form and files are successfully submitted, the user receives a confirmation in the email provided in the submission form.
5.3 Admin review
Once the Data Submission Form has been successfully submitted, the RDA admin team will review the data product, metadata and supporting documents to ensure that they comply with the WHO/RDA data and cybersecurity policies.
Specifically, the RDA admin team will check:
- Unit of analysis: by default, only aggregated data or modeled estimates are available for public release. Access to individual-level data products outside analytics hub is restricted and may only be granted with explicit permission from the original source studies.
- Reproducibility: the notebook can run and produce desired data products.
- Methodology: methodology are appropriately described in data processing notes provided in metadata.
- Intended use: the analysis loosely matches the intended use of the data stated in user’s request for access.
If it is necessary for the user to revise a submission (as communicated by the administrator reviewing the submission), the Data Submission GUI can be used to select the project in need of revision and populate the form with the previously submitted metadata.
5.4 Public release of user data product(s)
Once the data submission is approved, RDA admin team will notify the requester via email, providing:
- the URL for the approved data product(s) on the RDA Data Repository for external downloads;
- the DOI of the dataset with appropriate citation for future reference.
Upon approval, an entry for the data product(s) will be created in the RDA Data Repository for public access. On the page, the following information will be publicly released based on user submission:
- metadata describing the data product(s) and methodologies;
- an official citation with assigned DOI for future reference ;
- a unique identifier of the page and DDI document in RDA Data Repository;
- data dictionaries of the data product(s);
- macrodata documents if available, such as protocols, instruments;
- licensed access to download data product(s) and notebook upon request.
The repository id, DOI, metadata ddi and notebooks will also be stored in RDA SQLite database for future reference.
5.5 Example notebooks
Example notebooks has been provided for user reference in the /examples folder on RDA Analytics Hub. Demo codes are available in R, Python, and Julia, covering basic functions such as loading data, data processing, saving data product as .csv file etc..
6 RDA.SQLite
6.1 Data tables in RDA
RDA stores and manages all data in a relational SQLite database that supports SQL queries. The latest data model, illustrating its design and primary keys establishing relationships between datatables, can be found here. A full list of all fields within each data table in RDA can be found here.
List of data tables in RDA
- data: long-format table storing values for data as identified by row id and variable id
- data_ingestions: documenting ingestion of the data, with brief description, received date of data
- datarows: row ids for each dataset
- dataset_variables: ids for variables in each dataset
- datasets: all available datasets in RDA with brief descriptions, created date, unit of analysis as numeric id, NADA repository id for metadata page, registered doi for reference
- death_rows: row id for id of each death registered in RDA
- deaths: id of deaths registered in RDA, with site, external id as identifier in raw source, and id for ingestion
- domains: name space in which variable names are unique, can be source name for source-inputs, or user orcid for user-inputs
- ethics: list of ethics document associated with study protocols
- ethics_documents: ethics documents. The document files are stored as binary large objects (BLOBs), encoded as Base64 strings using Julia’s Base64 module.
- ingest_datasets: documenting ingestion information for each dataset including ingestion id and transformation id
- instrument_datasets: list of survey instrument associated with datasets
- instrument_documents: documents associated with each instrument. The document files are stored as binary large objects (BLOBs), encoded as Base64 strings using Julia’s Base64 module.
- instruments: description of instruments
- protocol_instruments: list of survey instruments under each protocol
- protocols: information of protocols, including name, description and associated ethics document ids
- repository: data in nada repositories, including ddi and rdf documents, and ddi document id for each repository
- site_protocols: list of protocols for each study site
- sites: name and country code of study sites for each source
- sources: name and id of each data source
- study_types: available study types, 1=“Demographic surveillance”, 2=“Cohort study”, 3=“Cross-sectional survey”, 4=“Panel data”
- transformation_intpus: the input datasets for each transformation
- transformation_outputs: the output datasets for each transformation
- transformation_statuses: available transformation status, 1=“Unverified”, 2=“Verified”
- transformation_types: available transformation types, 1=“Raw data ingest”, 2=“Dataset transform”
- transformations: details of each transformation, including the description, type and status of transformation, the date and producer of the transformation, along with reference code
- unit_of_analysis_types: available unit of analysis types, 1=“Individual”, 2=“Aggregation”
- value_types: available value types, 1=“Integer”, 2=“Float”, 3=“String”, 4=“Date”, 5=“Datetime”, 6=“Time”, 7=“Category”
- variablemappings: the tables mapping the transformation from one domain to another domain, e.g. from raw VA collected using WHO 2016 ODK to InSilicoVA-compatible input format
- variablemaps: the table indicates the source and destination domains of the mapping
- variables: all variables in the datasets, with variable names, descriptions, notes, domain id, value type, id of vocabulary if applicable
- vocabularies: vocabularies for categorical variable
- vocabulary_items: vocabulary items in categorical variables, with value, code and description for each item
- vocabulary_mapping: mapping between original vocabulary and newly-generated vocabulary
7 Metadata for RDA Data Repository release
The RDA Data Repository provides structured metadata for all publicly released data products, which enables the users to discover and understand the datasets effectively, ensures transparency and reproducibility of research, and facilitates decision-making and collaboration across settings.
The repository is built using NADA Data Catalog, a open-source web-based tool to structure and standardize documentation about data products. Therefore, the metadata published on RDA Data Repository follows the NADA Data Documentation Initiative (DDI) XML standard.
The RDA User GUI offers an intuitive and user-friendly interface for collecting metadata associated with user-generated data products, and other critical information for RDA Data Repository release.
Below is a full list of metadata supported on RDA Data Repository:
List of metadata for data repository publication
(Required fields are marked in red in GUI submission form)
Project name: short name which uniquely identifies the project of the user. The maximum length of project name is 64, with a combination of letter, number, underscore.
Title: the longder descriptive name of the project, which is the title of the project page in the data catalog.
Country: countries covered in the data product(s), please check all that’s applicable from the list.
Series information: details about any series or collection to which the project belongs, e.g. the project is derived from the CHAMPS study.
Abstract: a brief overview of the project, usually includes the scope, population, objective, and/or structure etc. of the project.
Type of study: select study type from dropdown list, e.g. demographic surveillance.
Kind of data: type of data products selected from dropdown list, e.g. modeled estimates.
Unit of analysis: the primary entity analyzed in the study, only aggregated level data will be allowed to release on RDA data repository according to RDA data policy.
Version:
- version of study: current version of the project. e.g. “v1: Level-2 deidentified data for public release”
- production date of current version: date that the current version of the project is created.
- additional notes of current version: accumulated notes of updates or changes made to the project over time.
Scope:
- Notes: details on the scope of analysis.
- Topics: key topics covered by the data product(s).
- Keywords: keywords describing the main topics covered by the data product(s).
Coverage:
- Geographic setting: details on geographic coverage of the project.
- Study population, eligibility for inclusion etc.: details on study population included in the analysis.
Producers and sponsers:
- Primary investigators: name and affiliations of the primary investigators (PIs) of the project.
- Producer: asking for information about the producer of the data products of the project, including affiliation, name, and role of the producer(s) of the data product(s).
- Funding agency: name, abbreviation and role for funding agencies of the project.
- Other identifications / acknowledgements: name, affiliation and role for other stakeholders.
Sampling:
- Sampling procedure: overview of sampling design, if applicable.
Data processing:
- Data editing: details about preprocessing of the data to prepare them for the analysis or modeling.
Data appraisal:
- Estimates of sampling error: discussion of sampling errors in the project, if applicable.
Data access:
- Access authority: the organization or individual responsible for granting access to the dataset, it’s by default the RDA team for user generated data product(s), and source organization for raw source data.
- Access conditions: any restrictions, permissions, or requirements for accessing the dataset, such as licensing terms, embargo periods, or registration processes.
Disclaimer and copyrights:
- Disclaimer: The user of the data acknowledges that the original collector of the data and the relevant funding agencies bear no responsibility for the data’s use or interpretation or inferences based upon it.
- Copyright: The dataset(s) documentation is licensed under a Creative Commons Attribution-Non Commercial 4.0 International License. The dataset(s) is shared in terms of the data-use agreement accepted at the time of data download.
Metadata production:
- Metadata producers: name, abbreviation, affiliation and role of the individual who submits the metadata.
- Date of metadata production: date that the current version of project metadata is produced.
- Metadata producers: name, abbreviation, affiliation and role of the individual who submits the metadata.
7.1 Data preprocessing in RDA
7.1.1 De-identification
All individual-level data from the study sources were de-identified by each study team before being ingested into RDA.
Standard data transformations were performed to de-identify these datasets, including: - case identifiers were replaced with surrogates; - narrative and sensitive fields such as names were removed; - dates of births/deaths were removed or shifted.
More details of the de-identification process were documented in study protocols, available for each study source in the RDA Data Repository.
7.1.2 Data cleaning
We briefly review the quality of data received from source studies. If obvious errors or inconsistency were identified, we communicate with source studies to try to correct. Otherwise, we recode garbage codes and invalid input into missing, and format variables into appropriate format indicated in the data dictionaries of each study source. Details for data cleaning can be found in source codes of RDAClean.jl.
Cleaned data are also stored in RDA SQLite database with suffix “_clean” in dataset names.
7.1.3 Convert VA data for algorithms
Raw VA data collected using WHO standardized questionnaires were mapped to InSilicoVA compatible format based on mapping configuration identical to PyCrossVA developed by openVA team, for downstream cause of death assignment.
Detailed mapping configuration files can be found in vocabulary_mapping data table in RDA SQLite database, or on our RDA GitHub repository.
8 Acknowledgements
The RDA development team is a collaborative effort of researchers from multiple institutions led by Samuel Clark that includes Yue Chu and Jason Thomas from The Ohio State University (OSU); Kobus Herbst and Brendan Gilbert from the Africa Health Research Institute (AHRI); Doris Ma Fat, George Vasiliche, Philippe Boucher, Joven Larin, and Mona Sharon from the World Health Organization (WHO); and David Plotner and Norman Goco from RTI International.
The Gates Foundation funded this work.
9 Contact us
For further information about RDA, please contact RDA Team, openVA Team.