Data Management in the ALCHIMIA Project

Artificial intelligence (AI) techniques have the potential to handle huge amounts of data and reveal information that may elude human perception, especially in the case of analysing unstructured data, which would require time consuming efforts also for skilled data scientists.
In the ALCHIMIA project, various types of data and information are being collected, managed, analysed, used, processed and generated. AI is very helpful for handling these large amounts of industrial data, but rules must be established to avoid violating security and privacy issues, and to ensure compliance with ethical principles.
For this reason, the project has a “Data Management Plan (DMP)”, a detailed document that includes the description of all the types of data involved in the project, how they should be classified, managed, stored and published. In particular, Alchimia data can be classified by considering the type (e.g., image, number, text), the source (e.g., industry, experiment, modelling, literature, survey) and the accessibility (confidential or public data), as shown in Figure 1.

Figure 1: Data classification

The DMP also establishes role and responsibilities among the partners and defines all aspects of allocation of resources, data security and ethical aspects. The Alchimia project is guided by the human-centric philosophy, therefore privacy and personal data protection is a key principle. Furthermore, the highest level of trust, safety and collaboration between workers and AI-based solutions must be guaranteed.
The development and the application of the DMP are based on the four pillars of the “FAIR Data guiding principles”: Findability, Accessibility, Interoperability and Reusability.

⦁ Findability: to make data finding as easy as possible for both humans and computers, consortium partners and external users (e.g., by establishing a file naming convention for internal and external use).
⦁ Accessibility: data should be accessible to all users according to the level of confidentiality. It can be achieved through standardised communication protocol, and an authentication and authorization procedure (e.g., by defining data uploading steps for both open access data confidential data in platform such as Zenodo, OwnCloud or private repositories)
⦁ Interoperability: it means facilitating the data integration (e.g., dataset formats, SI units, vocabularies, acronyms)
⦁ Reusability: optimizing the reuse of collected and generated datasets (e.g., specific licences for public use of datasets generated during the project).
If you are interested in this topic, we recommend reading the article by Wilkinson et al.