A repository to keep the vocabulary for datasets and models developed in Inesdata project. The goal is to have a lightweight vocabulary with a model-centric view of ML models. The vocabulary extends schema.org and Codemeta.
This vocabulary aims to be compatible with the efforts from the FAIR4ML RDA community which are under discussion.
https://w3id.org/inesdata#
https://w3id.org/inesdata/0.0.1#
ind
The following diagram shows the classes, properties and data properties of the inesdata ml schema. Note that a union has been used for classes like author and funder to denote that the domain/ranges includes both classes.
codemeta
: https://w3id.org/codemeta/ind
: https://w3id.org/inesdata#prov
: http://www.w3.org/ns/prov#schema
: http://schema.org/rdfs
: http://www.w3.org/2000/01/rdf-schema#The classes for the InesDATA machine learning schema are defined below. SubClassOf
stands for rdfs:subClassOf
. Likewise, domain
and range
stand for rdfs:domain
and rdfs:range
.
Class representing a downloadable ML model object (models may be distributed in different sizes and quantizations)
Class aimed at representing metadata records for ML models.
Class to represent Machine Learning models that can be run for some task (e.g., those available in HuggingFace). A Machine Learning model may have more than one model distribution
Action/Activity identifying that a model was trained, and its circumstances (training hours, hardware used, etc.)
Organization who provided the cloud infrastructure to train the model (e.g., Amazon)
Dataset used for evaluating the model. The dataset used for evaluation may not have been part of the train/test/validation (e.g., a benchmark)
Description of the metrics used for evaluating the ML model
Description of the evaluation results obtained from the model (comparison, metric tables, etc.)
Relationship to point to the source model used for fine tuning (if this model was finetuned from another one)
Description of the GPU requirements needed to run the model
Description of the type of hardware used when training the model, so it can be used to report emissions.
Amount of CO2 equivalent emissions produced by the model. The unit should be included in the field (e.g., 10 tonnes)
Category of the model (e.g., SVM, Transformer, Supervised, etc.)
Description of the risks and biases of the model, in a human-readable manner
Brief description on the parameter size used to train the model (e.g., 7B). The unit (e.g., billions) must be included in the description
Person or Organization who shared the model online (e.g., uploading it to HuggingFace)
Link to the dataset used to test the model (following train/test/validation splits)
Task for which the model was trained or fine tuned. E.g., image classification, sentiment analysis, etc.
Link to the dataset(s) used for training the model.
Region where the training of a model took place (e.g., Europe, UK, etc.)
Description of the instructions needed to run the model (e.g., to do inference on a task). Code snippets may be used for illustration
Link to the dataset used to validate the model. Typically the training dataset is a separated set from the train/testing set.
Method used for quantizing the distribution of a model. Quantization is often needed to reduce the size of a large language model.
Number of bits used for model distribution quantization. E.g., 2 bits.