A Study of the Quality of Wikidata

Datasets


  • Claims file - Wikidata dump of Dec 7, 2020.

    • Description: Reference dump containing the claims.tsv.gz file with 1.15B statements, used for finding constraint violations. The dump is in KGTK format (https://kgtk.readthedocs.io/)
    • License: 'https://creativecommons.org/licenses/by/4.0/legalcode' and 'info:eu-repo/semantics/openAccess'
  • Labels file - Wikidata dump of Dec 7, 2020

    • Description: File with all labels of nodes used in the analysis in KGTK format
    • License: 'https://creativecommons.org/licenses/by/4.0/legalcode' and 'info:eu-repo/semantics/openAccess'
  • Claim and qualifier properties file - Wikidata dump of Dec 7, 2020

    • Description: Both of these properties are crucial to calculate the constraints in our analysis. Claim properties file - Wikidata dump of Dec 7, 2020 File used to determine constraints. Qualifiers file - Wikidata dump of Dec 7, 2020. File used to fetch more details on the constraints for each property, such as exceptions.
    • License: 'https://creativecommons.org/licenses/by/4.0/legalcode' and 'info:eu-repo/semantics/openAccess'
  • Wikidata removed statements from Oct 2014 - Jan 2021

    • Description: This dataset contains 76.5M statements that have been permanently removed (i.e., removed from a Wikidata dump and not added again) since the first available dump of Wikidata in October, 2014. The dataset was generated by downloading all available weekly JSON Wikidata dumps from the Internet Archive; converting them to the KGTK format; and extracting only those statements that had been removed between each pair of dumps (i.e., calculating the deltas). The dataset is in KGTK format (https://kgtk.readthedocs.io/en/latest/specification/), a tsv file
    • License: 'https://creativecommons.org/licenses/by/4.0/legalcode' and 'info:eu-repo/semantics/openAccess'
  • Wikidata deprecated statements by Jan 2021.

    • Description: Wikidata deprecated statements by Jan 2021. All statements tagged as deprecated in the Knowledge Graph.
    • License: 'https://creativecommons.org/licenses/by/4.0/legalcode' and 'info:eu-repo/semantics/openAccess'
  • Wikidata instanceOf, subclassOf, isa-relation dumps of Feb 15, 2021.

    • Description: Wikidata instanceOf, subclassOf, isa-relation dumps of Feb 15, 2021. The combination of these files is used to determine the ancestors of each node at 1 or more levels above it. P279 (subclass of) P31 (instance of) isa (instance of or subclass of) P279 star (recursive subclass of)
    • License: 'https://creativecommons.org/licenses/by/4.0/legalcode' and 'info:eu-repo/semantics/openAccess'
  • Constraint violation summaries (Dump: Dec 7th, 2020)

    • Description: Wikidata constraint violation summary files (dumps from 7 Dec, 2020), calculated from running the notebooks in https://doi.org/10.5281/zenodo.5119983 and the scripts in https://github.com/usc-isi-i2/wd-quality/tree/main/Scripts. Item requires statement constraint: codepConstDFAnalysis.csv Inverse constraint: invConstDFAnalysis.csv Symmetric constraint: symmConstDFAnalysis.csv Type constraint (domain): typeConstDFAnalysis.csv Value type (range): valueTypeConstDFAnalysis.csv


    • License: 'https://creativecommons.org/licenses/by/4.0/legalcode' and 'info:eu-repo/semantics/openAccess'

Software


The pointers for the main software used can be found below:

    Knowledge Graph Toolkit

    Readme
    License: MIT License
    Notebook
    {KGTK}: A Toolkit for Large Knowledge Graph Manipulation and Analysis}
    Citation
    Installation
    Usage
    Documentation
    Download
    pythonhtmlshell

    Notebooks for generating and validating constraints in WIkidata

    Readme
    Notebook
    Citation
    Download
    shell

    Notebooks for generating and validating constraints in WIkidata

    Readme
    Notebook
    Citation
    Download
    shell

    Bibliography


    • Vrandecic, D., Krotzsch, M.: Wikidata: a free collaborative knowledgebase. Com-munications of the ACM57(10), 78–85 (2014)

    About the authors


    Daniel Garijo

    Daniel Garijo

    Distinguished Researcher

    Universidad Politécnica de Madrid, University of Southern California

    http://w3id.org/people/dgarijo

    I am a researcher at Universidad Politécnica de Madrid. My research activities focus on e-Science and the Semantic Web, specifically on how to increase the ease of use of software and scientific workflows using provenance, metadata, intermediate results and Linked Data.

    Daniel Schwabe

    Daniel Schwabe

    Author

    Invited researcher

    Professor at Pontificia Universidade Católica Rio de Janeiro and invited researcher at the Information Sciences Institute, University of Southern California.

    Pedro Szekely

    Pedro Szekely

    Author

    Research Director

    Research Director at the center on Knowledge Graphs, Information Sciences Institute, University of Southern California.

    Filip Ilievski

    Filip Ilievski

    Author

    Research Scientist

    Researcher at the Information Sciences Institute, University of Southern California.

    Kartik Shenoy

    Kartik Shenoy

    Author

    Student worker

    Master student at the University of Southern California.