Since the 30th of November 2017 (datathon start date), the organisers will made available the collections of Datathon Datasets. Such datasets are described below together with a set of potential data analysis challenges. Such challenges should be taken as inspiration and vision, other ideas, if well motivated by the intent of the datathon are welcome.
The data collections are obtained from the original OpenAIRE Information Space, after applying simplification, selection, and normalization techniques, so as to simplify the interpretation to participating teams, still without compromising the usefulness of their solutions and the possibility to integrate them in the production services of OpenAIRE.
|Dataset: the data will be provided as a set of RDF triples (SPARQL end point queries), together with the high-level schema and the RDF schema |
Challenges: enrichment by interlinking with other LOD datasets, enrichment by mining and analysis, identifying interesting patterns or research networks in the graph, mashups, etc.
|Dataset: the datasets will be made available as collections of XML records (one for each type of entity) with the relative XML schema |
Challenges: deduplication of entities; cleaning, enrichment of metadata; interlinking of papers via projects, organizations, projects; identification of similar documents across languages; semantic language-agnostic search, etc.
|Dataset: the data will consist of a set of Scholix JSON triples conforming to the Scholix schema (available in GitHub), collected from the OpenAIRE Scholexplorer Service (aka DLI Service) operated by OpenAIRE |
Challenges: identifying extra links which could improve discovery network patterns, metadata enrichment for enhancing discovery, identifying interesting patterns or networks in the graph, etc.
|Dataset: a collection of article full-texts will be provided as .txt files together with the relative XML metadata records (names of full-text files will be correlated to names of the relative metadata records) |
Challenges: mining for enrichment of article metadata (e.g. ORCID identifiers, ISNI identifiers, persistent identifiers, VIVO identifiers); mining for identification of links between articles and other objects on the web. Examples of such objects are datasets, ontologies and topics, DOIs, software, author IDs, organization IDs.