Data challenges Data challenges

Since the 30th of November 2017 (datathon start date), the organisers will made available the collections of Datathon Datasets. Such datasets are described below together with a set of potential data analysis challenges. Such challenges should be taken as inspiration and vision, other ideas, if well motivated by the intent of the datathon are welcome. 

The data collections are obtained from the original OpenAIRE Information Space, after applying simplification, selection, and normalization techniques, so as to simplify the interpretation to participating teams, still without compromising the usefulness of their solutions and the possibility to integrate them in the production services of OpenAIRE.

Dataset #1: OpenAIRE Information Space as LOD Dataset #1: OpenAIRE Information Space as LOD

 

Dataset: the data will be provided as a set of RDF triples (SPARQL end point queries), together with the high-level schema and the RDF schema
Challenges: enrichment by interlinking with other LOD datasets, enrichment by mining and analysis, identifying interesting patterns or research networks in the graph, mashups, etc. 
 

Dataset #3: OpenAIRE records relative to the entities authors, publications, datasets, and projects Dataset #3: OpenAIRE records relative to the entities authors, publications, datasets, and projects

Dataset: the datasets will be made available as collections of XML records (one for each type of entity) with the relative XML schema
Challenges: deduplication of entities; cleaning, enrichment of metadata; interlinking of papers via projects, organizations, projects; identification of similar documents across languages; semantic language-agnostic search, etc.
 

Dataset #2: OpenAIRE Information Space as Scholix.org links Dataset #2: OpenAIRE Information Space as Scholix.org links

Dataset: the data will consist of a set of Scholix JSON triples conforming to the Scholix schema (available in GitHub), collected from the OpenAIRE Scholexplorer Service (aka DLI Service) operated by OpenAIRE
Challenges: identifying extra links which could improve discovery network patterns, metadata enrichment for enhancing discovery, identifying interesting patterns or networks in the graph, etc. 
 

Dataset #4: OpenAIRE publications full-text and metadata Dataset #4: OpenAIRE publications full-text and metadata

Dataset: a collection of article full-texts will be provided as .txt files together with the relative XML metadata records (names of full-text files will be correlated to names of the relative metadata records)
Challenges: mining for enrichment of article metadata (e.g. ORCID identifiers, ISNI identifiers, persistent identifiers, VIVO identifiers); mining for identification of links between articles and other objects on the web. Examples of such objects are datasets, ontologies and topics, DOIs, software, author IDs, organization IDs.