All Projects → RubenVerborgh → WebsiteToRDF

RubenVerborgh / WebsiteToRDF

Licence: MIT license
A simple ETL pipeline for HTML+RDFa websites

Programming Languages

shell
77523 projects
Dockerfile
14818 projects
javascript
184084 projects - #8 most used programming language

Convert HTML+RDFa to Turtle

This repository contains a simple pipeline that extracts HTML+RDFa data from webpages and combines them into a single Turtle file from it. Semantic gaps are filled by reasoning.

As a result, your website's data can be queried with SPARQL at 100% completeness and without worrying about vocabularies.

The article “Piecing the puzzle – Self-publishing queryable research data on the Web” explains in detail what the pipeline does and how it works.

Requirements

Running the pipeline

$ ./extract-website-data https://example.org/ /var/www/example.org/

where https://example.org/ is the URL of your homepage and /var/www/example.org/ the location of its HTML files.

Customizing the pipeline

Place the ontologies you want to reason on in the ontologies folder.

Rules for common RDFS and OWL constructs are available at the EYE website.

Run via Docker

  • Build the Docker image with docker build -t WebsiteToRDF .
  • Run container with docker run -v /path/to/site:/data -v /path/to/results/folder:/result -i --rm WebsiteToRDF https://example.org/.
  • The RDF triples will be available in /path/to/results/folder/website.nt.

License

©2017 Ruben VerborghMIT License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].