All Projects → bloomonkey → oai-harvest

bloomonkey / oai-harvest

Licence: other
Python package for harvesting records from OAI-PMH provider(s).

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to oai-harvest

webumenia.sk
An online collection platform to explore digitised art collections from galleries and museums
Stars: ✭ 42 (-26.32%)
Mutual labels:  oai-pmh
minutes
Sync worklogs between multiple time trackers, invoicing, and bookkeeping software.
Stars: ✭ 19 (-66.67%)
Mutual labels:  harvest
python-harvest
A Python wrapper for the Harvest time-tracking API.
Stars: ✭ 53 (-7.02%)
Mutual labels:  harvest
oai
OAI-PMH R client
Stars: ✭ 13 (-77.19%)
Mutual labels:  oai-pmh
xoai
OAI-PMH Java Toolkit
Stars: ✭ 28 (-50.88%)
Mutual labels:  oai-pmh
hepcrawl
Scrapy project for feeds into INSPIRE-HEP
Stars: ✭ 16 (-71.93%)
Mutual labels:  harvest-data
chiadog
A watch dog providing a peace in mind that your Chia farm is running smoothly 24/7.
Stars: ✭ 466 (+717.54%)
Mutual labels:  harvester
harvesting
Ruby wrapper for the Harvest API v2
Stars: ✭ 24 (-57.89%)
Mutual labels:  harvest
SHARE
SHARE is building a free, open, data set about research and scholarly activities across their life cycle.
Stars: ✭ 93 (+63.16%)
Mutual labels:  harvest-data
frisbee
Collect email addresses by crawling search engine results.
Stars: ✭ 29 (-49.12%)
Mutual labels:  harvester
hapi
PHP Wrapper Library for the Harvest API
Stars: ✭ 41 (-28.07%)
Mutual labels:  harvest
osint
Docker image for osint
Stars: ✭ 92 (+61.4%)
Mutual labels:  harvester
PoE-HarvestVendor
Tool for getting the list of crafts out of Horticrafting station in Path of exile
Stars: ✭ 68 (+19.3%)
Mutual labels:  harvest
Tire-a-part
Digital repository for the papers of a research organization.
Stars: ✭ 24 (-57.89%)
Mutual labels:  oai-pmh
rdryad
R client for Dryad web services
Stars: ✭ 25 (-56.14%)
Mutual labels:  oai-pmh
Striker
Striker is an offensive information and vulnerability scanner.
Stars: ✭ 1,851 (+3147.37%)
Mutual labels:  harvester
polkascan-pre-harvester
Polkascan PRE Harvester
Stars: ✭ 23 (-59.65%)
Mutual labels:  harvester
article-dataset-builder
Open Access PDF harvester, metadata aggregator and full-text ingester
Stars: ✭ 13 (-77.19%)
Mutual labels:  harvester

OAI-PMH Harvest

build:status pypi:oaiharvest license:BSD format:black

Contents

Description

A harvester to collect records from an OAI-PMH enabled provider.

The harvester can be used to carry out one-time harvesting of all records from a particular OAI-PMH provider by giving its base URL. It can also be used for selective harvesting, e.g. to harvest only records updated after, or before specified dates.

To assist in regular harvesting from one or more OAI-PMH providers, there's a provider registry. It is possible to associate a short memorable name for a provider with its base URLs, destination directory for harvested records, and the format (metadataPrefix) in which records should be harvested. The registry will also record the date and time of the most recent harvest, and automatically add this to subsequent requests in order to avoid repeatedly harvesting unmodified records.

This could be used in conjunction with a scheduler (e.g. CRON) to maintain a reasonably up-to-date copy of the record in one or more providers. Examples of how to accomplish these tasks are available below.

Latest Version


The latest stable release version is available in the Python Packages Index:

https://pypi.python.org/pypi/oaiharvest

Source code is under version control and available from:

http://github.com/bloomonkey/oai-harvest

Documentation

All executable commands are self documenting, i.e. you can get help on how to use them with the -h or --help option.

At this time the only additional documentation that exists can be found in this README file!

Requirements / Dependencies

  • Python >= 2.7 or Python 3.x
  • pyoai
  • lxml
  • sqlite3

Installation

Users

pip install oaiharvest

Developers

I recommend that you use virtualenv to isolate your development environment from system Python and any packages that may be installed there.

  1. In GitHub, fork the repository

  2. Clone your fork:

    git clone [email protected]:<username>/oai-harvest.git
    
  3. Setup development virtualenv using tox:

    pip install tox
    tox -e dev
    
  4. Activate development virtualenv:

    -nix:

    source env/bin/activate
    

    Windows:

    env\Scripts\activate
    

Bugs, Feature requests etc.

Bug reports and feature requests can be submitted to the GitHub issue tracker: http://github.com/bloomonkey/oai-harvest/issues

If you'd like to contribute code, patches etc. please email the author, or submit a pull request on GitHub.

Copyright And Licensing

Copyright (c) University of Liverpool, 2013-2014

This project is licensed under the terms of the 3-Clause BSD License.

Examples

Harvesting records from an OAI-PMH provider URL

All records

oai-harvest http://example.com/oai

Records modified since a certain date

oai-harvest --from 2013-01-01 http://example.com/oai

Records from a named set

oai-harvest --set "some:set" http://example.com/oai

Limit the number of records to harvest

oai-harvest --limit 50 http://example.com/oai

Get help on all available options

oai-harvest --help

OAI-PMH Provider Registry

Add a provider

oai-reg add provider1 http://example.com/oai/1

If you don't supply --metadataPrefix and --directory options, you will be interactively prompted to supply alternatives, or accept the defaults.

Remove an existing provider

oai-reg rm provider1 [provider2]

List existing providers

oai-reg list

Harvesting from OAI-PMH providers in the registry

Harvest from one or more providers in the registry using the short names that they were registered with:

oai-harvest provider1 [provider2]

By default, this will harvest all records modified since the last harvest from each provider. You can over-ride this behavior using the --from and --until options.

Harvest from all providers in the registry:

oai-harvest all

Scheduling Regular Harvesting

In order to maintain a reasonably up-to-date copy of all the the records held by those providers, one could configure a scheduler to periodically harvest from all registered providers. e.g. to tell CRON to harvest all at 2am every day, one might add the following to crontab:

0 2 * * * oai-harvest all
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].