All Projects → nyuhsl → data-catalog

nyuhsl / data-catalog

Licence: GPL-3.0 license
The NYU Data Catalog facilitates researchers’ access to large datasets available either publicly or through institutional or individual licensing. It also includes descriptions of internally-generated research datasets from NYU researchers.

Programming Languages

PHP
23972 projects - #3 most used programming language
Twig
543 projects
javascript
184084 projects - #8 most used programming language
CSS
56736 projects
python
139335 projects - #7 most used programming language

NYUHSL Data Catalog

Welcome to the NYU Health Sciences Library's Data Catalog project. Our aim is to encourage the sharing and reuse of research data among insitutions and individuals by providing a simple yet powerful search platform to expose existing datasets to the researchers who can use it. There is a basic backend interface for administrators to manage the metadata which describes these datasets.

Components

The Data Catalog runs on Symfony2, a popular PHP application framework. Installation and management of this package should be performed by a PHP developer familiar with this framework. Typically, Symfony is run with a HTTP server such as Apache, and a database such as MySQL. Installation of the Data Catalog will require a working knowledge of these packages.

The search functionality is powered by Apache Solr, which will need to be installed separately from this project. Solr comes packaged with its own web server (Jetty) and can be run on the same machine as this website, or on its own machine. We recommend using Solr version 6; version 7 should also work but we have not tested this. Detailed information on installing Solr is outside the scope of this documentation, but the basic steps are as follows (this is also covered in the general installation steps below):

  1. Download and install the Solr package
  2. Start the Solr server and create a Solr core for this project
  3. Configure Solr to use our custom schema, which is included in the root directory of this project (SolrV6SchemaExample.xml)
  4. Add the URL of your new Solr core to Symfony's parameters.yml file (step 4 in the install instructions below).

Datasets are added using the Data Catalog's administrative interface, and then sent to Solr for indexing. Solr's index therefore needs to be kept in sync with any changes made in the Data Catalog. We've provided a sample indexing script ("SolrIndexerExample") in the root directory of this project. We recommend setting this up to run automatically either daily or weekly depending on your usage.

IMPORTANT NOTE: This package comes with a very basic form of authentication that should only be used in a local development environment. There are methods in place to use your institution's LDAP server, or you can use Symfony's built-in user management. Please read app/config/common/security.yml for more info.

Installation

This repository is essentially a Symfony2 distribution (i.e. it is not simply a Symfony "bundle"). As such, you should be able to install this site fairly easily, after configuring it for your environment.

  1. Install Composer, Apache Solr (we have tested on Solr v4 and v6), and set up a suitable database software such as MySQL. Create an empty database schema for this application. Ensure that the PHP modules php-curl and php-dom are installed on your system. In production, data is cached using the APC extension or, for newer versions of PHP, APCu.
  2. Clone this repository into a directory your web server can serve.
git clone https://github.com/nyuhsl/data-catalog.git
  1. Start Solr and create a new core for your project. Your core's name will become part of the URL that goes into the parameters.yml file in the next step. For example, if you create a core called "datacatalog" your Solr URL would look something like "http://localhost:8983/solr/datacatalog".
  2. Next we'll be setting up the Symfony configuration files. Check the Symfony documentation for some background information about how these files work. In this project, we have additional info in our app/config/parameters.yml.example. Fill in the information about your MySQL server, and the URL where your Solr installation lives (the solrsearchr_server parameter). You'll need a version of this file in app/config/dev and, later, in app/config/prod. Remember to choose a "secret" according to the documentation here. Then read through app/config/security.yml.example and copy it to app/config/common/security.yml. Please also read the README file in app/config which contains some more information.
  3. On a command line, navigate to your project's root directory and run composer install to install Symfony and any dependencies.
  4. Configure your web server to work with Symfony. NOTE: You will eventually have to require HTTPS connections on the login and administrative pages (at least), so remember to set up an SSL certificate for your server when you move the site to production. There is some sample code in app/config/common/security.yml that will tell Symfony to require HTTPS connections.
  5. Configure the file system. This means at the very least that app/config/cache and app/config/logs is writeable by the Apache web server and by your account.
  6. To set up the database, there are two options. First, there is a "starter database" prepopulated with several public datasets which can be loaded directly into the empty database schema you created in step 1. We recommend this option. Just extract the file starterDatabase.sql.tar.gz which is in the root of this repo, and import the *.sql file into your schema. However, due to updates to the metadata, this file may become out of date. In this case, or if you'd just prefer to start with an empty database, you can create the table structure using a Symfony console command. Navigate to the root of your Symfony installation and run php app/console doctrine:schema:update --force. If you have configured your database correctly in parameters.yml, this will set up your empty database to match the data model used in this application. If you haven't configured it correctly, this command will let you know.
  7. If using Solr v6+, you will need to switch from the "managed-schema" to use our custom schema, which is defined in SolrV6SchemaExample.xml. This involves some minor changes to solrconfig.xml as described here and here. Then place SolrV6SchemaExample.xml in the Solr config directory, named schema.xml. Perform any customizations you require, or leave as is.
  8. At this point, the site should function, but it still may not look right. Chances are you won't see any search results because there is nothing in the database, and nothing has been indexed in Solr. Click on the "Admin" tab, click "Add a New Dataset" in the sidebar menu, and get going!
  9. Once you've added some test datasets, you'll have to index them in Solr for them to become visible in the search interface. Navigate to your site's base directory and edit the file SolrIndexerExample.py (or SolrIndexerExample.php if you run PHP) to specify the URL of your Solr server where indicated. Then, run the script.

Follow-up Tasks

  1. You'll most likely want to regularly re-index Solr to account for datasets you add, delete, or update using the Admin section. In the root directory of this repo, there are PHP and Python examples of a script which can update a Solr index, called SolrIndexerExample. You'll probably want to call this script or something similar with a cron job every Sunday or every night or whatever seems appropriate, depending on much updating you do. I recommend weekly, since you can also run this script on-demand from the command line if you want.
  2. You'll most likely want to brand the site with your institution's logo or color scheme. Some placeholders have been left in app/Resources/views/base.html.twig that should get you started.
  3. In production, the site is configured to use the APC cache, which requires the installation of the APCu PHP module.
  4. There are currently three metadata fields ("Study Type", "Subject Gender" and "Subject Sex") which check in the database for the options they should display. When you first load the data entry form, these fields will appear blank until some options are added in their database tables. Please feel free to contact NYUHSL for examples of how to do this. Alternately, if you use the starter database, these fields will be pre-populated.
  5. You'll most likely want to have some datasets to search. Get to it!!

Using the API

The Data Catalog provides an API which can create and retrieve entities in JSON format.

Listing Entities

Existing datasets and related entities can be retrieved by calling the appropriate endpoints. Each type of entity has a URL which matches its class name. You can use the filenames in the src/AppBundle/Entity directory as a reference since they also match the class names. For example, the Dataset entity is defined in Dataset.php, so a list of datasets in your catalog can be found at /api/Dataset/all.json. Subject Keywords are defined in SubjectKeyword.php, so a list of all your subject keywords can be found at /api/SubjectKeyword/all.json. NOTE: The "all.json" is optional here, so /api/Dataset or /api/SubjectKeyword would work as well.

A specific dataset (or other entity) can be retrieved using its "slug" property (which you'd need to know beforehand). So, the URL /api/Dataset/ama-physician-masterfile will return the JSON representation of the AMA Physician Masterfile dataset.

In addition, the Dataset endpoint has an optional output_format parameter, which allows you to choose from three different output formats depending on your use case (all are returned as JSON):

  • default - the default output format can be ingested directly by other data catalogs using this codebase
  • solr - this format is suitable for indexing by Solr, and is used by our SolrIndexer scripts
  • complete - this format returns a more complete representation of the dataset, including full information about its related entities So for example, to retrieve the complete represenation of all your datasets, visit the URL /api/Dataset/all.json?output_format=complete

Ingesting Entities

New entities can also be ingested using the API, but there are some extra steps:

  1. Grant API Privileges - Each user wishing to upload via the API must be granted this privilege in the user management area (at /update/User). Choose your user in the list and then check the "API User" role. When you save your changes, a unique API key will be generated, which will be used to verify the user's identity. The new key will be displayed the next time you view this form. The key is generated using Symfony's random_bytes() function which is cryptographically secure. Please do not generate your own keys (except for testing) and PLEASE enforce HTTPS for all POST requests to the API, as this will keep your unique API key encrypted.
  2. Set X-AUTH-TOKEN Header - All POST requests to the API must include the user's API key as the X-AUTH-TOKEN header. Requests with missing API keys, or API keys corresponding to users who no longer have "API User" permissions will be rejected.
  3. Format your JSON - The entities you wish to ingest should be formatted in JSON in a way that Symfony can understand. We have provided a file in the base directory of this project called JSON_sample.json. This is a sample Dataset entity showing all the fields that are accepted by the API, and the types of values accepted by those fields. Note that many of the related entities fields (e.g. Subject Keywords) must already exist in the database before they can be applied to a new dataset via the API. For example, if you want to apply the keyword "Adolescent Health" to a dataset, you have to add "Adolescent Health" as a keyword before trying to ingest the dataset. There is more information about this in the APITester.php script. In this file you will see a sample PHP array which, like the sample JSON, shows the format required by the API (in case you're starting with your data in PHP). It also contains comments which go into a little more detail which fields require what.
  4. Perform the POST Request - The APITester.php script is a simple example of how to put together a POST request suitable for our API. Fill in the base URL of your data catalog installation (line 6), set the $data variable to contain the data you wish to ingest, and set the X-AUTH-TOKEN header to your API key (line 146). Please again note that most related entities can only be applied to new datasets if their values already exist in the database!

Luckily, these other entities can also be ingested via the API. Just like how we got a list of Subject Keywords by going to /api/SubjectKeyword, we can add new keywords by performing a POST request to /api/SubjectKeyword.

The API uses Symfony's form system to validate all incoming data, so the field names in your JSON should match the field names specified in each entity's form class. These files are located in src/AppBundle/Form/Type. Any fields that are required in the form class (or by database constraints) must be present in your JSON.

For example, if we check src/AppBundle/Form/Type/SubjectKeywordType.php, we can see which fields are required and what they should be called. Two fields are defined in this file, named "keyword" and "mesh_code". The MeSH code is set to 'required'=>false. So, a new Subject Keyword can be added by submitting a POST request to api/SubjectKeyword with the body:

{
  "keyword": "Test keyword"
}

If we want to add the MeSH code as well, the request body would look like:

{
  "keyword": "Test keyword",
  "mesh_code": "the mesh code"
}

Licensing

All files in this repository that are NOT components of the main Symfony distribution are Copyright 2016 NYU Health Sciences Library. This application is distributed under the GNU General Public License v3.0. For more information see the LICENSE file included in this repository.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].