Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → wikimedia → Html Metadata

wikimedia / Html Metadata

Licence: mit

MetaData html scraper and parser for Node.js (supports Promises and callback style)

Programming Languages

javascript

184084 projects - #8 most used programming language

Labels

nodejs web-scraping node-module web-scraper

Projects that are alternatives of or similar to Html Metadata

Php Curl Class

PHP Curl Class makes it easy to send HTTP requests and integrate with web APIs

Stars: ✭ 2,903 (+2150.39%)

Mutual labels: web-scraping, web-scraper

Spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Stars: ✭ 656 (+408.53%)

Mutual labels: web-scraping, web-scraper

Basketball reference web scraper

NBA Stats API via Basketball Reference

Stars: ✭ 279 (+116.28%)

Mutual labels: web-scraping, web-scraper

Linkedin-Client

Web scraper for grabing data from Linkedin profiles or company pages (personal project)

Stars: ✭ 42 (-67.44%)

Mutual labels: web-scraper, web-scraping

Cascadia

Go cascadia package command line CSS selector

Stars: ✭ 67 (-48.06%)

Mutual labels: web-scraping, web-scraper

OLX Scraper

📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.

Stars: ✭ 15 (-88.37%)

Mutual labels: web-scraper, web-scraping

Faster Than Requests

Faster requests on Python 3

Stars: ✭ 639 (+395.35%)

Mutual labels: web-scraping, web-scraper

Scrapple

A framework for creating semi-automatic web content extractors

Stars: ✭ 464 (+259.69%)

Mutual labels: web-scraping, web-scraper

Social Media Profile Scrapers

Fetch user's data across social media

Stars: ✭ 60 (-53.49%)

Mutual labels: web-scraping, web-scraper

Scrapy Craigslist

Web Scraping Craigslist's Engineering Jobs in NY with Scrapy

Stars: ✭ 54 (-58.14%)

Mutual labels: web-scraping, web-scraper

Scrape Linkedin Selenium

`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.

Stars: ✭ 239 (+85.27%)

Mutual labels: web-scraping, web-scraper

Detect Cms

PHP Library for detecting CMS

Stars: ✭ 78 (-39.53%)

Mutual labels: web-scraping, web-scraper

Web Scraping

Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, SHFE and news data crawlers on BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist

Stars: ✭ 153 (+18.6%)

Mutual labels: web-scraping, web-scraper

top-github-scraper

Scape top GitHub repositories and users based on keywords

Stars: ✭ 40 (-68.99%)

Mutual labels: web-scraper, web-scraping

Phpscraper

PHP Scraper - an highly opinionated web-interface for PHP

Stars: ✭ 148 (+14.73%)

Mutual labels: web-scraping, web-scraper

Project Tauro

A Router WiFi key recovery/cracking tool with a twist.

Stars: ✭ 52 (-59.69%)

Mutual labels: web-scraping, web-scraper

Arachnid

Powerful web scraping framework for Crystal

Stars: ✭ 68 (-47.29%)

Mutual labels: web-scraping, web-scraper

Daftlistings

A library that enables programmatic interaction with daft.ie. Daft.ie has nationwide coverage and contains about 80% of the total available properties in Ireland.

Stars: ✭ 86 (-33.33%)

Mutual labels: web-scraping, web-scraper

Rod

A Devtools driver for web automation and scraping

Stars: ✭ 1,392 (+979.07%)

Mutual labels: web-scraping

Dat8

General Assembly's 2015 Data Science course in Washington, DC

Stars: ✭ 1,516 (+1075.19%)

Mutual labels: web-scraping

View All Similar Projects ➔

html-metadata

MetaData html scraper and parser for Node.js (supports Promises and callback style)

The aim of this library is to be a comprehensive source for extracting all html embedded metadata. Currently it supports Schema.org microdata using a third party library, a native BEPress, Dublin Core, Highwire Press, JSON-LD, Open Graph, Twitter, EPrints, PRISM, and COinS implementation, and some general metadata that doesn't belong to a particular standard (for instance, the content of the title tag, or meta description tags).

Planned is support for RDFa, AGLS, and other yet unheard of metadata types. Contributions and requests for other metadata types welcome!

Install

npm install html-metadata

Usage

Promise-based:

var scrape = require('html-metadata');

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

scrape(url).then(function(metadata){
	console.log(metadata);
});

Callback-based:

var scrape = require('html-metadata');

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

scrape(url, function(error, metadata){
	console.log(metadata);
});

The scrape method used here invokes the parseAll() method, which uses all the available methods registered in method metadataFunctions(), and are available for use separately as well, for example:

Promise-based:

var cheerio = require('cheerio');
var preq = require('preq'); // Promisified request library
var parseDublinCore = require('html-metadata').parseDublinCore;

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

preq(url).then(function(response){
	$ = cheerio.load(response.body);
	return parseDublinCore($).then(function(metadata){
		console.log(metadata);
	});
});

Callback-based:

var cheerio = require('cheerio');
var request = require('request');
var parseDublinCore = require('html-metadata').parseDublinCore;

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

request(url, function(error, response, html){
	$ = cheerio.load(html);
	parseDublinCore($, function(error, metadata){
		console.log(metadata);
	});
});

Options object:

You can also pass an options object as the first argument containing extra parameters. Some websites require the user-agent or cookies to be set in order to get the response.

var scrape = require('html-metadata');
var request = require('request');

var options =  {
	url: "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/",
	jar: request.jar(), // Cookie jar
	headers: {
		'User-Agent': 'webscraper'
	}
};

scrape(options, function(error, metadata){
	console.log(metadata);
});

The method parseGeneral obtains the following general metadata:

<link rel="apple-touch-icon" href="" sizes="" type="">
<link rel="icon" href="" sizes="" type="">
<meta name="author" content="">
<link rel="author" href="">
<link rel="canonical" href="">
<meta name ="description" content="">
<link rel="publisher" href="">
<meta name ="robots" content="">
<link rel="shortlink" href="">
<title></title>
<html lang="en">
<html dir="rtl">

Tests

npm test runs the mocha tests

npm run-script coverage runs the tests and reports code coverage

Contributing

Contributions welcome! All contibutions should use bluebird promises instead of callbacks, and be .nodeify()-ed in index.js so the functions can be used as either callbacks or Promises.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 129

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (17) 🔗