All Projects → scotteh → Php Goose

scotteh / Php Goose

Licence: apache-2.0
Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

Projects that are alternatives of or similar to Php Goose

PyScholar
A 'supervised' parser for Google Scholar
Stars: ✭ 74 (-81.12%)
Mutual labels:  scraper, article
Article Parser
To extract main article from given URL with Node.js
Stars: ✭ 179 (-54.34%)
Mutual labels:  article, readability
Clean Mark
Convert an article into a clean text
Stars: ✭ 414 (+5.61%)
Mutual labels:  article, readability
Graby
Graby helps you extract article content from web pages
Stars: ✭ 281 (-28.32%)
Mutual labels:  composer, readability
Micro Open Graph
A tiny Node.js microservice to scrape open graph data with joy.
Stars: ✭ 371 (-5.36%)
Mutual labels:  scraper
Freshonions Torscraper
Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion
Stars: ✭ 348 (-11.22%)
Mutual labels:  scraper
Blog
刘博文(Berwin),花名“玖五”,畅销书《深入浅出Vue.js》作者、Speaker、阿里巴巴集团前端技术专家,天猫双11大促会场消防员、现负责包含天猫双11在内的超大型营销活动的终端渲染架构与专项PM。
Stars: ✭ 3,773 (+862.5%)
Mutual labels:  article
Idea Composer Plugin
PhpStorm plugin that adds code completion in composer.json file
Stars: ✭ 346 (-11.73%)
Mutual labels:  composer
Kotlin Tutorials
【Kotlin 视频教程】国内资料较少,我录制了一套视频作为抛砖引玉~
Stars: ✭ 14 (-96.43%)
Mutual labels:  article
Coastercms
The repository for Coaster CMS (coastercms.org), a full featured, Laravel based Content Management System
Stars: ✭ 380 (-3.06%)
Mutual labels:  composer
Unused Scanner
Detect unused composer dependencies
Stars: ✭ 363 (-7.4%)
Mutual labels:  composer
Awesome Django Admin
Curated List of Awesome Django Admin Panel Articles, Libraries/Packages, Books, Themes, Videos, Resources.
Stars: ✭ 356 (-9.18%)
Mutual labels:  article
Epub Press Clients
📦 Clients for building books with EpubPress.
Stars: ✭ 370 (-5.61%)
Mutual labels:  article
Flex
Composer plugin for Symfony
Stars: ✭ 3,731 (+851.79%)
Mutual labels:  composer
Easytbk
淘客5合一SDK,支持淘宝联盟、京东联盟、多多进宝、唯品会、苏宁
Stars: ✭ 383 (-2.3%)
Mutual labels:  composer
Composer Doc Cn
Composer 中文文档(新版本文档重新翻译中,详见 1.6分支)
Stars: ✭ 346 (-11.73%)
Mutual labels:  composer
Arrayy
🗃 Array manipulation library for PHP, called Arrayy!
Stars: ✭ 363 (-7.4%)
Mutual labels:  composer
Osi.ig
Information Gathering Instagram.
Stars: ✭ 377 (-3.83%)
Mutual labels:  scraper
Katana
A Python Tool For google Hacking
Stars: ✭ 355 (-9.44%)
Mutual labels:  scraper
Codeigniter Composer Installer
Installs the offical CodeIgniter 3 with secure folder structure via Composer
Stars: ✭ 357 (-8.93%)
Mutual labels:  composer

PHP Goose - Article Extractor

Scrutinizer Code Quality

Intro

PHP Goose is a port of Goose originally developed in Java and converted to Scala by GravityLabs. Portions have also been ported from the Python port python-goose. Its mission is to take any news article or article type web page and not only extract what is the main body of the article but also all metadata and most probable image candidate.

The extraction goal is to try and get the purest extraction from the beginning of the article for servicing flipboard/pulse type applications that need to show the first snippet of a web article along with an image.

Goose will try to extract the following information:

  • Main text of an article
  • Main image of article
  • Any YouTube/Vimeo movies embedded in article
  • Meta Description
  • Meta tags
  • Publish Date

The PHP version was rewritten by:

  • Andrew Scott

Requirement

  • PHP 7.1 or later
  • PSR-4 compatible autoloader

The older 0.x versions with PHP 5.5+ support are still available under releases.

Install

This library is designed to be installed via Composer.

Add the dependency into your projects composer.json.

{
  "require": {
    "scotteh/php-goose": "^1.0"
  }
}

Download the composer.phar

curl -sS https://getcomposer.org/installer | php

Install the library.

php composer.phar install

Autoloading

This library requires an autoloader, if you aren't already using one you can include Composers autoloader.

require('vendor/autoload.php');

Usage

use \Goose\Client as GooseClient;

$goose = new GooseClient();
$article = $goose->extractContent('http://url.to/article');

$title = $article->getTitle();
$metaDescription = $article->getMetaDescription();
$metaKeywords = $article->getMetaKeywords();
$canonicalLink = $article->getCanonicalLink();
$domain = $article->getDomain();
$tags = $article->getTags();
$links = $article->getLinks();
$videos = $article->getVideos();
$articleText = $article->getCleanedArticleText();
$entities = $article->getPopularWords();
$image = $article->getTopImage();
$allImages = $article->getAllImages();

Configuration

All config options are not required and are optional. Default (fallback) values have been used below.

use \Goose\Client as GooseClient;

$goose = new GooseClient([
    // Language - Selects common word dictionary
    //   Supported languages (ISO 639-1):
    //     ar, cs, da, de, en, es, fi, fr, hu, id, it, ja,
    //     ko, nb, nl, no, pl, pt, ru, sv, vi, zh
    'language' => 'en',
    // Minimum image size (bytes)
    'image_min_bytes' => 4500,
    // Maximum image size (bytes)
    'image_max_bytes' => 5242880,
    // Minimum image size (pixels)
    'image_min_width' => 120,
    // Maximum image size (pixels)
    'image_min_height' => 120,
    // Fetch best image
    'image_fetch_best' => true,
    // Fetch all images
    'image_fetch_all' => false,
    // Guzzle configuration - All values are passed directly to Guzzle
    //   See: http://guzzle.readthedocs.io/en/stable/request-options.html
    'browser' => [
        'timeout' => 60,
        'connect_timeout' => 30
    ]
]);

Licensing

PHP Goose is licensed by Gravity.com under the Apache 2.0 license, see the LICENSE file for more details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].