All Projects → fivefilters → Ftr Site Config

fivefilters / Ftr Site Config

Licence: other
Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications.

Labels

Projects that are alternatives of or similar to Ftr Site Config

Markup
A Swift package for working with HTML, XML, and other markup languages, based on libxml2.
Stars: ✭ 93 (-59.74%)
Mutual labels:  xpath
Goxpath
An XPath 1.0 implementation written in the Go programming language.
Stars: ✭ 148 (-35.93%)
Mutual labels:  xpath
Zson
专为测试人员打造的JSON解析器
Stars: ✭ 181 (-21.65%)
Mutual labels:  xpath
Graphquery
GraphQuery is a query language and execution engine tied to any backend service.
Stars: ✭ 112 (-51.52%)
Mutual labels:  xpath
Cssplus
CSSplus is a collection of CSS Reprocessor plugins that dynamically update CSS variables
Stars: ✭ 141 (-38.96%)
Mutual labels:  xpath
Didom
Simple and fast HTML and XML parser
Stars: ✭ 1,939 (+739.39%)
Mutual labels:  xpath
Internettools
XPath/XQuery 3.1 interpreter for Pascal with compatibility modes for XPath 2.0/XQuery 1.0/3.0, custom and JSONiq extensions, XML/HTML parsers and classes for HTTP/S requests
Stars: ✭ 82 (-64.5%)
Mutual labels:  xpath
Nokogiri
HTML parser for PHP - Парсер HTML
Stars: ✭ 214 (-7.36%)
Mutual labels:  xpath
Xsltdev.ru
Справочник web-разработчика с примерами
Stars: ✭ 148 (-35.93%)
Mutual labels:  xpath
Jquery Xpath
jQuery XPath plugin (with full XPath 2.0 language support)
Stars: ✭ 173 (-25.11%)
Mutual labels:  xpath
Docs
《数据采集从入门到放弃》源码。内容简介:爬虫介绍、就业情况、爬虫工程师面试题 ;HTTP协议介绍; Requests使用 ;解析器Xpath介绍; MongoDB与MySQL; 多线程爬虫; Scrapy介绍 ;Scrapy-redis介绍; 使用docker部署; 使用nomad管理docker集群; 使用EFK查询docker日志
Stars: ✭ 118 (-48.92%)
Mutual labels:  xpath
Harser
Easy way for HTML parsing and building XPath
Stars: ✭ 135 (-41.56%)
Mutual labels:  xpath
Xquery
Extract data or evaluate value from HTML/XML documents using XPath
Stars: ✭ 155 (-32.9%)
Mutual labels:  xpath
Pythonstudy
Python related technologies used in work: crawler, data analysis, timing tasks, RPC, page parsing, decorator, built-in functions, Python objects, multi-threading, multi-process, asynchronous, redis, mongodb, mysql, openstack, etc.
Stars: ✭ 103 (-55.41%)
Mutual labels:  xpath
Xmlquery
xmlquery is Golang XPath package for XML query.
Stars: ✭ 209 (-9.52%)
Mutual labels:  xpath
Domquery
PHP library for easy 'jQuery like' DOM traversing and manipulation.
Stars: ✭ 84 (-63.64%)
Mutual labels:  xpath
Html Agility Pack
Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
Stars: ✭ 2,014 (+771.86%)
Mutual labels:  xpath
Pugixml
Light-weight, simple and fast XML parser for C++ with XPath support
Stars: ✭ 2,809 (+1116.02%)
Mutual labels:  xpath
Xembly
Assembly for XML: imperative language to modify XML documents
Stars: ✭ 212 (-8.23%)
Mutual labels:  xpath
Astpath
A command-line search utility for Python ASTs using XPath syntax.
Stars: ✭ 167 (-27.71%)
Mutual labels:  xpath

Full-Text RSS site config files

Full-Text RSS, our article extraction tool, makes use of site-specific extraction rules to improve results. Each time a URL is processed, it checks to see if there are extraction rules for the site being processed. If there are no rules are found, it tries to detect the content block automatically.

This repository contains the site-specific extraction rules we rely on in Full-Text RSS.

Contributing changes

We run automated tests on these files to detect issues. If you'd like to help keep these up to date, please look at the test results and see which files you'd like to contribute fixes for.

We chose GitHub for this set of files because they offer one feature which we hope will make contributing changes easier: file editing through the web interface.

You can now make changes to any of our site config files and request that your changes be pulled into the main set we maintain. This is what GitHub calls the Fork and Pull model:

The Fork & Pull Model lets anyone fork an existing repository and push changes to their personal fork without requiring access be granted to the source repository. The changes must then be pulled into the source repository by the project maintainer. This model reduces the amount of friction for new contributors and is popular with open source projects because it allows people to work independently without upfront coordination.

When we receive a pull request we'll review the changes and if everything's okay we'll update our copy.

If a site is not in our set, you can create a file for it in the same way. See Creating files on GitHub.

How to write a site config file

The quickest and simplest way is to use our point-and-click interface. It's a simple tool only intended to create a rule to extract the correct content block.

For further refinements, e.g. selecting the title, stripping elements, dealing with multi-page articles, please see our help page.

File naming

Use example.com.txt for

Use .example.com.txt for

  • sport.example.com
  • news.example.com
  • environment.example.com
  • etc.

Use sport.example.com.txt to target just that sub-domain:

  • sport.example.com

Note: .example.com.txt will not match www.example.com or example.com

Instapaper

When we introduced site patterns, we chose to adopt the same format used by Instapaper. This allows us to make use of the existing extraction rules contributed by Instapaper users.

Marco, Instapaper's creator, graciously opened up the database of contributions to everyone:

And, recognizing that your efforts could be useful to a wide range of other tools and services, I'll make the list of all of these site-specific configurations available to the public, free, with no strings attached.

Most of the extraction rules in our set are borrowed from Instapaper. You can see the list maintained by Instapaper at instapaper.com/bodytext/ (no longer available since Instapaper was sold).

Testing site config files

Currently you will have to have a copy of Full-Text RSS to test changes to the site config files. In the future we will try to make this process easier.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].