All Projects → DBeath → feedsearch-crawler

DBeath / feedsearch-crawler

Licence: MIT license
Crawl sites for RSS, Atom, and JSON feeds.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to feedsearch-crawler

Linkedin Learning Downloader
Linkedin Learning videos downloader
Stars: ✭ 171 (+643.48%)
Mutual labels:  scraping, aiohttp, asyncio
Easy Scraping Tutorial
Simple but useful Python web scraping tutorial code.
Stars: ✭ 583 (+2434.78%)
Mutual labels:  scraping, crawling, asyncio
pomp
Screen scraping and web crawling framework
Stars: ✭ 61 (+165.22%)
Mutual labels:  scraping, crawling, asyncio
Elixir Scrape
Scrape any website, article or RSS/Atom Feed with ease!
Stars: ✭ 306 (+1230.43%)
Mutual labels:  rss, scraping, feed
tomodachi
💻 Microservice library / framework using Python's asyncio event loop with full support for HTTP + WebSockets, AWS SNS+SQS, RabbitMQ / AMQP, middleware, etc. Extendable for GraphQL, protobuf, gRPC, among other technologies.
Stars: ✭ 170 (+639.13%)
Mutual labels:  aiohttp, asyncio
proxycrawl-python
ProxyCrawl Python library for scraping and crawling
Stars: ✭ 51 (+121.74%)
Mutual labels:  scraping, crawling
Cyca
Web-based bookmarks and feeds manager
Stars: ✭ 15 (-34.78%)
Mutual labels:  feed, feeds
overflow-news
📚 Don't waste time searching for good dev blog posts. Get the latest news here.
Stars: ✭ 32 (+39.13%)
Mutual labels:  rss, feed
pytest-aiohttp
pytest plugin for aiohttp support
Stars: ✭ 110 (+378.26%)
Mutual labels:  aiohttp, asyncio
docker-ttrss
Tiny Tiny RSS feed reader as a Docker image.
Stars: ✭ 55 (+139.13%)
Mutual labels:  rss, feeds
asyncio-socks-server
A SOCKS proxy server implemented with the powerful python cooperative concurrency framework asyncio.
Stars: ✭ 154 (+569.57%)
Mutual labels:  pypi, asyncio
python3-concurrency
Python3爬虫系列的理论验证,首先研究I/O模型,分别用Python实现了blocking I/O、nonblocking I/O、I/O multiplexing各模型下的TCP服务端和客户端。然后,研究同步I/O操作(依序下载、多进程并发、多线程并发)和异步I/O(asyncio)之间的效率差别
Stars: ✭ 49 (+113.04%)
Mutual labels:  aiohttp, asyncio
python-logi-circle
Python 3.6+ API for Logi Circle cameras
Stars: ✭ 23 (+0%)
Mutual labels:  aiohttp, asyncio
netunnel
A tool to create network tunnels over HTTP/S written in Python 3
Stars: ✭ 19 (-17.39%)
Mutual labels:  aiohttp, asyncio
duckpy
A simple Python library for searching on DuckDuckGo.
Stars: ✭ 20 (-13.04%)
Mutual labels:  pypi, asyncio
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (+126.09%)
Mutual labels:  scraping, crawling
go-scrapy
Web crawling and scraping framework for Golang
Stars: ✭ 17 (-26.09%)
Mutual labels:  scraping, crawling
awesome-rss-feeds
Awesome RSS feeds - A curated list of RSS feeds (and OPML files) used in Recommended Feeds and local news sections of Plenary - an RSS reader, article downloader and a podcast player app for android
Stars: ✭ 114 (+395.65%)
Mutual labels:  rss, feed
aiohttp-mako
mako template renderer for aiohttp.web
Stars: ✭ 32 (+39.13%)
Mutual labels:  aiohttp, asyncio
scrapy-distributed
A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.
Stars: ✭ 38 (+65.22%)
Mutual labels:  scraping, crawling

Feedsearch Crawler

PyPI PyPI - Python Version PyPI - License

Feedsearch Crawler is a Python library for searching websites for RSS, Atom, and JSON feeds.

It is a continuation of my work on Feedsearch, which is itself a continuation of the work done by Dan Foreman-Mackey on Feedfinder2, which in turn is based on feedfinder - originally written by Mark Pilgrim and subsequently maintained by Aaron Swartz until his untimely death.

Feedsearch Crawler differs with all of the above in that it is now built as an asynchronous Web crawler for Python 3.7 and above, using asyncio and aiohttp, to allow much more rapid scanning of possible feed urls.

An implementation using this library to provide a public Feed Search API is available at https://feedsearch.dev

Pull requests and suggestions are welcome.

Installation

The library is available on PyPI:

pip install feedsearch-crawler

The library requires Python 3.7+.

Usage

Feedsearch Crawler is called with the single function search:

>>> from feedsearch_crawler import search
>>> feeds = search('xkcd.com')
>>> feeds
[FeedInfo('https://xkcd.com/rss.xml'), FeedInfo('https://xkcd.com/atom.xml')]
>>> feeds[0].url
URL('https://xkcd.com/rss.xml')
>>> str(feeds[0].url)
'https://xkcd.com/rss.xml'
>>> feeds[0].serialize()
{'url': 'https://xkcd.com/rss.xml', 'title': 'xkcd.com', 'version': 'rss20', 'score': 24, 'hubs': [], 'description': 'xkcd.com: A webcomic of romance and math humor.', 'is_push': False, 'self_url': '', 'favicon': 'https://xkcd.com/s/919f27.ico', 'content_type': 'text/xml; charset=UTF-8', 'bozo': 0, 'site_url': 'https://xkcd.com/', 'site_name': 'xkcd: Chernobyl', 'favicon_data_uri': '', 'content_length': 2847}

If you are already running in an asyncio event loop, then you can import and await search_async instead. The search function is only a wrapper that runs search_async in a new asyncio event loop.

from feedsearch_crawler import search_async

feeds = await search_async('xkcd.com')

A search will always return a list of FeedInfo objects, each of which will always have a url property, which is a URL object that can be decoded to a string with str(url). The returned FeedInfo are sorted by the score value from highest to lowest, with a higher score theoretically indicating a more relevant feed compared to the original URL provided. A FeedInfo can also be serialized to a JSON compatible dictionary by calling it's .serialize() method.

The crawl logs can be accessed with:

import logging

logger = logging.getLogger("feedsearch_crawler")

Feedsearch Crawler also provides a handy function to output the returned feeds as an OPML subscription list, encoded as a UTF-8 bytestring.

from feedsearch_crawler import output_opml

output_opml(feeds).decode()

Search Arguments

search and search_async take the following arguments:

search(
    url: Union[URL, str, List[Union[URL, str]]],
    crawl_hosts: bool=True,
    try_urls: Union[List[str], bool]=False,
    concurrency: int=10,
    total_timeout: Union[float, aiohttp.ClientTimeout]=10,
    request_timeout: Union[float, aiohttp.ClientTimeout]=3,
    user_agent: str="Feedsearch Bot",
    max_content_length: int=1024 * 1024 * 10,
    max_depth: int=10,
    headers: dict={"X-Custom-Header": "Custom Header"},
    favicon_data_uri: bool=True,
    delay: float=0
)
  • url: Union[str, List[str]]: The initial URL or list of URLs at which to search for feeds. You may also provide URL objects.
  • crawl_hosts: bool: (default True): An optional argument to add the site host origin URL to the list of initial crawl URLs. (e.g. add "example.com" if crawling "example.com/path/rss.xml"). If False, site metadata and favicon data may not be found.
  • try_urls: Union[List[str], bool]: (default False): An optional list of URL paths to query for feeds. Takes the origins of the url parameter and appends the provided paths. If no list is provided, but try_urls is True, then a list of common feed locations will be used.
  • concurrency: int: (default 10): An optional argument to specify the maximum number of concurrent HTTP requests.
  • total_timeout: float: (default 30.0): An optional argument to specify the time this function may run before timing out.
  • request_timeout: float: (default 3.0): An optional argument that controls how long before each individual HTTP request times out.
  • user_agent: str: An optional argument to override the default User-Agent header.
  • max_content_length: int: (default 10Mb): An optional argument to specify the maximum size in bytes of each HTTP Response.
  • max_depth: int: (default 10): An optional argument to limit the maximum depth of requests while following urls.
  • headers: dict: An optional dictionary of headers to pass to each HTTP request.
  • favicon_data_uri: bool: (default True): Optionally control whether to fetch found favicons and return them as a Data Uri.
  • delay: float: (default 0.0): An optional argument to delay each HTTP request by the specified time in seconds. Used in conjunction with the concurrency setting to avoid overloading sites.

FeedInfo Values

In addition to the url, FeedInfo objects may have the following values:

  • bozo: int: Set to 1 when feed data is not well formed or may not be a feed. Defaults 0.
  • content_length: int: Current length of the feed in bytes.
  • content_type: str: Content-Type value of the returned feed.
  • description: str: Feed description.
  • favicon: URL: URL of feed or site Favicon.
  • favicon_data_uri: str: Data Uri of Favicon.
  • hubs: List[str]: List of Websub hubs of feed if available.
  • is_podcast: bool: True if the feed contains valid podcast elements and enclosures.
  • is_push: bool: True if feed contains valid Websub data.
  • item_count: int: Number of items currently in the feed.
  • last_updated: datetime: Date of the latest published entry.
  • score: int: Computed relevance of feed url value to provided URL. May be safely ignored.
  • self_url: URL: ref="self" value returned from feed links. In some cases may be different from feed url.
  • site_name: str: Name of feed's website.
  • site_url: URL: URL of feed's website.
  • title: str: Feed Title.
  • url: URL: URL location of feed.
  • velocity: float: Mean number of items per day in the feed at the current time.
  • version: str: Feed version XML values, or JSON feed.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].