All Projects → scrapehero → selectorlib

scrapehero / selectorlib

Licence: MIT license
A library to read a YML file with Xpath or CSS Selectors and extract data from HTML pages using them

Programming Languages

HTML
75241 projects

Projects that are alternatives of or similar to selectorlib

Parsel
Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
Stars: ✭ 628 (+1084.91%)
Mutual labels:  scraping, xpath
Sqrape
Simple Query Scraping with CSS and Go Reflection (MOVED to Gitlab)
Stars: ✭ 144 (+171.7%)
Mutual labels:  scraping, web-scraping
Webhere
HTML scraping for Objective-C.
Stars: ✭ 16 (-69.81%)
Mutual labels:  scraping, xpath
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (+422.64%)
Mutual labels:  scraping, web-scraping
reapr
🕸→ℹ️ Reap Information from Websites
Stars: ✭ 14 (-73.58%)
Mutual labels:  web-scraping, xpath
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+7592.45%)
Mutual labels:  scraping, web-scraping
Humanoid
Node.js package to bypass CloudFlare's anti-bot JavaScript challenges
Stars: ✭ 88 (+66.04%)
Mutual labels:  scraping, web-scraping
top-github-scraper
Scape top GitHub repositories and users based on keywords
Stars: ✭ 40 (-24.53%)
Mutual labels:  scraping, web-scraping
Scrape Linkedin Selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Stars: ✭ 239 (+350.94%)
Mutual labels:  scraping, web-scraping
Xquery
Extract data or evaluate value from HTML/XML documents using XPath
Stars: ✭ 155 (+192.45%)
Mutual labels:  scraping, xpath
Apify Js
Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Stars: ✭ 3,154 (+5850.94%)
Mutual labels:  scraping, web-scraping
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+1241.51%)
Mutual labels:  scraping, web-scraping
raspagem-de-dados-fatec
📓 Minicurso de raspagem de dados web com Python ministrado na Semana de Tecnologia da FATEC Jundiaí
Stars: ✭ 22 (-58.49%)
Mutual labels:  scraping, web-scraping
Scrapple
A framework for creating semi-automatic web content extractors
Stars: ✭ 464 (+775.47%)
Mutual labels:  scraping, web-scraping
papercut
Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.
Stars: ✭ 15 (-71.7%)
Mutual labels:  scraping, web-scraping
Detect Cms
PHP Library for detecting CMS
Stars: ✭ 78 (+47.17%)
Mutual labels:  scraping, web-scraping
codechef-rank-comparator
Web application hosted on Heroku cloud platform based on web scraping in python using lxml library (XML Path Language).
Stars: ✭ 23 (-56.6%)
Mutual labels:  web-scraping, xpath
browser-pool
A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.
Stars: ✭ 71 (+33.96%)
Mutual labels:  scraping, web-scraping
Phpscraper
PHP Scraper - an highly opinionated web-interface for PHP
Stars: ✭ 148 (+179.25%)
Mutual labels:  scraping, web-scraping
PythonScrapyBasicSetup
Basic setup with random user agents and IP addresses for Python Scrapy Framework.
Stars: ✭ 57 (+7.55%)
Mutual labels:  scraping, web-scraping

selectorlib

Documentation Status Updates

A library to read a YML file with Xpath or CSS Selectors and extract data from HTML pages using them

Example

>>> from selectorlib import Extractor
>>> yaml_string = """
    title:
        css: "h1"
        type: Text
    link:
        css: "h2 a"
        type: Link
    """
>>> extractor = Extractor.from_yaml_string(yaml_string)
>>> html = """
    <h1>Title</h1>
    <h2>Usage
        <a class="headerlink" href="http://test">¶</a>
    </h2>
    """
>>> extractor.extract(html)
{'title': 'Title', 'link': 'http://test'}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].