All Projects → danieldotnl → ha-multiscrape

danieldotnl / ha-multiscrape

Licence: MIT license
Home Assistant custom component for scraping (html, xml or json) multiple values (from a single HTTP request) with a separate sensor/attribute for each value. Support for (login) form-submit functionality.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to ha-multiscrape

diffbot-php-client
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
Stars: ✭ 53 (-48.54%)
Mutual labels:  scraper, scraping, scrape
scrapers
scrapers for building your own image databases
Stars: ✭ 46 (-55.34%)
Mutual labels:  scraper, scraping, scrape
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+3858.25%)
Mutual labels:  scraper, scraping, scrape
readability-cli
A CLI for Mozilla Readability. Get clean, uncluttered, ready-to-read HTML from any webpage!
Stars: ✭ 41 (-60.19%)
Mutual labels:  scraping, scrape
google-scraper
This class can retrieve search results from Google.
Stars: ✭ 33 (-67.96%)
Mutual labels:  scraper, scraping
Pahe.ph-Scraper
Pahe.ph [Pahe.in] Movies Website Scraper
Stars: ✭ 57 (-44.66%)
Mutual labels:  scraper, scraping
Goose Parser
Universal scrapping tool, which allows you to extract data using multiple environments
Stars: ✭ 211 (+104.85%)
Mutual labels:  scraper, scraping
homeassistant-afvalwijzer
Provides sensors for some Dutch waste collectors
Stars: ✭ 119 (+15.53%)
Mutual labels:  sensor, hacs
crawler-chrome-extensions
爬虫工程师常用的 Chrome 插件 | Chrome extensions used by crawler developer
Stars: ✭ 53 (-48.54%)
Mutual labels:  scraper, scraping
stweet
Advanced python library to scrap Twitter (tweets, users) from unofficial API
Stars: ✭ 287 (+178.64%)
Mutual labels:  scraper, scrape
fansly
Simply scrape / download all the media from an fansly account
Stars: ✭ 351 (+240.78%)
Mutual labels:  scraper, scrape
balboa homeassistan
Balboa spa integration for home-assistant
Stars: ✭ 21 (-79.61%)
Mutual labels:  hacs, home-assistant-custom
Anniversaries
Anniversary Countdown Sensor for Home Assistant
Stars: ✭ 128 (+24.27%)
Mutual labels:  sensor, hacs
Scrape Linkedin Selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Stars: ✭ 239 (+132.04%)
Mutual labels:  scraper, scraping
midea-ac-py
This is a library to allow communicating to a Midea appliance via the Midea cloud.
Stars: ✭ 72 (-30.1%)
Mutual labels:  hacs, home-assistant-custom
Scrapysharp
reborn of https://bitbucket.org/rflechner/scrapysharp
Stars: ✭ 226 (+119.42%)
Mutual labels:  scraper, scraping
gochanges
**[ARCHIVED]** website changes tracker 🔍
Stars: ✭ 12 (-88.35%)
Mutual labels:  scraper, scraping
Jsonframe Cheerio
simple multi-level scraper json input/output for Cheerio
Stars: ✭ 196 (+90.29%)
Mutual labels:  scraper, scraping
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+14982.52%)
Mutual labels:  scraper, scraping
Indego
Home Assistant Custom Component for Bosch Indego Lawn Mower
Stars: ✭ 42 (-59.22%)
Mutual labels:  sensor, hacs

HA Multiscrape

GitHub Release License

pre-commit Black

hacs Project Maintenance BuyMeCoffee

Discord Community Forum

Important note: troubleshooting

If you don't manage to scrape the value you are looking for, please enable debug logging and log_response. This will provide you with a lot of information for continued investigation. log_response will write all responses to files. If the value you want to scrape is not in the files with the output from BeautifulSoup (*-soup.txt), Multiscrape will not be able to scrape it. Most likely it is retrieved in the background by javascript. Your best chance in this case, is to investigate the network traffic in de developer tools of your browser, and try to find a json response containing the value you are looking for.

If all of this doesn't help, use the home assistant forum. I cannot give everyone personal assistance and please don't create github issues unless you are sure there is a bug. Check the wiki for a scraping guide and other details on the functionality of this component.

HA MultiScrape custom component

This Home Assistant custom component can scrape multiple fields (using CSS selectors) from a single HTTP request (the existing scrape sensor can scrape a single field only). The scraped data becomes available in separate sensors.

It is based on both the existing Rest sensor and the Scrape sensor. Most properties of the Rest and Scrape sensor apply.

Buy Me A Coffee

Installation

hacs

Install via HACS (default store) or install manually by copying the files in a new 'custom_components/multiscrape' directory.

Example configuration (YAML)

multiscrape:
  - resource: https://www.home-assistant.io
    scan_interval: 3600
    sensor:
      - unique_id: ha_latest_version
        name: Latest version
        select: ".current-version > h1:nth-child(1)"
        value_template: '{{ (value.split(":")[1]) }}'
      - unique_id: ha_release_date
        icon: >-
          {% if is_state('binary_sensor.ha_version_check', 'on') %}
            mdi:alarm-light
          {% else %}
            mdi:bat
          {% endif %}
        name: Release date
        select: ".release-date"
    binary_sensor:
      - unique_id: ha_version_check
        name: Latest version == 2021.7.0
        select: ".current-version > h1:nth-child(1)"
        value_template: '{{ (value.split(":")[1]) | trim == "2021.7.0" }}'
        attributes:
          - name: Release notes link
            select: "div.links:nth-child(3) > a:nth-child(1)"
            attribute: href

Options

Based on latest (pre) release.

name description required default type
name The name for the integration. False string
resource The url for retrieving the site or a template that will output an url. Not required when resource_template is provided. True string
resource_template A template that will output an url after being rendered. Only required when resource is not provided. True template
authentication Configure HTTP authentication. basic or digest. Use this with username and password fields. False string
username The username for accessing the url. False string
password The password for accessing the url. False string
headers The headers for the requests. False string - list
params The query params for the requests. False string - list
method The method for the request. Either POST or GET. False GET string
payload Optional payload to send with a POST request. False string
verify_ssl Verify the SSL certificate of the endpoint. False True boolean
log_response Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) False False boolean
timeout Defines max time to wait data from the endpoint. False 10 int
scan_interval Determines how often the url will be requested. False 60 int
parser Determines the parser to be used with beautifulsoup. Either lxml or html.parser. False lxml string
form_submit See Form-submit False
sensor See Sensor False list
binary_sensor See Binary sensor False list
button See Refresh button False list

Sensor/Binary Sensor

Configure the sensors that will scrape the data.

name description required default type
unique_id Will be used as entity_id and enables editing the entity in the UI False string
name Friendly name for the sensor False string
select CSS selector used for retrieving the value of the sensor. Only required when select_list is not provided. True string/template
select_list CSS selector for multiple values of multiple elements which will be returned as csv. Only required when select is not provided. True string/template
attribute Attribute from the selected element to read as value False string
value_template Defines a template applied on the result of the selector to extract the value. For binary sensors, the sensor is on if the template evaluates as True False string/template
attributes See Sensor attributes False list
unit_of_measurement Defines the units of measurement of the sensor False string
device_class Sets the device_class for sensors or binary sensors False string
state_class Defines the state class of the sensor, if any. (measurement, total or total_increasing) (not for binary_sensor) False None string
icon Defines the icon or a template for the icon of the sensor. The value of the selector is provided as input for the template. For binary sensors, the value is parsed in a boolean. False string/template
picture Contains a path to a local image and will set it as entity picture False string
force_update Sends update events even if the value hasn’t changed. Useful if you want to have meaningful value graphs in history. False False boolean
on_error See On-error False

Refresh button

Configure a refresh button to manually trigger scraping.

name description required default type
unique_id Will be used as entity_id and enables editing the entity in the UI False string
name Friendly name for the button False string

Sensor attributes

Configure the attributes on the sensor that can be set with additional scraping values.

name description required default type
name Name of the attribute (will be slugified) True string
select CSS selector used for retrieving the value of the attribute. Only required when select_list is not provided. True string/template
select_list CSS selector for multiple values of multiple elements which will be returned as csv. Only required when select is not provided. True string/template
attribute Attribute from the selected element to read as value False string
value_template Defines a template applied on the result of the selector to extract the value False string/template
on_error See On-error False

Form-submit

Configure the form-submit functionality which enables you to submit a (login) form before scraping a site. More details on how this works can be found on the wiki.

name description required default type
resource The url for the site with the form False string
select CSS selector used for selecting the form in the html True string
input A dictionary with name/values which will be merged with the input fields on the form False string - dictionary
input_filter A list of input fields that should not be submitted with the form False string - list
submit_once Submit the form only once on startup instead of each scan interval False False boolean
resubmit_on_error Resubmit the form after a scraping error is encountered False True boolean

On-error

Configure what should happen in case of a scraping error (the css selector does not return a value).

name description required default type
log Determines if and how something should be logged in case of a scraping error. Value can be either 'false', 'info', 'warning' or 'error'. False error string
value Determines what value the sensor/attribute should get in case of a scraping error. The value can be 'last' meaning that the value does not change, 'none' which results in HA showing 'Unkown' on the sensor, or 'default' which will show the specified default value. False none string
default The default value to be used when the on-error value is set to 'default'. False string

Services

For each multiscrape instance, a service will be created to trigger a scrape run through an automation. (For manual triggering, the button entity can now be configured.) The services are named multiscrape.trigger_{name of integration}.

Debug logging

Debug logging can be enabled as follows:

logger:
  default: info
  logs:
    custom_components.multiscrape: debug

Depending on your issue, also consider enabling log_response.

Contributions are welcome!

If you want to contribute to this please read the Contribution guidelines

Credits

This project was generated from @oncleben31's Home Assistant Custom Component Cookiecutter template.

Code template was mainly taken from @Ludeeus's integration_blueprint template


Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].