All Projects → kamome283 → AngleParse

kamome283 / AngleParse

Licence: Apache-2.0 license
HTML parsing and processing tool for PowerShell.

Programming Languages

C#
18002 projects
powershell
5483 projects
shell
77523 projects
Batchfile
5799 projects

Projects that are alternatives of or similar to AngleParse

browser-pool
A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.
Stars: ✭ 71 (+102.86%)
Mutual labels:  scraping
chesf
CHeSF is the Chrome Headless Scraping Framework, a very very alpha code to scrape javascript intensive web pages
Stars: ✭ 18 (-48.57%)
Mutual labels:  scraping
ferenda
Transform unstructured document collections to structured Linked Data
Stars: ✭ 22 (-37.14%)
Mutual labels:  scraping
angel.co-companies-list-scraping
No description or website provided.
Stars: ✭ 54 (+54.29%)
Mutual labels:  scraping
rubium
Rubium is a lightweight alternative to Selenium/Capybara/Watir if you need to perform some operations (like web scraping) using Headless Chromium and Ruby
Stars: ✭ 65 (+85.71%)
Mutual labels:  scraping
gunaydin
Your good mornings ☀️
Stars: ✭ 16 (-54.29%)
Mutual labels:  scraping
Scrapping
Mastering the art of scrapping 🎓
Stars: ✭ 24 (-31.43%)
Mutual labels:  scraping
subscene scraper
Library to download subtitles from subscene.com
Stars: ✭ 14 (-60%)
Mutual labels:  scraping
go-scrapy
Web crawling and scraping framework for Golang
Stars: ✭ 17 (-51.43%)
Mutual labels:  scraping
proxi
Proxy pool. Finds and checks proxies with rest api for querying results. Can find over 25k proxies in under 5 minutes.
Stars: ✭ 32 (-8.57%)
Mutual labels:  scraping
sg-food-ml
This script is used to scrap images from the Internet to classify 5 common noodle "mee" dishes in Singapore. Wanton Mee, Bak Chor Mee, Lor Mee, Prawn Mee and Mee Siam.
Stars: ✭ 18 (-48.57%)
Mutual labels:  scraping
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (+48.57%)
Mutual labels:  scraping
scrapy-distributed
A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.
Stars: ✭ 38 (+8.57%)
Mutual labels:  scraping
torchestrator
Spin up Tor containers and then proxy HTTP requests via these Tor instances
Stars: ✭ 32 (-8.57%)
Mutual labels:  scraping
Captcha-Tools
All-in-one Python (And now Go!) module to help solve captchas with Capmonster, 2captcha and Anticaptcha API's!
Stars: ✭ 23 (-34.29%)
Mutual labels:  scraping
scavenger
Scrape and take screenshots of dynamic and static webpages
Stars: ✭ 14 (-60%)
Mutual labels:  scraping
document-dl
Command line program to download documents from web portals.
Stars: ✭ 14 (-60%)
Mutual labels:  scraping
top-github-scraper
Scape top GitHub repositories and users based on keywords
Stars: ✭ 40 (+14.29%)
Mutual labels:  scraping
feedsearch-crawler
Crawl sites for RSS, Atom, and JSON feeds.
Stars: ✭ 23 (-34.29%)
Mutual labels:  scraping
internet-affordability
🌍 Dataset that shows the Internet affordability by country (a shocking reality!)
Stars: ✭ 13 (-62.86%)
Mutual labels:  scraping

AngleParse

Easy to use HTML parsing and processing tool for PowerShell.

# Articles from PowerShell dev blog
iwr 'https://devblogs.microsoft.com/powershell/' |
Select-HtmlContent 'div.entry-box.container', @{
  Title = 'h5.entry-title'
  Author = 'span.entry-author-link', ([regex]'(\w+) \w+')
  PostDate = 'span.entry-post-date', { [DateTime]::Parse($_) }
} | select -first 3

# PostDate           Title                                                   Author
# --------           -----                                                   ------
# 2020/07/07 0:00:00 PowerShellGet 3.0 Preview 6 Release                     Sydney
# 2020/06/26 0:00:00 Native Commands in PowerShell – A New Approach – Part 2 Jim
# 2020/06/22 0:00:00 Native Commands in PowerShell – A New Approach          Jim

Package

https://www.powershellgallery.com/packages/AngleParse

Before Use

  1. Install-Package AngleParse
  2. Import-Module AngleParse

Usage

gc ./foobar.html -raw | Select-HtmlContent ([AngleParse.Attr]::Class)

Select-HtmlContent command receives string content from pipeline, interprets given content as HTML DOM tree, then processes with given selectors which are specified in the command's first argument.

About Selector

There are 5 kinds of selectors. They are CSS selector, attribute selector, regex selector, scriptblock selector and hashtable selector. All selectors receive one input and output multiple (includes 0 and 1) items. And by specifying multiple selectors, you can combine selectors which works as PowerShell's pipeline.

'<div><span>abc</span></div>' | Select-HtmlContent "div > span", ([regex]'a(bc)')
# bc

# Similar work as below.
'<div><span>abc</span></div>' |
Select-HtmlContent "div > span" |
% { $_ | Select-HtmlContent ([regex]'a(bc)') }
# bc

When selector outputs single item, output array is unified like PowerShell's default behaviour for ease of use.

iwr "https://b.hatena.ne.jp/" | Select-HtmlContent "div.entrylist-contents", @{ 
    Title = "h3.entrylist-contents-title > a"
    Tags = "a[rel=tag]"
} | select -first 1 | Format-List

# Title contains only one string item so that array is unified.
# Title : Go To トラベル 感染を広げないためには(忽那賢志) - 個人 - Y...
# Tags  : {COVID-19, 旅行, 社会, 医療…}    

CSS Selector

'<div><span class="foo">text content here</span></div>' | Select-HtmlContent "div > span.foo"
# text content here

String is interpreted as CSS selector. This selector receives DOM element and outputs DOM elements which matches given selector.

Attribute Selector

'<a href="https://foo.go.jp">bar</a>' | Select-HtmlContent ([AngleParse.Attr]::Href)
# https://foo.go.jp

Enum value of AngleParse.Attr class is interpreted as attribute selector. There are 11 kinds of attributes as below.

Element
InnerHtml
OuterHtml
TextContent
Id
Class
SplitClasses
Href
Src
Title
Name

This selector receives DOM element and outputs matched attributes as string, excluding Element attribute which is introduced in ScriptBlock Selector section. SplitClasses attribute outputs classes which is split by space, although Class attribute does not do any special work.

# Class
'<div class="foo bar">' | Select-HtmlContent ([AngleParse.Attr]::Class)
# foo bar

# SplitClasses
'<div class="foo bar">' | Select-HtmlContent ([AngleParse.Attr]::SplitClasses)
# foo
# bar

Regex Selector

Regex value is interpreted as regex selector. This selector receives DOM element or string and outputs captured strings. When you pass DOM element to this selector, this selector operate matching on the element's inner text content.

# Not captured the day part so that outputs are year and month.
'<span>2020/07/22</span>' | Select-HtmlContent ([regex]'(\d{4})/(\d{2})/\d{2}')
# 2020
# 07

ScriptBlock Selector

'<span>7</span>' | Select-HtmlContent { [int]$_ * 6; [int]$_ * 7 }
# 42
# 49

ScriptBlock is interpreted as scriptblock selector. This selector receives any kind of objects and outputs evaluated objects. Passed object is bound to $_. When you pass DOM element to this selector, the element is implicitly converted to the element's inner text content which has string type. If you do not want this conversion, pipe Element attribute selector before using scriptblock selector.

'<div><span>a</span><span>b</span></div>' | 
Select-HtmlContent ([AngleParse.Attr]::Element), { $_.ChildElementCount }
# 2

Hashtable Selector

'<div class="a">1a</div><div class="b">2b</div>' |
select-htmlcontent "> div",
@{ Class = ([AngleParse.Attr]::Class);
   NumPlus1 = ([regex]'(\d)\w'), { [int]$_ + 1 } }

# Class NumPlus1
# ----- --------
# a            2
# b            3

Hashtable is interpreted as hashtable selector. Each value of the hashtable must be valid selctor(s). This selector processes input with given selectors in the hashtable and bound to the corresponding key.

Other Resources

PowerShellから簡単にスクレイピングするためのツールを作った(Japanese)

Special Thanks To

and all the support.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].