All Projects β†’ aantron β†’ Lambdasoup

aantron / Lambdasoup

Licence: mit
Functional HTML scraping and rewriting with CSS in OCaml

Programming Languages

ocaml
1615 projects

Projects that are alternatives of or similar to Lambdasoup

TorScrapper
A Scraper made 100% in Python using BeautifulSoup and Tor. It can be used to scrape both normal and onion links. Happy Scraping :)
Stars: ✭ 24 (-91.43%)
Mutual labels:  scraping
raspagem-de-dados-fatec
πŸ““ Minicurso de raspagem de dados web com Python ministrado na Semana de Tecnologia da FATEC JundiaΓ­
Stars: ✭ 22 (-92.14%)
Mutual labels:  scraping
facebook-discussion-tk
A collection of tools to (semi-)automatically collect and analyze data from online discussions on Facebook groups and pages.
Stars: ✭ 33 (-88.21%)
Mutual labels:  scraping
Babler
Data Collection System For NLP/Speech Recognition
Stars: ✭ 21 (-92.5%)
Mutual labels:  scraping
python-overwatch
A simple API for scraping Overwatch stats
Stars: ✭ 14 (-95%)
Mutual labels:  scraping
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-92.14%)
Mutual labels:  scraping
whatsapp-tracking
Scraping the status of WhatsApp contacts
Stars: ✭ 49 (-82.5%)
Mutual labels:  scraping
Apify Js
Apify SDK β€” The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Stars: ✭ 3,154 (+1026.43%)
Mutual labels:  scraping
memes-api
API for scrapping common meme sites
Stars: ✭ 17 (-93.93%)
Mutual labels:  scraping
jazz
The Scripting Engine that Combines Speed, Safety, and Simplicity
Stars: ✭ 132 (-52.86%)
Mutual labels:  scraping
PyLex
Perform lexical analysis on words, one word at a time.
Stars: ✭ 60 (-78.57%)
Mutual labels:  scraping
webdext
Intelligent Web Data Extractor
Stars: ✭ 75 (-73.21%)
Mutual labels:  scraping
bots-zoo
No description or website provided.
Stars: ✭ 59 (-78.93%)
Mutual labels:  scraping
Zeiver
A Scraper, Downloader, & Recorder for static open directories.
Stars: ✭ 14 (-95%)
Mutual labels:  scraping
instagram explorer
πŸ“· An app to scrap instagram posts and analyze data.
Stars: ✭ 17 (-93.93%)
Mutual labels:  scraping
papercut
Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.
Stars: ✭ 15 (-94.64%)
Mutual labels:  scraping
scraper
Nodejs web scraper. Contains a command line, docker container, terraform module and ansible roles for distributed cloud scraping. Supported databases: SQLite, MySQL, PostgreSQL. Supported headless clients: Puppeteer, Playwright, Cheerio, JSdom.
Stars: ✭ 37 (-86.79%)
Mutual labels:  scraping
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (-1.07%)
Mutual labels:  scraping
schedule-tweet
Schedules tweets using TweetDeck
Stars: ✭ 14 (-95%)
Mutual labels:  scraping
ARGUS
ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
Stars: ✭ 68 (-75.71%)
Mutual labels:  scraping

Lambda Soup   Travis status Coverage

Lambda Soup is a functional HTML scraping and manipulation library for OCaml aimed at being easy to use.

Lambda Soup usage example

Lambda Soup is simple. It provides a set of elementary traversals for getting from node to node, familiar functional combinators such as filter, map, and fold, and support for all CSS selectors that still make sense when not running in a browser (and a few obvious extensions on top of that).

Here is a trivial self-contained example:

(parse "<p class='Hello'>World!</p>") $ ".Hello" |> R.leaf_text;;
- : string = "World!"

And, a mutation:

let soup = parse "<p class='Hello'>World!</p>" in
wrap (soup $ ".Hello" |> R.child) (create_element "strong");
soup |> to_string;;
- : string = "<p class=\"Hello\"><strong>World!</strong></p>"

For some more examples, see the Lambda Soup postprocessor that runs on Lambda Soup's own documentation after it is generated by ocamldoc.

The library is tested thoroughly.

Lambda Soup is based on Markup.ml. As a consequence, it resolves entity references, detects character encodings automatically, and converts everything to UTF-8. And, you can use Lambda Soup on XML, by parsing the XML with Markup.ml and feeding the signals to Lambda Soup.


Installing

opam install lambdasoup

Starting from scratch

To use Lambda Soup interactively as in the GIF at the top of this README, you need to have done something like this:

your-package-manager install ocaml opam
opam init
eval `opam config env`          # Or restart your shell
opam install lambdasoup

and make sure your ~/.ocamlinit file looks something like this:

let () =
  try Topdirs.dir_directory (Sys.getenv "OCAML_TOPLEVEL_PATH")
  with Not_found -> ()
;;

#use "topfind";;

Then, run ocaml -short-paths to start the top-level, and scrape away!


Depending

Lambda Soup uses semantic versioning, but is currently in 0.x.x. For now, the minor version number will be incremented on breaking changes. So, to give yourself a chance to review the changelog before your code breaks, put the following constraint on Lambda Soup: lambdasoup {< "0.7.0"}.


Documentation

Lambda Soup's interface consists of one module Soup, whose signature is documented here.


Developing

See CONTRIBUTING. All feedback is welcome – open an issue on GitHub, or send me an email at [email protected]. If you find yourself repeatedly writing the same helper on top of Lambda Soup's functions, perhaps we should add it to Lambda Soup.


History

Lambda Soup was originally written to answer a Stack Overflow question in November 2015.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].