Scraper
This project provides a web scraping library built around the JavaFX WebEngine, which in turn is built on top of WebKit. The goal of this project is to provide an robust and easy-to-use web scraper that doesn't require an external binary in order to function. With the introduction of Java 8, this is finally beginning to seem feasible.
If you find this code useful in any way, please feel free to...
Usage
It's still early days yet, this project hasn't reached the point where we're releasing builds of the library. Still, you can checkout the project and build it yourself.
[com.nervestaple/scraper "0.1.0-SNAPSHOT"]
Probably more fun is to check out the project and then interact with it directly via the REPL.
$ cd scraper
$ lein repl
From there it's easy to get a handle on a WebEngine instance and scrape out some content.
user> (def we (scraper/get-web-engine))
#'user/we
user> (scraper/load-url we "http://twitch.nervestaple.com")
{:state :ready}
user> (scraper/load-artoo we)
{:state :ready}
user> (scraper/scrape we "h1" {:title "text"})
{"title" "Bishop: Makes Your Web Service Shiny"} {"title" "Why Is My Web Service
API Crappy?"} {"title" "All Your HBase Are Belong to Clojure"}) ({"title" "Work
In Progress"} {"title" "Linux Is All About Choices"} {"title" "Real Life Web App
Integration Testing (IT) with Spring"} {"title" "Bishop: Makes Your Web Service
Shiny"} {"title" "Why Is My Web Service API Crappy?"} {"title" "All Your HBase
Are Belong to Clojure"})
As you can see in the example above, the Artoo.js JavaScript scraping library is injected into the loaded page in order to make your scraping easier. You are welcome! ;-)
If you're interested in being able to see the content that your WebEngine instance is loading, you can get a handle on a WebView instead. This will bring up a new window displaying the WebView.
user> (def wv (scraper/get-web-view))
#'user/wv
user> (def we (:web-engine wv))
#'user/we
Work on the project continues, but this should be enough to get you started.