Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Stars: ✭ 656 (+556%)

Mutual labels: web-scraping, web-crawler

Splashr

💦 Tools to Work with the 'Splash' JavaScript Rendering Service in R

Stars: ✭ 93 (-7%)

Mutual labels: web-scraping, selenium

View All Similar Projects ➔

Pulsar README

Pulsar focus on web data processing, it extends SQL to handle the entire life cycle of web data processing: crawling, web scraping, data mining, BI, etc.

Other language

Chinese

Features

X-SQL: eXtends SQL to manage web data: crawling, web scraping, data mining, BI, etc.
Web spider: browser rendering, ajax, scheduling, page scoring, monitoring, distributed, high performance, indexing by solr/elastic
BI Integration: turn Web sites into tables and charts using just one simple SQL
Big data: large scale, various storage: HBase/MongoDB

For more information check out platon.ai

X-SQL

Crawl and scrape a single page:

select
    dom_text(dom) as title,
    dom_abs_href(dom) as link
from
    load_and_select('https://en.wikipedia.org/wiki/topology', '.references a.external');

The SQL above downloads a Web page from wikipedia, find out the references section and extract all external reference links.

Crawl out pages from a portal and scrape each one:

select
    dom_first_text(dom, '.sku-name') as name,
    dom_first_number(dom, '.p-price .price', 0.00) as price,
    dom_first_number(dom, '#page_opprice', 0.00) as tag_price,
    dom_first_text(dom, '#comment-count .count') as comments,
    dom_first_text(dom, '#summary-service') as logistics,
    dom_base_uri(dom) as baseuri
from
    load_out_pages('https://list.jd.com/list.html?cat=652,12345,12349 -i 1s -ii 100d', 'a[href~=item]', 1, 100)
where
    dom_first_number(dom, '.p-price .price', 0.00) > 0
order by
    dom_first_number(dom, '.p-price .price', 0.00);

The SQL above visits a portal page in jd.com, downloads detail pages and then scrape data from them.

You can clone a copy of Pulsar code and run the SQLs yourself, or run them from our online demo.

Check sql-history.sql to see more example SQLs. All SQL functions can be found under ai.platon.pulsar.ql.h2.udfs.

Use pulsar as a library

Scrape out pages from a portal url using native api: Add maven dependency to your project:

<dependency>
    <groupId>ai.platon.pulsar</groupId>
    <artifactId>pulsar-protocol</artifactId>
    <version>1.5.7-SNAPSHOT</version>
</dependency>

And then scrape web pages using simple native api:

val url = "https://list.jd.com/list.html?cat=652,12345,12349"

val session = PulsarContexts.createSession()
session.scrapeOutPages(url,
            "-expires 1d -itemExpires 7d -outLink a[href~=item]",
            ".product-intro",
            listOf(".sku-name", ".p-price"))

Scrape out pages from a portal url using x-sql: Add maven dependency to your project:

<dependency>
    <groupId>ai.platon.pulsar</groupId>
    <artifactId>pulsar-ql-server</artifactId>
    <version>1.5.7-SNAPSHOT</version>
</dependency>

then scrape web pages using:

select
    dom_first_text(dom, '.sku-name') as name,
    dom_first_text(dom, '.p-price') as price
from
    load_out_pages('$url -i 1d -ii 7d', 'a[href~=item]')

Check out Tutorials for details.

Use pulsar as an X-SQL server

Once pulsar runs in X-SQL server mode, the web can be used just like the normal database. You can use our customized Metabase to write X-SQLs and turn web sites into tables and charts immediately. Everyone in your company can ask questions and learn from WEB DATA now, for the first time.

Build & Run

Check & install dependencies

bin/tools/install-depends.sh

Install mongodb

MongoDB is optional but is recommended. You can skip this step, in such case, all data will lose after pulsar shutdown. Ubuntu/Debian:

sudo apt install mongodb

Build from source

git clone https://github.com/platonai/pulsar.git
cd pulsar && mvn -DskipTests=true

Run the native api demo

bin/pulsar example ManualKt

Start pulsar server

bin/pulsar

Use sql console

bin/pulsar sql

Now you can execute any x-sql using the command line.

Use web console

Open web console http://localhost:8082 using your favourite browser now, enjoy playing with X-SQL.

Use Metabase

Metabase is the easy, open source way for everyone in your company to ask questions and learn from data. With X-SQL support, everyone can organize knowledge not just from the company's internal data, but also from the web.

git clone https://github.com/platonai/pulsar-metabase.git
cd pulsar-metabase
bin/build && bin/start

Enterprise Edition:

Pulsar Enterprise Edition supports Auto Web Mining: advanced machine learning, no rules or training required, turn web sites into tables automatically. Here are some examples: Auto Web Mining Examples

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 100

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗