All Projects → platonai → Pulsar

platonai / Pulsar

Licence: apache-2.0
Turn large Web sites into tables and charts using simple SQLs.

Projects that are alternatives of or similar to Pulsar

Sillynium
Automate the creation of Python Selenium Scripts by drawing coloured boxes on webpage elements
Stars: ✭ 100 (+0%)
Mutual labels:  web-scraping, selenium
OLX Scraper
📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.
Stars: ✭ 15 (-85%)
Mutual labels:  web-crawler, web-scraping
Trump Lies
Tutorial: Web scraping in Python with Beautiful Soup
Stars: ✭ 201 (+101%)
Mutual labels:  data-science, web-scraping
Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+1416%)
Mutual labels:  data-science, web-scraping
Ache
ACHE is a web crawler for domain-specific search.
Stars: ✭ 320 (+220%)
Mutual labels:  web-scraping, web-crawler
Learnpythonforresearch
This repository provides everything you need to get started with Python for (social science) research.
Stars: ✭ 163 (+63%)
Mutual labels:  data-science, web-scraping
WaWebSessionHandler
(DISCONTINUED) Save WhatsApp Web Sessions as files and open them everywhere!
Stars: ✭ 27 (-73%)
Mutual labels:  selenium, web-scraping
Zillow
Zillow Scraper for Python using Selenium
Stars: ✭ 141 (+41%)
Mutual labels:  web-scraping, selenium
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (+177%)
Mutual labels:  web-scraping, web-crawler
Stock-Fundamental-data-scraping-and-analysis
Project on building a web crawler to collect the fundamentals of the stock and review their performance in one go
Stars: ✭ 40 (-60%)
Mutual labels:  selenium, web-scraping
Scrape Linkedin Selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Stars: ✭ 239 (+139%)
Mutual labels:  web-scraping, selenium
Terpene Profile Parser For Cannabis Strains
Parser and database to index the terpene profile of different strains of Cannabis from online databases
Stars: ✭ 63 (-37%)
Mutual labels:  data-science, web-crawler
Selenium Python Helium
Selenium-python but lighter: Helium is the best Python library for web automation.
Stars: ✭ 2,732 (+2632%)
Mutual labels:  web-scraping, selenium
Web Database Analytics
Web scrapping and related analytics using Python tools
Stars: ✭ 175 (+75%)
Mutual labels:  data-science, web-scraping
Bet On Sibyl
Machine Learning Model for Sport Predictions (Football, Basketball, Baseball, Hockey, Soccer & Tennis)
Stars: ✭ 190 (+90%)
Mutual labels:  web-scraping, selenium
WeReadScan
扫描“微信读书”已购图书并下载本地PDF的爬虫
Stars: ✭ 273 (+173%)
Mutual labels:  web-crawler, selenium
30 Days Of Python
Learn Python for the next 30 (or so) Days.
Stars: ✭ 1,748 (+1648%)
Mutual labels:  web-scraping, selenium
SchweizerMesser
🎯Python 3 网络爬虫实战、数据分析合集 | 当当 | 网易云音乐 | unsplash | 必胜客 | 猫眼 |
Stars: ✭ 89 (-11%)
Mutual labels:  web-crawler, selenium
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (+556%)
Mutual labels:  web-scraping, web-crawler
Splashr
💦 Tools to Work with the 'Splash' JavaScript Rendering Service in R
Stars: ✭ 93 (-7%)
Mutual labels:  web-scraping, selenium

Pulsar README

Pulsar focus on web data processing, it extends SQL to handle the entire life cycle of web data processing: crawling, web scraping, data mining, BI, etc.

Other language

Chinese

product-screenshot

Features

  • X-SQL: eXtends SQL to manage web data: crawling, web scraping, data mining, BI, etc.
  • Web spider: browser rendering, ajax, scheduling, page scoring, monitoring, distributed, high performance, indexing by solr/elastic
  • BI Integration: turn Web sites into tables and charts using just one simple SQL
  • Big data: large scale, various storage: HBase/MongoDB

For more information check out platon.ai

X-SQL

Crawl and scrape a single page:

select
    dom_text(dom) as title,
    dom_abs_href(dom) as link
from
    load_and_select('https://en.wikipedia.org/wiki/topology', '.references a.external');

The SQL above downloads a Web page from wikipedia, find out the references section and extract all external reference links.

Crawl out pages from a portal and scrape each one:

select
    dom_first_text(dom, '.sku-name') as name,
    dom_first_number(dom, '.p-price .price', 0.00) as price,
    dom_first_number(dom, '#page_opprice', 0.00) as tag_price,
    dom_first_text(dom, '#comment-count .count') as comments,
    dom_first_text(dom, '#summary-service') as logistics,
    dom_base_uri(dom) as baseuri
from
    load_out_pages('https://list.jd.com/list.html?cat=652,12345,12349 -i 1s -ii 100d', 'a[href~=item]', 1, 100)
where
    dom_first_number(dom, '.p-price .price', 0.00) > 0
order by
    dom_first_number(dom, '.p-price .price', 0.00);

The SQL above visits a portal page in jd.com, downloads detail pages and then scrape data from them.

You can clone a copy of Pulsar code and run the SQLs yourself, or run them from our online demo.

Check sql-history.sql to see more example SQLs. All SQL functions can be found under ai.platon.pulsar.ql.h2.udfs.

Use pulsar as a library

Scrape out pages from a portal url using native api: Add maven dependency to your project:

<dependency>
    <groupId>ai.platon.pulsar</groupId>
    <artifactId>pulsar-protocol</artifactId>
    <version>1.5.7-SNAPSHOT</version>
</dependency>

And then scrape web pages using simple native api:

val url = "https://list.jd.com/list.html?cat=652,12345,12349"

val session = PulsarContexts.createSession()
session.scrapeOutPages(url,
            "-expires 1d -itemExpires 7d -outLink a[href~=item]",
            ".product-intro",
            listOf(".sku-name", ".p-price"))

Scrape out pages from a portal url using x-sql: Add maven dependency to your project:

<dependency>
    <groupId>ai.platon.pulsar</groupId>
    <artifactId>pulsar-ql-server</artifactId>
    <version>1.5.7-SNAPSHOT</version>
</dependency>

then scrape web pages using:

select
    dom_first_text(dom, '.sku-name') as name,
    dom_first_text(dom, '.p-price') as price
from
    load_out_pages('$url -i 1d -ii 7d', 'a[href~=item]')

Check out Tutorials for details.

Use pulsar as an X-SQL server

Once pulsar runs in X-SQL server mode, the web can be used just like the normal database. You can use our customized Metabase to write X-SQLs and turn web sites into tables and charts immediately. Everyone in your company can ask questions and learn from WEB DATA now, for the first time.

Build & Run

Check & install dependencies

bin/tools/install-depends.sh

Install mongodb

MongoDB is optional but is recommended. You can skip this step, in such case, all data will lose after pulsar shutdown. Ubuntu/Debian:

sudo apt install mongodb

Build from source

git clone https://github.com/platonai/pulsar.git
cd pulsar && mvn -DskipTests=true

Run the native api demo

bin/pulsar example ManualKt

Start pulsar server

bin/pulsar

Use sql console

bin/pulsar sql

Now you can execute any x-sql using the command line.

Use web console

Open web console http://localhost:8082 using your favourite browser now, enjoy playing with X-SQL.

Use Metabase

Metabase is the easy, open source way for everyone in your company to ask questions and learn from data. With X-SQL support, everyone can organize knowledge not just from the company's internal data, but also from the web.

git clone https://github.com/platonai/pulsar-metabase.git
cd pulsar-metabase
bin/build && bin/start

Enterprise Edition:

Pulsar Enterprise Edition supports Auto Web Mining: advanced machine learning, no rules or training required, turn web sites into tables automatically. Here are some examples: Auto Web Mining Examples

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].