All Projects → sensiblecodeio → Scraperwiki Python

sensiblecodeio / Scraperwiki Python

Licence: bsd-2-clause
ScraperWiki Python library for scraping and saving data

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Scraperwiki Python

Headlesschrome
A Go package for working with headless Chrome. Run interactive JavaScript commands on web pages with Go and Chrome.
Stars: ✭ 112 (-23.29%)
Mutual labels:  scraper
Scraper
A scraper that switches between normal mode and gentleman mode, built on Eletron, React
Stars: ✭ 127 (-13.01%)
Mutual labels:  scraper
Go Jd
京东自动登录,在线商品自动下单
Stars: ✭ 139 (-4.79%)
Mutual labels:  scraper
Ridereceipts
🚕 Simple automation desktop app to download and organize your receipts from Uber/Lyft. Try out our new Ride Receipts PRO !
Stars: ✭ 117 (-19.86%)
Mutual labels:  scraper
Arxivscraper
A python module to scrape arxiv.org for specific date range and categories
Stars: ✭ 121 (-17.12%)
Mutual labels:  scraper
Newspaper
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Stars: ✭ 11,545 (+7807.53%)
Mutual labels:  scraper
Google Play Scraper
Node.js scraper to get data from Google Play
Stars: ✭ 1,606 (+1000%)
Mutual labels:  scraper
Youtube Projects
This repository contains all the code I use in my YouTube tutorials.
Stars: ✭ 144 (-1.37%)
Mutual labels:  scraper
Mwoffliner
Scrape any online Mediawiki motorised wiki (like Wikipedia) to your local filesystem
Stars: ✭ 121 (-17.12%)
Mutual labels:  scraper
Bandcamp Scraper
A scraper for https://bandcamp.com
Stars: ✭ 137 (-6.16%)
Mutual labels:  scraper
Cum
comic updater, mangafied
Stars: ✭ 117 (-19.86%)
Mutual labels:  scraper
Youtube Comment Suite
Download YouTube comments from numerous videos, playlists, and channels for archiving, general search, and showing activity.
Stars: ✭ 120 (-17.81%)
Mutual labels:  scraper
Udemycoursegrabber
Your will to enroll in Udemy course is here, but the money isn't? Search no more! This python program searches for your desired course in more than [insert big number here] websites, compares the last updated date, and gives you the download link of the latest one back, but you also have the choice to see the other ones as well!
Stars: ✭ 137 (-6.16%)
Mutual labels:  scraper
Instagram Python Scraper
A instagram scraper wrote in python. Similar to instagram-php-scraper.Usages are in example.py. Enjoy it!
Stars: ✭ 115 (-21.23%)
Mutual labels:  scraper
Zillow
Zillow Scraper for Python using Selenium
Stars: ✭ 141 (-3.42%)
Mutual labels:  scraper
Jobfunnel
Scrape job websites into a single spreadsheet with no duplicates.
Stars: ✭ 1,528 (+946.58%)
Mutual labels:  scraper
Proxyscrape
Python library for retrieving free proxies (HTTP, HTTPS, SOCKS4, SOCKS5).
Stars: ✭ 134 (-8.22%)
Mutual labels:  scraper
Google2csv
Google2Csv a simple google scraper that saves the results on a csv/xlsx/jsonl file
Stars: ✭ 145 (-0.68%)
Mutual labels:  scraper
Google Play Scraper
Google play scraper for Python inspired by <facundoolano/google-play-scraper>
Stars: ✭ 143 (-2.05%)
Mutual labels:  scraper
Onegram
This repository is no longer maintained.
Stars: ✭ 137 (-6.16%)
Mutual labels:  scraper

ScraperWiki Python library

.. image:: https://travis-ci.org/scraperwiki/scraperwiki-python.png?branch=master :alt: Build Status :target: https://travis-ci.org/scraperwiki/scraperwiki-python

This is a Python library for scraping web pages and saving data. It is the easiest way to save data on the ScraperWiki platform, and it can also be used locally or on your own servers.

Installing

::

pip install scraperwiki

Scraping

scraperwiki.scrape(url[, params][,user_agent]) Returns the downloaded string from the given url.

params are sent as a POST if set.

user_agent sets the user-agent string if provided.

Saving data

Helper functions for saving and querying an SQL database. Updates the schema automatically according to the data you save.

Currently only supports SQLite. It will make a local SQLite database. It is based on SQLAlchemy <https://pypi.python.org/pypi/SQLAlchemy>_. You should expect it to support other SQL databases at a later date.

scraperwiki.sql.save(unique_keys, data[, table_name="swdata"]) Saves a data record into the datastore into the table given by table_name.

data is a dict object with field names as keys; unique_keys is a subset of data.keys() which determines when a record is overwritten. For large numbers of records data can be a list of dicts.

scraperwiki.sql.save is entitled to buffer an arbitrary number of rows until the next read via the ScraperWiki API, an exception is hit, or until process exit. An effort is made to do a timely periodic flush. Records can be lost if the process experiences a hard-crash, power outage or SIGKILL due to high memory usage during an out-of-memory condition. The buffer can be manually flushed with scraperwiki.sql.flush().

scraperwiki.sql.execute(sql[, vars]) Executes any arbitrary SQL command. For example CREATE, DELETE, INSERT or DROP.

vars is an optional list of parameters, inserted when the SQL command contains ‘?’s. For example::

scraperwiki.sql.execute("INSERT INTO swdata VALUES (?,?,?)", [a,b,c])

The ‘?’ convention is like "paramstyle qmark" from Python's DB API 2.0 <http://www.python.org/dev/peps/pep-0249/>_ (but note that the API to the datastore is nothing like Python's DB API). In particular the ‘?’ does not itself need quoting, and can in general only be used where a literal would appear. (Note that you cannot substitute in, for example, table or column names.)

scraperwiki.sql.select(sqlfrag[, vars]) Executes a select command on the datastore. For example::

scraperwiki.sql.select("* FROM swdata LIMIT 10")

Returns a list of dicts that have been selected.

vars is an optional list of parameters, inserted when the select command contains ‘?’s. This is like the feature in the .execute command, above.

scraperwiki.sql.commit() Commits to the file after a series of execute commands. (sql.save auto-commits after every action).

scraperwiki.sql.show_tables([dbname]) Returns an array of tables and their schemas in the current database.

scraperwiki.sql.table_info(name) Returns an array of attributes for each element of the table.

scraperwiki.sql.save_var(key, value) Saves an arbitrary single-value into a table called swvariables. Intended to store scraper state so that a scraper can continue after an interruption.

scraperwiki.sql.get_var(key[, default]) Retrieves a single value that was saved by save_var. Only works for string, float, or int types. For anything else, use the pickle library <http://docs.python.org/library/pickle.html>_ to turn it into a string.

Miscellaneous

scraperwiki.status(type, message=None) If run on the ScraperWiki platform (the new one, not Classic), updates the visible status of the dataset. If not on the platform, does nothing. type can be 'ok' or 'error'. If no message is given, it will show the time since the update. See dataset status API <https://scraperwiki.com/help/developer#boxes-status>_ in the documentation for details.

scraperwiki.pdftoxml(pdfdata) Convert a byte string containing a PDF file into an XML file containing the coordinates and font of each text string (see the pdftohtml documentation <http://linux.die.net/man/1/pdftohtml>_ for details). This requires pdftohtml which is part of poppler-utils.

Environment Variables

SCRAPERWIKI_DATABASE_NAME default: scraperwiki.sqlite - name of database

SCRAPERWIKI_DATABASE_TIMEOUT default: 300 - number of seconds database will wait for a lock

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].