All Projects β†’ scrapy β†’ scurl

scrapy / scurl

Licence: Apache-2.0 license
Performance-focused replacement for Python urllib

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to scurl

ungoogled-chromium-portable
πŸš€ Ungoogled Chromium portable for Windows
Stars: ✭ 96 (+433.33%)
Mutual labels:  chromium
cordova-plugin-x5-tbs
Use Tencent Browser Service(TBS) instead of System WebView for Cordova App
Stars: ✭ 65 (+261.11%)
Mutual labels:  chromium
node-headless-chrome
⚠️ 🚧 Install precompiled versions of the Chromium/Chrome headless shell using npm or yarn
Stars: ✭ 20 (+11.11%)
Mutual labels:  chromium
seb-win-refactoring
Safe Exam Browser for Windows.
Stars: ✭ 98 (+444.44%)
Mutual labels:  chromium
throughout
πŸŽͺ End-to-end testing made simple (using Jest and Puppeteer)
Stars: ✭ 16 (-11.11%)
Mutual labels:  chromium
ubuntu-vnc-xfce-chromium
Retired. Headless Ubuntu/Xfce container with VNC/noVNC and Chromium (Generation 1)
Stars: ✭ 20 (+11.11%)
Mutual labels:  chromium
LInkedIn-Reverese-Lookup
πŸ”ŽSearch LinkedIn profile by email addressπŸ“§
Stars: ✭ 20 (+11.11%)
Mutual labels:  chromium
chromium-all-old-stable-versions
Collections of Chromium all old/history stable versions, releases. Support me via Bitcoin: bc1qqgkmph9cvygzxfpupv4jr4n0nfx3qumwg39j5w
Stars: ✭ 50 (+177.78%)
Mutual labels:  chromium
Uranium
Fast and versatile implementation of CEF for Unreal Engine
Stars: ✭ 51 (+183.33%)
Mutual labels:  chromium
JxBrowser-Examples
JxBrowser Examples & Tutorials
Stars: ✭ 49 (+172.22%)
Mutual labels:  chromium
headless-chrome-alpine
A Docker container running headless Chrome
Stars: ✭ 26 (+44.44%)
Mutual labels:  chromium
quic vs tcp
A Survey and Benchmark of QUIC
Stars: ✭ 41 (+127.78%)
Mutual labels:  chromium
browserexport
backup and parse browser history databases (chrome, firefox, safari, and other chrome/firefox derivatives)
Stars: ✭ 54 (+200%)
Mutual labels:  chromium
labyrinth
[DEPRICATED] Labyrinth is a anti-censorship Web Browser created to bypass DPI, Blocklists, Port Filtering, Firewalls and DNS censorship all in one
Stars: ✭ 17 (-5.56%)
Mutual labels:  chromium
cefHtmlSnapshot
Command-line utility for Windows take snapshots of HTML pages and save them as images or PDF
Stars: ✭ 23 (+27.78%)
Mutual labels:  chromium
SAML-tracer
Browser extension for examining SAML messages
Stars: ✭ 104 (+477.78%)
Mutual labels:  chromium
crx3
Node.js module to create CRX3 files (web extension package v3 format) for Chromium, Google Chrome and Opera browsers.
Stars: ✭ 39 (+116.67%)
Mutual labels:  chromium
rubium
Rubium is a lightweight alternative to Selenium/Capybara/Watir if you need to perform some operations (like web scraping) using Headless Chromium and Ruby
Stars: ✭ 65 (+261.11%)
Mutual labels:  chromium
Recorder
A browser extension that generates Cypress, Playwright and Puppeteer test scripts from your interactions πŸ–± ⌨
Stars: ✭ 277 (+1438.89%)
Mutual labels:  chromium
NotionAI-MyMind
This repo uses AI and the wonderful Notion to enable you to add anything on the web to your "Mind" and forget about everything else.
Stars: ✭ 181 (+905.56%)
Mutual labels:  chromium

Scurl

Build Status codecov

About Scurl

Scurl is a library that is meant to replace some functions in urllib, such as urlparse, urlsplit and urljoin. It is built using the Chromium url parse source, which is called GURL.

In addition, this library is built to support the Scrapy project (hence the name Scurl). Therefore, an additional function is built, which is canonicalize_url, the bottleneck function in Scrapy spiders. It uses the canonicalize function from GURL to canonicalize the path, fragment and query of the urls.

Since the library is built based on Chromium source, the performance is greatly increased. The performance of urlparse, urlsplit and urljoin is 2-3 times faster than the urllib.

At the moment, we run the tests from urllib and w3lib. Nearly all the tests from urllib have passed (we are still working on passing all the tests :) ).

Credits

We want to give special thanks to urlparse4 since this project is built based on it.

GSoC 2018

This project is built under the funding of the Google Summer of Code program 2018. More detail about the program can be found here.

The final report, which contains more detail on how this project was made can be found here.

Supported functions

Since scurl meant to replace those functions in urllib, these are supported by Scurl: urlparse, urljoin, urlsplit and canonicalize_url.

Installation

Scurl has not been deployed to pypi yet. Currently the only way to install Scurl is cloning this repository

git clone https://github.com/scrapy/scurl
cd scurl
pip install -r requirements.txt
make clean
make build_ext
make install

Available Make commands

Make commands create a shorter way to type commands while developing :)

make clean

This will clean the build dir and the files that are generated when running build_ext command

make test

This will run all the tests found in the /tests folder

make build_ext

This will run the command python setup.py build_ext --inplace, which builds Cython code for this project.

make sdist

This will run python setup.py sdist command on this project.

make install

This will run python setup.py install command on this project.

make develop

This will run python setup.py develop command on this project.

make perf

Run the performance tests on urlparse, urlsplit and urljoin.

make cano:

Run the performance tests on canonicalize_url.

Profiling

Scurl repository has the built-in profiling tool, which you can turn on by adding this lines to the top of the *.pyx files in scurl/scurl:

# cython: profile=True

Then you can run python benchmarks/cython_profile.py --func [function-name] to get the cprofiling result. Currently, Scurl supports profiling urlparse, urlsplit and canonicalize.

This is not the most convenient way to profile Scurl with cprofiler, but we will come up with a way of improving this soon!

Benchmarking result report

urlparse, urlsplit and urljoin

This shows the performance difference between urlparse, urlsplit and urljoin from urllib.parse and those of Scurl (this is measured by running these functions with the urls from the file chromiumUrls.txt, which can also be found in this project):

The chromiumUrls.txt file contains ~83k urls. This measure the time it takes to run the performance_test.py test.

urlparse urlsplit urljoin
urllib.parse 0.52 sec 0.39 sec 1.33 sec
Scurl 0.19 sec 0.10 sec 0.17 sec

Canonicalize urls

The speed of canonicalize_url from scrapy/w3lib compared to the speed of canonicalize_url from Scurl (this is measured by running canonicalize_url with the urls from the file chromiumUrls.txt, which can also be found in this project):

This measures the speed of both functions. The test can be found in canonicalize_test.py file.

canonicalize_url
scrapy/w3lib 22,757 items/sec
Scurl 46,199 items/sec

Feedback

Any feedback is highly appreciated :) Please feel free to submit any error/feedback in the repository issue tab!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].