All Projects → writepython → web-crawler

writepython / web-crawler

Licence: other
Python Web Crawler with Selenium and PhantomJS

Programming Languages

python
139335 projects - #7 most used programming language
Roff
2310 projects

Projects that are alternatives of or similar to web-crawler

kick-off-web-scraping-python-selenium-beautifulsoup
A tutorial-based introduction to web scraping with Python.
Stars: ✭ 18 (-5.26%)
Mutual labels:  scraper, phantomjs
crawlkit
A crawler based on Phantom. Allows discovery of dynamic content and supports custom scrapers.
Stars: ✭ 23 (+21.05%)
Mutual labels:  scraper, phantomjs
Lambda Phantom Scraper
PhantomJS/Node.js web scraper for AWS Lambda
Stars: ✭ 93 (+389.47%)
Mutual labels:  scraper, phantomjs
Mimo-Crawler
A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.
Stars: ✭ 22 (+15.79%)
Mutual labels:  scraper, webcrawler
Goose Parser
Universal scrapping tool, which allows you to extract data using multiple environments
Stars: ✭ 211 (+1010.53%)
Mutual labels:  scraper, phantomjs
yellowpages-scraper
Yellowpages.com Web Scraper written in Python and LXML to extract business details available based on a particular category and location.
Stars: ✭ 56 (+194.74%)
Mutual labels:  scraper
node-mocha-extjs
Framework for testing ExtJs applications
Stars: ✭ 19 (+0%)
Mutual labels:  phantomjs
MangDL
The most inefficient Manga downloader for PC
Stars: ✭ 40 (+110.53%)
Mutual labels:  scraper
tv grab fr telerama
XMLTV Grabber using telerama api data
Stars: ✭ 36 (+89.47%)
Mutual labels:  scraper
Facebook-Profile-Pictures-Downloader
😆 Download public profile pictures from Facebook.
Stars: ✭ 23 (+21.05%)
Mutual labels:  scraper
wikipedia-reference-scraper
Wikipedia API wrapper for references
Stars: ✭ 34 (+78.95%)
Mutual labels:  scraper
lezhin-comics-downloader
📥 Downloader for lezhin comics
Stars: ✭ 30 (+57.89%)
Mutual labels:  scraper
google-scraper
This class can retrieve search results from Google.
Stars: ✭ 33 (+73.68%)
Mutual labels:  scraper
file-extensions
JSON collection of scraped file extensions, along with their description and type, from FileInfo.com
Stars: ✭ 15 (-21.05%)
Mutual labels:  scraper
TradeTheEvent
Implementation of "Trade the Event: Corporate Events Detection for News-Based Event-Driven Trading." In Findings of ACL2021
Stars: ✭ 64 (+236.84%)
Mutual labels:  scraper
nyt-first-said
Tweets when words are published for the first time in the NYT
Stars: ✭ 222 (+1068.42%)
Mutual labels:  scraper
fiveN1-rent-scraper
🏠 a.k.a 591 rent scraper(591 租屋網爬蟲)
Stars: ✭ 51 (+168.42%)
Mutual labels:  scraper
jd-autobuy
Python爬虫,京东自动登录,在线抢购商品
Stars: ✭ 1,262 (+6542.11%)
Mutual labels:  scraper
karma-detect-browsers
Karma runner plugin for detecting all browsers installed on the current system.
Stars: ✭ 44 (+131.58%)
Mutual labels:  phantomjs
proxy-scraper
⭐️ A proxy scraper made using Protractor | Proxy list Updates every three hour 🔥
Stars: ✭ 201 (+957.89%)
Mutual labels:  scraper
=== About ===

- This python web crawler will read in a configuration file containing seed URLs to crawl, and download filtering parameters.  
- The program will then crawl each seed URL in succession and add any subsequently found URLs to a queue of URLs to visit.
- As each URL is visited, if it satisfies the given filtering parameters, it will be downloaded while maintaining the directory structure of the website.
- The value of URLs to visit will thus grow initially, round off when no new URLs are being discovered, and eventually shrink down to zero, at which point the program will move on to the next seed URL, or exit if there are no more URLs to process.
- Pages that return only Javascript with a text/html mimetype will be requested again with Selenium using the PhantomJS browser.
- Additional functionality is available to handle an input file containing a list of files to download.

=== Requirements ===

Curl for downloading binary files

=== RUN.PY Usage ===

Edit config.py (Explanation below)
python run.py -o <output_dir>

=== DOWNLOAD.PY Usage ===

python download.py -i <input_file> -o <output_dir>

=== Config.py Variables ===

mimetypes_list is an array of mimetypes that determines which files will be downloaded, provided they pass the regular expression filters.

binary_mimetypes_list is an array of mimetypes that determines which files will be downloaded as binary files using Curl, provided they pass the regular expression filters.

file_extensions_list is an array of file extensions that determines which files will be downloaded, provided they pass the regular expression filters.

*Note: It will take less time to process each URL if one or the other of the above are used rather than both.

request_delay is a float describing how long to wait in seconds before making the next request.

urls_to_crawl is an array of hashes containing the items url, follow_links_containing, and (optionally) regex_filters.

url is a url string in the style http://www.main.russia.org

follow_links_containing is a string that determines what links are followed.  For example, www.main.russia.org will follow all links containing www.main.russia.org and russia.org will follow all links containing russia.org.  www.main.russia.org is thus more restrictive and will take less time to process.

regex_filters is an optional array of Perl-style regular expression patterns.  Files matching any one of the patters will be downloaded.  "\d" means a single digit and "." means any character except the newline character.  Prefix regex stings with an "r" so that the "/" character is interpreted properly, as in: r"/2014/07\d\d/"  http://docs.python.org/2/howto/regex.html#regex-howto

ignore_query_strings is a boolean.  Setting this to True means that when new URLs are discovered, the query strings will be stripped before further processing.  If the resultant URL actually redirects to a URL with a query string, then that will be preserved.

=== An Example config.py ===

mimetypes_list = [ 'text/html' ]

binary_mimetypes_list = [ 'pdf', 'video', 'audio', 'image' ]

file_extensions_list = [ '.txt' ]

request_delay = 0

urls_to_crawl = [
    {
        "url": "http://madeinheights.com",
        "follow_links_containing": "madeinheights.com",
        "regex_filters": [ r"st.ry" ]
    },
    {
        "url": "http://www.china.com.cn",
        "follow_links_containing": "www.china.com.cn",
        "regex_filters": [ r"/2014-07/\d\d/" ],
        "ignore_query_strings": True,
    },
    {
        "url": "http://politics.people.com.cn",
        "follow_links_containing": "politics.people.com.cn",
        "regex_filters": [ r"/2014/07\d\d/" ]
    }    
]
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].