All Projects → MarHai → ScrapeBot

MarHai / ScrapeBot

Licence: GPL-2.0 license
A Selenium-driven tool for automated website interaction and scraping.

Programming Languages

python
139335 projects - #7 most used programming language
HTML
75241 projects
javascript
184084 projects - #8 most used programming language
CSS
56736 projects

Projects that are alternatives of or similar to ScrapeBot

Panther
A browser testing and web crawling library for PHP and Symfony
Stars: ✭ 2,480 (+15400%)
Mutual labels:  scraping, selenium-webdriver
Seleniumcrawler
An example using Selenium webdrivers for python and Scrapy framework to create a web scraper to crawl an ASP site
Stars: ✭ 117 (+631.25%)
Mutual labels:  scraping, selenium-webdriver
Linkedin
Linkedin Scraper using Selenium Web Driver, Chromium headless, Docker and Scrapy
Stars: ✭ 309 (+1831.25%)
Mutual labels:  scraping, selenium-webdriver
Scrape Linkedin Selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Stars: ✭ 239 (+1393.75%)
Mutual labels:  scraping, selenium-webdriver
etf4u
📊 Python tool to scrape real-time information about ETFs from the web and mixing them together by proportionally distributing their assets allocation
Stars: ✭ 29 (+81.25%)
Mutual labels:  scraping
AppiumGrid
A framework for running appium tests in parallel across devices and also on desktop browser... U like it STAR it !!
Stars: ✭ 17 (+6.25%)
Mutual labels:  selenium-webdriver
SHAFT ENGINE
SHAFT is an MIT licensed test automation engine. Powered by best-in-class frameworks like Selenium WebDriver, Appium & RestAssured it provides a wizard-like syntax to increase productivity, and built-in wrappers to eliminate boilerplate code and to ensure your tests are extra stable and your results are extra reliable.
Stars: ✭ 170 (+962.5%)
Mutual labels:  selenium-webdriver
linkedin-scraper
Tool to scrape linkedin
Stars: ✭ 74 (+362.5%)
Mutual labels:  scraping
google-meet-bot
Bot for scheduling and entering google meet sessions automatically
Stars: ✭ 33 (+106.25%)
Mutual labels:  selenium-webdriver
nightwatch-vrt
Visual Regression Testing tools for nightwatch.js
Stars: ✭ 59 (+268.75%)
Mutual labels:  selenium-webdriver
NBA-Fantasy-Optimizer
NBA Daily Fantasy Lineup Optimizer for FanDuel Using Python
Stars: ✭ 21 (+31.25%)
Mutual labels:  scraping
info-bot
🤖 A Versatile Telegram Bot
Stars: ✭ 37 (+131.25%)
Mutual labels:  scraping
scrapers
scrapers for building your own image databases
Stars: ✭ 46 (+187.5%)
Mutual labels:  scraping
shale
A Clojure-backed replacement for Selenium hubs.
Stars: ✭ 14 (-12.5%)
Mutual labels:  selenium-webdriver
Architeuthis
MITM HTTP(S) proxy with integrated load-balancing, rate-limiting and error handling. Built for automated web scraping.
Stars: ✭ 35 (+118.75%)
Mutual labels:  scraping
Goirate
Pillaging the seven seas for torrents, pieces of eight and other bounty.
Stars: ✭ 20 (+25%)
Mutual labels:  scraping
double-agent
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
Stars: ✭ 123 (+668.75%)
Mutual labels:  scraping
socials
👨‍👩‍👦 Social account detection and extraction in Python, e.g. for crawling/scraping.
Stars: ✭ 37 (+131.25%)
Mutual labels:  scraping
gochanges
**[ARCHIVED]** website changes tracker 🔍
Stars: ✭ 12 (-25%)
Mutual labels:  scraping
crawler-chrome-extensions
爬虫工程师常用的 Chrome 插件 | Chrome extensions used by crawler developer
Stars: ✭ 53 (+231.25%)
Mutual labels:  scraping

Build Status

ScrapeBot

ScrapeBot is a tool for so-called "agent-based testing" to automatically visit, modify, and scrape a defined set of webpages regularly. It was built to automate various web-based tasks and keep track of them in a controllable way for academic research, primarily in the realm of computational social science.

This repository allows for actual agent-based testing across a variety of instances (e.g., to vary locations, languages, operating systems, browsers ...) as well as configuring and maintaining these instances. As such, ScrapeBot consists of three major major parts, each of which is included in this repository. If you want to break it down, it consists of the following three parts:

# Part Required Accessibility Technology
1 Database yes, once needs to be accessed by all instances and the web frontend MySQL
2 Instance yes, as often as you like should not be accessible to anybody Python (+ Selenium)
3 Web frontend no, but if you fancy it, then set it up once should be served through a web server Python (+ Flask)

1. Installing the database

There is actually not much to do here apart from installing a MySQL server somewhere and make it accessible from outside. Yes, this is the part which everybody warns you about on the internet, but hey ¯\_(ツ)_/¯. So go ahead, install it, and remember those credentials. Next, proceed with part 2, installing a new instance. Once you specify the database credentials there, it will create the database tables as necessary if they do not exist yet.

2. Installing a new Instance

Installation varies depending on your operating system. The most common way to use it, though, would be on a *nix server, such as one from Amazon's Web Services (AWS). Hence, this installation tutorial is geared toward that, although ScrapeBot can also run under other operating systems, including Windows.

Installing on Linux/Unix

  1. The easiest server version to get started with is a 64-bit Ubuntu EC2 instance, such as AWS' "Ubuntu Server 18.04 LTS" free tier. Launch that and SSH into it.
  2. Update the available package-manager repositories. Afterwards, we need to install four things:
    • PIP for the 3.x Python environment
    • A browser of your choice. I'll use Firefox, which is easily avilable through EC2's apt-get repositories. Note that version 60 or above is required.
    • If you are on a Unix system, such as our free EC2 tier, you do not have a GUI but you will need one (or simulate one, for that matter). I'll use an X11 display.
    • Git to later get the latest version of ScrapeBot.
    sudo apt-get update
    sudo apt-get install -y python3-pip firefox xvfb git
    
  3. Now get (i.e., clone) the latest version of ScrapeBot and change into the newly generated directory.
    git clone https://github.com/MarHai/ScrapeBot.git
    cd ScrapeBot/
    
  4. If you are using either Chrome or Firefox, ChromeDriver or Geckodriver are also required. Luckily, I've already integrated them into the ScrapeBot. All you need to do is provide it with execution rights.
    chmod u+x lib/*
    
  5. Next, we install all Python requirements that ScrapeBot has.
    pip3 install -r requirements.txt
    
  6. That's it, we should be good to go. Hence, let's configure your newly created instance of the ScrapeBot by following the step-by-step wizard. On Linux, it will end by setting up a cronjob that runs every two minutes. On other systems, you need to ensure that yourself.
    python3 setup.py
    
    By the way, in running setup.py, also on an already running instance, you can easily create new users.

Installing on Windows

Should work fine but keep in mind to either have your preferred browser set in your PATH environment or to specify the paths to your executables in the Instance section of your config.ini, like so:

BrowserBinary = C:\Program Files\Mozilla Firefox\firefox.exe

or

BrowserBinary = C:\Program Files (x86)\Google\Chrome\Application\chrome.exe

Installing on a Raspberry Pi (2B+)

Currently available Firefox versions mismatch currently available Geckodriver versions for ARM systems, such as Raspberry Pi. In other words, as long as apt-get install firefox-esr results in versions below 57, do not bother.

Instead, you can use Chrome. This is in spite of the fact that the therefore needed Chromedriver is a bit old (as Chrome has stopped deploying it for ARM systems) and is thus not capable of taking screenshots.

Finally, note that Selenium and the ScrapeBot require RAM, something the Raspberry Pi is rather short of. As a general take, you should only use ScrapeBot on a RasPi in exceptional cases and without too many recipes running.

tl;dr: Don't do it unless necessary. And if so, use Chrome but do not take screenshots.

3. Installing the web frontend

By following the installation guidelines from above, you have also installed all the prerequisites for the web frontend. Despite these prerequisites, though, no web server is yet in place to serve it. This is only required on one instance, obviously, and it can even run on a separate machine without the instance being run regularly.

The web frontend allows you to overlook instances (see screenshots from the dashboard and the instance detail view), configure recipes (see screenshots from the recipe detail view and the recipe-step configuration), export the recipes to replicable .sbj files (see a screenshot of a .sbj file), and check the log files from individual runs (see the screenshot of a run log).

Here is a short installation guide for the web frontend (again, after (!) you have successfully finished setting up the above).

  1. First, we need to install the actual web server, or servers for that matter. As we are dealing with a Python Flask app on Unix, we'll use gunicorn to serve the Python app internally, nginx as an external web server, and supervisor to ensure availability.
    sudo apt-get -y install gunicorn3 supervisor nginx
    
  2. To make everything accessible via HTTPS, we need to have SSL certificates. For the sake of completeness, we'll self-sign our certificates here; for a production server, you may want to use Let's Encrypt, for example through Certbot.
    mkdir certs/
    openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 -keyout certs/key.pem -out certs/cert.pem
    
  3. Now, we prepare the nginx web server to use these certificates and to listen to incoming requests. I've included an example configuration for this which we can simply copy and use. Be warned, however, that this configuration file requires absolute paths to your home directory, which I assumed to adhere to a "ubuntu" user (as is the default user for AWS Ubuntu instances).
    sudo rm /etc/nginx/sites-enabled/default
    sudo cp nginx.conf /etc/nginx/sites-enabled/scrapebot
    sudo service nginx reload
    
  4. The outgoing web server is in place. Yet, for this to be possible, firewall settings may also need adjustment. That is, on AWS, you may need to add inbound security rules for both HTTP and HTTPS.
  5. Finally, we can configure gunicorn to be handled by supervisor. Again, this relies on absolute paths to your home directory, so if your user is not called "ubuntu" you should double-check these configuration files.
    sudo cp supervisor.conf /etc/supervisor/conf.d/scrapebot.conf
    sudo supervisorctl reload
    
  6. This is it, you should now be able to call and work with the web interface (which is incredibly slow under AWS' free tier).

The almighty "config.ini"

ScrapeBot is configured through two ways. First and foremost is a config.ini file that sets the connection to the database and the like. This ensures that everything is running and instances and recipes can interact with each other. Second, all the actual agent-based testing stuff is done via the web frontend. Thus, this configuration is stored inside the central database. And while the web frontend helps you in understanding what you can and cannot do, the config.ini file is not as self-explanatory. The setup.py helps you in creating one, though. But if you want to know more, here is an overview of all potential settings.

Database

This section holds only four options, all aimed at connecting to the central database.

  • Host is the central database host to connect to.
  • User must hold the username to connect to the central database.
  • Password holds, well, the according password.
  • Database represents the database name. Small side note here: The step-by-step wizard Python script (i.e., setup.py) will generate tables and stuff if (and only if) they do not exist yet.
  • Due to long runtimes for recipes, ScrapeBot sometimes struggles with MySQL server timeouts (at least, if servers close connections rather strictly). To overcome this problem, you may set Timeout here to a number of seconds after which the database connection should be automatically renewed. Best practice here, by the way, is to do nothing until you run into problems. If you do, however, check your MySQL server's timeout and set ScrapeBot's Database/Timeout setting to a value slightly below (e.g., -10) this number:
    SHOW SESSION VARIABLES LIKE 'wait_timeout';
    
  • If you intend to take lots of screenshots, you might want to store them not locally but rather in an Amazon S3 bucket. For this to happen, you need to specify your Amazon S3 bucket user's credentials (i.e., its access and secret keys). Alternatively (also, additionally), you can specify to store screenshots locally (default; directory specified under Instance). So, in case you want to upload screenshots to Amazon, you need to specify AWSaccess, AWSsecret, and AWSbucket here.

Email

The web frontend will send emails from time to time. So if you want an instance to serve as web frontend, you need to configure an SMTP server here for it to be able to actually send those emails.

  • Again, Host is the address of the SMTP (!) server.
  • Port represents the port through which to connect (typically, this is 25 for non-TLS and 465 or 587 for TLS servers).
  • TLS should indicate whether a secure TLS connection should be used (1) or not (0).
  • User is the user to connect to the SMTP server.
  • And Password, well, again, holds the according password.

Instance

Finally, a config.ini file is always unique for one instance. And as such, it specified some of its instance's environmental settings. And while they do differ (depending, for example, on the browser and the operating system), I tried to keep them as unified as possible to make configuration for you as convenient as possible.

  • Name is especially easy as you can use whatever name you prefer. This is the name the instance will use to register itself against the database. It will thus appear in the web frontend as well as in all downloaded datasets. Keep in mind that this should be unique or otherwise the instance pretends to be something (or somebody) else.
  • Timeout makes agents more humane in that it specified the amount of seconds between each recipe step (after loading a page finished). As such, it also affects the time an agent needs to perform a recipe. A good balance is a timeout of 1 second. Side note: Actual timeouts will vary randomly around +/-25% to mimic human surf behavior more thoroughly.
  • Browser is the Selenium webdriver to use. See its documentation on drivers to find out more. Whatever driver you choose, though, it needs to be installed correctly.
  • BrowserBinary is the path to the binary (if necessary). If your browser is able to run from PATH directly, then this is not necessary.
  • BrowserUserAgent overwrites, if set, the default user-agent string.
  • BrowserLanguage sets the accept_languages setting. You can use either languages (e.g., "en", "de") or language+region (e.g., "en-us", "en-gb") settings.
  • BrowserWidth and BrowserHeight define (in pixels) the size of the browser window to emulate. Use 1024 and 768 if unsure.
  • Using Firefox, you can also use BrowserGeoLatitude (e.g., 51.09102) and BrowserGeoLongitude (e.g., 6.5827) to set a specified browser location (most websites/platforms overwrite that by information from your IP address though). Do not set or set to "0.0" to ignore.
  • For screenshots to be taken and stored locally, a ScreenshotDirectory could be specified. Default is the screenshots/ sub directory. Alternatively, you can upload screenshots to an Amazon S3 bucket. In this case, go ahead and configure AWSaccess, AWSsecret, and AWSbucket under Database, this setting is then ignored.

Retrieving collected data

Depending on the amount of data collected, you can then easily download collected data through the web frontend as CSV file(s) or use the R package ScrapeBotR to directly access your data. While the first option requires a web frontend to be set up, the latter option only asks you to specify your ScrapeBot database using the same credentials as specified in the almighty "config.ini" below Database.

Replicability

ScrapeBot offers to export recipes into JSON-encapsulated files. These files are called .sbj files (as in ScrapeBot JSON) and include all necessary specifications for a recipe, its individual steps and values. Note, that these files do not include any runtime information, such as instances, runs, logs, or collected data.

However, .sbj files can also be imported into the system. As such, they depict an easy way to publish replicable recipes for other scholars to build upon.

To get you started easily on ScrapeBot, you can find a couple of import-ready .sbj files under recipes/.

Further information

There is a publication available to read up on agent-based testing, including some instructionary advice on how to use the ScrapeBot:

Haim, M. (2020). Agent-based testing: An automated approach toward artificial reactions to human behavior. Journalism Studies, 21(7), 895-911. https://dx.doi.org/10.1080/1461670x.2019.1702892

ScrapeBot uses Selenium WebDriver for its browser emulations. As such, it is capable to run with a broad variety of browsers.

Very brief history

The ScrapeBot was available for quite some time as a CasperJS-based tool which was controlled through a bash script. The latest (stable) release of that version is available as v1.1 of this repository.

Citation

Haim, Mario (2019). ScrapeBot. A Selenium-based tool for agent-based testing. Source code available at https://github.com/MarHai/ScrapeBot/.

Haim, Mario (2020). Agent-based testing: An automated approach toward artificial reactions to human behavior. Journalism Studies, 21(7), 895-911. https://dx.doi.org/10.1080/1461670x.2019.1702892

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].