Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → kevintpeng → Web Crawler And Scraper

kevintpeng / Web Crawler And Scraper

🕸 Ruby crawler and scraper with record generation for SQL-based databases. Generates active record objects.

Programming Languages

ruby

36898 projects - #4 most used programming language

ruby-scraper

Ruby crawler and scraper for postgres or other SQL-based databases. Generates active record objects.

General Usage:

Require or include the active record models.
Create a new scraper object obj = Scraper.new.
Use Scraper.connectDatabase to connect the remote heroku database.
obj.testPath "http://example.com", :get will print the raw http request from the url. (In terminal, redirect the output of the file to a text file for easy regular expression analysis: path $ ruby scraper.rb > output.txt)
Create a list of links: obj.getLinks "http://example.com", :get, /urlRegex/. The getLinks method takes a regex and whose matches should be the urls.
Scrape each link and build an active record object: obj.scrapeLinks! Job, {attributes_hash}, {:sanitize => [list of keys for attributes that should be html safe]}. The attributes hash should be filled with each of the assigned attributes of the active record object with its corresponding regex to match it in the html request. The symbol :site if passed as a value with set the attribute to the current url.

Crawlify Documentation

Create a new crawler object my_crawler = Crawlify::Crawler.new("website")

`#crawl(resource_path, url)`

Crawls from a base path, retrieving all linked resources from a starting url. resource_path is the root directory that will be created in the output directory.

`#save(resource_path, body)`

Saves a body of text to the resource path specified.

configurations

configuration settings can be passed to a new crawler object.

Crawlify::Crawler.new("website", stop: "myregex") specify a pattern. If any web page matches the regular expression, the program will terminate rather than continuing to crawl.

Crawlify::Crawler.new("website", doc_type: "xml") to specify a different file type to crawl. Default is html

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 5

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗