Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → vedhavyas → scrape

vedhavyas / scrape

Licence: MIT License

Depth controllable Web scraper and Sitemap Generator in Go

Programming Languages

31211 projects - #10 most used programming language

Labels

sitemap web-scraper sitemap-generator

Projects that are alternatives of or similar to scrape

Simple sitemap generator for .NET

Stars: ✭ 66 (+247.37%)

Mutual labels: sitemap, sitemap-generator

grav-plugin-sitemap

Grav Sitemap Plugin

Stars: ✭ 34 (+78.95%)

Mutual labels: sitemap, sitemap-generator

express-sitemap-xml

Serve sitemap.xml from a list of URLs in Express

Stars: ✭ 56 (+194.74%)

Mutual labels: sitemap, sitemap-generator

PHP Simple Sitemap Generator

Stars: ✭ 16 (-15.79%)

Mutual labels: sitemap, sitemap-generator

A rust library to generate sitemaps.

Stars: ✭ 18 (-5.26%)

Mutual labels: sitemap, sitemap-generator

jsitemapgenerator

Java sitemap generator. This library generates a web sitemap, can ping Google, generate RSS feed, robots.txt and more with friendly, easy to use Java 8 functional style of programming

Stars: ✭ 38 (+100%)

Mutual labels: sitemap, sitemap-generator

A simple sitemap generator for Laravel Framework.

Stars: ✭ 32 (+68.42%)

Mutual labels: sitemap, sitemap-generator

Sitemap Plugin for Sylius eCommerce platform

Stars: ✭ 68 (+257.89%)

Mutual labels: sitemap, sitemap-generator

Twitter Troll & Fake News Hunter - Crawls news websites and twitter to identify fake news

Stars: ✭ 38 (+100%)

Mutual labels: sitemap

Bolt Sitemap extension - create XML sitemaps for your Bolt website.

Stars: ✭ 19 (+0%)

Mutual labels: sitemap

Linkedin-Client

Web scraper for grabing data from Linkedin profiles or company pages (personal project)

Stars: ✭ 42 (+121.05%)

Mutual labels: web-scraper

This project is a web scraper built with Ruby retrieving data from the "Movies | Netflix official website"

Stars: ✭ 14 (-26.32%)

Mutual labels: web-scraper

top-github-scraper

Scape top GitHub repositories and users based on keywords

Stars: ✭ 40 (+110.53%)

Mutual labels: web-scraper

gatsby-blog-mdx

A ready-to-use, customizable personal blog with minimalist design

Stars: ✭ 61 (+221.05%)

Mutual labels: sitemap

XML Sitemap parser class compliant with the Sitemaps.org protocol.

Stars: ✭ 57 (+200%)

Mutual labels: sitemap

Elder.js plugins and community plugins.

Stars: ✭ 80 (+321.05%)

Mutual labels: sitemap

📷 Automate full website screenshots and PDF generation with multiple viewport support.

Stars: ✭ 63 (+231.58%)

Mutual labels: sitemap

sitemap-checker

a tool for validate xml sitemap and sitemap index files for broken links

Stars: ✭ 21 (+10.53%)

Mutual labels: sitemap

📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.

Stars: ✭ 15 (-21.05%)

Mutual labels: web-scraper

Instagram-Giveaways-Winner

Instagram Bot which when given a post url will spam mentions to increase the chances of winning. Win Instagram Giveaways!

Stars: ✭ 95 (+400%)

Mutual labels: web-scraper

View All Similar Projects ➔

Scrape

Scrape is minimalistic depth controlled web scraping project. It can be used as command-line tool or integrate it in your project. Scrape also supports sitemap generation as an output.

Scrape Response

Once the Scraping is done on given URL, the API returns the following structure.

// Response holds the scrapped response
package scrape

import (
	"net/url"
	"regexp"
)

type Response struct {
	BaseURL      *url.URL            // starting url at maxDepth 0
	UniqueURLs   map[string]int      // UniqueURLs holds the map of unique urls we crawled and times each url is repeated
	URLsPerDepth map[int][]*url.URL  // URLsPerDepth holds urls found in each depth
	SkippedURLs  map[string][]string // SkippedURLs holds urls extracted from source urls but failed domainRegex (if given) and are invalid.
	ErrorURLs    map[string]error    // errorURLs holds details as to why reason the url was not crawled
	DomainRegex  *regexp.Regexp      // restricts crawling the urls to given domain
	MaxDepth     int                 // MaxDepth of crawl, -1 means no limit for maxDepth
	Interrupted  bool                // true if the scrapping was interrupted
}

Command line:

Installation:

go get github.com/vedhavyas/scrape/cmd/scrape/

Available command line options:

Usage of ./scrape:
 -domain-regex string(optional)
        Domain regex to limit crawls to. Defaults to base url domain
 -max-depth int(optional)
        Max depth to Crawl (default -1)
 -sitemap string(optional)
        File location to write sitemap to
 -url string(required)
        Starting URL (default "https://vedhavyas.com")

Output

Scrape supports 2 types of output.

Printing all the above collected data to stdout from Response
Generating a sitemap xml file(if passed) from the Response.

As a Package

Scrape can be integrated into any Go project through the given APIs. As a package, you will have access to the above mentioned Response and all the data in it. At this point, the following are the available APIs.

Start

func Start(ctx context.Context, url string) (resp *Response, err error)

Start will start the scrapping with no depth limit(-1) and base url domain

StartWithDepth

func StartWithDepth(ctx context.Context, url string, maxDepth int) (resp *Response, err error)

StartWithDepth will start the scrapping with given max depth and base url domain

StartWithDepthAndDomainRegex

func StartWithDepthAndDomainRegex(ctx context.Context, url string, maxDepth int, domainRegex string) (resp *Response, err error)

StartWithDepthAndDomainRegex will start the scrapping with max depth and regex

StartWithRegex

func StartWithDomainRegex(ctx context.Context, url, domainRegex string) (resp *Response, err error)

StartWithRegex will start the scrapping with no depth limit(-1) and regex

Sitemap

func Sitemap(resp *Response, file string) error

Sitemap generates a sitemap from the given response

Feedback and Contributions

If you think something is missing, please feel free to raise an issue.
If you would like to work on an open issue, feel free to announce yourself in issue's comments

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 19

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗