All Projects → crawler-commons → Crawler Commons

crawler-commons / Crawler Commons

Licence: apache-2.0
A set of reusable Java components that implement functionality common to any web crawler

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Crawler Commons

Gerador Validador Cpf
Biblioteca JS open-source para gerar e validar CPF.
Stars: ✭ 312 (+80.35%)
Mutual labels:  open-source, library
Smartisandialog
Smartisan style Dialog.
Stars: ✭ 33 (-80.92%)
Mutual labels:  open-source, library
Fakeit
The Kotlin fake data generator library!
Stars: ✭ 482 (+178.61%)
Mutual labels:  open-source, library
Length.js
📏 JavaScript library for length units conversion.
Stars: ✭ 292 (+68.79%)
Mutual labels:  open-source, library
Raisincss
An Utility CSS only library. It supports css grid and many more useful css properties.
Stars: ✭ 93 (-46.24%)
Mutual labels:  open-source, library
Rando.js
The world's easiest, most powerful random function.
Stars: ✭ 659 (+280.92%)
Mutual labels:  open-source, library
Humblelogging
HumbleLogging is a lightweight C++ logging framework. It aims to be extendible, easy to understand and as fast as possible.
Stars: ✭ 15 (-91.33%)
Mutual labels:  open-source, library
Localize and translate
Flutter localization in easy steps, really simple
Stars: ✭ 40 (-76.88%)
Mutual labels:  open-source, library
Spider
A small dart library to generate Assets dart code from assets folder.
Stars: ✭ 90 (-47.98%)
Mutual labels:  open-source, library
Angular Tree Component
A simple yet powerful tree component for Angular (>=2)
Stars: ✭ 1,031 (+495.95%)
Mutual labels:  open-source, library
Redux Unhandled Action
Redux middleware that logs an error to the console when an action is fired and the state is not mutated,
Stars: ✭ 125 (-27.75%)
Mutual labels:  open-source, library
Angular Open Source Starter
This is a starter project for creating open-source libraries for Angular. It is a full fledged Angular workspace with demo application and easy library addition. It is designed to be used for open-sourcing libraries on Github and has everything you'd need ready for CI, code coverage, SSR testing, StackBlitz demo deployment and more.
Stars: ✭ 120 (-30.64%)
Mutual labels:  open-source, library
Amazing Swift Ui 2019
23 Amazing iOS UI Libraries written in Swift for the Past Year (v.2019)
Stars: ✭ 147 (-15.03%)
Mutual labels:  open-source, library
Expanding Collection Android
ExpandingCollection is a material design card peek/pop controller. Android UI Library made by @Ramotion
Stars: ✭ 2,032 (+1074.57%)
Mutual labels:  library
Books Collection
To the programmer's open source and free books collection 给程序员的开源、免费书籍收集,图书集合。
Stars: ✭ 2,188 (+1164.74%)
Mutual labels:  open-source
Swifthub
GitHub iOS client in RxSwift and MVVM-C clean architecture
Stars: ✭ 2,330 (+1246.82%)
Mutual labels:  open-source
Dragview
Android library used to create an awesome Android UI based on a draggable element similar to the last YouTube New graphic component.
Stars: ✭ 171 (-1.16%)
Mutual labels:  library
Cutintolayout
CutIntoLayout allows you to erase the background.
Stars: ✭ 172 (-0.58%)
Mutual labels:  library
Json Api
Implementation of JSON API in PHP 7
Stars: ✭ 171 (-1.16%)
Mutual labels:  library
Media Watermark
GPU/CPU-based iOS Watermark Library for Image and Video Overlay
Stars: ✭ 170 (-1.73%)
Mutual labels:  library

Build Status license

Overview

Crawler-Commons is a set of reusable Java components that implement functionality common to any web crawler. These components benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.

User Documentation

Javadocs

Mailing List

There is a mailing list on Google Groups.

Issue Tracking

If you find an issue, please file a report here

Crawler-Commons News

29th June 2020 - crawler-commons 1.1 released

We are glad to announce the 1.1 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details.

21st March 2019 - crawler-commons 1.0 released

We are glad to announce the 1.0 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. Among other bug fixes and improvements this version adds support for parsing sitemap extensions (image, video, news, alternate links).

7th June 2018 - crawler-commons 0.10 released

We are glad to announce the 0.10 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. This version contains among other things improvements to the Sitemap parsing and the removal of the Tika dependency.

31st October 2017 - crawler-commons 0.9 released

We are glad to announce the 0.9 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are the removal of DOM-based sitemap parser as the SAX equivalent introduced in the previous version has better performance and is also more robust. You might need to change your code to replace SiteMapParserSAX with SiteMapParser. The parser is now aware of namespaces, and by default does not force the namespace to be the one recommended in the specification (http://www.sitemaps.org/schemas/sitemap/0.9) as variants can be found in the wild. You can set the behaviour using the method setStrictNamespace(boolean).

As usual, the version 0.9 contains numerous improvements and bugfixes and all users are invited to upgrade to this version.

9th June 2017 - crawler-commons 0.8 released

We are glad to announce the 0.8 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are the removal of the HTTP fetcher support, which has been put in a separate project. We also added a SAX-based parser for processing sitemaps, which requires less memory and is more robust to malformed documents than its DOM-based counterpart. The latter has been kept for now but might be removed in the future.

24th November 2016 - crawler-commons 0.7 released

We are glad to announce the 0.7 release of Crawler-Commons. See the CHANGES.txt file included with the release for a full list of details. The main changes are that Crawler-Commons now requires JAVA 8 and that the package crawlercommons.url has been replaced with crawlercommons.domains. If your project uses CC then you might want to run the following command on it

find . -type f -print0 | xargs -0 sed -i 's/import crawlercommons\.url\./import crawlercommons\.domains\./'

Please note also that this is the last release containing the HTTP fetcher support, which is deprecated and will be removed from the next version.

The version 0.7 contains numerous improvements and bugfixes and all users are invited to upgrade to this version.

11th June 2015 - crawler-commons 0.6 is released

We are glad to announce the 0.6 release of Crawler Commons. See the CHANGES.txt file included with the release for a full list of details.

We suggest all users to upgrade to this version. Details of how to do so can be found on Maven Central. Please note that the groupId has changed to com.github.crawler-commons.

The Java documentation can be found here.

22nd April 2015 - crawler-commons has moved

The crawler-commons project is now being hosted at GitHub, due to the demise of Google code hosting.

15th October 2014 - crawler-commons 0.5 is released

We are glad to announce the 0.5 release of Crawler Commons. This release mainly improves Sitemap parsing as well as an upgrade to Apache Tika 1.6.

See the CHANGES.txt file included with the release for a full list of details. Additionally the Java documentation can be found here.

We suggest all users to upgrade to this version. The Crawler Commons project artifacts are released as Maven artifacts and can be found at Maven Central.

11th April 2014 - crawler-commons 0.4 is released

We are glad to announce the 0.4 release of Crawler Commons. Amongst other improvements, this release includes support for Googlebot-compatible regular expressions in URL specifications, further imprvements to robots.txt parsing and an upgrade of httpclient to v4.2.6.

See the CHANGES.txt file included with the release for a full list of details.

We suggest all users to upgrade to this version. Details of how to do so can be found on Maven Central.

11 Oct 2013 - crawler-commons 0.3 is released

This release improves robots.txt and sitemap parsing support, updates Tika to the latest released version (1.4), and removes some left-over cruft from the pre-Maven build setup.

See the CHANGES.txt file included with the release for a full list of details.

24 Jun 2013 - Nutch 1.7 now uses crawler-commons for robots.txt parsing

Similar to the previous note about Nutch 2.2, there's now a version of Nutch in the 1.x tree that also uses crawler-commons. See Apache Nutch v1.7 Released for more details.

08 Jun 2013 - Nutch 2.2 now uses crawler-commons for robots.txt parsing

See Apache Nutch v2.2 Released for more details.

02 Feb 2013 - crawler-commons 0.2 is released

This release improves robots.txt and sitemap parsing support.

See the CHANGES.txt file included with the release for a full list of details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].