All Projects → chatnoir-eu → chatnoir-resiliparse

chatnoir-eu / chatnoir-resiliparse

Licence: Apache-2.0 license
A robust web archive analytics toolkit

Programming Languages

cython
566 projects
python
139335 projects - #7 most used programming language
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to chatnoir-resiliparse

node-warc
Parse And Create Web ARChive (WARC) files with node.js
Stars: ✭ 69 (+165.38%)
Mutual labels:  warc, webarchive
mixnode-warcreader-php
Read Web ARChive (WARC) files in PHP.
Stars: ✭ 20 (-23.08%)
Mutual labels:  warc, webarchive
Aws Etl Orchestrator
A serverless architecture for orchestrating ETL jobs in arbitrarily-complex workflows using AWS Step Functions and AWS Lambda.
Stars: ✭ 245 (+842.31%)
Mutual labels:  bigdata
young-examples
java学习和项目中一些典型的应用场景样例代码
Stars: ✭ 21 (-19.23%)
Mutual labels:  bigdata
vandal
Navigator for Web Archive
Stars: ✭ 146 (+461.54%)
Mutual labels:  webarchive
bigdatatutorial
bigdatatutorial
Stars: ✭ 34 (+30.77%)
Mutual labels:  bigdata
hayabusa
Hayabusa: Simple and Fast Full-Text Search Engine for Massive System Log Data
Stars: ✭ 43 (+65.38%)
Mutual labels:  bigdata
Simple It English
Simple-IT-English: smart wordbook from community for community
Stars: ✭ 233 (+796.15%)
Mutual labels:  bigdata
PersonNotes
个人笔记集中营,快糙猛的形式记录技术性Notes .. 📚☕️⌨️🎧
Stars: ✭ 61 (+134.62%)
Mutual labels:  bigdata
xkcd-2048
No description or website provided.
Stars: ✭ 12 (-53.85%)
Mutual labels:  extraction
twitter-archive-reader
Full featured TypeScript Twitter archive reader and browser
Stars: ✭ 43 (+65.38%)
Mutual labels:  bigdata
Spark-MLlib-Tutorial
大数据框架 Spark MLlib 机器学习库基础算法全面讲解,附带齐全的测试文件
Stars: ✭ 32 (+23.08%)
Mutual labels:  bigdata
workflUX
An open-source, cloud-ready web application for simplified deployment of big data workflows.
Stars: ✭ 26 (+0%)
Mutual labels:  bigdata
php-article-extractor
A PHP library to extract article text from web pages
Stars: ✭ 28 (+7.69%)
Mutual labels:  extraction
Every Single Day I Tldr
A daily digest of the articles or videos I've found interesting, that I want to share with you.
Stars: ✭ 249 (+857.69%)
Mutual labels:  bigdata
intersect
一道面试题的思考 - 6000万数据包和300万数据包在50M内存使用环境中求交集
Stars: ✭ 54 (+107.69%)
Mutual labels:  bigdata
Dpark
Python clone of Spark, a MapReduce alike framework in Python
Stars: ✭ 2,668 (+10161.54%)
Mutual labels:  bigdata
optimus
🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Stars: ✭ 1,351 (+5096.15%)
Mutual labels:  bigdata
pnextract
Pore network extraction from micro-CT images of porous media
Stars: ✭ 43 (+65.38%)
Mutual labels:  extraction
Clustering4Ever
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
Stars: ✭ 126 (+384.62%)
Mutual labels:  bigdata

ChatNoir Resiliparse

Build Wheels Codecov Documentation Status

A collection of robust and fast processing tools for parsing and analyzing web archive data.

Usage Instructions

For detailed information about the build process, dependencies, APIs, or usage instructions, please read the Resiliparse Documentation

Resiliparse Module Summary

The Resiliparse collection encompasses the following two modules at the moment:

1. Resiliparse

The Resiliparse main module with the following subcomponents:

Parsing Utilities

The Resiliparse Parsing Utilities are highly optimized tools for dealing with encodings, detecting content types of raw protocol payloads, parsing HTML web pages, performing language detection, and more.

Main documentation: Resiliparse Parsing Utilities

Extraction Utilities

The Resiliparse Extraction Utilities are a set of performance-optimized and highly efficient tools for extracting structural or semantic information from noisy raw web data for further processing, such as (main) content extraction / boilerplate removal, schema extraction, general web data cleansing, and more.

Main documentation: Resiliparse Extraction Utilities

Process Guards

The Resiliparse Process Guard module is a set of decorators and context managers for guarding a processing context to stay within pre-defined limits for execution time and memory usage. Process Guards help to ensure the (partially) successful completion of batch processing jobs in which individual tasks may time out or use abnormal amounts of memory, but in which the success of the whole job is not threatened by (a few) individual failures. A guarded processing context will be interrupted upon exceeding its resource limits so that the task can be skipped or rescheduled.

Main documentation: Resiliparse Process Guards

Itertools

Resiliparse Itertools are a collection of convenient and robust helper functions for iterating over data from unreliable sources using other tools from the Resiliparse toolkit.

Main documentation: Resiliparse Itertools

2. FastWARC

FastWARC is a high-performance WARC parsing library for Python written in C++/Cython. The API is inspired in large parts by WARCIO, but does not aim at being a drop-in replacement. FastWARC supports compressed and uncompressed WARC/1.0 and WARC/1.1 streams. Supported compression algorithms are GZip and LZ4.

Main documentation: FastWARC and FastWARC CLI

Installation

The main Resiliparse package can be installed from PyPi as follows:

pip install resiliparse

FastWARC is being distributed as its own package and can be installed like so:

pip install fastwarc

For optimal performance, however, it is recommended to build FastWARC from sources instead of relying on the pre-built binaries. See below for more information.

Building From Source

To build Resiliparse or FastWARC from sources, you need to install all required build-time dependencies first. On Ubuntu, this is done as follows:

# Add Lexbor repository
curl -L https://lexbor.com/keys/lexbor_signing.key | sudo apt-key add -
echo "deb https://packages.lexbor.com/ubuntu/ $(lsb_release -sc) liblexbor" | \
    sudo tee /etc/apt/sources.list.d/lexbor.list

# Install build dependencies
sudo apt update
sudo apt install build-essential python3-dev zlib1g-dev \
    liblz4-dev libuchardet-dev liblexbor-dev libre2-dev

Then, to build the actual packages, run:

# Optional: Create a fresh venv first
python3 -m venv venv && source venv/bin/activate

# Build and install Resiliparse
pip install -e resiliparse

# Build and install FastWARC
pip install -e fastwarc

Instead of building the packages from this repository, you can also build them from the PyPi source packages:

# Build Resiliparse from PyPi
pip install --no-binary resiliparse resiliparse

# Build FastWARC from PyPi
pip install --no-binary fastwarc fastwarc

Cite Us

Resiliparse is part of the ChatNoir web analytics toolkit. If you use ChatNoir or any of its tools for a publication, you can make us happy by citing our ECIR 2018 demo paper:

@InProceedings{bevendorff:2018,
  address =             {Berlin Heidelberg New York},
  author =              {Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast},
  booktitle =           {Advances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018)},
  editor =              {Leif Azzopardi and Allan Hanbury and Gabriella Pasi and Benjamin Piwowarski},
  month =               mar,
  publisher =           {Springer},
  series =              {Lecture Notes in Computer Science},
  site =                {Grenoble, France},
  title =               {{Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl}},
  year =                2018
}

If you use FastWARC, you can also cite our OSSYM 2021 abstract paper:

@InProceedings{bevendorff:2021,
  author =                {Janek Bevendorff and Martin Potthast and Benno Stein},
  booktitle =             {3rd International Symposium on Open Search Technology (OSSYM 2021)},
  editor =                {Andreas Wagner and Christian Guetl and Michael Granitzer and Stefan Voigt},
  month =                 oct,
  publisher =             {International Open Search Symposium},
  site =                  {CERN, Geneva, Switzerland},
  title =                 {{FastWARC: Optimizing Large-Scale Web Archive Analytics}},
  year =                  2021
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].