All Projects → N0taN3rd → node-warc

N0taN3rd / node-warc

Licence: MIT license
Parse And Create Web ARChive (WARC) files with node.js

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to node-warc

mixnode-warcreader-php
Read Web ARChive (WARC) files in PHP.
Stars: ✭ 20 (-71.01%)
Mutual labels:  warc, webarchive
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (-24.64%)
Mutual labels:  warc, webarchiving
Archivebox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Stars: ✭ 12,383 (+17846.38%)
Mutual labels:  warc, web-archiving
warc
📇 Tools to Work with the Web Archive Ecosystem in R
Stars: ✭ 21 (-69.57%)
Mutual labels:  warc, warc-files
chatnoir-resiliparse
A robust web archive analytics toolkit
Stars: ✭ 26 (-62.32%)
Mutual labels:  warc, webarchive
wail
🐋 One-Click User Instigated Preservation
Stars: ✭ 107 (+55.07%)
Mutual labels:  warc, web-archiving
Heritrix3
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Stars: ✭ 2,104 (+2949.28%)
Mutual labels:  warc
CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
Stars: ✭ 43 (-37.68%)
Mutual labels:  warc
Archivenow
A Tool To Push Web Resources Into Web Archives
Stars: ✭ 253 (+266.67%)
Mutual labels:  web-archiving
domcurl
cUrl-like utility for fetching a resource (in this case we will run JS and return after network is idle) - great for JS heavy apps
Stars: ✭ 84 (+21.74%)
Mutual labels:  pupeteer
Wail
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
Stars: ✭ 232 (+236.23%)
Mutual labels:  web-archiving
warrick
Recover lost websites from the Web Infrastructure
Stars: ✭ 76 (+10.14%)
Mutual labels:  web-archiving
MemGator
A Memento Aggregator CLI and Server in Go
Stars: ✭ 42 (-39.13%)
Mutual labels:  web-archiving
warc
⚙️ A Rust library for reading and writing WARC files
Stars: ✭ 26 (-62.32%)
Mutual labels:  warc
munin-indexer
A social media open post web archiving tool
Stars: ✭ 16 (-76.81%)
Mutual labels:  webarchiving
Archiveror
Archiveror will help you preserve the webpages you love. 💾
Stars: ✭ 246 (+256.52%)
Mutual labels:  web-archiving
Collect
A server to collect & archive websites that also supports video downloads
Stars: ✭ 62 (-10.14%)
Mutual labels:  web-archiving
warcworker
A dockerized, queued high fidelity web archiver based on Squidwarc
Stars: ✭ 48 (-30.43%)
Mutual labels:  webarchiving
awesome-memento
A list of things related to software, literature, and other content for 🕣 Memento
Stars: ✭ 62 (-10.14%)
Mutual labels:  webarchiving
vandal
Navigator for Web Archive
Stars: ✭ 146 (+111.59%)
Mutual labels:  webarchive

node-warc

Parse Web Archive (WARC) files or create WARC files using

Run npm install node-warc or yarn add node-warc to ge started

npm Package

Documentation

Full documentation available at n0tan3rd.github.io/node-warc

Parsing

Using async iteration

Requires node 10 or greater

const fs = require('fs')
const zlib = require('zlib')
// recordIterator only exported if async iteration on readable streams is available
const { recordIterator } = require('node-warc')

async function iterateRecords (warcStream) {
  for await (const record of recordIterator(warcStream)) {
    console.log(record)
  }
}

iterateRecords(
  fs.createReadStream('<path-to-gzipd-warcfile>').pipe(zlib.createGunzip())
).then(() => {
  console.log('done')
})

Or using one of the parsers

for await (const record of new AutoWARCParser('<path-to-warcfile>')) {
    console.log(record)
}

Using Stream Transform

const fs = require('fs')
const { WARCStreamTransform } = require('node-warc')

fs
  .createReadStream('<path-to-warcfile>')
  .pipe(new WARCStreamTransform())
  .on('data', record => {
    console.log(record)
  })

Both .warc and .warc.gz

const { AutoWARCParser } = require('node-warc')

const parser = new AutoWARCParser('<path-to-warcfile>')
parser.on('record', record => { console.log(record) })
parser.on('done', () => { console.log('finished') })
parser.on('error', error => { console.error(error) })
parser.start()

Only gzip'd warc files

const { WARCGzParser } = require('node-warc')

const parser = new WARCGzParser('<path-to-warcfile>')
parser.on('record', record => { console.log(record) })
parser.on('done', () => { console.log('finished') })
parser.on('error', error => { console.error(error) })
parser.start()

Only non gzip'd warc files

const { WARCGzParser } = require('node-warc')

const parser = new WARCParser('<path-to-gzipd-warcfile>')
parser.on('record', record => { console.log(record) })
parser.on('done', () => { console.log('finished') })
parser.on('error', error => { console.error(error) })
parser.start()

WARC Creation

Environment

  • NODEWARC_WRITE_GZIPPED - enable writing gzipped records to WARC outputs.

Examples

Using chrome-remote-interface

const CRI = require('chrome-remote-interface')
const { RemoteChromeWARCWriter, RemoteChromeCapturer } = require('node-warc')

;(async () => {
  const client = await CRI()
  await Promise.all([
    client.Page.enable(),
    client.Network.enable(),
  ])
  const cap = new RemoteChromeCapturer(client.Network)
  cap.startCapturing()
  await client.Page.navigate({ url: 'http://example.com' });
  // actual code should wait for a better stopping condition, eg. network idle
  await client.Page.loadEventFired()
  const warcGen = new RemoteChromeWARCWriter()
  await warcGen.generateWARC(cap, client.Network, {
    warcOpts: {
      warcPath: 'myWARC.warc'
    },
    winfo: {
      description: 'I created a warc!',
      isPartOf: 'My awesome pywb collection'
    }
  })
  await client.close()
})()

Using chrome-remote-interface-extra

const { CRIExtra, Events, Page } = require('chrome-remote-interface-extra')
const { CRIExtraWARCGenerator, CRIExtraCapturer } = require('node-warc')

;(async () => {
  let client
  try {
    // connect to endpoint
    client = await CRIExtra({ host: 'localhost', port: 9222 })
    const page = await Page.create(client)
    const cap = new CRIExtraCapturer(page, Events.Page.Request)
    cap.startCapturing()
    await page.goto('https://example.com', { waitUntil: 'networkIdle' })
    const warcGen = new CRIExtraWARCGenerator()
    await warcGen.generateWARC(cap, {
      warcOpts: {
        warcPath: 'myWARC.warc'
      },
      winfo: {
        description: 'I created a warc!',
        isPartOf: 'My awesome pywb collection'
      }
    })
  } catch (err) {
    console.error(err)
  } finally {
    if (client) {
      await client.close()
    }
  }
})()

Using Puppeteer

const puppeteer = require('puppeteer')
const { Events } = require('puppeteer')
const { PuppeteerWARCGenerator, PuppeteerCapturer } = require('node-warc')

;(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  const cap = new PuppeteerCapturer(page, Events.Page.Request)
  cap.startCapturing()
  await page.goto('http://example.com', { waitUntil: 'networkidle0' })
  const warcGen = new PuppeteerWARCGenerator()
  await warcGen.generateWARC(cap, {
    warcOpts: {
      warcPath: 'myWARC.warc'
    },
    winfo: {
      description: 'I created a warc!',
      isPartOf: 'My awesome pywb collection'
    }
  })
  await page.close()
  await browser.close()
})()

Note

The generateWARC method used in the preceding examples is helper function for making the WARC generation process simple. See its implementation for a full example of WARC generation using node-warc

Or see one of the crawler implementations provided by Squidwarc.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].