Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

helgeho / Archivespark

Licence: mit

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

Programming Languages

scala

5932 projects

Labels

spark web-archiving

Projects that are alternatives of or similar to Archivespark

Bigdata Notes

大数据入门指南 ⭐

Stars: ✭ 10,991 (+9801.8%)

Mutual labels: spark

Seldon Server

Machine Learning Platform and Recommendation Engine built on Kubernetes

Stars: ✭ 1,435 (+1192.79%)

Mutual labels: spark

Bigdataclass

Two-day workshop that covers how to use R to interact databases and Spark

Stars: ✭ 110 (-0.9%)

Mutual labels: spark

Spark Ffm

FFM (Field-Awared Factorization Machine) on Spark

Stars: ✭ 101 (-9.01%)

Mutual labels: spark

Spark On K8s Operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

Stars: ✭ 1,780 (+1503.6%)

Mutual labels: spark

Hnswlib

Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

Stars: ✭ 108 (-2.7%)

Mutual labels: spark

Logisland

Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.

Stars: ✭ 97 (-12.61%)

Mutual labels: spark

Lambda Arch

Applying Lambda Architecture with Spark, Kafka, and Cassandra.

Stars: ✭ 111 (+0%)

Mutual labels: spark

Logigsk

A Linux based software package to control led's on Logitech G910, G810, G610 and G410.

Stars: ✭ 107 (-3.6%)

Mutual labels: spark

Parquet Index

Spark SQL index for Parquet tables

Stars: ✭ 109 (-1.8%)

Mutual labels: spark

Spark Terasort

Stars: ✭ 101 (-9.01%)

Mutual labels: spark

Sparktutorial

Source code for James Lee's Aparch Spark with Java course

Stars: ✭ 105 (-5.41%)

Mutual labels: spark

Pyspark Cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

Stars: ✭ 108 (-2.7%)

Mutual labels: spark

Bigdata Notebook

Stars: ✭ 100 (-9.91%)

Mutual labels: spark

Java learning practice

java 进阶之路：面试高频算法、akka、多线程、NIO、Netty、SpringBoot、Spark&&Flink 等

Stars: ✭ 110 (-0.9%)

Mutual labels: spark

Almond

A Scala kernel for Jupyter

Stars: ✭ 1,354 (+1119.82%)

Mutual labels: spark

Flink Learning

flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例，还有 Flink 落地应用的大型项目案例（PVUV、日志存储、百亿数据实时去重、监控告警）分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》

Stars: ✭ 11,378 (+10150.45%)

Mutual labels: spark

Elephas

Distributed Deep learning with Keras & Spark

Stars: ✭ 1,521 (+1270.27%)

Mutual labels: spark

Waterdrop

Production Ready Data Integration Product, documentation：

Stars: ✭ 1,856 (+1572.07%)

Mutual labels: spark

Distributed Dataset

A distributed data processing framework in Haskell.

Stars: ✭ 108 (-2.7%)

Mutual labels: spark

View All Similar Projects ➔

ArchiveSpark

ArchiveSpark is a framework / toolkit / library / API to facilitate efficient data processing, extraction as well as derivation for archival collections.

While originally developed for the use with web archives, which is still its main focus, ArchiveSpark can be used with any (archival) data collections through its modular architecture and customizable data specifications.

What can you do with it?

The main use case of ArchiveSpark is the efficient access to archival data with the goal to derive corpora by applying filters and tools in order to extract information from the original raw data, to be stored in a more accessible format, like JSON, while reflecting the data lineage of each derived value.

Examples of what you can do with it include: (see recipes for code examples)

Selecting a subset of your data and extracting desired properties (e.g., title, entities, ...)
Running a (temporal) data analysis on the filtered / extracted / derived data
Generating hyperlink or knowledge graphs for downstream applications
Processing archived webpages and extracting embedded resources
Downloading remote WARC/CDX data from the Internet Archive's Wayback Machine

New in 3.0

Namespace changed to org.archive.archivespark.
Extensive overhaul to be based on Sparkling, Internet Archive's internal data processing library, which is now partially included under org.archive.archivespark.sparkling.
ArchiveSpark will evolve as Sparkling evolves and automatically benefit from new features and bugfixes.
Streamlined with all unused / unnecessary / academic / experimental features being removed.
Refactored / cleaned up / simplified ArchiveSpark's public APIs.

For more information and instructions, please read the docs:

License

The MIT License (MIT)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 111

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗