All Projects → helgeho → Archivespark

helgeho / Archivespark

Licence: mit
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Archivespark

Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+9801.8%)
Mutual labels:  spark
Seldon Server
Machine Learning Platform and Recommendation Engine built on Kubernetes
Stars: ✭ 1,435 (+1192.79%)
Mutual labels:  spark
Bigdataclass
Two-day workshop that covers how to use R to interact databases and Spark
Stars: ✭ 110 (-0.9%)
Mutual labels:  spark
Spark Ffm
FFM (Field-Awared Factorization Machine) on Spark
Stars: ✭ 101 (-9.01%)
Mutual labels:  spark
Spark On K8s Operator
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Stars: ✭ 1,780 (+1503.6%)
Mutual labels:  spark
Hnswlib
Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
Stars: ✭ 108 (-2.7%)
Mutual labels:  spark
Logisland
Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
Stars: ✭ 97 (-12.61%)
Mutual labels:  spark
Lambda Arch
Applying Lambda Architecture with Spark, Kafka, and Cassandra.
Stars: ✭ 111 (+0%)
Mutual labels:  spark
Logigsk
A Linux based software package to control led's on Logitech G910, G810, G610 and G410.
Stars: ✭ 107 (-3.6%)
Mutual labels:  spark
Parquet Index
Spark SQL index for Parquet tables
Stars: ✭ 109 (-1.8%)
Mutual labels:  spark
Spark Terasort
Spark Terasort
Stars: ✭ 101 (-9.01%)
Mutual labels:  spark
Sparktutorial
Source code for James Lee's Aparch Spark with Java course
Stars: ✭ 105 (-5.41%)
Mutual labels:  spark
Pyspark Cheatsheet
🐍 Quick reference guide to common patterns & functions in PySpark.
Stars: ✭ 108 (-2.7%)
Mutual labels:  spark
Bigdata Notebook
Stars: ✭ 100 (-9.91%)
Mutual labels:  spark
Java learning practice
java 进阶之路:面试高频算法、akka、多线程、NIO、Netty、SpringBoot、Spark&&Flink 等
Stars: ✭ 110 (-0.9%)
Mutual labels:  spark
Almond
A Scala kernel for Jupyter
Stars: ✭ 1,354 (+1119.82%)
Mutual labels:  spark
Flink Learning
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Stars: ✭ 11,378 (+10150.45%)
Mutual labels:  spark
Elephas
Distributed Deep learning with Keras & Spark
Stars: ✭ 1,521 (+1270.27%)
Mutual labels:  spark
Waterdrop
Production Ready Data Integration Product, documentation:
Stars: ✭ 1,856 (+1572.07%)
Mutual labels:  spark
Distributed Dataset
A distributed data processing framework in Haskell.
Stars: ✭ 108 (-2.7%)
Mutual labels:  spark

ArchiveSpark

ArchiveSpark Logo

ArchiveSpark is a framework / toolkit / library / API to facilitate efficient data processing, extraction as well as derivation for archival collections.

While originally developed for the use with web archives, which is still its main focus, ArchiveSpark can be used with any (archival) data collections through its modular architecture and customizable data specifications.

What can you do with it?

The main use case of ArchiveSpark is the efficient access to archival data with the goal to derive corpora by applying filters and tools in order to extract information from the original raw data, to be stored in a more accessible format, like JSON, while reflecting the data lineage of each derived value.

Examples of what you can do with it include: (see recipes for code examples)

  • Selecting a subset of your data and extracting desired properties (e.g., title, entities, ...)
  • Running a (temporal) data analysis on the filtered / extracted / derived data
  • Generating hyperlink or knowledge graphs for downstream applications
  • Processing archived webpages and extracting embedded resources
  • Downloading remote WARC/CDX data from the Internet Archive's Wayback Machine

New in 3.0

  • Namespace changed to org.archive.archivespark.
  • Extensive overhaul to be based on Sparkling, Internet Archive's internal data processing library, which is now partially included under org.archive.archivespark.sparkling.
  • ArchiveSpark will evolve as Sparkling evolves and automatically benefit from new features and bugfixes.
  • Streamlined with all unused / unnecessary / academic / experimental features being removed.
  • Refactored / cleaned up / simplified ArchiveSpark's public APIs.

For more information and instructions, please read the docs:

ArchiveSpark Documentation

License

The MIT License (MIT)

Copyright (c) 2015-2019 [email protected]>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].