All Projects → jwoschitz → avrocount

jwoschitz / avrocount

Licence: Apache-2.0 license
Count records in Avro files efficiently

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to avrocount

mpu
Martins Python Utilities - Stuff that comes in Handy
Stars: ✭ 47 (+193.75%)
Mutual labels:  utility
DataProfiler
What's in your data? Extract schema, statistics and entities from datasets
Stars: ✭ 843 (+5168.75%)
Mutual labels:  avro
i2c-exp-driver
Driver to program I2C based Onion Expansions
Stars: ✭ 33 (+106.25%)
Mutual labels:  utility
cra-tailwindcss-in-js
Integrate Tailwind CSS in a Create React App setup using css-in-js solutions
Stars: ✭ 35 (+118.75%)
Mutual labels:  utility
OP1GO
Ultraportable backups for Teenage Engineering's OP-1
Stars: ✭ 34 (+112.5%)
Mutual labels:  utility
leak
Show info about package releases on PyPI.
Stars: ✭ 15 (-6.25%)
Mutual labels:  utility
linearmouse
🖱 The mouse and trackpad utility for Mac.
Stars: ✭ 1,151 (+7093.75%)
Mutual labels:  utility
Windows10Tools
Tools for Windows 10
Stars: ✭ 45 (+181.25%)
Mutual labels:  utility
grizzly
Extra utilities for Bear 🐻
Stars: ✭ 20 (+25%)
Mutual labels:  utility
mik
The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
Stars: ✭ 32 (+100%)
Mutual labels:  utility
tableize
Turn lists into tables with ease
Stars: ✭ 12 (-25%)
Mutual labels:  utility
goroutines
provides utilities to perform common tasks on goroutines
Stars: ✭ 19 (+18.75%)
Mutual labels:  utility
gut
🍱 yet another collection of go utilities & tools
Stars: ✭ 24 (+50%)
Mutual labels:  utility
darwin
Avro Schema Evolution made easy
Stars: ✭ 26 (+62.5%)
Mutual labels:  avro
tracked
Header-only C++17 library enables to track object instances with varied policies and gives you to control exceptions on policy rule break.
Stars: ✭ 12 (-25%)
Mutual labels:  utility
TestCards
A simple test pattern generator.
Stars: ✭ 46 (+187.5%)
Mutual labels:  utility
kyanite
A small purely functional library of curried functions, with great piping possibilities!
Stars: ✭ 26 (+62.5%)
Mutual labels:  utility
sharyn
🌹 Sharyn – A collection of JavaScript / TypeScript packages that make your life easier and reduce your boilerplate code
Stars: ✭ 30 (+87.5%)
Mutual labels:  utility
DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (+50%)
Mutual labels:  avro
schema-registry
📙 json & avro http schema registry backed by Kafka
Stars: ✭ 23 (+43.75%)
Mutual labels:  avro

Avrocount

release Build Status License

This tool provides a way of efficiently counting records in Apache Avro data files. (https://avro.apache.org/docs/current/)

It works with single files or whole folders, with local filesystem or HDFS.

Usage

Quickstart

Get the latest released version or build it from source.

Then simply invoke the tool via

java -jar avrocount.jar /path/to/myfile.avro

And it will print the amount of records found within the file to stdout.

Folders containing avro files

You can either provide a path to a file or a folder containing avro files.

java -jar avrocount.jar /path/to/folder

The tool will consider only files ending with .avro and ignore other files. Currently only files within the directory will be processed, sub-directories will not be considered.

The total amount of records found in all avro files within the folder will be printed to stdout.

HDFS integration

The tool is using the Hadoop Filesystem API to resolve paths, as long as the proper Hadoop configuration is provided via PATH it should be able to connect to HDFS file paths.

You can execute the tool directly with the yarn binary and it should pick up the necessary configurations automatically.

yarn jar avrocount.jar /path/to/myfile.avro

In older Hadoop distributions, you need to replace yarn with hadoop.

Alternatively you can explicitly point to a HDFS instance by specifying the protocol

jar -jar avrocount.jar hdfs://<namenode>/path/to/myfile.avro

Build from source

You can also get the already compiled dependencies from the latest release.

This project relies on gradle for dependency management and build automation.

In order to build the project execute:

gradle build

This will generate an uber-jar (contains all relevant dependencies) in ./build/libs/

Motivation

The initial idea was submitted as a patch in 2015 to the Apache Avro project (https://issues.apache.org/jira/browse/AVRO-1720) as an addition to the already existing avro-tools.

Though due to several reasons this patch has not been merged yet.

Unfortunately up to this date there is no convenient and efficient way to count records in an Avro data file by using avro-tools from the command line.

This project tries to fill this gap (at least) until a similar functionality is provided by avro-tools.

Over time there were also several improvements to this project in comparison to the original patch.

It would be great if these improvements would also find a way back into the Apache Avro project in the longterm. Until then this project can be used in addition to the currently existing avro-tools.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].