All Projects → titorenko → quick-csv-streamer

titorenko / quick-csv-streamer

Licence: GPL-2.0 license
Quick CSV Parser with Java 8 Streams API

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to quick-csv-streamer

Semantic Csv
Higher level tools for working with CSV data and files
Stars: ✭ 232 (+700%)
Mutual labels:  parsing
CaptCC
A tiny C compiler written purely in JavaScript.
Stars: ✭ 175 (+503.45%)
Mutual labels:  parsing
data examples
An example app showing different ways to pass to and share data with widgets and pages.
Stars: ✭ 56 (+93.1%)
Mutual labels:  streams
Link Preview Js
Parse and/or extract web links meta information: title, description, images, videos, etc. [via OpenGraph], runs on mobiles and node.
Stars: ✭ 240 (+727.59%)
Mutual labels:  parsing
DotGrok
Parse text with pattern. Inspired by grok filter.
Stars: ✭ 26 (-10.34%)
Mutual labels:  parsing
yellowpages-scraper
Yellowpages.com Web Scraper written in Python and LXML to extract business details available based on a particular category and location.
Stars: ✭ 56 (+93.1%)
Mutual labels:  parsing
Scrapysharp
reborn of https://bitbucket.org/rflechner/scrapysharp
Stars: ✭ 226 (+679.31%)
Mutual labels:  parsing
Syntax
Write value-driven parsers quickly in Swift with an intuitive SwiftUI-like DSL
Stars: ✭ 134 (+362.07%)
Mutual labels:  parsing
module-dependents
Get the list of npm modules that depend on the specified npm module.
Stars: ✭ 15 (-48.28%)
Mutual labels:  streams
masci-tools
Tools, utility, parsers useful in daily material science work
Stars: ✭ 18 (-37.93%)
Mutual labels:  parsing
Ohm
A library and language for building parsers, interpreters, compilers, etc.
Stars: ✭ 3,938 (+13479.31%)
Mutual labels:  parsing
autumn
A Java parser combinator library written with an unmatched feature set.
Stars: ✭ 112 (+286.21%)
Mutual labels:  parsing
html
HTML templating and streaming response library for Service Worker-like environments such as Cloudflare Workers.
Stars: ✭ 41 (+41.38%)
Mutual labels:  streams
Cbor
CBOR support for serde.
Stars: ✭ 238 (+720.69%)
Mutual labels:  parsing
NFlags
Simple yet powerfull library to made parsing CLI arguments easy. Library also allow to print usage help "out of box".
Stars: ✭ 44 (+51.72%)
Mutual labels:  parsing
Jsmn
Jsmn is a world fastest JSON parser/tokenizer. This is the official repo replacing the old one at Bitbucket
Stars: ✭ 2,794 (+9534.48%)
Mutual labels:  parsing
postcss-jsx
PostCSS syntax for parsing CSS in JS literals
Stars: ✭ 73 (+151.72%)
Mutual labels:  parsing
ghakuf
A Rust library for parsing/building SMF (Standard MIDI File).
Stars: ✭ 30 (+3.45%)
Mutual labels:  parsing
logstreamer
Prefixes streams (e.g. stdout or stderr) in Go
Stars: ✭ 41 (+41.38%)
Mutual labels:  streams
aurum
Fast and concise declarative DOM rendering library for javascript
Stars: ✭ 17 (-41.38%)
Mutual labels:  streams

Quick CSV Streamer

Build Status Maven Central Javadoc

Quick CSV streamer is a high performance CSV parsing library with Java 8 Stream API. The library operates in "zero-copy" mode and only parses what is required by the client. Amount of garbage produced is also optimized, reducing pressure on the garbage collector. Parallel, multi-core parsing is supported transparently via Java Stream API.

Compared to other open source Java CSV parsing libraries Quick CSV achieves speed ups at 2x - 10x range in sequential, single thread, mode. Naturally parallel mode improves performance further. See benchmarking results below for more details.

The library is limited to so called "line-optimal" charsets like UTF-8, US-ASCII, ISO-8859-1 and some others. Such line-optimal charsets have the property that line feed ('\n'), carriage return ('\r'), CSV separator are easily identifiable from other encoded characters.

Maven dependency

Available from Maven Central:

<dependency>
    <groupId>uk.elementarysoftware</groupId>
    <artifactId>quick-csv-streamer</artifactId>
    <version>0.2.4</version>
</dependency>

Example usage

Suppose following CSV file needs to be parsed

Country,City,AccentCity,Region,Population,Latitude,Longitude
ad,andorra,Andorra,07,,42.5,1.5166667
gb,city of london,City of London,H9,,51.514125,-.093689
ua,kharkiv,Kharkiv,07,,49.980814,36.252718

First define Java class to represent the records as follows

public class City {
    private final String city;
    private final int population;
    private final double latitude;
    private final double longitude;

    ...
}

here we will be sourcing 4 fields from the source file, ignoring other 3.

Parsing the file is simple

import uk.elementarysoftware.quickcsv.api.*;

CSVParser<City> parser = CSVParserBuilder.aParser(City::new, City.CSVFields.class).forRfc4180().build();

the parser will be using CSV separators as per RFC 4180, default encoding and will be expecting header as first record in the source. Custom separators, quotes, encodings and header sources are supported.

Actual mapping is done in City constructor

public class City {

    public static enum CSVFields {
        AccentCity,
        Population,
        Latitude,
        Longitude
    }

    public City(CSVRecordWithHeader<CSVFields> r) {
        this.city = r.getField(CSVFields.AccentCity).asString();
        this.population = r.getField(CSVFields.Population).asInt();
        this.latitude = r.getField(CSVFields.Latitude).asDouble();
        this.longitude = r.getField(CSVFields.Longitude).asDouble();
    }

first CSVFields enum specifies which fields should be sourced and only these fields will be actually parsed. After that CSVRecordWithHeader instance is used to populate City instance fields, refering to CSV fields by enum values.

Of course mapping can also be done outside domain class constructor, just pass different Function<CSVRecordWithHeader, City> to CSVParserBuilder.

Resulting stream can be processed in parallel or sequentially with usual Java stream API. For example to parse sequentially on a single thread

Stream<City> stream = parser.parse(source).sequential();
stream.forEach(System.out::println);    

By default parser will operate in parallel mode.

Please see sample project for full source code of the above example.

Special cases for headers

When header contains special characters the fields can not be simply encoded by enum literals. In such cases toString should be overwritten, for example

enum Fields {
    Latitude("City Latitude"),
    Longitude("City Longitude"),
    City("City name"),
    Population("City Population");

    private final String headerFieldName;

    private Fields(String headerFieldName) {
        this.headerFieldName = headerFieldName;
    }

    @Override public String toString() {
        return headerFieldName;
    }
}

If header is missing from the source it can be supplied during parser constuction

CSVParserBuilder
    .aParser(City::new, City.CSVFields.class)
    .usingExplicitHeader("Country", "City", "AccentCity", "Region", "Population", "Latitude", "Longitude")
    .build();

Advanced usage

About 10% performance improvement compared to normal usage can be achieved by referencing the fields by position instead of name. In this case parser construction is even simpler

CSVParser<City> parser = CSVParserBuilder.aParser(City::new).build();

as enumeration specifying field names is not needed. However now constructor will be using CSVRecord interface

public City(CSVRecord r) {
    r.skipFields(2);
    this.city  = r.getNextField().asString();
    r.skipField();        
    this.population = r.getNextField().asInt();        
    this.latitude = r.getNextField().asDouble();
    this.longitude = r.getNextField().asDouble();
}

effectively this encodes field order in the CSV source.

Performance

Best way to check performance of the library is to run benchmark on your target system with

gradle jmh

reports can be then found in build/reports/jmh.

It is very important to appreciate that performance might vary dramattically depending on the actual CSV content. As a very rough guideline see below sample output of "gradle jmh" on i7 2700k Ubuntu system, which uses cities.txt similar to example above, expanded to have 3173800 rows and 157 MB in size:

Benchmark Mode Cnt Score Error Units
OpenCSVParser avgt 5 2393.921 ± 262.347 ms/op
Quick CSV Parallel with header avgt 5 205.013 ± 1.739 ms/op
Quick CSV Parallel (advanced) avgt 5 177.262 ± 1.739 ms/op
Quick CSV Sequential avgt 5 648.462 ± 45.991 ms/op

Comparison is done with OpenCSV library v3.8, performance of other libraries can be extrapolated using chart from https://github.com/uniVocity/csv-parsers-comparison

Prerequisites

Quick CSV Streamer library requires Java 8, it has no other dependencies.

License

Library is licensed under the terms of GPL v2.0 license. Please contact me if you wish to use this library under more commercially friendly license or want to extend it, for example to add async parsing or support different file formats.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].