All Projects → Rynaro → estratto

Rynaro / estratto

Licence: MIT license
parsing fixed width files content made easy

Programming Languages

ruby
36898 projects - #4 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to estratto

corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Stars: ✭ 16 (+33.33%)
Mutual labels:  text-mining, text-processing
Text Mining
Text Mining in Python
Stars: ✭ 18 (+50%)
Mutual labels:  text-mining, text-processing
support-tickets-classification
This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en
Stars: ✭ 142 (+1083.33%)
Mutual labels:  text-mining, text-processing
TRUNAJOD2.0
An easy-to-use library to extract indices from texts.
Stars: ✭ 18 (+50%)
Mutual labels:  text-mining, text-processing
Text-Classification-LSTMs-PyTorch
The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.
Stars: ✭ 45 (+275%)
Mutual labels:  text-mining, text-processing
TextDatasetCleaner
🔬 Очистка датасетов от мусора (нормализация, препроцессинг)
Stars: ✭ 27 (+125%)
Mutual labels:  text-mining, text-processing
Artificial Adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
Stars: ✭ 348 (+2800%)
Mutual labels:  text-mining, text-processing
Text-Analysis
Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.
Stars: ✭ 48 (+300%)
Mutual labels:  text-mining, text-processing
Textcluster
短文本聚类预处理模块 Short text cluster
Stars: ✭ 115 (+858.33%)
Mutual labels:  text-mining, text-processing
Applied Text Mining In Python
Repo for Applied Text Mining in Python (coursera) by University of Michigan
Stars: ✭ 59 (+391.67%)
Mutual labels:  text-mining, text-processing
perke
A keyphrase extractor for Persian
Stars: ✭ 60 (+400%)
Mutual labels:  text-mining, text-processing
Xioc
Extract indicators of compromise from text, including "escaped" ones.
Stars: ✭ 148 (+1133.33%)
Mutual labels:  text-mining, text-processing
deduce
Deduce: de-identification method for Dutch medical text
Stars: ✭ 40 (+233.33%)
Mutual labels:  text-mining, text-processing
advanced-text-mining
TEANAPS 라이브러리를 활용한 자연어 처리와 텍스트 분석 방법론에 대해 다룹니다.
Stars: ✭ 15 (+25%)
Mutual labels:  text-mining, text-processing
teanaps
자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.
Stars: ✭ 91 (+658.33%)
Mutual labels:  text-mining, text-processing
Pipeit
PipeIt is a text transformation, conversion, cleansing and extraction tool.
Stars: ✭ 57 (+375%)
Mutual labels:  text-mining, text-processing
Cogcomp Nlpy
CogComp's light-weight Python NLP annotators
Stars: ✭ 115 (+858.33%)
Mutual labels:  text-mining, text-processing
text-analysis
Weaving analytical stories from text data
Stars: ✭ 12 (+0%)
Mutual labels:  text-mining, text-processing
JagTag
📝 JagTag is a simple - yet powerful and customizable - interpretted text parsing language!
Stars: ✭ 40 (+233.33%)
Mutual labels:  text-parser
sliceslice-rs
A fast implementation of single-pattern substring search using SIMD acceleration.
Stars: ✭ 66 (+450%)
Mutual labels:  text-processing

Estratto

Gem Version Build Status Coverage Status Maintainability

Estratto is a easy to handle parser based on YAML templating engine. Creating a easy interface for developers, and non developers to extract data from fixed width files

Motivation

In various scenarios the data processment is a crucial step of a integration with partner systems, or data storage. But the task to create parsing and import data from these text format is boring, and causing code duplication in every code project. This project borns to help developers to reduce the time spent in this task, or creating a total delegation scenario to other team responsabilities.

Installation

Add this line to your application's Gemfile:

gem 'estratto'

And then execute:

$ bundle

Or install it yourself as:

$ gem install estratto

Usage

Estratto works with simple input of data to parse file and a yaml layout equivalent.

Example of a default call for parsing:

Estratto::Document.process(file: 'path/to/data.txt', layout: 'path/to/layout.yml')

Layout specifications

Fixed width files is sometimes always painful for human reading, and the layout manual comes in a very useful pdf or spreasheet format.

Here, we'll try to made things fun again, or less painful. 😂

The base layout for YAML file is:

layout:
  name: 'jojo stand users'
  multi-register: true
  prefix: 0..1
  registers:
    - register: '01'
      fields:
        - name: name
          range: 2..45
          type: String
        - name: stand
          range: 46..75
          type: String

And the output will be a array of hashes reflection of your columns:

[
    {
        name: 'Jotaro Kujo',
        stand: 'Star Platinum'
    },
    {
        name: 'Giorno Giovanna',
        stand: 'Golden Experience Requiem'
    },
    {
        name: 'Jobin Higashikata',
        stand: 'Speed King'
    }
]

The structure follows the strict directive

layout:
    (base configuration)
    registers:
        (layouts)

Actually Estratto supports these types of fixed width layouts:

  • Batch prefix based registers
  • Mono layout based registers (development)

UTF-8 Conversion

Estratto makes use of CharlockHolmes gem to detect the file content encoding and convert it to UTF-8. This approach prevents invalid characters from being present in the output.

CharlockHolmes uses ICU for charset detection. And you need libicu in your environment.

Linux

RedHat, CentOS, Fedora:

yum install libicu-devel

Debian based:

apt-get install libicu-dev

Homebrew

brew install icu4c

Type Coercion

Estratto supports type coercion, with some perks called formats, on layout file.

Data type supported to handle in Estratto

  • String
  • Integer
  • Float
  • DateTime
  • Date

Default data type in fields is String, if no one type is setted in field list register.

Registers fields list always respect this base structure:

  fields:
    - name: name
      range: 2..12
      type: String
      formats:
        strip: true

name is your field identification of field, this value will be your symbol in hashed parsed data

range is where data is inside the file. (First index is 0)

type data type to be coerced

formats receives a specific configuration for data type. Here we can format Strings, and adjust precision for unformatted Float data.

Formats

Formats is the resource for deal with some "surprises" that this type of file can provide to us. Like, super large string fields that has a huge blank space, DateTime with suspicious formatting, or Float without any decimal point, but the manual description shows "Decimal(15, 2)"

String

strip

Works like common ruby String strip method

strip: true

Output example:

#raw_data
'Hierophant Green         '
# with strip clause
'Hierophant Green'

Integer

Simple integer values converter. Useful in cases that you need to deal with ids.

Actually we don't have any formats for Integer. :)

#raw_data
'000123'
# coerced
123
#raw_data
'123'
# coerced
123
#raw_data
'a'
# coerced
0

Float

Float is one of most important types here. The fixed width files always respect the non logical format to deliver information.

precision
precision: <integer>

Examples:

precision: 2
#raw data
'12345'
# with precision
123.45
precision: 3
#raw data
'12345'
# with precision
12.345
comma_format
comma_format: <boolean>

Examples:

comma_format: true
#raw data
'123,45'
# with comma formats
123.45

DateTime and Date

The DateTime and Date has the same formats attributes. But the difference, one shows DateTime format, and other always respect Date output

format
format: <ruby strptime format pattern>

Examples

format: '%Y%m%d'
#raw data
'20180101'
# with comma formats
#<DateTime: 2018-01-01T00:00:00+00:00 ...>
format: '%d/%m/%Y'
#raw data
'01/01/2018'
# with comma formats
#<DateTime: 2018-01-01T00:00:00+00:00 ...>

General Formats Properties

Sometimes we need to deal with some general patterns on third-party files. Like lacks of informations, or some unexpected exported data pattern.

Allow Empty

The allow_empty property was designed to deal with randomic unexpected data exported from third-party. Like DateTime field that has %Y%m%d format, but in third-party file, some lines cames with , or 00000000.

The common return when allow_empty was marked on field, is nil.

Tip: allow_empty could be ommitted when you not need a data saving

Example
  fields:
    - name: birthdate
      range: 2..10
      type: DateTime
      formats:
        allow_empty: true
        format: '%d/%m/%Y'

Tests

Simple rake spec

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/Rynaro/estratto.

License

The gem is available as open source under the terms of the MIT License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].