All Projects â†’ kororo â†’ Excelcy

kororo / Excelcy

Licence: mit
Excel Integration with spaCy. Training NER using Excel/XLSX from PDF, DOCX, PPT, PNG or JPG.

Programming Languages

python
139335 projects - #7 most used programming language
python3
1442 projects

Projects that are alternatives of or similar to Excelcy

Dbwebapi
(Migrated from CodePlex) DbWebApi is a .Net library that implement an entirely generic Web API (RESTful) for HTTP clients to call database (Oracle & SQL Server) stored procedures or functions in a managed way out-of-the-box without any configuration or coding.
Stars: ✭ 84 (-5.62%)
Mutual labels:  excel, xlsx
Xlnt
📊 Cross-platform user-friendly xlsx library for C++11+
Stars: ✭ 876 (+884.27%)
Mutual labels:  excel, xlsx
Tableexport
The simple, easy-to-implement library to export HTML tables to xlsx, xls, csv, and txt files.
Stars: ✭ 781 (+777.53%)
Mutual labels:  excel, xlsx
Sheetjs
📗 SheetJS Community Edition -- Spreadsheet Data Toolkit
Stars: ✭ 28,479 (+31898.88%)
Mutual labels:  excel, xlsx
Documentbuilder
ONLYOFFICE Document Builder is powerful text, spreadsheet, presentation and PDF generating tool
Stars: ✭ 61 (-31.46%)
Mutual labels:  excel, xlsx
Rows
A common, beautiful interface to tabular data, no matter the format
Stars: ✭ 739 (+730.34%)
Mutual labels:  excel, xlsx
Myexcel
MyExcel, a new way to operate excel!
Stars: ✭ 1,198 (+1246.07%)
Mutual labels:  excel, xlsx
Docjure
Read and write Office documents from Clojure
Stars: ✭ 510 (+473.03%)
Mutual labels:  excel, xlsx
Openvasreporting
OpenVAS Reporting: Convert OpenVAS XML report files to reports
Stars: ✭ 42 (-52.81%)
Mutual labels:  excel, xlsx
Desktopeditors
An office suite that combines text, spreadsheet and presentation editors allowing to create, view and edit local documents
Stars: ✭ 1,008 (+1032.58%)
Mutual labels:  excel, xlsx
Xlsx Populate
Excel XLSX parser/generator written in JavaScript with Node.js and browser support, jQuery/d3-style method chaining, encryption, and a focus on keeping existing workbook features and styles in tact.
Stars: ✭ 668 (+650.56%)
Mutual labels:  excel, xlsx
Fast Excel
🦉 Fast Excel import/export for Laravel
Stars: ✭ 1,183 (+1229.21%)
Mutual labels:  excel, xlsx
Readxl
Read excel files (.xls and .xlsx) into R 🖇
Stars: ✭ 585 (+557.3%)
Mutual labels:  excel, xlsx
Excel Io
Object-oriented java Excel library
Stars: ✭ 76 (-14.61%)
Mutual labels:  excel, xlsx
Reogrid
Fast and powerful .NET spreadsheet component, support data format, freeze, outline, formula calculation, chart, script execution and etc. Compatible with Excel 2007 (.xlsx) format and working on .NET 3.5 (or client profile), WPF and Android platform.
Stars: ✭ 532 (+497.75%)
Mutual labels:  excel, xlsx
Pyexcel
Single API for reading, manipulating and writing data in csv, ods, xls, xlsx and xlsm files
Stars: ✭ 902 (+913.48%)
Mutual labels:  excel, xlsx
Pyexcelerate
Accelerated Excel XLSX Writing Library for Python 2/3
Stars: ✭ 384 (+331.46%)
Mutual labels:  excel, xlsx
Better Xlsx
A better xlsx library.
Stars: ✭ 395 (+343.82%)
Mutual labels:  excel, xlsx
Luckysheet
Luckysheet is an online spreadsheet like excel that is powerful, simple to configure, and completely open source.
Stars: ✭ 9,772 (+10879.78%)
Mutual labels:  excel, xlsx
Excelize
Golang library for reading and writing Microsoft Excelâ„¢ (XLSX) files.
Stars: ✭ 10,286 (+11457.3%)
Mutual labels:  excel, xlsx

ExcelCy

Build Status Coverage Status MIT license PyPI pyversions PyPI - Downloads


ExcelCy is a NER trainer from XLSX, PDF, DOCX, PPT, PNG or JPG. ExcelCy uses spaCy framework to match Entity with PhraseMatcher or Matcher in regular expression.

ExcelCy is convenience

This is example taken from spaCy documentation, Simple Style Training. It demonstrates how to train NER using spaCy:

import spacy
import random

TRAIN_DATA = [
     ("Uber blew through $1 million a week", {'entities': [(0, 4, 'ORG')]}), # note: it is required to supply the character position
     ("Google rebrands its business apps", {'entities': [(0, 6, "ORG")]})] # note: it is required to supply the character position

nlp = spacy.blank('en')
optimizer = nlp.begin_training()
for i in range(20):
    random.shuffle(TRAIN_DATA)
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations], sgd=optimizer)

nlp.to_disk('test_model')

The TRAIN_DATA, describes sentences and annotated entities to be trained. It is cumbersome to always count the characters. With ExcelCy, (start,end) characters can be omitted.

# install excelcy
# pip install excelcy

# download the en model from spacy
# python -m spacy download en"

# run this inside python or file
from excelcy import ExcelCy

# Test: John is the CEO of this_is_a_unique_company_name
excelcy = ExcelCy()
# by default it is assume the nlp_base using model en_core_web_sm
# excelcy.storage.config = Config(nlp_base='en_core_web_sm')
# if you have existing model, use this
# excelcy.storage.config = Config(nlp_path='/path/model')
doc = excelcy.nlp('John is the CEO of this_is_a_unique_company_name')
# it will show no company entities
print([(ent.label_, ent.text) for ent in doc.ents])
# run this in root of repo or https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx
excelcy = ExcelCy.execute(file_path='tests/data/test_data_01.xlsx')
# use the nlp object as per spaCy API
doc = excelcy.nlp('John is the CEO of this_is_a_unique_company_name')
# now it recognise the company name
print([(ent.label_, ent.text) for ent in doc.ents])
# NOTE: if not showing, remember, it may be required to increase the "train_iteration" or lower the "train_drop", the "config" sheet in Excel

ExcelCy is friendly

By default, ExcelCy training is divided into phases, the example Excel file can be found in tests/data/test_data_01.xlsx:

1. Discovery

The first phase is to collect sentences from data source in sheet "source". The data source can be either:

  • Text: Direct sentence values.
  • Files: PDF, DOCX, PPT, PNG or JPG will be parsed using textract.

Note: See textract source examples in tests/data/test_data_03.xlsx Note: Dependencies "textract" is not included in the ExcelCy, it is required to add manually

2. Preparation

Next phase, the Gold annotation needs to be defined in sheet "prepare", based on:

  • Current Data Model: Using spaCy API of nlp(sentence).ents
  • Phrase pattern: Robbie, Uber, Google, Amazon
  • Regex pattern: ^([0-1]?[0-9]|2[0-3]):[0-5][0-9]$

All annotations in here are considered as Gold annotations, which described in here.

3. Training

Main phase of NER training, which described in Simple Style Training. The data is iterated from sheet "train", check sheet "config" to control the parameters.

4. Consolidation

The last phase, is to test/save the results and repeat the phases if required.

ExcelCy is flexible

Need more specific export and phases? It is possible to control it using phase API. This is the illustration of the real-world scenario:

  1. Train from tests/data/test_data_05.xlsx

    # download the dataset
    $ wget https://github.com/kororo/excelcy/raw/master/tests/data/test_data_05.xlsx
    # this will create a directory and file "export/train_05.xlsx"
    $ excelcy execute test_data_05.xlsx
    
  2. Open the result in "export/train_05.xlsx", it shows all identified sentences from source given. However, there is error in the "Himalayas" as identified as "PRODUCT".

  3. To fix this, add phrase matcher for "Himalayas = FAC". It is illustrated in tests/data/test_data_05a.xlsx

  4. Train again and check the result in "export/train_05a.xlsx"

    # download the dataset
    $ wget https://github.com/kororo/excelcy/raw/master/tests/data/test_data_05a.xlsx
    # this will create a directory "nlp/data" and file "export/train_05a.xlsx"
    $ excelcy execute test_data_05a.xlsx
    
  5. Check the result that there is backed up nlp data model in "nlp" and the result is corrected in "export/train_05a.xlsx"

  6. Keep training the data model, if there is unexpected behaviour, there is backup data model in case needed.

ExcelCy is comprehensive

Under the hood, ExcelCy has strong and well-defined data storage. At any given phase above, the data can be inspected.

from excelcy import ExcelCy
from excelcy.storage import Config

# Test: John is the CEO of this_is_a_unique_company_name
excelcy = ExcelCy()
excelcy.storage.config = Config(nlp_base='en_core_web_sm', train_iteration=10, train_drop=0.2)
doc = excelcy.nlp('John is the CEO of this_is_a_unique_company_name')
# showing no ORG
print([(ent.label_, ent.text) for ent in doc.ents])
excelcy.storage.source.add(kind='text', value='John is the CEO of this_is_a_unique_company_name')
excelcy.discover()
excelcy.storage.prepare.add(kind='phrase', value='this_is_a_unique_company_name', entity='ORG')
excelcy.prepare()
excelcy.train()
doc = excelcy.nlp('John is the CEO of this_is_a_unique_company_name')
# ORG now is recognised
print([(ent.label_, ent.text) for ent in doc.ents])
# NOTE: if not showing, remember, it may be required to increase the "train_iteration" or lower the "train_drop", the "config" sheet in Excel

Features

  • Load multiple data sources such as Word documents, PowerPoint presentations, PDF or images.
  • Import/Export configuration with JSON, YML or Excel.
  • Add custom Entity labels.
  • Rule based phrase matching using PhraseMatcher
  • Rule based matching using regex + Matcher
  • Train Named Entity Recogniser with ease

Install

Either use the famous pip or clone this repository and execute the setup.py file.

$ pip install excelcy
# ensure you have the language model installed before
$ spacy download en

Train

To train the spaCy model:

from excelcy import ExcelCy
excelcy = ExcelCy.execute(file_path='test_data_01.xlsx')

Note: tests/data/test_data_01.xlsx

CLI

ExelCy has basic CLI command for execute:

$ excelcy execute https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx

Test

Run test by installing packages and run tox

$ pip install poetry tox
$ tox
$ tox -e py36 -- tests/test_readme.py

For hot-reload development coding

$ npm i -g nodemon
$ nodemon

Data Definition

ExcelCy has data definition which expressed in api.yml. As long as, data given in this specific format and structure, ExcelCy will able to support any type of data format. Check out, the Excel file format in api.xlsx. Data classes are defined with attrs, check in storage.py for more detail.

Publishing

# this is note for contributors
# ensure locally tests all running
npm run test

# prepare for new version
poetry version 0.4.1
npm run export

# make changes in the git, especially release branch and check in the travis
# https://travis-ci.com/github/kororo/excelcy

# if all goes well, push to master

FAQ

What is that idx columns in the Excel sheet?

The idea is to give reference between two things. Imagine in sheet "train", like to know where the sentence generated from in sheet "source". And also, the nature of Excel, you can sort things, this is the safe guard to keep things in the correct order.

Can ExcelCy import/export to X, Y, Z data format?

ExcelCy has strong and well-defined data storage, thanks to attrs. It is possible to import/export data in any format.

Error: ModuleNotFoundError: No module named 'pip'

There are lots of possibility on this. Try to lower pip version (it was buggy for version 19.0.3).

ExcelCy accepts suggestions/ideas?

Yes! Please submit them into new issue with label "enhancement".

Acknowledgement

This project uses other awesome projects:

  • attrs: Python Classes Without Boilerplate.
  • pyexcel: Single API for reading, manipulating and writing data in csv, ods, xls, xlsx and xlsm files.
  • pyyaml: The next generation YAML parser and emitter for Python.
  • spacy: Industrial-strength Natural Language Processing (NLP) with Python and Cython.
  • textract: extract text from any document. no muss. no fuss.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].