All Projects → taojinmin → spparser

taojinmin / spparser

Licence: GPL-3.0 License
an async ETL tool written in Python.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to spparser

ElementFinder
Fetch data from HTML and XML via xpath/css and prepare it with regexp
Stars: ✭ 29 (-14.71%)
Mutual labels:  xpath
workbook
simple framework for containing spreadsheet like data
Stars: ✭ 13 (-61.76%)
Mutual labels:  csv
AndrOBD-Plugin
AndrOBD plugin development project
Stars: ✭ 38 (+11.76%)
Mutual labels:  csv
org-clock-csv
Export Emacs org-mode clock entries to CSV format.
Stars: ✭ 80 (+135.29%)
Mutual labels:  csv
scala-csv-parser
CSV parser library.
Stars: ✭ 24 (-29.41%)
Mutual labels:  csv
flatpack
CSV/Tab Delimited and Fixed Length Parser and Writer
Stars: ✭ 55 (+61.76%)
Mutual labels:  csv
XPathTools
A Visual Studio Extension which can run any XPath and XPath function; navigates through results at the click of a button. Can show and copy any XPath incl. XML namespaces, avoiding XML namespace induced headaches. Keeps track of the current XPath via the statusbar.
Stars: ✭ 40 (+17.65%)
Mutual labels:  xpath
dbd
dbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL databases.
Stars: ✭ 30 (-11.76%)
Mutual labels:  csv
go-csv-tag
Read csv file from go using tags
Stars: ✭ 94 (+176.47%)
Mutual labels:  csv
ISx
ISx is an InstallShield installer extractor
Stars: ✭ 79 (+132.35%)
Mutual labels:  extractor
AlphaVantageAPI
An Opinionated AlphaVantage API Wrapper in Python 3.9. Compatible with Pandas TA (pip install pandas_ta). Get your FREE API Key at https://www.alphavantage.co/support/
Stars: ✭ 77 (+126.47%)
Mutual labels:  csv
datatools
A set of tools for working with JSON, CSV and Excel workbooks
Stars: ✭ 68 (+100%)
Mutual labels:  csv
csv2json
Writen in C, CSV file to JSON file/string converter with utf8 support.
Stars: ✭ 18 (-47.06%)
Mutual labels:  csv
elm-csv
Decode CSV in the most boring way possible.
Stars: ✭ 23 (-32.35%)
Mutual labels:  csv
data-models
Collection of various biomedical data models in parseable formats.
Stars: ✭ 23 (-32.35%)
Mutual labels:  csv
WoWHead-PHP-Extractors
Compare your database with WoWHead and find missing data
Stars: ✭ 14 (-58.82%)
Mutual labels:  extractor
shopify-product-csvs-and-images
Shopify product CSVs and images to seed your store with product data.
Stars: ✭ 76 (+123.53%)
Mutual labels:  csv
csv-to-sqlite
A desktop app to convert CSV files to SQLite databases!
Stars: ✭ 68 (+100%)
Mutual labels:  csv
VBA-CSV-interface
The most powerful and comprehensive CSV/TSV/DSV data management library for VBA, providing parsing/writing capabilities compliant with RFC-4180 specifications and a complete set of tools for manipulating records and fields.
Stars: ✭ 24 (-29.41%)
Mutual labels:  csv
YouPlot
A command line tool that draw plots on the terminal.
Stars: ✭ 412 (+1111.76%)
Mutual labels:  csv

中文介绍

Introduction

The goal of spparser is to provide a concise and efficient way to read, write, and process text data. At the same time, it supports synchronous and asynchronous reading and writing files, and supports regular, xpath, css selector to extract data. In the future, read and write support for the database will be implemented, and NLP will be introduced to provide more flexible processing methods. The architecture diagram is as follows:
jiagou

The AsyncReader and AsyncWriter is inspired by @zpoint's idataapi_transform

Installation

pip3 install spparser

Quick Start

from spparser import Reader, Writer, Extractor

def main():
    data = Reader.read_csv(file_path="./example.csv", each_line_type="dict", max_read_lines=10)
    '''
    example.csv:
    field1,field2
    1,2
    3,4
    5,6
    '''
    '''
    read_csv result: data = [{'a': '122github', 'b': '2'}, {'a': '-8spparser999', 'b': '4'}]
    '''
    alist = []
    for item in data:
        res = Extractor.regex(r"[a-zA-Z]+", item["a"], flags=0, trim_mode=True, return_all=False)
        alist.append(res)
    '''
    alist = ["github","spparser"]
    '''
    Writer.write(alist, "result.json")

if __name__ == "__main__":
    main()

Use Extractor.xpath() to extract html text

from spparser import Reader, Writer, Extractor

def main():
    '''
    demo.html
    <html lang="en">
    <head>
        <title>spparser</title>
    </head>
    <body>
        <ul id="container">
            <li class="object-1" tag="1"/>
            <li class="object-2"/>
            <li class="object-3"/>
        </ul>
    </body>
    </html>
    '''
    '''
    read_csv result: data = [{'a': '122github', 'b': '2'}, {'a': '-8spparser999', 'b': '4'}]
    '''
    html_text = Reader.read_anyfile("demo.html",line_by_line=False)
    res = Extractor.xpath("//title/text()",html_text)
    print(res)

if __name__ == "__main__":
    main()

Reading files asynchronously

from spparser import Reader,Writer, AsyncReader, AsyncWriter
import asyncio

async def main():
    reader = AsyncReader.async_csv_reader("./src.csv",batch_size=10,each_line_type="dict",max_read_lines=100, debug=True)
    with AsyncWriter.async_csv_writer("./dest.csv") as writer:
        async for items in reader:
            #for item in items:
                # Parser process
            await writer.write(items)

if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

When debug is set to True, output logs:

[2020-07-17  14:54:04] AsyncReader.py[line:70] INFO: from source: ./src.csv, this batch get 10 lines
[2020-07-17  14:54:04] AsyncWriter.py[line:63] INFO: to destination: ./dest.csv, write 10 lines.
[2020-07-17  14:54:04] AsyncReader.py[line:70] INFO: from source: ./src.csv, this batch get 10 lines
[2020-07-17  14:54:04] AsyncWriter.py[line:63] INFO: to destination: ./dest.csv, write 10 lines.
[2020-07-17  14:54:04] AsyncReader.py[line:70] INFO: from source: ./src.csv, this batch get 10 lines
[2020-07-17  14:54:04] AsyncWriter.py[line:63] INFO: to destination: ./dest.csv, write 10 lines.
[2020-07-17  14:54:04] AsyncReader.py[line:70] INFO: from source: ./src.csv, this batch get 10 lines
[2020-07-17  14:54:04] AsyncWriter.py[line:63] INFO: to destination: ./dest.csv, write 10 lines.
...

For mongodb asynchronous read and write:

async def main():
    reader = AsyncReader.async_mongo_reader(query={},collection="src_col", host="my_address",port=27017, database="my_db",username="my_name", password="my_pwd", batch_size=100,max_read_lines=1000)
    with AsyncWriter.async_mongo_writer(collection="dest_col", host="my_address",port=27017, database="my_db",username="my_name", password="my_pwd") as writer:
        async for items in getter:
            await writer.write(items)

if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

Version 0.4.10 added support for MySQL asynchronous read and write

async def main():
    sql = "CREATE TABLE IF NOT EXISTS TARGET_TABLE (field1 type1, field2 type2) DEFAULT CHARSET=utf8;"
    getter = AsyncReader.async_mysql_reader(query_sql="SELECT * FROM SRC_TABLE",host="localhost", port=None, database="test", username="username", password="password",batch_size=100,max_read_lines=1000)
    with AsyncWriter.async_mysql_writer(create_table_sql=sql,host="localhost", port=None, database="test", username="username", password="password") as writer:
        async for items in getter:
            await writer.write(items)

if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

History

0.2.10

  • async_anyfile_reader, async_anyfile_writer, async_csv_reader, async_csv_writer support.
  • xpath, css, regex selectors in Extractor support.

0.3.30

  • async_mongo_reader, async_mongo_writer support

0.4.10

  • async_mysql_reader, async_mysql_writer support
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].