All Projects → cligs → toolbox

cligs / toolbox

Licence: other
Collection of small tools for text processing.

Programming Languages

HTML
75241 projects
python
139335 projects - #7 most used programming language
XSLT
1337 projects

toolbox

ARRAS should not be thought of as a black box into which one inserts a text along with a set of commands and out of which one receives a completed analysis. A better analogy is a toolbox containing a set of tools, each designed for a particular task. The ARRAS design always presumes a human inquirer at the center. This ARRAS amplifies, rather than replaces, specific perceptual and cognitive functions. (John B. Smith, "A New Environment for Literary Analysis", Perspectives in Computing 4.2/3, 1984)

What is the toolbox?

The toolbox is a collection of small work-in-progress scripts and code snippets for text processing produced by CLiGS.

Note that all functions are designed for Python 3 and are experimental in nature and quality. Each folder contains one or several Python scripts and some sample texts for testing. Currently, we are transitioning towards the toolbox as a module (see below).

Experimental feature: toolbox as module

This allows using the scripts as a repo-based module. The basic idea is that you clone the toolbox repository from GitHub and add the path to the folder containing the toolbox to your Python sys.path (using the script "activate_toolbox.py" which is included here). Then, you can import modules and submodules from the toolbox in your custom text processing scripts anywhere on your computer and use the functions provided in the toolbox. You may want to create your own branch of the toolbox to customize the functions as necessary.

Requirements

  • pandas
  • numpy
  • requests
  • lxml
  • ...

Module structure

In order to use the module efficiently, you need to know which submodules are included and which functions are included in each submodule. The following is intended as a quick overview, please see the submodules themselves for details.

  • extract.py
    • read_tei5
    • read_tei4
    • get_metadata
    • get_metadataP4
  • crawl.py
    • crawl_tc
    • convert_encoding
  • annotate
    • annotate_fw.py
      • use_freeling
      • use_wordnet
      • annotate_fw
    • fw2txm.py
      • fw2txm
    • prepare_tei.py
      • prepare_anno
      • postpare_anno
      • prepare
    • use_heideltime.py
      • apply_ht
    • workflow_teifw.py
  • check_quality
    • spellchecking.py
      • check_collection
      • correct_words
    • validate_tei.py
      • validate_tei
    • elements_used.py
  • extract
    • tei2pdf.py
      • convert2pdf
    • tei2pdf.xsl

To get more information about a submodule, especially what each function does and which parameters they take, just use the usual help command in Python, for example:

help(extract)

or

help(extract.read_tei5)

Example

If you want to read text from a TEI P5 file, you could use the following import statement and function call in your script:

from toolbox import extract

extract.read_tei5("/folder/with/tei/files/", "/folder/for/text/files", "bodytext")            
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].