All Projects → quanteda → readtext

quanteda / readtext

Licence: other
an R package for reading text files

Programming Languages

r
7636 projects
Rich Text Format
576 projects
python
139335 projects - #7 most used programming language
CSS
56736 projects

Projects that are alternatives of or similar to readtext

Lingo
Text encoding for modern C++
Stars: ✭ 28 (-72.55%)
Mutual labels:  encoding, text
Js Codepage
💱 Codepages for JS
Stars: ✭ 119 (+16.67%)
Mutual labels:  encoding, text
content inspector
Fast inspection of binary buffers to guess/determine the type of content
Stars: ✭ 28 (-72.55%)
Mutual labels:  encoding, text
nimtesseract
A Tesseract OCR wrapper for Nim
Stars: ✭ 23 (-77.45%)
Mutual labels:  text
bytes-java
Bytes is a utility library that makes it easy to create, parse, transform, validate and convert byte arrays in Java. It supports endianness as well as immutability and mutability, so the caller may decide to favor performance.
Stars: ✭ 120 (+17.65%)
Mutual labels:  encoding
blackcrownproject
The archive of The Black Crown Project, a now-dismembered narrative web game.
Stars: ✭ 18 (-82.35%)
Mutual labels:  text
svensktext
Svenska språkresurser: kvinno- och mansnamn, orter, län, kommuner, länder, nationaliteter, yrken, sentimentlexikon, moral, stoppord, myndigheter m.m.
Stars: ✭ 54 (-47.06%)
Mutual labels:  text
arrayfiles
Array-like File Access in Python
Stars: ✭ 41 (-59.8%)
Mutual labels:  text
h264-roi
H.264 video Region of Interest encoding tool, using x264
Stars: ✭ 44 (-56.86%)
Mutual labels:  encoding
go-kml
Package kml provides convenience methods for creating and writing KML documents.
Stars: ✭ 67 (-34.31%)
Mutual labels:  encoding
spyql
Query data on the command line with SQL-like SELECTs powered by Python expressions
Stars: ✭ 694 (+580.39%)
Mutual labels:  text
QPrompt
Personal teleprompter software for all video creators. Built with ease of use, productivity, control accuracy, and smooth performance in mind.
Stars: ✭ 168 (+64.71%)
Mutual labels:  text
heroku-buildpack-tex
A Heroku buildpack to run TeX Live inside a dyno.
Stars: ✭ 18 (-82.35%)
Mutual labels:  text
glText
Cross-platform single header text rendering library for OpenGL
Stars: ✭ 93 (-8.82%)
Mutual labels:  text
muse-as-service
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.
Stars: ✭ 45 (-55.88%)
Mutual labels:  text
Take-Notes
Huge Assignments to Write with only a little time in Hand?
Stars: ✭ 17 (-83.33%)
Mutual labels:  text
FNet-pytorch
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms
Stars: ✭ 204 (+100%)
Mutual labels:  text
fql
Formatted text processing with SQL
Stars: ✭ 20 (-80.39%)
Mutual labels:  text
glitched-writer
Glitched, text writing js module. Highly customizable settings. Decoding, decrypting, scrambling, or simply spelling out text.
Stars: ✭ 51 (-50%)
Mutual labels:  text
quanteda.corpora
A collection of corpora for quanteda
Stars: ✭ 17 (-83.33%)
Mutual labels:  quanteda

readtext: Import and handling for plain and formatted text files

CRAN Version R build status codecov.io Downloads Total Downloads

An R package for reading text files in all their various formats, by Ken Benoit, Adam Obeng, Paul Nulty, Aki Matsuo, Kohei Watanabe, and Stefan Müller.

Introduction

readtext is a one-function package that does exactly what it says on the tin: It reads files containing text, along with any associated document-level metadata, which we call “docvars”, for document variables. Plain text files do not have docvars, but other forms such as .csv, .tab, .xml, and .json files usually do.

readtext accepts filemasks, so that you can specify a pattern to load multiple texts, and these texts can even be of multiple types. readtext is smart enough to process them correctly, returning a data.frame with a primary field “text” containing a character vector of the texts, and additional columns of the data.frame as found in the document variables from the source files.

As encoding can also be a challenging issue for those reading in texts, we include functions for diagnosing encodings on a file-by-file basis, and allow you to specify vectorized input encodings to read in file types with individually set (and different) encodings. (All encoding functions are handled by the stringi package.)

How to Install

  1. From CRAN

    install.packages("readtext")
  2. From GitHub, if you want the latest development version.

    # devtools packaged required to install readtext from Github 
    devtools::install_github("quanteda/readtext") 

Linux note: There are a couple of dependencies that may not be available on linux systems. On Debian/Ubuntu try installing these packages by running these commands at the command line:

sudo apt-get install libpoppler-cpp-dev   # for antiword

Demonstration: Reading one or more text files

readtext supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF, Microsoft Word formatted files and other document formats (.pdf, .doc, .docx, .odt, .rtf). readtext also handles multiple files and file types using for instance a “glob” expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz).

The file formats are determined automatically by the filename extensions. If a file has no extension or is unknown, readtext will assume that it is plain text. The following command, for instance, will load in all of the files from the subdirectory txt/UDHR/:

require(readtext)
## Loading required package: readtext
# get the data directory from readtext
DATA_DIR <- system.file("extdata/", package = "readtext")

# read in all files from a folder
readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
## readtext object consisting of 13 documents and 0 docvars.
## # Description: df[,2] [13 × 2]
##   doc_id            text                         
##   <chr>             <chr>                        
## 1 UDHR_chinese.txt  "\"世界人权宣言\n联合国\"..."
## 2 UDHR_czech.txt    "\"VŠEOBECNÁ \"..."          
## 3 UDHR_danish.txt   "\"Den 10. de\"..."          
## 4 UDHR_english.txt  "\"Universal \"..."          
## 5 UDHR_french.txt   "\"Déclaratio\"..."          
## 6 UDHR_georgian.txt "\"FLFVBFYBC \"..."          
## # … with 7 more rows

For files that contain multiple documents, such as comma-separated-value documents, you will need to specify the column name containing the texts, using the text_field argument:

# read in comma-separated values and specify text field
readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
## readtext object consisting of 5 documents and 3 docvars.
## # Description: df[,5] [5 × 5]
##   doc_id            text                 Year President  FirstName
##   <chr>             <chr>               <int> <chr>      <chr>    
## 1 inaugCorpus.csv.1 "\"Fellow-Cit\"..."  1789 Washington George   
## 2 inaugCorpus.csv.2 "\"Fellow cit\"..."  1793 Washington George   
## 3 inaugCorpus.csv.3 "\"When it wa\"..."  1797 Adams      John     
## 4 inaugCorpus.csv.4 "\"Friends an\"..."  1801 Jefferson  Thomas   
## 5 inaugCorpus.csv.5 "\"Proceeding\"..."  1805 Jefferson  Thomas

For a more complete demonstration, see the package vignette.

Inter-operability with other packages

With quanteda

readtext was originally developed in early versions of the quanteda package for the quantitative analysis of textual data. Because quanteda’s corpus constructor recognizes the data.frame format returned by readtext(), it can construct a corpus directly from a readtext object, preserving all docvars and other meta-data.

require(quanteda)
## Loading required package: quanteda
## Package version: 2.1.2
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
# read in comma-separated values with readtext
rt_csv <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
# create quanteda corpus
corpus_csv <- corpus(rt_csv)
summary(corpus_csv, 5)
## Corpus consisting of 5 documents, showing 5 documents:
## 
##               Text Types Tokens Sentences Year  President FirstName
##  inaugCorpus.csv.1   625   1539        23 1789 Washington    George
##  inaugCorpus.csv.2    96    147         4 1793 Washington    George
##  inaugCorpus.csv.3   826   2577        37 1797      Adams      John
##  inaugCorpus.csv.4   717   1923        41 1801  Jefferson    Thomas
##  inaugCorpus.csv.5   804   2380        45 1805  Jefferson    Thomas

Text Interchange Format compatibility

readtext returns a data.frame that is formatted as per the corpus structure of the Text Interchange Format, it can easily be used by other packages that can accept a corpus in data.frame format.

If you only want a named character object, readtext also defines an as.character() method that inputs its data.frame and returns just the named character vector of texts, conforming to the TIF definition of the character version of a corpus.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].