Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → hrbrmstr → Docxtractr

hrbrmstr / Docxtractr

Licence: other

✂️ Extract Tables from Microsoft Word Documents with R

Programming Languages

7636 projects

Labels

rstats docx

Projects that are alternatives of or similar to Docxtractr

Ralger

ralger makes it easy to scrape a website. Built on the shoulders of titans: rvest, xml2.

Stars: ✭ 130 (-6.47%)

Mutual labels: rstats

Net Core Docx Html To Pdf Converter

.NET Core library to create custom reports based on Word docx or HTML documents and convert to PDF

Stars: ✭ 133 (-4.32%)

Mutual labels: docx

Tidygeocoder

Geocoding Made Easy

Stars: ✭ 137 (-1.44%)

Mutual labels: rstats

Ncaahoopr

An R package for working with NCAA Basketball Play-by-Play Data

Stars: ✭ 132 (-5.04%)

Mutual labels: rstats

Shinyfiles

A shiny extension for server side file access

Stars: ✭ 133 (-4.32%)

Mutual labels: rstats

D3r

d3.js helpers for R

Stars: ✭ 133 (-4.32%)

Mutual labels: rstats

Analogsea

Digital Ocean R client

Stars: ✭ 128 (-7.91%)

Mutual labels: rstats

Dataspice

🌶 Create lightweight schema.org descriptions of your datasets

Stars: ✭ 137 (-1.44%)

Mutual labels: rstats

Rstudiothemes

A curated list of RStudio themes found on Github

Stars: ✭ 134 (-3.6%)

Mutual labels: rstats

Rmapshaper

An R wrapper for the mapshaper javascript library

Stars: ✭ 136 (-2.16%)

Mutual labels: rstats

Data Science In Tidyverse

Data Science in the tidyverse, a two-day workshop @ rstudio:conf(2018)

Stars: ✭ 132 (-5.04%)

Mutual labels: rstats

Rsuite

The most complete R development and data science platform.

Stars: ✭ 133 (-4.32%)

Mutual labels: rstats

Gganimate

A Grammar of Animated Graphics

Stars: ✭ 1,744 (+1154.68%)

Mutual labels: rstats

Cicerone

🏛️ Give tours of your Shiny apps

Stars: ✭ 131 (-5.76%)

Mutual labels: rstats

Crrri

A Chrome Remote Interface written in R

Stars: ✭ 137 (-1.44%)

Mutual labels: rstats

Packagemetrics

A Package for Helping You Choose Which Package to Use

Stars: ✭ 129 (-7.19%)

Mutual labels: rstats

Hrconvert2

A self-hosted, drag-and-drop, & nosql file conversion server that supports 62x file formats.

Stars: ✭ 132 (-5.04%)

Mutual labels: docx

Streamgraph

〰️ htmlwidget for creating streamgraph visualizations in R

Stars: ✭ 137 (-1.44%)

Mutual labels: rstats

Rcrossref

R client for various CrossRef APIs

Stars: ✭ 137 (-1.44%)

Mutual labels: rstats

Tic

Tasks Integrating Continuously: CI-Agnostic Workflow Definitions

Stars: ✭ 135 (-2.88%)

Mutual labels: rstats

View All Similar Projects ➔

docxtractr

Extract Data Tables and Comments from ‘Microsoft’ ‘Word’ Documents

Description

An R package for extracting tables & comments out of Word documents (docx). Development versions are available here and production versions are on CRAN.

Microsoft Word docx files provide an XML structure that is fairly straightforward to navigate, especially when it applies to Word tables. The docxtractr package provides tools to determine table count, table structure and extract tables from Microsoft Word docx documents.

Many tables in Word documents are in twisted formats where there may be labels or other oddities mixed in that make it difficult to work with the underlying data. docxtractr provides a function—assign_colnames—that makes it easy to identify a particular row in a scraped (or any, really) data.frame as the one containing column names and have it become the column names, removing it and (optionally) all of the rows before it (since that’s usually what needs to be done).

What’s in the tin?

The following functions are implemented:

read_docx: Read in a Word document for table extraction
docx_describe_tbls: Returns a description of all the tables in the Word document
docx_describe_cmnts: Returns a description of all the comments in the Word document
docx_extract_tbl: Extract a table from a Word document
docx_extract_all_cmnts: Extract comments from a Word document
docx_extract_all_tbls: Extract all tables from a Word document (docx_extract_all is now deprecated)
docx_tbl_count: Get number of tables in a Word document
docx_cmnt_count: Get number of comments in a Word document
assign_colnames: Make a specific row the column names for the specified data.frame
mcga : Make column names great again
set_libreoffice_path: Point to Local soffice.exe File

The following data file are included:

system.file("examples/data.docx", package="docxtractr"): Word docx with 1 table
system.file("examples/data3.docx", package="docxtractr"): Word docx with 3 tables
system.file("examples/none.docx", package="docxtractr"): Word docx with 0 tables
system.file("examples/complex.docx", package="docxtractr"): Word docx with non-uniform tables
system.file("examples/comments.docx", package="docxtractr"): Word docx with comments
system.file("examples/realworld.docx", package="docxtractr"): A “real world” Word docx file with tables of all shapes and sizes
system.file("examples/trackchanges.docx", package="docxtractr"): Word docx with track changes in a table

Installation

# devtools::install_github("hrbrmstr/docxtractr")
# OR 
install.packages("docxtractr")

Usage

library(docxtractr)
library(tibble)
library(dplyr)

# current version
packageVersion("docxtractr")
#> [1] '0.6.0'

# one table
doc <- read_docx(system.file("examples/data.docx", package="docxtractr"))

docx_tbl_count(doc)
#> [1] 1

docx_describe_tbls(doc)
#> Word document [/Library/Frameworks/R.framework/Versions/3.5/Resources/library/docxtractr/examples/data.docx]
#> 
#> Table 1
#>   total cells: 16
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [This, Is, A, Column]

docx_extract_tbl(doc, 1)
#> # A tibble: 3 x 4
#>   This  Is      A     Column  
#>   <chr> <chr>   <chr> <chr>   
#> 1 1     Cat     3.4   Dog     
#> 2 3     Fish    100.3 Bird    
#> 3 5     Pelican -99   Kangaroo

docx_extract_tbl(doc)
#> # A tibble: 3 x 4
#>   This  Is      A     Column  
#>   <chr> <chr>   <chr> <chr>   
#> 1 1     Cat     3.4   Dog     
#> 2 3     Fish    100.3 Bird    
#> 3 5     Pelican -99   Kangaroo

docx_extract_tbl(doc, header=FALSE)
#> NOTE: header=FALSE but table has a marked header row in the Word document
#> # A tibble: 4 x 4
#>   V1    V2      V3    V4      
#>   <chr> <chr>   <chr> <chr>   
#> 1 This  Is      A     Column  
#> 2 1     Cat     3.4   Dog     
#> 3 3     Fish    100.3 Bird    
#> 4 5     Pelican -99   Kangaroo

# url 

budget <- read_docx("http://rud.is/dl/1.DOCX")

docx_tbl_count(budget)
#> [1] 2

docx_describe_tbls(budget)
#> Word document [http://rud.is/dl/1.DOCX]
#> 
#> Table 1
#>   total cells: 24
#>   row count  : 6
#>   uniform    : likely!
#>   has header : unlikely
#> 
#> Table 2
#>   total cells: 28
#>   row count  : 4
#>   uniform    : likely!
#>   has header : unlikely

docx_extract_tbl(budget, 1)
#> # A tibble: 5 x 4
#>   ``                                 `Short-term Portfolio` `Long-term Portfolio` `Total Portfolio Values`
#>   <chr>                              <chr>                  <chr>                 <chr>                   
#> 1 Portfolio Balance (Market Value) * $  123,651,911         $ 294,704,136         $ 418,356,047           
#> 2 Effective Yield                    0.16 %                 1.42 %                1.05 %                  
#> 3 Avg. Weighted Maturity             11 Days                2.4 Years             1.7 Years               
#> 4 Net Earnings                       $      18,470          $      350,554        $      369,024          
#> 5 Benchmark**                        0.02 %                 0.41 %                0.27 %

docx_extract_tbl(budget, 2) 
#> # A tibble: 3 x 7
#>   ``                   `Amount of Funds … Maturity  `Effective Yiel… `Interpolated Y… `Total Return  … `Total Return  …
#>   <chr>                <chr>              <chr>     <chr>            <chr>            <chr>            <chr>           
#> 1 Short-Term Portfolio $ 123,651,911      11 days   0.16 %           0.01 %           0.013            0.160           
#> 2 Long-Term Portfolio  $ 294,704,136      2.4 years 1.42 %           0.41 %           0.437            0.250           
#> 3 Total Portfolio      $ 418,356,047      1.7 years 1.05 %           0.27 %           0.298            0.222

# three tables
doc3 <- read_docx(system.file("examples/data3.docx", package="docxtractr"))

docx_tbl_count(doc3)
#> [1] 3

docx_describe_tbls(doc3)
#> Word document [/Library/Frameworks/R.framework/Versions/3.5/Resources/library/docxtractr/examples/data3.docx]
#> 
#> Table 1
#>   total cells: 16
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [This, Is, A, Column]
#> 
#> Table 2
#>   total cells: 12
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 3
#>   total cells: 14
#>   row count  : 7
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar]

docx_extract_tbl(doc3, 3)
#> # A tibble: 6 x 2
#>   Foo   Bar  
#>   <chr> <chr>
#> 1 Aa    Bb   
#> 2 Dd    Ee   
#> 3 Gg    Hh   
#> 4 1     2    
#> 5 Zz    Jj   
#> 6 Tt    ii

# no tables
none <- read_docx(system.file("examples/none.docx", package="docxtractr"))

docx_tbl_count(none)
#> [1] 0

# wrapping in try since it will return an error
# use docx_tbl_count before trying to extract in scripts/production
try(docx_describe_tbls(none))
#> No tables in document
try(docx_extract_tbl(none, 2))

# 5 tables, with two in sketchy formats
complx <- read_docx(system.file("examples/complex.docx", package="docxtractr"))

docx_tbl_count(complx)
#> [1] 5

docx_describe_tbls(complx)
#> Word document [/Library/Frameworks/R.framework/Versions/3.5/Resources/library/docxtractr/examples/complex.docx]
#> 
#> Table 1
#>   total cells: 16
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [This, Is, A, Column]
#> 
#> Table 2
#>   total cells: 12
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 3
#>   total cells: 14
#>   row count  : 7
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar]
#> 
#> Table 4
#>   total cells: 11
#>   row count  : 4
#>   uniform    : unlikely => found differing cell counts (3, 2) across some rows
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 5
#>   total cells: 21
#>   row count  : 7
#>   uniform    : likely!
#>   has header : unlikely

docx_extract_tbl(complx, 3, header=TRUE)
#> # A tibble: 6 x 2
#>   Foo   Bar  
#>   <chr> <chr>
#> 1 Aa    Bb   
#> 2 Dd    Ee   
#> 3 Gg    Hh   
#> 4 1     2    
#> 5 Zz    Jj   
#> 6 Tt    ii

docx_extract_tbl(complx, 4, header=TRUE)
#> # A tibble: 3 x 3
#>   Foo   Bar   Baz  
#>   <chr> <chr> <chr>
#> 1 Aa    BbCc  <NA> 
#> 2 Dd    Ee    Ff   
#> 3 Gg    Hh    ii

docx_extract_tbl(complx, 5, header=TRUE)
#> # A tibble: 6 x 3
#>   Foo   Bar   Baz  
#>   <chr> <chr> <chr>
#> 1 Aa    Bb    Cc   
#> 2 Dd    Ee    Ff   
#> 3 Gg    Hh    Ii   
#> 4 Jj88  Kk    Ll   
#> 5 ""    Uu    Ii   
#> 6 Hh    Ii    h

# a "real" Word doc
real_world <- read_docx(system.file("examples/realworld.docx", package="docxtractr"))

docx_tbl_count(real_world)
#> [1] 8

# get all the tables
tbls <- docx_extract_all_tbls(real_world)

# see table 1
tbls[[1]]
#> # A tibble: 9 x 9
#>   V1                V2        V3         V4                     V5                     V6        V7      V8     V9     
#>   <chr>             <chr>     <chr>      <chr>                  <chr>                  <chr>     <chr>   <chr>  <chr>  
#> 1 Lesson 1:  Step 1 <NA>      <NA>       <NA>                   <NA>                   <NA>      <NA>    <NA>   <NA>   
#> 2 Country           Birthrate Death Rate Population Growth 2005 Population Growth 2050 Relative… Social… Socia… Social…
#> 3 USA               2.06      0.51%      0.92%                  -0.06%                 Post- In… Female… Stabl… Good t…
#> 4 China             1.62      0.3%       0.6%                   -0.58%                 Post- In… Govern… Techn… Urbani…
#> 5 Egypt             2.83      0.41%      2.0%                   1.32%                  Mature I… Not ye… More … Slight…
#> 6 India             2.35      0.34%      1.56%                  0.76%                  Post Ind… Econom… Pover… Becomi…
#> 7 Italy             1.28      0.72%      0.35%                  -1.33%                 Late Pos… Stable… Peopl… Better…
#> 8 Mexico            2.43      0.25%      1.41%                  0.96%                  Mature I… Better… Emigr… Econom…
#> 9 Nigeria           4.78      0.26%      2.46%                  3.58%                  End of M… Disease Peopl… People…

# make table 1 better
assign_colnames(tbls[[1]], 2)
#> # A tibble: 7 x 9
#>   Country Birthrate `Death Rate` `Population Grow… `Population Grow… `Relative place… `Social Factors… `Social Factors…
#>   <chr>   <chr>     <chr>        <chr>             <chr>             <chr>            <chr>            <chr>           
#> 1 USA     2.06      0.51%        0.92%             -0.06%            Post- Industrial Female Independ… Stable Birth Ra…
#> 2 China   1.62      0.3%         0.6%              -0.58%            Post- Industrial Government inte… Technology      
#> 3 Egypt   2.83      0.41%        2.0%              1.32%             Mature Industri… Not yet industr… More children n…
#> 4 India   2.35      0.34%        1.56%             0.76%             Post Industrial  Economic growth  Poverty         
#> 5 Italy   1.28      0.72%        0.35%             -1.33%            Late Post indus… Stable birth ra… People marry la…
#> 6 Mexico  2.43      0.25%        1.41%             0.96%             Mature Industri… Better health c… Emigration      
#> 7 Nigeria 4.78      0.26%        2.46%             3.58%             End of Mechaniz… Disease          People marry ea…
#> # ... with 1 more variable: `Social Factors 3` <chr>

# make table 1's column names great again 
mcga(assign_colnames(tbls[[1]], 2))
#> # A tibble: 7 x 9
#>   country birthrate death_rate population_growt… population_growt… relative_place_in… social_factors_1 social_factors_2
#>   <chr>   <chr>     <chr>      <chr>             <chr>             <chr>              <chr>            <chr>           
#> 1 USA     2.06      0.51%      0.92%             -0.06%            Post- Industrial   Female Independ… Stable Birth Ra…
#> 2 China   1.62      0.3%       0.6%              -0.58%            Post- Industrial   Government inte… Technology      
#> 3 Egypt   2.83      0.41%      2.0%              1.32%             Mature Industrial  Not yet industr… More children n…
#> 4 India   2.35      0.34%      1.56%             0.76%             Post Industrial    Economic growth  Poverty         
#> 5 Italy   1.28      0.72%      0.35%             -1.33%            Late Post industr… Stable birth ra… People marry la…
#> 6 Mexico  2.43      0.25%      1.41%             0.96%             Mature Industrial  Better health c… Emigration      
#> 7 Nigeria 4.78      0.26%      2.46%             3.58%             End of Mechanizat… Disease          People marry ea…
#> # ... with 1 more variable: social_factors_3 <chr>

# see table 5
tbls[[5]]
#> # A tibble: 5 x 6
#>   V1                V2      V3            V4        V5        V6      
#>   <chr>             <chr>   <chr>         <chr>     <chr>     <chr>   
#> 1 Lesson 2:  Step 1 <NA>    <NA>          <NA>      <NA>      <NA>    
#> 2 Nigeria           Default Prediction    + 5 years +15 years -5 years
#> 3 Birth rate        4.78    Goes Down     4.76      4.72      4.79    
#> 4 Death rate        0.36%   Stay the Same 0.42%     0.52%     0.3%    
#> 5 Population growth 3.58%   Goes Down     3.02%     2.32%     4.38%

# make table 5 better
assign_colnames(tbls[[5]], 2)
#> # A tibble: 3 x 6
#>   Nigeria           Default Prediction    `+ 5 years` `+15 years` `-5 years`
#>   <chr>             <chr>   <chr>         <chr>       <chr>       <chr>     
#> 1 Birth rate        4.78    Goes Down     4.76        4.72        4.79      
#> 2 Death rate        0.36%   Stay the Same 0.42%       0.52%       0.3%      
#> 3 Population growth 3.58%   Goes Down     3.02%       2.32%       4.38%

# preserve lines
intracell_whitespace <- read_docx(system.file("examples/preserve.docx", package="docxtractr"))
docx_extract_all_tbls(intracell_whitespace, preserve=TRUE)
#> [[1]]
#> # A tibble: 6 x 2
#>   `Test1:` Apple                                  
#>   <chr>    <chr>                                  
#> 1 Test2:   Banana                                 
#> 2 Test3:   "Cranberry\nDark"                      
#> 3 Test4:   "Elephant, Farm\nGrandpa"              
#> 4 Test5:   "Hat\nIgloo\nJackrabbit"               
#> 5 Test6:   " \nQuestion1\n[ ] Underwear\n[ ] VM\n"
#> 6 Test7:   Warm                                   
#> 
#> [[2]]
#> # A tibble: 2 x 4
#>   ``    Kite  Lemur      Madagascar
#>   <chr> <chr> <chr>      <chr>     
#> 1 Nanny Open  Port       Quarter   
#> 2 Rain  Sand  Television Unicorn   
#> 
#> [[3]]
#> # A tibble: 2 x 2
#>   `Test8:` `Xylophone\nYew`             
#>   <chr>    <chr>                        
#> 1 Test9:   Zebra                        
#> 2 Test10:  "Apple2\nBanana2\nCranberry2"

docx_extract_all_tbls(intracell_whitespace)
#> [[1]]
#> # A tibble: 6 x 2
#>   `Test1:` Apple                                                                                        
#>   <chr>    <chr>                                                                                        
#> 1 Test2:   Banana                                                                                       
#> 2 Test3:   CranberryDark                                                                                
#> 3 Test4:   Elephant, FarmGrandpa                                                                        
#> 4 Test5:   HatIglooJackrabbit                                                                           
#> 5 Test6:   KiteLemurMadagascarNannyOpenPortQuarterRainSandTelevisionUnicorn Question1[ ] Underwear[ ] VM
#> 6 Test7:   Warm                                                                                         
#> 
#> [[2]]
#> # A tibble: 2 x 4
#>   ``    Kite  Lemur      Madagascar
#>   <chr> <chr> <chr>      <chr>     
#> 1 Nanny Open  Port       Quarter   
#> 2 Rain  Sand  Television Unicorn   
#> 
#> [[3]]
#> # A tibble: 2 x 2
#>   `Test8:` XylophoneYew           
#>   <chr>    <chr>                  
#> 1 Test9:   Zebra                  
#> 2 Test10:  Apple2Banana2Cranberry2

# comments
cmnts <- read_docx(system.file("examples/comments.docx", package="docxtractr"))

print(cmnts)
#> No tables in document
#> Word document [/Library/Frameworks/R.framework/Versions/3.5/Resources/library/docxtractr/examples/comments.docx]
#> 
#> Found 3 comments.
#> # A tibble: 1 x 2
#>   author    `# Comments`
#>   <chr>            <int>
#> 1 boB Rudis            3

glimpse(docx_extract_all_cmnts(cmnts))
#> Observations: 3
#> Variables: 5
#> $ id           <chr> "0", "1", "2"
#> $ author       <chr> "boB Rudis", "boB Rudis", "boB Rudis"
#> $ date         <chr> "2016-07-01T21:09:00Z", "2016-07-01T21:09:00Z", "2016-07-01T21:09:00Z"
#> $ initials     <chr> "bR", "bR", "bR"
#> $ comment_text <chr> "This is the first comment", "This is the second comment", "This is a reply to the second comm...

Track Changes (depends on `pandoc` being available)

# original
read_docx(
  system.file("examples/trackchanges.docx", package="docxtractr")
) %>% 
  docx_extract_all_tbls(guess_header = FALSE)
#> NOTE: header=FALSE but table has a marked header row in the Word document
#> [[1]]
#> # A tibble: 1 x 1
#>   V1   
#>   <chr>
#> 1 21

# accept
read_docx(
  system.file("examples/trackchanges.docx", package="docxtractr"),
  track_changes = "accept"
) %>% 
  docx_extract_all_tbls(guess_header = FALSE)
#> [[1]]
#> # A tibble: 1 x 1
#>   V1   
#>   <chr>
#> 1 2

# reject
read_docx(
  system.file("examples/trackchanges.docx", package="docxtractr"),
  track_changes = "reject"
) %>% 
  docx_extract_all_tbls(guess_header = FALSE)
#> [[1]]
#> # A tibble: 1 x 1
#>   V1   
#>   <chr>
#> 1 1

Test Results

library(docxtractr)
library(testthat)
#> 
#> Attaching package: 'testthat'
#> The following object is masked from 'package:dplyr':
#> 
#>     matches

date()
#> [1] "Tue Oct 23 08:10:10 2018"

test_dir("tests/")
#> ✔ | OK F W S | Context
#> ══ testthat results  ═════════════════════════════════════════════════
#> OK: 16 SKIPPED: 0 FAILED: 0
#> 
#> ══ Results ═══════════════════════════════════════════════════════════
#> Duration: 0.2 s
#> 
#> OK:       0
#> Failed:   0
#> Warnings: 0
#> Skipped:  0

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 139

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (4) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

hrbrmstr / Docxtractr

Programming Languages

Labels

Projects that are alternatives of or similar to Docxtractr

docxtractr

Description

What’s in the tin?

Installation

Usage

Track Changes (depends on pandoc being available)

Test Results

Code of Conduct

Track Changes (depends on `pandoc` being available)