1. 990 Xml ReaderIRSx: Turn the IRS' versioned XML 990 nonprofit annual tax returns into standardized python objects, json, or human readable text with original line number and description.
2. WhatwordwhereTooling to extract data from scanned paper forms OCR-ed by Tesseract using the HOCR standard.
3. Pdf bbox utilsHelpers to create .csv files of word-level bounding boxes from text-based pdfs, or from hocr output.