All Projects → dhondta → webgrep

dhondta / webgrep

Licence: GPL-3.0 license
Grep Web pages with extra features like JS deobfuscation and OCR

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to webgrep

penelope
Penelope Shell Handler
Stars: ✭ 291 (+238.37%)
Mutual labels:  ctf-tools
Korean-OCR-Model-Design-based-on-Keras-CNN
Korean OCR Model Design(한글 OCR 모델 설계)
Stars: ✭ 34 (-60.47%)
Mutual labels:  ocr
jochre
Java Optical CHaracter Recognition
Stars: ✭ 18 (-79.07%)
Mutual labels:  ocr
python-tinyscript
Devkit for quickly building CLI tools with Python
Stars: ✭ 39 (-54.65%)
Mutual labels:  ctf-tools
kuzushiji-recognition
5th place solution for the Kaggle Kuzushiji Recognition Challenge
Stars: ✭ 41 (-52.33%)
Mutual labels:  ocr
so stupid search
It's my honor to drive you fucking fire faster, to have more time with your Family and Sunshine.This tool is for those who often want to search for a string Deeply into a directory in Recursive mode, but not with the great tools: grep, ack, ripgrep .........every thing should be Small, Thin, Fast, Lazy....without Think and Remember too much ...一…
Stars: ✭ 135 (+56.98%)
Mutual labels:  grep
tmpleak
Leak other players' temporary workspaces for ctf and wargames.
Stars: ✭ 76 (-11.63%)
Mutual labels:  ctf-tools
Multi-Type-TD-TSR
Extracting Tables from Document Images using a Multi-stage Pipeline for Table Detection and Table Structure Recognition:
Stars: ✭ 174 (+102.33%)
Mutual labels:  ocr
writable search.vim
Grep for something, then write the original files directly through the search results.
Stars: ✭ 47 (-45.35%)
Mutual labels:  grep
memento
Organize your meme image cluster in a better format using OCR from the meme to sort them using tesseract along with editing memes by segmenting them using OpenCV within a directory
Stars: ✭ 70 (-18.6%)
Mutual labels:  ocr
deep-license-plate-recognition
Automatic License Plate Recognition (ALPR) or Automatic Number Plate Recognition (ANPR) software that works with any camera.
Stars: ✭ 309 (+259.3%)
Mutual labels:  ocr
javascript-deobfuscator
A deobfuscator for JavaScript codes generated by Obfuscator.io
Stars: ✭ 136 (+58.14%)
Mutual labels:  js-deobfuscator
baidu-chain-dog
百度莱茨狗爬虫。
Stars: ✭ 52 (-39.53%)
Mutual labels:  ocr
ocr-machine-learning
OCR Machine Learning in python
Stars: ✭ 42 (-51.16%)
Mutual labels:  ocr
saram
Get OCR in txt form from an image or pdf extension supporting multiple files from directory using pytesseract with auto rotation for wrong orientation. PYPI:
Stars: ✭ 51 (-40.7%)
Mutual labels:  ocr
MouseTooltipTranslator
chrome extension - When mouse hover on text, it shows translated tooltip using google translate
Stars: ✭ 93 (+8.14%)
Mutual labels:  ocr
doctr
docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
Stars: ✭ 1,409 (+1538.37%)
Mutual labels:  ocr
paperbase
Open source document organizer with automatic OCR and full text search
Stars: ✭ 21 (-75.58%)
Mutual labels:  ocr
pjs
An awk-like command-line tool for processing text, CSV, JSON, HTML, and XML.
Stars: ✭ 21 (-75.58%)
Mutual labels:  grep
SynthText Chinese
Modify from https://github.com/JarveeLee/SynthText_Chinese_version.git with python3 and cv3.
Stars: ✭ 35 (-59.3%)
Mutual labels:  ocr

WebGrep Tweet

Grep Web pages and their resources.

PyPi Platform Read The Docs Known Vulnerabilities Requirements Status License

This self-contained tool relies on the well-known grep tool for grepping Web pages. It binds nearly every option of the original tool and also provides additional features like deobfuscating Javascript or appyling OCR on images before grepping downloaded resources.

$ pip install webgrep-tool

Quick Start

  1. Help
$ webgrep --help
usage: webgrep [OPTION]... PATTERN [URL]...

Search for PATTERN in each input URL and its related resources
(images, scripts and style sheets).
By default,
- resources are NOT downloaded
- response HTTP headers are NOT included in grepping ; use '--include-headers'
- PATTERN is a basic regular expression (BRE) ; use '-E' for extended (ERE)
Important note: webgrep does not handle recursion (in other words, it does not
               spider additional web pages).
Examples:
 webgrep example http://www.example.com     # will only grep on HTML code
 webgrep -r example http://www.example.com  # will only grep on LOCAL images, ...
 webgrep -R example http://www.example.com  # will only grep on ALL images, ...

Regexp selection and interpretation:
 -e REGEXP, --regexp REGEXP
                       use PATTERN for matching
 -f FILE, --file FILE  obtain PATTERN from FILE
 -E, --extended-regexp
                       PATTERN is an extended regular expression (ERE)
 -F, --fixed-strings   PATTERN is a set of newline-separated fixed strings
 -G, --basic-regexp    PATTERN is a basic regular expression (BRE)
 -P, --perl-regexp     PATTERN is a Perl regular expression
 -i, --ignore-case     ignore case distinctions
 -w, --word-regexp     force PATTERN to match only whole words
 -x, --line-regexp     force PATTERN to match only whole lines
 -z, --null-data       a data line ends in 0 byte, not newline

Miscellaneous:
 -s, --no-messages     suppress error messages
 -v, --invert-match    select non-matching lines
 -V, --version         print version information and exit
 --help                display this help and exit
 --verbose             verbose mode
 --keep-files          keep temporary files in the temporary directory
 --temp-dir TMP        define the temporary directory (default: /tmp/webgrep)

Output control:
 -m NUM, --max-count NUM
                       stop after NUM matches
 -b, --byte-offset     print the byte offset with output lines
 -n, --line-number     print line number with output lines
 --line-buffered       flush output on every line
 -H, --with-filename   print the file name for each match
 -h, --no-filename     suppress the file name prefix on output
 --label LABEL         use LABEL as the standard input filename prefix
 -o, --only-matching   show only the part of a line matching PATTERN
 -q, --quiet, --silent
                       suppress all normal output
 --binary-files TYPE   assume that binary files are TYPE;
                       TYPE is 'binary', 'text', or 'without-match'
 -a, --text            equivalent to --binary-files=text
 -I                    equivalent to --binary-files=without-match
 -L, --files-without-match
                       print only names of FILEs containing no match
 -l, --files-with-match
                       print only names of FILEs containing matches
 -c, --count           print only a count of matching lines per FILE
 -T, --initial-tab     make tabs line up (if needed)
 -Z, --null            print 0 byte after FILE name

Context control:
 -B NUM, --before-context NUM
                       print NUM lines of leading context
 -A NUM, --after-context NUM
                       print NUM lines of trailing context
 -C NUM, --context NUM
                       print NUM lines of output context

Web options:
 -r, --local-resources
                       also grep local resources (same-origin)
 -R, --all-resources   also grep all resources (even non-same-origin)
 --include-headers     also grep HTTP headers
 --cookie COOKIE       use a session cookie in the HTTP headers
 --referer REFERER     provide the referer in the HTTP headers

Proxy settings (by default, system proxy settings are used):
 -d, --disable-proxy   manually disable proxy
 --http-proxy HTTP     manually set the HTTP proxy
 --https-proxy HTTPS   manually set the HTTPS proxy

Please report bugs on GitHub: https://github.com/dhondta/webgrep

  1. Example
$ ./webgrep -R Welcome https://github.com
      Welcome home, <br>developers

📌 Resource Handlers

Definitions:

  • Resource (what is being processed): Web page, images, Javascript, CSS
  • Handler (how a resource is processed): CSS unminifying, OCR, deobfuscation, EXIF data retrieval, ...

The handlers are defined in the # --...-- HANDLERS SECTION --...-- of the code. Currently available handlers :

  1. Images
  • EXIF: using exiftool
  • Steganography: using steghide (with a blank password)
  • Strings: using strings
  • OCR: using tesseract
  1. Scripts
  • Javascript beautifying and deobfuscation: using jsbeautifier
  1. Styles
  • Unminifying: using regular expressions

Note: images found in the CSS files are also processed.

👏 Supporters

Stargazers repo roster for @dhondta/webgrep

Forkers repo roster for @dhondta/webgrep

Back to top

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].