All Projects → Corpuscrawler → Similar Projects or Alternatives

145 Open source projects that are alternatives of or similar to Corpuscrawler

Beta
An open source reimplementation of Benny Brodda's BETA in Python
Stars: ✭ 65 (-48.82%)
Mutual labels:  linguistics
TextDatasetCleaner
🔬 Очистка датасетов от мусора (нормализация, препроцессинг)
Stars: ✭ 27 (-78.74%)
Mutual labels:  linguistics
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Stars: ✭ 440 (+246.46%)
Mutual labels:  crawling
Mimo-Crawler
A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.
Stars: ✭ 22 (-82.68%)
Mutual labels:  crawling
Elpis
🙊 WIP software for creating speech recognition models.
Stars: ✭ 101 (-20.47%)
Mutual labels:  linguistics
crawlkit
A crawler based on Phantom. Allows discovery of dynamic content and supports custom scrapers.
Stars: ✭ 23 (-81.89%)
Mutual labels:  crawling
Pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Stars: ✭ 426 (+235.43%)
Mutual labels:  linguistics
custom-crawler
🌌 High productivity semi-automatic crawler generator 🛠️🧰
Stars: ✭ 33 (-74.02%)
Mutual labels:  crawling
Yesterday I Learned
Brainfarts are caused by the rupturing of the cerebral sphincter.
Stars: ✭ 50 (-60.63%)
Mutual labels:  linguistics
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (-59.06%)
Mutual labels:  crawling
Webster
a reliable high-level web crawling & scraping framework for Node.js.
Stars: ✭ 364 (+186.61%)
Mutual labels:  crawling
pumba
Fetch, store and access user agent strings for different browsers
Stars: ✭ 12 (-90.55%)
Mutual labels:  crawling
Colibri Core
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Stars: ✭ 112 (-11.81%)
Mutual labels:  linguistics
lingvo--Ner-ru
Named entity recognition (NER) in Russian texts / Определение именованных сущностей (NER) в тексте на русском языке
Stars: ✭ 38 (-70.08%)
Mutual labels:  linguistics
Sasila
一个灵活、友好的爬虫框架
Stars: ✭ 286 (+125.2%)
Mutual labels:  crawling
zcrawl
An open source web crawling platform
Stars: ✭ 21 (-83.46%)
Mutual labels:  crawling
Python Datamuse
Python 3 wrapper for the Datamuse API
Stars: ✭ 47 (-62.99%)
Mutual labels:  linguistics
NatLang
NatLang is an English parser with an extensible grammar
Stars: ✭ 20 (-84.25%)
Mutual labels:  linguistics
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (+118.11%)
Mutual labels:  crawling
crawling-framework
Easily crawl news portals or blog sites using Storm Crawler.
Stars: ✭ 22 (-82.68%)
Mutual labels:  crawling
Wikipron
Massively multilingual pronunciation mining
Stars: ✭ 99 (-22.05%)
Mutual labels:  linguistics
mlconjug3
A Python library to conjugate verbs in French, English, Spanish, Italian, Portuguese and Romanian (more soon) using Machine Learning techniques.
Stars: ✭ 47 (-62.99%)
Mutual labels:  linguistics
Spidy
The simple, easy to use command line web crawler.
Stars: ✭ 257 (+102.36%)
Mutual labels:  crawling
expletives
Expletives vomiting library...
Stars: ✭ 12 (-90.55%)
Mutual labels:  linguistics
Phonemes
Jason Riggle's chart of phonological features in JSON format + extras
Stars: ✭ 33 (-74.02%)
Mutual labels:  linguistics
langua
A suite of language tools
Stars: ✭ 29 (-77.17%)
Mutual labels:  linguistics
ARGUS
ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
Stars: ✭ 68 (-46.46%)
Mutual labels:  crawling
diffbot-php-client
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
Stars: ✭ 53 (-58.27%)
Mutual labels:  crawling
Awesome Puppeteer
A curated list of awesome puppeteer resources.
Stars: ✭ 1,728 (+1260.63%)
Mutual labels:  crawling
auctus
Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index
Stars: ✭ 34 (-73.23%)
Mutual labels:  crawling
bots-zoo
No description or website provided.
Stars: ✭ 59 (-53.54%)
Mutual labels:  crawling
the-seinfeld-chronicles
A dataset for textual analysis on arguably the best written comedy television show ever.
Stars: ✭ 14 (-88.98%)
Mutual labels:  crawling
Awesome Sentiment Analysis
😀😄😂😭 A curated list of Sentiment Analysis methods, implementations and misc. 😥😟😱😤
Stars: ✭ 816 (+542.52%)
Mutual labels:  linguistics
dev
PHOIBLE data and development.
Stars: ✭ 90 (-29.13%)
Mutual labels:  linguistics
concepticon-data
The curation repository for the data behind Concepticon.
Stars: ✭ 25 (-80.31%)
Mutual labels:  linguistics
corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Stars: ✭ 16 (-87.4%)
Mutual labels:  linguistics
Flat
FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia), a rich XML-based format for linguistic annotation. Flat allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm.
Stars: ✭ 93 (-26.77%)
Mutual labels:  linguistics
tech-seo-crawler
Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.
Stars: ✭ 57 (-55.12%)
Mutual labels:  crawling
wikipron
Massively multilingual pronunciation mining
Stars: ✭ 167 (+31.5%)
Mutual labels:  linguistics
Onset
A language evolution simulator, using realistic phonetic changes.
Stars: ✭ 30 (-76.38%)
Mutual labels:  linguistics
Nltk data
NLTK Data
Stars: ✭ 675 (+431.5%)
Mutual labels:  linguistics
event-embedding-multitask
*SEM 2018: Learning Distributed Event Representations with a Multi-Task Approach
Stars: ✭ 22 (-82.68%)
Mutual labels:  linguistics
flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
Stars: ✭ 48 (-62.2%)
Mutual labels:  crawling
lingtypology
R package for linguistic cartography and typological databases search
Stars: ✭ 47 (-62.99%)
Mutual labels:  linguistics
Skycaiji
蓝天采集器是一款免费的数据采集发布爬虫软件,采用php+mysql开发,可部署在云服务器,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
Stars: ✭ 1,514 (+1092.13%)
Mutual labels:  crawling
scrape-github-trending
Tutorial for web scraping / crawling with Node.js.
Stars: ✭ 42 (-66.93%)
Mutual labels:  crawling
img-cli
An interactive Command-Line Interface Build in NodeJS for downloading a single or multiple images to disk from URL
Stars: ✭ 15 (-88.19%)
Mutual labels:  crawling
proiel-treebank
Official releases of the PROIEL treebank of ancient Indo-European languages
Stars: ✭ 30 (-76.38%)
Mutual labels:  linguistics
Easy Scraping Tutorial
Simple but useful Python web scraping tutorial code.
Stars: ✭ 583 (+359.06%)
Mutual labels:  crawling
feminizator.github.io
Феминизатор слов
Stars: ✭ 29 (-77.17%)
Mutual labels:  linguistics
SlackWebhooksGithubCrawler
Search for Slack Webhooks token publicly exposed on Github
Stars: ✭ 21 (-83.46%)
Mutual labels:  crawling
puppet-master
Puppeteer as a service hosted on Saasify.
Stars: ✭ 25 (-80.31%)
Mutual labels:  crawling
Textannotationgraphs
A modular annotation system that supports complex, interactive annotation graphs embedded on top of sequences of text.
Stars: ✭ 73 (-42.52%)
Mutual labels:  linguistics
talospider
talospider - A simple,lightweight scraping micro-framework
Stars: ✭ 57 (-55.12%)
Mutual labels:  crawling
Squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Stars: ✭ 125 (-1.57%)
Mutual labels:  crawling
Ichiran
Linguistic tools for texts in Japanese language
Stars: ✭ 120 (-5.51%)
Mutual labels:  linguistics
Pyconll
A minimal, pure Python library to interface with CoNLL-U format files.
Stars: ✭ 104 (-18.11%)
Mutual labels:  linguistics
Arachnid
Powerful web scraping framework for Crystal
Stars: ✭ 68 (-46.46%)
Mutual labels:  crawling
Ferret
Declarative web scraping
Stars: ✭ 4,837 (+3708.66%)
Mutual labels:  crawling
folia
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…
Stars: ✭ 56 (-55.91%)
Mutual labels:  linguistics
61-120 of 145 similar projects