All Projects → tastyminerals → Ccrawl

tastyminerals / Ccrawl

Licence: mit
Simple CORPORA list crawler

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Ccrawl

Instagram Profilecrawl
📝 quickly crawl the information (e.g. followers, tags etc...) of an instagram profile.
Stars: ✭ 816 (+7318.18%)
Mutual labels:  crawler
Mzitu
👧 美女写真套图爬虫(二)
Stars: ✭ 920 (+8263.64%)
Mutual labels:  crawler
Pic Gather
[ Closed ] 🎨 image collector, which supports custom acquisition source configuration and is compatible with MacOS and Windows operating systems.
Stars: ✭ 842 (+7554.55%)
Mutual labels:  crawler
Torbot
Dark Web OSINT Tool
Stars: ✭ 821 (+7363.64%)
Mutual labels:  crawler
Finalrecon
The Last Web Recon Tool You'll Need
Stars: ✭ 888 (+7972.73%)
Mutual labels:  crawler
Appcrawler
基于appium的app自动遍历工具
Stars: ✭ 925 (+8309.09%)
Mutual labels:  crawler
Gospider
Gospider - Fast web spider written in Go
Stars: ✭ 785 (+7036.36%)
Mutual labels:  crawler
Goods Crawling
爬取amazon/bestbuy/costco/6pm 的商品详情
Stars: ✭ 9 (-18.18%)
Mutual labels:  crawler
Fscrawler
Elasticsearch File System Crawler (FS Crawler)
Stars: ✭ 906 (+8136.36%)
Mutual labels:  crawler
Sqliv
massive SQL injection vulnerability scanner
Stars: ✭ 840 (+7536.36%)
Mutual labels:  crawler
Python
Python脚本。模拟登录知乎, 爬虫,操作excel,微信公众号,远程开机
Stars: ✭ 7,355 (+66763.64%)
Mutual labels:  crawler
Zhihu Crawler
zhihu-crawler是一个基于Java的高性能、支持免费http代理池、支持横向扩展、分布式爬虫项目
Stars: ✭ 890 (+7990.91%)
Mutual labels:  crawler
Scrapit
Scraping scripts for various websites.
Stars: ✭ 25 (+127.27%)
Mutual labels:  crawler
Py3 scripts
Life is short, *****.
Stars: ✭ 5 (-54.55%)
Mutual labels:  crawler
Symfony Crawler Bundle
Implements the crawler package into Symfony
Stars: ✭ 8 (-27.27%)
Mutual labels:  crawler
Lulu
[Unmaintained] A simple and clean video/music/image downloader 👾
Stars: ✭ 789 (+7072.73%)
Mutual labels:  crawler
Tumblthree
A Tumblr Blog Backup Application
Stars: ✭ 923 (+8290.91%)
Mutual labels:  crawler
Disec
Distributed Image Search Engine Crawler
Stars: ✭ 11 (+0%)
Mutual labels:  crawler
Beian Domain
获取最新可备案域名列表爬虫
Stars: ✭ 9 (-18.18%)
Mutual labels:  crawler
Appcrawler
Android应用市场网络爬虫
Stars: ✭ 25 (+127.27%)
Mutual labels:  crawler

ccrawl

Simple CORPORA list crawler

The CORPORA list is open for information and questions about text corpora such as availability, aspects of compiling and using corpora, software, tagging, parsing, bibliography, conferences etc. The list is also open for all types of discussion with a bearing on corpora.

CORPORA list: http://clu.uni.no/corpora/welcome.html

Screenshots:

Usage:

ccrawl is a python script and can be run simply by python2 ccrawl.py + some arguments. Before using the script you need to syncronize with the CORPORA first: python2 ccrawl --sync. Depending on your choice this operation might take seconds or up to 20 min. ccrawl will create a local copy of CORPORA .corpora_list.pickle which will be accessed each time you run the script.

  • To search CORPORA thread titles:
python2 ccrawl.py -f corpus
python2 ccrawl.py -f "chinese corpus"
  • To search CORPORA emails (available only if you performed deep sync):
python2 ccrawl.py -df corpus
python2 ccrawl.py -df "chinese corpus"
  • To add older archives (1995-2004):
python2 ccrawl.py -old
  • To see help:
python2 ccrawl.py -h

Install:

No installation needed. Make sure you have python2 installed on your system before running.

The script uses requests and beautifulsoup4 libraries.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].