All Projects → ankanch → tieba-zhuaqu

ankanch / tieba-zhuaqu

Licence: GPL-3.0 license
百度贴吧分布式爬虫,用于贴吧数据挖掘。从贴吧维度和用户维度进行数据分析

Programming Languages

python
139335 projects - #7 most used programming language
C++
36643 projects - #6 most used programming language
c
50402 projects - #5 most used programming language
Batchfile
5799 projects

Projects that are alternatives of or similar to tieba-zhuaqu

Fraud-Detection-in-Online-Transactions
Detecting Frauds in Online Transactions using Anamoly Detection Techniques Such as Over Sampling and Under-Sampling as the ratio of Frauds is less than 0.00005 thus, simply applying Classification Algorithm may result in Overfitting
Stars: ✭ 41 (-26.79%)
Mutual labels:  data-analysis
FDBeye
R tools for eyetracker workflows.
Stars: ✭ 101 (+80.36%)
Mutual labels:  data-analysis
meta-csv
A Clojure smart reader for CSV files
Stars: ✭ 20 (-64.29%)
Mutual labels:  data-analysis
tianchi-diabetes
天池精准医疗大赛——人工智能辅助糖尿病遗传风险预测 第一赛季
Stars: ✭ 20 (-64.29%)
Mutual labels:  data-analysis
metrics
📈 What to measure, how to measure it.
Stars: ✭ 14 (-75%)
Mutual labels:  data-analysis
osm-data-classification
Migrated to: https://gitlab.com/Oslandia/osm-data-classification
Stars: ✭ 23 (-58.93%)
Mutual labels:  data-analysis
site-audit-seo
Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv, xlsx, Google Drive.
Stars: ✭ 91 (+62.5%)
Mutual labels:  scraper
aliexscrape
Get Aliexpress product details in JSON
Stars: ✭ 80 (+42.86%)
Mutual labels:  scraper
akshare
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
Stars: ✭ 5,155 (+9105.36%)
Mutual labels:  data-analysis
yt-videos-list
Create and **automatically** update a list of all videos on a YouTube channel (in txt/csv/md form) via YouTube bot with end-to-end web scraping - no API tokens required. Multi-threaded support for YouTube videos list updates.
Stars: ✭ 64 (+14.29%)
Mutual labels:  scraper
OLX Scraper
📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.
Stars: ✭ 15 (-73.21%)
Mutual labels:  scraper
LeTourDataSet
Every cyclist and stage of the Tour de France in two CSV files.
Stars: ✭ 61 (+8.93%)
Mutual labels:  data-analysis
dflib
In-memory Java DataFrame library
Stars: ✭ 50 (-10.71%)
Mutual labels:  data-analysis
stock-market-scraper
Scraps historical stock market data from Yahoo Finance (https://finance.yahoo.com/)
Stars: ✭ 110 (+96.43%)
Mutual labels:  scraper
document-dl
Command line program to download documents from web portals.
Stars: ✭ 14 (-75%)
Mutual labels:  scraper
InstagramLocationScraper
No description or website provided.
Stars: ✭ 13 (-76.79%)
Mutual labels:  scraper
youtube-unofficial
Access parts of your account unavailable through normal YouTube API access.
Stars: ✭ 33 (-41.07%)
Mutual labels:  scraper
crazy-awesome-crypto
A list of awesome crypto and blockchain projects
Stars: ✭ 35 (-37.5%)
Mutual labels:  data-analysis
OpenScraper
An open source webapp for scraping: towards a public service for webscraping
Stars: ✭ 80 (+42.86%)
Mutual labels:  scraper
scraper
A web scraper starter project
Stars: ✭ 18 (-67.86%)
Mutual labels:  scraper

百度贴吧分布式爬虫


版本

【v0.9】 @ May 6 2017 -> 0813bc127125438b71dfee6dc9a3153661c8d629

简介

该分布式爬虫可以抓取贴吧帖子内容并进行相关数据分析(详情见数据分析示例)。

目前该系统内部自带了4个插件用于数据分析,你可以给它贡献更多插件(插件由Python编写)

该爬虫系统主要由3部分组成:TaskManager任务管理服务器,KCrawlerManager用户端管理软件(KCrawlerController),Cralwer爬虫程序

在你继续往下读之前:

如果你只是简单的想使用这个软件爬取信息并加以分析,你需要下载以下文件:

tieba-zhuaqu:贴吧抓取主程序(请运行RunTest.bat)
KCrawlerControal:需要使用这个软件里面的数据分析模块

在开始之前请确认你已经安装python3.5以及后面提到的第三方库。

** 建议使用数据库版本(以DSV开头的)

** 注意:你需要将AttachImport文件下的ktieba文件夹放入C盘根目录,才能够正常运行。


语言及环境

Python3.5.1

C++

Visual Studio 2015

建议你安装64位的python,否则可能会出现memory error

文件结构

所有以DSV开头的文件夹代表其对应的数据库版本(Database Support Version)(默认为任务结果文件版本)

shareLib:系统组成三部分的共享库,定义报文,网络交互操作
task-manager:TaskManager任务管理服务器
tieba-zhuaqu:KCrawler爬虫主体
user-application:KCrawlerManager用户端管理软件KCrawlerController
DataAnalyzer:数据分析套件(从user-application中独立出来的)

数据库结构见下图:


第三方库

matplotlib:用于对数据进行可视化分析

numpy:用于对数据进行可视化分析

jieba中文分词:用于中文分词以及关键字提取


数据分析模块

**测试数据下载地址:http://pan.cuit.edu.cn/share/7FF9yiO5 (提取码:cm8p)

数据分析示例见文档末尾


开发状态

开发中...


授权条款:GPL

GPL


数据分析示例

目前自带的数据分析插件可以完成以下几种类型的分析:

对比统计多个词语(multiwords)

显示某个词语的词频-时间图(wordstimeline)

分析特定用户

分析某位用户的贴吧活跃度(userX)

分析某位用户的高频关键字(userX)

分析某位用户的贴吧活跃时间段(userX:通过叠加每日活跃时间段)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].