ankanch / tieba-zhuaqu

Licence: GPL-3.0 license

百度贴吧分布式爬虫，用于贴吧数据挖掘。从贴吧维度和用户维度进行数据分析

Programming Languages

python

139335 projects - #7 most used programming language

C++

36643 projects - #6 most used programming language

50402 projects - #5 most used programming language

Batchfile

5799 projects

Projects that are alternatives of or similar to tieba-zhuaqu

Fraud-Detection-in-Online-Transactions

Detecting Frauds in Online Transactions using Anamoly Detection Techniques Such as Over Sampling and Under-Sampling as the ratio of Frauds is less than 0.00005 thus, simply applying Classification Algorithm may result in Overfitting

Stars: ✭ 41 (-26.79%)

Mutual labels: data-analysis

FDBeye

R tools for eyetracker workflows.

Stars: ✭ 101 (+80.36%)

Mutual labels: data-analysis

meta-csv

A Clojure smart reader for CSV files

Stars: ✭ 20 (-64.29%)

Mutual labels: data-analysis

tianchi-diabetes

天池精准医疗大赛——人工智能辅助糖尿病遗传风险预测第一赛季

Stars: ✭ 20 (-64.29%)

Mutual labels: data-analysis

metrics

📈 What to measure, how to measure it.

Stars: ✭ 14 (-75%)

Mutual labels: data-analysis

osm-data-classification

Migrated to: https://gitlab.com/Oslandia/osm-data-classification

Stars: ✭ 23 (-58.93%)

Mutual labels: data-analysis

site-audit-seo

Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv, xlsx, Google Drive.

Stars: ✭ 91 (+62.5%)

Mutual labels: scraper

aliexscrape

Get Aliexpress product details in JSON

Stars: ✭ 80 (+42.86%)

Mutual labels: scraper

akshare

AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库

Stars: ✭ 5,155 (+9105.36%)

Mutual labels: data-analysis

yt-videos-list

Create and **automatically** update a list of all videos on a YouTube channel (in txt/csv/md form) via YouTube bot with end-to-end web scraping - no API tokens required. Multi-threaded support for YouTube videos list updates.

Stars: ✭ 64 (+14.29%)

Mutual labels: scraper

OLX Scraper

📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.

Stars: ✭ 15 (-73.21%)

Mutual labels: scraper

LeTourDataSet

Every cyclist and stage of the Tour de France in two CSV files.

Stars: ✭ 61 (+8.93%)

Mutual labels: data-analysis

dflib

In-memory Java DataFrame library

Stars: ✭ 50 (-10.71%)

Mutual labels: data-analysis

stock-market-scraper

Scraps historical stock market data from Yahoo Finance (https://finance.yahoo.com/)

Stars: ✭ 110 (+96.43%)

Mutual labels: scraper

document-dl

Command line program to download documents from web portals.

Stars: ✭ 14 (-75%)

Mutual labels: scraper

InstagramLocationScraper

No description or website provided.

Stars: ✭ 13 (-76.79%)

Mutual labels: scraper

youtube-unofficial

Access parts of your account unavailable through normal YouTube API access.

Stars: ✭ 33 (-41.07%)

Mutual labels: scraper

crazy-awesome-crypto

A list of awesome crypto and blockchain projects

Stars: ✭ 35 (-37.5%)

Mutual labels: data-analysis

OpenScraper

An open source webapp for scraping: towards a public service for webscraping

Stars: ✭ 80 (+42.86%)

Mutual labels: scraper

scraper

A web scraper starter project

Stars: ✭ 18 (-67.86%)

Mutual labels: scraper

View All Similar Projects ➔

百度贴吧分布式爬虫

版本

【v0.9】 @ May 6 2017 -> 0813bc127125438b71dfee6dc9a3153661c8d629

简介

该分布式爬虫可以抓取贴吧帖子内容并进行相关数据分析（详情见数据分析示例）。

目前该系统内部自带了4个插件用于数据分析，你可以给它贡献更多插件（插件由Python编写）

该爬虫系统主要由3部分组成：TaskManager任务管理服务器，KCrawlerManager用户端管理软件（KCrawlerController），Cralwer爬虫程序

在你继续往下读之前：

如果你只是简单的想使用这个软件爬取信息并加以分析，你需要下载以下文件：

tieba-zhuaqu：贴吧抓取主程序（请运行RunTest.bat）
KCrawlerControal:需要使用这个软件里面的数据分析模块

在开始之前请确认你已经安装python3.5以及后面提到的第三方库。

** 建议使用数据库版本（以DSV开头的）

** 注意：你需要将AttachImport文件下的ktieba文件夹放入C盘根目录，才能够正常运行。

语言及环境

Python3.5.1

C++

Visual Studio 2015

建议你安装64位的python，否则可能会出现memory error

文件结构

所有以DSV开头的文件夹代表其对应的数据库版本（Database Support Version）（默认为任务结果文件版本）

shareLib:系统组成三部分的共享库，定义报文，网络交互操作
task-manager：TaskManager任务管理服务器
tieba-zhuaqu：KCrawler爬虫主体
user-application：KCrawlerManager用户端管理软件KCrawlerController
DataAnalyzer：数据分析套件（从user-application中独立出来的）

数据库结构见下图：

第三方库

matplotlib：用于对数据进行可视化分析

numpy：用于对数据进行可视化分析

jieba中文分词：用于中文分词以及关键字提取

数据分析模块

**测试数据下载地址：http://pan.cuit.edu.cn/share/7FF9yiO5 （提取码：cm8p）

数据分析示例见文档末尾

开发状态

开发中...

授权条款：GPL

数据分析示例

目前自带的数据分析插件可以完成以下几种类型的分析：

对比统计多个词语（multiwords）

显示某个词语的词频-时间图（wordstimeline）

分析特定用户

分析某位用户的贴吧活跃度（userX）

分析某位用户的高频关键字（userX）

分析某位用户的贴吧活跃时间段（userX：通过叠加每日活跃时间段）

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

ankanch / tieba-zhuaqu

Programming Languages

Labels

Projects that are alternatives of or similar to tieba-zhuaqu

百度贴吧分布式爬虫

版本

简介

在你继续往下读之前：

语言及环境

建议你安装64位的python，否则可能会出现memory error

文件结构

第三方库

数据分析模块

开发状态

授权条款：GPL

数据分析示例

分析特定用户