Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → LouisYZK → pythonCrawlDemo

LouisYZK / pythonCrawlDemo

Licence: other

some python crawling demo share

Programming Languages

139335 projects - #7 most used programming language

75241 projects

Labels

json id3 echarts requests-module

Projects that are alternatives of or similar to pythonCrawlDemo

react-visualized-platform

🐞 基于 React 的雾霾数据爬虫分析平台

Stars: ✭ 31 (+158.33%)

Mutual labels: echarts

🎶🎵A macOS application to edit the ID3 tag of your mp3 files. Developed with RxSwift and RxCocoa. 🎸🎼

Stars: ✭ 17 (+41.67%)

Mutual labels: id3

An R Interface to Baidu Echart3 Library

Stars: ✭ 31 (+158.33%)

Mutual labels: echarts

monitor system based on zabbix API pyzaabix grafana

Stars: ✭ 70 (+483.33%)

Mutual labels: echarts

ioBroker.vw-connect

ioBroker Adapter for VW We connect and Skoda connect

Stars: ✭ 57 (+375%)

Mutual labels: id3

EChartsAnnotation

ECharts的Java注解框架

Stars: ✭ 22 (+83.33%)

Mutual labels: echarts

Tag .mp3 and .m4a audio files from iTunes data automatically.

Stars: ✭ 25 (+108.33%)

Mutual labels: id3

Let's you to access your FB account from the command line and returns various things number of unread notifications, messages or friend requests you have.

Stars: ✭ 30 (+150%)

Mutual labels: requests-module

学习vue-admin架构，顺便记录工作的组件

Stars: ✭ 31 (+158.33%)

Mutual labels: echarts

Mass Clean MP3 Tags

Stars: ✭ 22 (+83.33%)

Mutual labels: id3

Library to read, modify and write ID3 & Lyrics3 tags in MP3 files. Provides an extensible framework for retrieving ID3 information from online services.

Stars: ✭ 27 (+125%)

Mutual labels: id3

A native Go SDK for the Extensible Metadata Platform (XMP)

Stars: ✭ 36 (+200%)

Mutual labels: id3

Replica, the id3 metadata cloner

Stars: ✭ 13 (+8.33%)

Mutual labels: id3

基金,大盘,股票,虚拟货币状态栏显示小应用,基于Electron开发,支持MacOS,Windows,Linux客户端,数据源来自天天基金,蚂蚁基金,爱基金,腾讯证券,新浪基金等

Stars: ✭ 424 (+3433.33%)

Mutual labels: echarts

An All-in-one Visualization Framework for TiddlyWiki5 based on ECharts

Stars: ✭ 17 (+41.67%)

Mutual labels: echarts

GSoC-Data-Analyser

Simple search for organisations participating/participated in the GSoC

Stars: ✭ 29 (+141.67%)

Mutual labels: requests-module

📊 A set of charts based on rsuite and ECharts

Stars: ✭ 65 (+441.67%)

Mutual labels: echarts

适用于 Taro 项目的 ECharts 图表组件，欢迎提 PR

Stars: ✭ 43 (+258.33%)

Mutual labels: echarts

stm32 + esp8266 + Express + MySQL + AngularJS + MUI +Maibu(技术过于陈旧, 仅供参考)

Stars: ✭ 0 (-100%)

Mutual labels: echarts

Apache ECharts component for Vue.js.

Stars: ✭ 6,891 (+57325%)

Mutual labels: echarts

View All Similar Projects ➔

实习日志

这个库是18年寒假在武汉做数据挖掘实习生岗位上的工作日志。我的主要工作有：

对接数据部的建模数据需求和格式，编写爬虫程序爬取需求格式数据。主要用到的关键技术有：
- python3 + requests/urllib + BeautifulSoup/正则表达式
- Chales 抓包分析
- Python其他相关请求编码处理库
- 编辑器： VS code (方便提交我的服务器和github)
- 公司版本控制与代码提交: SVN
前端数据可视化项目。根据一定需求编写数据可视化js，用到的js框架有：Echarts\百度地图API\lealef等

[TOC]

1.23 基于flex技术页面的爬取

中国农业信息网发布每一天的价格行情，但政府网站较为古老，采用Flash呈现数据。网址:http://jgsb.agri.cn/controller?SERVICE_ID=REGISTRY_JCSJ_MRHQ_SHOW_SERVICE&recordperpage=15&newsearch=true&login_result_sign=nologin 与正常的Ajax网站分析一样，只是请求码和返回码运用了awf技术编码，无法分析。此时用chales抓包可以分析出正常明文。据此可以写出伪造的请求头和接收数据格式，具体采用python的第三方pyawf库 pyawf库原生的并不支持3.x，需要安装Py3Awf. 此外导入时会报错，需要修改一下__init__.py文件

代码见crawl_1.23.py

1.24爬取国家林木种质资源品台数据

今天的主要工作是爬取林木种质资源数据，网页数据是ajax加载，分析步骤并不难。

但是出现了json格式解析失败等意外状况。解决方案是舍弃requets的json()方法，直接用正则表达式解析原文档。

因为今天的数据较多有8W，所以开了python的多进程，感觉跟坐了火箭一样...

1.25爬取国家家养动物资源平台核心元数据

插曲：今天拖了一下昨天的进度，因为数据部的同事告诉我数据量质量不行，大概是不太会处理json文件吧，我看了一下数据部的同志们大多用的是Matlab,spss,stata等还有一堆我没听过名字的数据分析软件。这样的话非结构化的数据确实让他们有点为难。

不过这不是我的锅... 经过交涉，他们愿意自学非结构处理方法...

今天的网页还算简单，ajax的post请求，用charles抓包之后构造详图的包就行。

1.26 财务处数据变换写法

今天武汉下起了大雪，又是周五，所以一同实习的小伙伴一整天都很躁动呀~！

1.27:周六无聊找事情做

娱乐之余分析爬取了新教务系统的课程信息。新的教务系统在课程信息展示方面更加人性化，为跨专业选课提供了方便。经过分析，使用的技术主要就是ajax渲染，简单方便。

1.28爬虫界的天坑：百度指数

在没有学习爬虫之前就听闻了网络数据采集较为困难的几个点，其中百度指数因其诡异的数据生成方式而被很多人成为天坑。。。我就很好奇想试试这个项目。。。这个项目估计会搞好久，慢慢更新

1.29微信搜狗搜索模拟

微信公总号文章的搜索也是常用的一种搜索手段，其数据也是研究自媒体信息发布趋势和基本情况的基础。微信搜狗搜索仅仅需要突破的技术难题就是代理跌换。经过测试，每个IP在请求100次左右就会被封锁，此时需要更换代理。

于是赶出了一个抓取免费可用代理的脚本。

1.30 再次使用Pyamf爬取flash数据

这是此次实习时间最长的一次项目，因为要深度剖析flash的请求原理和amf编码解码方式并应用在python爬虫中，这方面的文档很少，自主探索的内容多。

1.31爬取农业批发市场空间分布数据

使用地图js框架渲染的页面有很多，数据产生的方式也不一样，这次的页面地图数据直接放在了头部的js文本里（可能是数据不太多吧...）需要正则表达式和字符串的深度处理。

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 12

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗