All Projects → wdkwdkwdk → Fuck_illness

wdkwdkwdk / Fuck_illness

写疾病数据分析用到的所有东西

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Fuck illness

Pipeline
the `pipeline` shell command
Stars: ✭ 168 (-13.85%)
Mutual labels:  data-analysis
Matplotlib Doc Zh
📖 [译] Matplotlib 用户指南
Stars: ✭ 178 (-8.72%)
Mutual labels:  data-analysis
Volbx
Graphical tool for data manipulation written in C++/Qt
Stars: ✭ 187 (-4.1%)
Mutual labels:  data-analysis
Airbyte
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+2422.56%)
Mutual labels:  data-analysis
Eegrunt
A Collection Python EEG (+ ECG) Analysis Utilities for OpenBCI and Muse
Stars: ✭ 171 (-12.31%)
Mutual labels:  data-analysis
Collapse
Advanced and Fast Data Transformation in R
Stars: ✭ 184 (-5.64%)
Mutual labels:  data-analysis
Pandas Datareader
Extract data from a wide range of Internet sources into a pandas DataFrame.
Stars: ✭ 2,183 (+1019.49%)
Mutual labels:  data-analysis
Zebras
Data analysis library for JavaScript built with Ramda
Stars: ✭ 192 (-1.54%)
Mutual labels:  data-analysis
Python practice of data analysis and mining
《Python数据分析与挖掘实战》随书源码与数据
Stars: ✭ 172 (-11.79%)
Mutual labels:  data-analysis
Redata
Monitoring system for data teams. Computing health checks on data, visualizing and alerting on them.
Stars: ✭ 181 (-7.18%)
Mutual labels:  data-analysis
Matplotplusplus
Matplot++: A C++ Graphics Library for Data Visualization 📊🗾
Stars: ✭ 2,433 (+1147.69%)
Mutual labels:  data-analysis
Data Science Resources
👨🏽‍🏫You can learn about what data science is and why it's important in today's modern world. Are you interested in data science?🔋
Stars: ✭ 171 (-12.31%)
Mutual labels:  data-analysis
Dtale
Visualizer for pandas data structures
Stars: ✭ 2,864 (+1368.72%)
Mutual labels:  data-analysis
Dabestr
Data Analysis with Bootstrap Estimation in R
Stars: ✭ 169 (-13.33%)
Mutual labels:  data-analysis
Gradio
Create UIs for your machine learning model in Python in 3 minutes
Stars: ✭ 4,358 (+2134.87%)
Mutual labels:  data-analysis
Countly Sdk Web
Countly Product Analytics SDK for websites and web applications
Stars: ✭ 165 (-15.38%)
Mutual labels:  data-analysis
Ida
Introduction to Data Analysis, using R (2013)
Stars: ✭ 180 (-7.69%)
Mutual labels:  data-analysis
Data Science Live Book
An open source book to learn data science, data analysis and machine learning, suitable for all ages!
Stars: ✭ 193 (-1.03%)
Mutual labels:  data-analysis
Klib
Easy to use Python library of customized functions for cleaning and analyzing data.
Stars: ✭ 192 (-1.54%)
Mutual labels:  data-analysis
Goaccess
GoAccess is a real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser.
Stars: ✭ 14,096 (+7128.72%)
Mutual labels:  data-analysis

fuck_illness

公众号:超级王登科

文章链接:https://greatdk.com/1513.html

概述

为了写一篇关于疾病的数据分析,我爬取了150万疾病问答数据,并使用python做了数据分析,在这里记录整个过程,并给出代码和数据

爬虫

爬虫文件为:health.pym.py 爬虫没什么好说的,基本的看代码就行,不过有一点,加入多线程后,爬虫隔一段时间效率会下降,甚至卡死,研究了半天也没什么好办法,所以又写了一个监控程序,也就是m.py ,它会每隔五秒钟看一下新增的数据,如果低于一定数量,就重启一次爬虫 GIF.gif

jieba

一开始我习惯性的使用 jieba ,后来我发现我的目的其实只是按照给定字典做词频统计,这个不需要 jieba 就能实现,但既然已经引入了,我就顺便用 jieba 实现了,虽然越到后面发现坑越多,但最后还是实现了,我对比了一下,发现速度也不错。

jieba 词频统计的函数是 jieba.analyse.extract_tags ,所以我一开始就直接用的这个函数

jieba.analyse.set_idf_path('dic_for_idf.txt') #配置自定义字典
tags = jieba.analyse.extract_tags(content, topK=200, withWeight=True)

但这样出现的结果很混乱,因为虽然配置了词频的字典,但是分词的时候会产生许多字典之外的词,他们也有权重,而且这些通用词出现频率更高,会完全压制自定义字典里的词,导致做词频统计,统计到的都不是自定义字典中的

所以我接下来加了一行代码,也同时配置了 jieba 分词的字典

jieba.set_dictionary('dic_for_use.txt') #配置自定义字典
jieba.analyse.set_idf_path('dic_for_idf.txt') #配置自定义字典
tags = jieba.analyse.extract_tags(content, topK=200, withWeight=True)

但还是不行,网上找了资料,发现 jieba 其实还有新词发现功能,需要关闭隐马尔科夫模型,虽然jieba.cut可以配置隐马尔科夫模型的开关,但我调用的jieba.analyse.extract_tags却并没有这个参数,因此我只能修改 jieba 的源码,手动把 隐马尔科夫模型(HMM)给关闭了,修改的地方在 jieba库目录/posseg/init.py,搜索HMM就能找到许多,都改成False即可

为了保险起见,我还在词频统计的核心文件中加了一行判断,103行附近

 if w not in self.idf_freq:
                continue

这样一来就可以完全过滤掉自定义字典之外的词

语料库

我发现搜狗的词库真的是个很不错的地方,有太多医疗相关的语料,不过要注意的是,下载下来不能直接使用,需要使用工具解码,这里推荐『深蓝词库转换』,使用非常方便 深蓝词库

几个要注意的地方

  • dict 比 list 快,但是如果不做其他操作,仅仅是读出来,不要用 for 循环
  • 过滤掉一些停用词,节省时间
  • 各种中文编码问题

数据库数据

https://c-t.work/s/3443c377e7814c

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].