Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → wdkwdkwdk → Fuck_illness

wdkwdkwdk / Fuck_illness

写疾病数据分析用到的所有东西

Programming Languages

python

139335 projects - #7 most used programming language

Labels

data-analysis

Projects that are alternatives of or similar to Fuck illness

Pipeline

the `pipeline` shell command

Stars: ✭ 168 (-13.85%)

Mutual labels: data-analysis

Matplotlib Doc Zh

📖 [译] Matplotlib 用户指南

Stars: ✭ 178 (-8.72%)

Mutual labels: data-analysis

Volbx

Graphical tool for data manipulation written in C++/Qt

Stars: ✭ 187 (-4.1%)

Mutual labels: data-analysis

Airbyte

Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.

Stars: ✭ 4,919 (+2422.56%)

Mutual labels: data-analysis

Eegrunt

A Collection Python EEG (+ ECG) Analysis Utilities for OpenBCI and Muse

Stars: ✭ 171 (-12.31%)

Mutual labels: data-analysis

Collapse

Advanced and Fast Data Transformation in R

Stars: ✭ 184 (-5.64%)

Mutual labels: data-analysis

Pandas Datareader

Extract data from a wide range of Internet sources into a pandas DataFrame.

Stars: ✭ 2,183 (+1019.49%)

Mutual labels: data-analysis

Zebras

Data analysis library for JavaScript built with Ramda

Stars: ✭ 192 (-1.54%)

Mutual labels: data-analysis

Python practice of data analysis and mining

《Python数据分析与挖掘实战》随书源码与数据

Stars: ✭ 172 (-11.79%)

Mutual labels: data-analysis

Redata

Monitoring system for data teams. Computing health checks on data, visualizing and alerting on them.

Stars: ✭ 181 (-7.18%)

Mutual labels: data-analysis

Matplotplusplus

Matplot++: A C++ Graphics Library for Data Visualization 📊🗾

Stars: ✭ 2,433 (+1147.69%)

Mutual labels: data-analysis

Data Science Resources

👨🏽‍🏫You can learn about what data science is and why it's important in today's modern world. Are you interested in data science?🔋

Stars: ✭ 171 (-12.31%)

Mutual labels: data-analysis

Dtale

Visualizer for pandas data structures

Stars: ✭ 2,864 (+1368.72%)

Mutual labels: data-analysis

Dabestr

Data Analysis with Bootstrap Estimation in R

Stars: ✭ 169 (-13.33%)

Mutual labels: data-analysis

Gradio

Create UIs for your machine learning model in Python in 3 minutes

Stars: ✭ 4,358 (+2134.87%)

Mutual labels: data-analysis

Countly Sdk Web

Countly Product Analytics SDK for websites and web applications

Stars: ✭ 165 (-15.38%)

Mutual labels: data-analysis

Ida

Introduction to Data Analysis, using R (2013)

Stars: ✭ 180 (-7.69%)

Mutual labels: data-analysis

Data Science Live Book

An open source book to learn data science, data analysis and machine learning, suitable for all ages!

Stars: ✭ 193 (-1.03%)

Mutual labels: data-analysis

Klib

Easy to use Python library of customized functions for cleaning and analyzing data.

Stars: ✭ 192 (-1.54%)

Mutual labels: data-analysis

Goaccess

GoAccess is a real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser.

Stars: ✭ 14,096 (+7128.72%)

Mutual labels: data-analysis

View All Similar Projects ➔

fuck_illness

公众号：超级王登科

文章链接：https://greatdk.com/1513.html

概述

为了写一篇关于疾病的数据分析，我爬取了150万疾病问答数据，并使用python做了数据分析，在这里记录整个过程，并给出代码和数据

爬虫

爬虫文件为：health.py 和 m.py 爬虫没什么好说的，基本的看代码就行，不过有一点，加入多线程后，爬虫隔一段时间效率会下降，甚至卡死，研究了半天也没什么好办法，所以又写了一个监控程序，也就是m.py ，它会每隔五秒钟看一下新增的数据，如果低于一定数量，就重启一次爬虫

jieba

一开始我习惯性的使用 jieba ，后来我发现我的目的其实只是按照给定字典做词频统计，这个不需要 jieba 就能实现，但既然已经引入了，我就顺便用 jieba 实现了，虽然越到后面发现坑越多，但最后还是实现了，我对比了一下，发现速度也不错。

jieba 词频统计的函数是 jieba.analyse.extract_tags ，所以我一开始就直接用的这个函数

jieba.analyse.set_idf_path('dic_for_idf.txt') #配置自定义字典
tags = jieba.analyse.extract_tags(content, topK=200, withWeight=True)

但这样出现的结果很混乱，因为虽然配置了词频的字典，但是分词的时候会产生许多字典之外的词，他们也有权重，而且这些通用词出现频率更高，会完全压制自定义字典里的词，导致做词频统计，统计到的都不是自定义字典中的

所以我接下来加了一行代码，也同时配置了 jieba 分词的字典

jieba.set_dictionary('dic_for_use.txt') #配置自定义字典
jieba.analyse.set_idf_path('dic_for_idf.txt') #配置自定义字典
tags = jieba.analyse.extract_tags(content, topK=200, withWeight=True)

但还是不行，网上找了资料，发现 jieba 其实还有新词发现功能，需要关闭隐马尔科夫模型，虽然jieba.cut可以配置隐马尔科夫模型的开关，但我调用的jieba.analyse.extract_tags却并没有这个参数，因此我只能修改 jieba 的源码，手动把隐马尔科夫模型（HMM）给关闭了，修改的地方在 jieba库目录/posseg/init.py，搜索HMM就能找到许多，都改成False即可

为了保险起见，我还在词频统计的核心文件中加了一行判断，103行附近

 if w not in self.idf_freq:
                continue

这样一来就可以完全过滤掉自定义字典之外的词

语料库

我发现搜狗的词库真的是个很不错的地方，有太多医疗相关的语料，不过要注意的是，下载下来不能直接使用，需要使用工具解码，这里推荐『深蓝词库转换』，使用非常方便

几个要注意的地方

dict 比 list 快，但是如果不做其他操作，仅仅是读出来，不要用 for 循环
过滤掉一些停用词，节省时间
各种中文编码问题

数据库数据

https://c-t.work/s/3443c377e7814c

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 195

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗