All Projects → Jacen789 → Hotnewsanalysis

Jacen789 / Hotnewsanalysis

利用文本挖掘技术进行新闻热点关注问题分析

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Hotnewsanalysis

Woid
Simple news aggregator displaying top stories in real time
Stars: ✭ 204 (+119.35%)
Mutual labels:  news, crawler
Newspaper
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Stars: ✭ 11,545 (+12313.98%)
Mutual labels:  news, crawler
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+7056.99%)
Mutual labels:  news, word2vec
N2h4
네이버 뉴스 수집을 위한 도구
Stars: ✭ 177 (+90.32%)
Mutual labels:  news, crawler
Taiwan News Crawlers
Scrapy-based Crawlers for news of Taiwan
Stars: ✭ 83 (-10.75%)
Mutual labels:  news, crawler
Ttbot
今日头条机器人,支持用户登陆、关注、取消关注、获取关注粉丝、发文、发悟空问答、点赞、评论、采集各种类型新闻讯息等,使用今日头条网页版API实现
Stars: ✭ 338 (+263.44%)
Mutual labels:  news, crawler
News Please
news-please - an integrated web crawler and information extractor for news that just works.
Stars: ✭ 969 (+941.94%)
Mutual labels:  news, crawler
Is Google
Verify that a request is from Google crawlers using Google's DNS verification steps
Stars: ✭ 82 (-11.83%)
Mutual labels:  crawler
Glove As A Tensorflow Embedding Layer
Taking a pretrained GloVe model, and using it as a TensorFlow embedding weight layer **inside the GPU**. Therefore, you only need to send the index of the words through the GPU data transfer bus, reducing data transfer overhead.
Stars: ✭ 85 (-8.6%)
Mutual labels:  word2vec
Cw2vec
基于字符训练词向量
Stars: ✭ 80 (-13.98%)
Mutual labels:  word2vec
Ja.text8
Japanese text8 corpus for word embedding.
Stars: ✭ 79 (-15.05%)
Mutual labels:  word2vec
Newspaper
An aggregated newspaper app containing news from 10+ local news publishers in Hong Kong. Made with ❤
Stars: ✭ 82 (-11.83%)
Mutual labels:  news
Nlp Journey
Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation),etc. All codes are implemented intensorflow 2.0.
Stars: ✭ 1,290 (+1287.1%)
Mutual labels:  word2vec
Work crawler
Download comics novels 小说漫画下载工具 小説漫画のダウンローダ 小說漫畫下載:腾讯漫画 大角虫漫画 有妖气 知音漫客 咪咕 SF漫画 哦漫画 看漫画 漫画柜 汗汗酷漫 動漫伊甸園 快看漫画 微博动漫 733动漫网 大古漫画网 漫画DB 無限動漫 動漫狂 卡推漫画 动漫之家 动漫屋 古风漫画网 36漫画网 亲亲漫画网 乙女漫画 comico webtoons 咚漫 ニコニコ静画 ComicWalker ヤングエースUP モアイ pixivコミック サイコミ;アルファポリス カクヨム ハーメルン 小説家になろう 起点中文网 八一中文网 顶点小说 落霞小说网 努努书坊 笔趣阁→epub.
Stars: ✭ 1,224 (+1216.13%)
Mutual labels:  crawler
Proxy Pool
爬虫代理IP池服务,可供其他爬虫程序通过restapi获取
Stars: ✭ 91 (-2.15%)
Mutual labels:  crawler
Wombat
Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
Stars: ✭ 1,220 (+1211.83%)
Mutual labels:  crawler
Ktspeechcrawler
Automatically constructing corpus for automatic speech recognition from YouTube videos
Stars: ✭ 92 (-1.08%)
Mutual labels:  crawler
Dict2vec
Dict2vec is a framework to learn word embeddings using lexical dictionaries.
Stars: ✭ 91 (-2.15%)
Mutual labels:  word2vec
Geziyor
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
Stars: ✭ 1,246 (+1239.78%)
Mutual labels:  crawler
Code news
Diycode每日文章精选
Stars: ✭ 1,303 (+1301.08%)
Mutual labels:  news

HotNewsAnalysis

利用文本挖掘技术进行新闻热点关注问题分析


热点分析

本文主要通过文本挖掘技术进行新闻热点问题分析,把从网上抓取到的财经新闻,通过对新闻内容的聚类,得到新闻热点;再对热点进行分析,通过对某一热点相关词汇的聚类,得到热点问题所涉及的人物、行业或组织等。主要涵盖的内容如图1-1所示:

   系统总任务

图 1-1 新闻热点关注问题分析总任务

由图1-1所见,本文主要研究的内容为:

  1. 利用新闻API、爬虫算法、多线程并行技术,抓取三大专业财经新闻网站(新浪财经、搜狐财经、新华网财经)的大量财经新闻报道;

  2. 对新闻进行去重、时间段过滤,然后对新闻内容文本进行jieba分词并词性标注,过滤出名词、动词、简称等词性,分词前使用自定义的用户词词典增加分词的准确性,分词后使用停用词词典、消歧词典、保留单字词典过滤掉对话题无关并且影响聚类准确性的词,建立每篇新闻的词库,利用TF-IDF特征提取之后对新闻进行DBSCAN聚类,并对每个类的大小进行排序;

  3. 针对聚类后的每一类新闻,为了得到该处热点的话题信息,还需要提取它们的标题,利用TextRank算法,对标题的重要程度进行排序,用重要性最高的标题来描述该处热点的话题;

  4. 对所有的新闻内容进行jieba分词,并训练出word2vec词嵌入模型,然后对聚类后的每一类新闻,提取它们的内容分词后的结果,运用word2vec模型得到每个词的词向量,再利用k-Means聚类算法进行相近词聚类。

系统界面可视化如图1-2所示:

   系统总任务

图 1-2 新闻热点关注问题分析系统总界面

功能:见paper

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].