stanzhai / Html2article
Licence: other
Html网页正文提取
Stars: ✭ 441
Projects that are alternatives of or similar to Html2article
Hacker News Digest
📰 A responsive interface of Hacker News with summaries and thumbnails.
Stars: ✭ 278 (-36.96%)
Mutual labels: content, crawler, spider, article, topic
Weixin Spider
微信公众号爬虫,公众号历史文章,文章评论,文章阅读及在看数据,可视化web页面,可部署于Windows服务器。基于Python3之flask/mysql/redis/mitmproxy/pywin32等实现,高效微信爬虫,微信公众号爬虫,历史文章,文章评论,数据更新。
Stars: ✭ 287 (-34.92%)
Mutual labels: crawler, spider, article
Weibo Topic Spider
微博超级话题爬虫,微博词频统计+情感分析+简单分类,新增肺炎超话爬取数据
Stars: ✭ 128 (-70.98%)
Mutual labels: crawler, spider, topic
Crawlertutorial
爬蟲極簡教學(fetch, parse, search, multiprocessing, API)- PTT 為例
Stars: ✭ 282 (-36.05%)
Mutual labels: crawler, spider
Gospider
golang实现的爬虫框架,使用者只需关心页面规则,提供web管理界面。基于colly开发。
Stars: ✭ 285 (-35.37%)
Mutual labels: crawler, spider
galer
A fast tool to fetch URLs from HTML attributes by crawl-in.
Stars: ✭ 138 (-68.71%)
Mutual labels: crawler, spider
Ttbot
今日头条机器人,支持用户登陆、关注、取消关注、获取关注粉丝、发文、发悟空问答、点赞、评论、采集各种类型新闻讯息等,使用今日头条网页版API实现
Stars: ✭ 338 (-23.36%)
Mutual labels: crawler, spider
Webster
a reliable high-level web crawling & scraping framework for Node.js.
Stars: ✭ 364 (-17.46%)
Mutual labels: crawler, spider
Fictiondown
小说下载|小说爬取|起点|笔趣阁|导出Markdown|导出txt|转换epub|广告过滤|自动校对
Stars: ✭ 362 (-17.91%)
Mutual labels: crawler, spider
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (-37.19%)
Mutual labels: crawler, spider
91porn Api
🌭💦 91porn爬虫在线无限制API接口(永久有效,口令每日更新) 及 在线web预览
Stars: ✭ 341 (-22.68%)
Mutual labels: crawler, spider
Bilili
🍻 bilibili video (including bangumi) and danmaku downloader | B站视频(含番剧)、弹幕下载器
Stars: ✭ 379 (-14.06%)
Mutual labels: crawler, spider
Freshonions Torscraper
Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion
Stars: ✭ 348 (-21.09%)
Mutual labels: crawler, spider
Html2Article
.NET平台下,一个高效的从Html中提取正文的工具。
正文提取采用了基于文本密度的提取算法,支持从压缩的Html文档中提取正文,每个页面平均提取时间为30ms,正确率在95%以上。
Html2Article特色
- 标签无关,提取正文不依赖标签;
- 支持从压缩的html文档中提取正文内容;
- 支持带标签输出原始正文;
- 核心算法简洁高效,平均提取时间在30ms左右。
让你的项目支持Html正文提取
PM> Install-Package Html2Article
- 引入命名空间
using StanSoft;
。 - 添加如下代码:
// html为你要提取的html文本
string html = "<html>....</html>";
// article对象包含Title(标题),PublishDate(发布日期),Content(正文)和ContentWithTags(带标签正文)四个属性
Article article = Html2Article.GetArticle(html);
Html2Article类
- Html2Article类是提取正文的核心类
-
Html2Article配置说明
- AppendMode:是否使用正文追加模式,默认为false,设置为true会将更多符合条件的文本添加到正文。
- Depth:分析的深度,默认为5,对于行空隙较大的页面可增加此值。
- LimitCount:字符限定数,当分析的文本数量达到限定数则认为进入正文内容,默认为180个字符。
- GetArticle(string html):从Html文本中获取Article。
License
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].