All Projects → sparrow629 → Tumblr_crawler

sparrow629 / Tumblr_crawler

Licence: gpl-3.0
This is a Multi-thread crawler for Tumblr.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Tumblr crawler

Tumblr Crawler
Easily download all the photos/videos from tumblr blogs. 下载指定的 Tumblr 博客中的图片,视频
Stars: ✭ 1,118 (+333.33%)
Mutual labels:  tumblr, crawler
TumblTwo
TumblTwo, an Improved Fork of TumblOne, a Tumblr Downloader.
Stars: ✭ 57 (-77.91%)
Mutual labels:  crawler, tumblr
Tumblthree
A Tumblr Backup Application
Stars: ✭ 211 (-18.22%)
Mutual labels:  tumblr, crawler
Tumblthree
A Tumblr Blog Backup Application
Stars: ✭ 923 (+257.75%)
Mutual labels:  tumblr, crawler
Tumblr crawler
tumblr解析网站
Stars: ✭ 83 (-67.83%)
Mutual labels:  tumblr, crawler
Media Scraper
Scrapes all photos and videos in a web page / Instagram / Twitter / Tumblr / Reddit / pixiv / TikTok
Stars: ✭ 206 (-20.16%)
Mutual labels:  tumblr, crawler
Annie
👾 Fast and simple video download library and CLI tool written in Go
Stars: ✭ 16,369 (+6244.57%)
Mutual labels:  tumblr, crawler
snapcrawl
Crawl a website and take screenshots
Stars: ✭ 37 (-85.66%)
Mutual labels:  crawler
eastmoney
python requests + Django+ nodejs koa+ mysql to crawl eastmoney fund and stock data,for data analysis and visualiaztion .
Stars: ✭ 56 (-78.29%)
Mutual labels:  crawler
bots-zoo
No description or website provided.
Stars: ✭ 59 (-77.13%)
Mutual labels:  crawler
WebCrawler
一个轻量级、快速、多线程、多管道、灵活配置的网络爬虫。
Stars: ✭ 39 (-84.88%)
Mutual labels:  crawler
ZhengFang System Spider
🐛一只登录正方教务管理系统,爬取数据的小爬虫
Stars: ✭ 21 (-91.86%)
Mutual labels:  crawler
weibo-scraper
Simple Weibo Scraper
Stars: ✭ 50 (-80.62%)
Mutual labels:  crawler
octopus
Recursive and multi-threaded broken link checker
Stars: ✭ 19 (-92.64%)
Mutual labels:  crawler
lightnovel epub
🍭 epub generator for (light)novels (轻) 小说 epub 生成器,支持站点:轻之国度、轻小说文库
Stars: ✭ 89 (-65.5%)
Mutual labels:  crawler
PY-Login
模拟登录各类网站,操作 API 完成各种不可描述的事情
Stars: ✭ 26 (-89.92%)
Mutual labels:  crawler
JQScrollNumberLabel
JQScrollNumberLabel:仿tumblr热度滚动数字条数, 一个显示数字的控件,当你改变其数字时,能够有滚动的动画,同时动画和位数可以限制,动态创建和实例化可选,字体样式自定义等。
Stars: ✭ 29 (-88.76%)
Mutual labels:  tumblr
dijnet-bot
Az összes számlád még egy helyen :)
Stars: ✭ 17 (-93.41%)
Mutual labels:  crawler
rankr
🇰🇷 Realtime integrated information analysis service
Stars: ✭ 21 (-91.86%)
Mutual labels:  crawler
MyCrawler
我的爬虫合集
Stars: ✭ 55 (-78.68%)
Mutual labels:  crawler

Tumblr_Crawler

This is a multi-threade crawler for Tumblr. Able to download entire blog or any post that you like.

There are two crawler module for video and image. One is for video, another is for image including GIF. The main file is Crawler.

Change Log

update2.0 for download any Post

This version of TumblrCrawler combine video and image including GIF in the same file. What’s more, it can acknowledge whether the main content is video or photo. Current version is only for download post page directly.
The whole blog searching function is undergoing. This searching will be easy, ignoring the JS. My thoughts is using archive page to get all the post pages, then get in every page to download.

Update3.0

This version is final one which add crawler whole blog posts function, which means this crawler can download all the file, including images and video, of one blog once.
This crawler uses threading.Thread Module. Every 10 posts as a page in tumblr as a single thread one time, Multi-thread accelerate whole procession. It needs no cookie can crawler any account. Of course, the more post there are, the longer it will take to crawler all.

update4.0

Find out some blog install personal Theme, which means they use different stylesheet from the default one. So it leads to crawl from home page is unavailable, so I change to search the default page as Archive. All the Archive page is the same stylesheet. But every archive page has 50 post, which means one single thread has to process 50 post url download. Definitely, it slow downs the procession a lot. But it has to be.

PersonalThemeSearch.py Module is for discriminating whether blog use default stylesheet or personal one.

ArchiveSearch.py is the Module for crawling all the post url in Archive page, every page has 50 posts url. Meanwhile, original way to crawl main page gets 10 posts every posts.

This version only figure out that searching all the post in every kind of stylesheet blog. It need to be solved to design a more universal function to crawl personal template post’s content.

What’s more, this version fixes some exception in none post page and a little logical problem about input. There are some spacial cases of url format, like "https://.*?", "http://wanimal1983.org/" (WTF? Redirection? http://wanimal1983.tumblr.com)

update5.0

This may be final version. It fix the problem that can not download content of special stylesheet blogs, and all the problems in last version. It adds the discrimination for homepage or post page, which means that user can download whole blog or specific post.

The main function is working for lots of blogs, like special url or theme. Of course, there may be some freak blogs’ stylesheet that is incompatible. You are welcome to remind me if you have some find. :)

update 5.5 Stable version

Fix the url decoding problem, then there will be no more 'url not found' problem which can be viewed from the browser.

update 6.0

Tumblr update the format of videos' url. So the version before 6.0 may not download the video. I modify the regular expression.

Envirment

Development under Python3.5 with some basic packages, such as requests.

Run

Run the TumblrCrawler.py directly. The input could be the blog's url, such as http://name.tumblr.com/
Or any single post that you like.

Finally, Enjoy your Interested and Excited Dowload! :)

You can support me through scanning the QR Code of Wechat wallet.

image

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].