All Projects → Charleswyt → crawler

Charleswyt / crawler

Licence: GPL-3.0 license
Crawler with Python 3.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to crawler

facebook-cleaner
It is almost spring, so time for a pre spring cleaning. This time: taking care of your Facebook. This script can safe you a lot of time if you would try to do that by hand.
Stars: ✭ 52 (+52.94%)
Mutual labels:  facebook, selenium
Spam Bot 3000
Social media research and promotion, semi-autonomous CLI bot
Stars: ✭ 79 (+132.35%)
Mutual labels:  facebook, selenium
Memedensity
CLI tool to let you know amount of memes in facebook feed.
Stars: ✭ 44 (+29.41%)
Mutual labels:  facebook, selenium
facebook-data-extraction
Experiences in extracting data from Facebook with these 3 methods: Facebook Graph API, Automation tools, DevTools Console
Stars: ✭ 81 (+138.24%)
Mutual labels:  facebook, selenium
Facebook-Messenger-Statistics
Discover fun stats about your Facebook Messenger usage such as: most used words, most active year, most active hours, etc. using this Jupyter Notebook
Stars: ✭ 75 (+120.59%)
Mutual labels:  facebook
YoukuDownload
youku 视频下载
Stars: ✭ 12 (-64.71%)
Mutual labels:  selenium
ffeed
A simple page that returns the feed of a specific facebook page and group.
Stars: ✭ 19 (-44.12%)
Mutual labels:  facebook
OAuthLogin.AspNetCore
第三方平台联合登陆(facebook、微信、微博、QQ、Kakao)
Stars: ✭ 43 (+26.47%)
Mutual labels:  facebook
robotframework-preso
Introduction to Robot Framework - TriTAUG Presentation
Stars: ✭ 16 (-52.94%)
Mutual labels:  selenium
facebook-node-sdk
Modeled from the (Facebook Javascript SDK), now with the facebook-node-sdk you can now easily write the same code and share between your server (nodejs) and the client (Facebook Javascript SDK).
Stars: ✭ 519 (+1426.47%)
Mutual labels:  facebook
OpenYspider
千万级图片爬虫、视频爬虫 [开源版本] Image Spider
Stars: ✭ 122 (+258.82%)
Mutual labels:  selenium
selenium-grid-docker-swarm-test
Distribute automated tests with Selenium Grid and Docker Swarm
Stars: ✭ 28 (-17.65%)
Mutual labels:  selenium
sharon
A lightweight and modular social sharing library
Stars: ✭ 16 (-52.94%)
Mutual labels:  facebook
people-also-ask
People also ask Google scraper. Get as many questions as you need to optimize your site for voice or new content ideas or answering questions about your desired topic.
Stars: ✭ 39 (+14.71%)
Mutual labels:  selenium
MyTools
🛠️一些小工具脚本
Stars: ✭ 23 (-32.35%)
Mutual labels:  selenium
Instagram-Giveaways-Winner
Instagram Bot which when given a post url will spam mentions to increase the chances of winning. Win Instagram Giveaways!
Stars: ✭ 95 (+179.41%)
Mutual labels:  selenium
java-opensdk
TestProject Java OpenSDK
Stars: ✭ 45 (+32.35%)
Mutual labels:  selenium
todo-list
A practical web application built with Node.js, Express, and MySQL for you to readily record, view, and manage your tasks with an account: Create, view, edit, delete, filter, and sort expenses are as easy as pie 🥧
Stars: ✭ 18 (-47.06%)
Mutual labels:  facebook
chesf
CHeSF is the Chrome Headless Scraping Framework, a very very alpha code to scrape javascript intensive web pages
Stars: ✭ 18 (-47.06%)
Mutual labels:  selenium
yt-videos-list
Create and **automatically** update a list of all videos on a YouTube channel (in txt/csv/md form) via YouTube bot with end-to-end web scraping - no API tokens required. Multi-threaded support for YouTube videos list updates.
Stars: ✭ 64 (+88.24%)
Mutual labels:  selenium

Facebook Crawler

Facebook crawler with python3.x (the "ladder" need solving by yourself).
The official website of Python is https://www.python.org.

Packages

selenium, requests, BeautifulSoup4 and other built-in packages, such as re, shutil, time, random, tkinter and so on.

You can use pychram, spyder or jupyter notebook to debug the code. Here, jupyter notebook and IPython is recommanded as the tool for step debug.

Install them with "pip" command.

The homepage url of selenium is https://www.seleniumhq.org/download/.

Browser

Chrome | Firefox (In this project, Chrome is applied.)

How to use

  1. Install the packages and browser mentioned above.

  2. Check whether your browser can be opened via the code.

    • If you use Chrome browser

    Copy the Chromedriver into the python path or other path in the enviroment variables. The download link of Chromedriver is http://chromedriver.storage.googleapis.com/index.html.
    Download the suitable version according to the version of Chrome browser.
    The version of Chrome browser can be seen in "帮助 -> 关于Google Chrome", either upgrade the Chrome browser or download corresponding Chromedriver is OK.

    • If you use Firefox browser

    You just control Firefox at an appropriate version in order to avoid some incompatible error.

  3. Run the code

    • For Test: to the path of the scripts and open the command line (Win + R and input "cmd"). Input "jupyter notebook", enter.
    • For Run: Use arbitrary IDE such as PyCharm. Or, use C++ or C# to call the scripts.

Critical Technology

This part is written in Chinese, and can be seen optionally.

Main procedure

1. 模拟登录

模拟登录分为初次登录cookie登录 Facebook的所有动态上传与数据下载均需在登录状态下完成,因此需先实现模拟登录 * 初次登录 在初次登录中,使用selenium寻找Facebook的账号与密码输入框以及确定项,模拟手工操作实现登录。 * cookie登录 cookie主要用于解决Facebook反爬检测。若用户频繁登录,则会被Facebook判定为异常,易被封号,因此在初次登录后保存cookie信息,用于后续的操作。

cookie格式(JSON):
{
	"secure": true,
	"httpOnly": true, 
	"path": "/", 
	"value": "0J0JP757BRZglDevJ.AWU44Pk-lv2MkD1KmUbFSABhp58.Ba10h6.L8.AAA.0.0.Ba10h_.AWVfmtWL", 
	"name": "fr", 
	"domain": ".facebook.com", 
	"expiry": 1531834239.141339
}
其中,expiry为cookie失效日期,格式为Unix时间戳

2. 状态发布

对网页进行解析,找到状态发布(Make Post)对应的元素,使用selenium对该元素进行定位,并在定位成功后实现点击,从而完成状态的发布,发布过程与直接使用浏览器登录Facebook并发布内容相同。

3. URL解析

Facebook的用户链接分为两种:1. 用户名; 2.用户ID
两种形式的用户主页链接分别为:
用户名形式:https://www.facebook.com/qiao.fengchun
用户ID形式:https://www.facebook.com/profile.php?id=100005030479034

  • 用户名

    Facebook的用户名从本质上来讲是唯一的,用户名分为两种,显示用户名实际用户名,如显示用户名为haha时,实际用户名则可能为haha.521,因此直接通过显示用户名对用户进行检索会存在误差,实际用户名需在登录用户主页后才可获得。此外,Facebook会对用户名中的空格直接填充".",如qiao fenchun的实际用户名为qiao.fengchun。

  • 用户ID

    用户ID是唯一的,但是不直接对用户可见,需要对页面进行解析才可获取。

    在用户主页链接的基础上,用户的照片,视频等数据的链接为:
    https://www.facebook.com/qiao.fengchun/friends
    https://www.facebook.com/profile.php?id=100005030479034&sk=photos
    可选关键字:["about", "photos", "friends", "videos", "music", "movies", "books", "tv"]

4. 下拉刷新

Facebook的页面均为Ajax动态加载,使用selenium模拟鼠标行为对页面进行下拉刷新,页面下拉刷新过程中可以预先设定下拉次数,但是会存在页面信息预估不准无效下拉两种异常情况。因此,需要找到页面底端标识

  • 页面信息预估不准:当需要获取的用户量大于页面显示的用户量时会出现该类问题
  • 无效下拉:当所有信息全部显示后仍执行额外的下拉操作
    以上两种异常情况会影响程序的执行效率,应尽量避免。
  • 解决方案
    Facebook有两类页面底端标识,搜索页面的End of Results和用户信息页面的More about you/Username,考虑到不同语言的情况,不使用关键字检索,使用页面元素检索。
    以上两种底端标识对应的元素标识分别为:
Flag Class Name id XPath
End of Results uiHeader - //*[@id="timeline-medley"]/div/div[2]/div[1]
More about you/Username - browse_end_of_results_footer //*[@id="browse_end_of_results_footer"]

5. 原图链接获取

Facebook的图像采用多级缩略图形式,用户主页用户图片主页以及图片预览三种模式下的图片链接不完全相同,且从用户主页和用户图片主页均可进入到图片预览模式,两种模式下的链接不同,但均可实现图片的下载。本项目中使用从图片主页窗口进入的方式。
图片预览状态下获取图片链接易捕获无效链接,预览状态下的图片链接与全屏放大后的链接虽然相同,但是直接在预览状态下下载的图片仍不是原图,因此采用全屏放大后再获取链接的方式。

6. 图片信息获取

我们主要获取图片的链接发布时间对应文字尺寸以及发布位置,对应文字和发布位置有可能为空

7. 图片下载

使用requests和shutil库对图片进行下载

if response.status_code == 200:
	photo_file_name = os.path.join("***.jpg")
	with open(photo_file_name, "wb") as file:
		shutil.copyfileobj(response.raw, file)

8. 检索

Facebook采用的是模糊检索,随机输入一个关键字后会返回零个或多个相近结果,包括用户,照片,主页等, 不同的检索内容对应不同的URL: URL: https://www.facebook.com/search/str/"**keyword**"/**type**

  • keyword为待检索的字符串

  • type为待检索的类型 在各类检索中,还可进一步进行条件筛选,筛选条件在URL中为JSON形式

    用户 (People):keywords_users
    • 筛选类型:

      • CITY

        • Any city
        • Choose a City
      • EDUCTION

        • Any school
        • Choose a School
      • WORK

        • Any company
        • Choose a Company
      • MUTUAL FRIENDS

        • Anyone
        • Friends
        • Friends of Friends
        • Mutual Friends with

      全部选Any时,URL为上述URL,当筛选条件存在时,则为:

      • filters_city={"name":"users_location", "args":"110730292284790"}
      • filters_school={"name":"user_school", "args": "7701216166"}
      • filters_employer={"name":"users_employer**","args":"20528438720"}
      • filters_friends={"name":"users_friends","args":"100024373853102"} or filters_friends={"name":"users_friends_of_friends","args":"100024373853102"} or filters_friends={"name":"users_friends","args":"100024373853102"} args的键值均为ID,使用filters_friends筛选时,ID为用户ID,可通过网页解析获取,其余ID均为先验 各JSON字段之间用"&"连接
    主页 (pages):keywords_pages
    • 筛选类型:

      • VERIFIED

        • Verified
      • CATEGORY

        • Any category
        • Local Business or Place
        • Company, Organization or Institution
        • Brand or Product
        • Artist, Band or Public Figure
        • Entertainment
        • Cause or Community

      全部选Any时,URL为上述URL,当筛选条件存在时,则为:

      • filters_verified={"name":"pages_verified","args":""}

      • filters_category={"name":"pages_category","args":"1006"}

        args:

        • 1006 (Local Business or Place)
        • 1013 (Local Business or Place)
        • 1009 (Brand or Product)
        • 1007,180164648685982 (Artist, Band or Public Figure)
        • 1019 (Entertainment)
        • 2612 (Cause or Community)
    照片 (Photos):photos-keyword
    • 筛选类型:

      • POSTED BY

        • Anyone
        • You
        • Your Friends
        • Your Friends and Groups
        • Choose a Source
      • TAGGED LOCATION

        • Anywhere
        • Choose a Location
      • DATE POSTED

        • Any date
        • 2018
        • 2017
        • 2016
        • Choose a Date

      全部选Any时,URL为上述URL,当筛选条件存在时,则为:

      • filters_rp_author={"name":"author_me","args":""}

        name:

        • author_me (You)
        • author_friends (Your Friends)
        • author_friends_groups (Your Friends and Groups)
        • author (Choose a Source) args有值
      • filters_rp_location={"name":"location","args":"106324046073002"}

      • filters_rp_creation_time={"name":"creation_time","args":"{"start_year":"2018","start_month":"2018-01","end_year":"2018","end_month":"2018-12"}"} aregs

        • 指定年份,如2017,则键值为:

            {
            	"start_year":"2017",
            	"start_month":"2017-01",
            	"end_year":"2017",
            	"end_month":"2017-12"
            }
          
        • 不指定年份,选定的时间需具体到月份,键值为:

            {
            	"start_month":"2017-05",
            	"end_month":"2017-05"
            }
          
    小组 (Group) :keywords_groups
    • 筛选类型

      • SHOW ONLY

        • Any group
        • Public Groups
        • Closed groups
      • MEMBERSHIP

        • Any group
        • Friends' groups
        • My groups

      全部选Any时,URL为上述URL,当筛选条件存在时,则为:

      • filters_groups_show_only={"name":"public_groups","args":""}

        name:

        • public_groups (Public Groups)
        • closed_groups (Closed groups)
      • filters_groups_memebership={"name":"friends_groups","args":""}

        name:

        • friends_groups (Friends' groups)
        • my_groups (My groups)
    视频 (Videos):keywords_videos

    无筛选类型

    事件 (Events):keywords_events
    • 筛选类型:

      • SHOW ONLY

        • Popular with Friends
      • LOCATION

        • Anywhere
        • Choose a City
        • My groups
      • DATE

        • Any date
        • Today
        • Tomorrow
        • This week
        • This weekend
        • Next week

      全部选Any时,URL为上述URL,当筛选条件存在时,则为:

      • filters_rp_events_popular_with_friends={"name":"filter_events_popular_with_friends","args":""}

      • filters_rp_events_location={"name":"filter_events_location","args":"106283509403187"}

      • filters_rp_events_date={"name":"filter_events_date","args":"2018-04-26"}

        args:

        • 2018-04-26 (Today)
        • 2018-04-27 (Tomorrow)
        • 2018-04-23~2018-04-29 (This week)
        • 2018-04-28~2018-04-29 (This weekend)
        • 2018-04-30~2018-05-06 (Next week)
    链接 (Links) : links-keyword/articles-links

    无筛选类型

    应用 (App) : keywords_apps

    无筛选类型

9. 页面分析

页面分析使用seleniumBeautifulSoup4库 selenium用于动态解析,如页面加载等待,查找输入框等 BeautifulSoup4用于静态解析

10. 页面返回信息为空

在用户检索和查看用户媒体内容时,会出现内容为空的情况,Facebook会给用户以相应的提示,通过该类标识元素即可对页面内容进行判断。

  • 用户搜索返回为空 通过查找id=empty_result_error的元素即可完成判断
  • 用户媒体内容返回为空 通过查找No photos to show字符串进行判断

Flag Description

Variable Type Value Function
browse_results_container_id ID BrowseResultsContainer 用于在用户检索时获取用户信息块的类名
clearfix_flag Class Name clearfix 用于在用户检索时获取用户
post_class_name Class Name _3jk (may be changed regularly) 用于实现用户动态发布
user_cover_class_name Class Name cover 用户封面类名,用于从用户主页获取用户ID
bottom_class_name Class Name uiHeaderTitle 用户照片,视频等媒体内容页面的底线,用于下拉刷新
browse_end_of_results_footer ID browse_end_of_results_footer 用户搜索页面的底线,用于下拉刷新
full_screen_id ID fbPhotoSnowliftFullScreenSwitch 用于图片全屏点击,方便获取图片URL
homeSideNav Class Name homeSideNav facebook用户登录页导航栏,用于判断是否登录成功
friends_list_class_name Class Name uiProfileBlockContent 用于获取好友信息
friends_number_id_name ID pagelet_timeline_medley_friends 用于获取当前用户的好友数量

Screenshot

Some Screenshots are shown below, the order is the same as the table in Flag Description.

用户信息获取

由于存储用户信息的类名**_4p2o**有可能定期变化,ID更改的几率较低,因此为了降低后期维护成本,通过改ID获取到类名后再进一步获取用户信息。

好友信息获取

图片信息获取

图片链接与尺寸

图片发布位置

图片发布时间

图片发布对应文字

Usage

A run demo can be saw in demo.ipynb file, the usage of jupyter can be seen in the "How to use"

user_name = "your user name"
password = "your password"
fb = Facebook(user_name, password)
fb.sign_in()	# 账户登录
fb.make_post()	# 状态发布
fb.driver.quit()# 关闭浏览器 (选择使用)

user_name = "qiaofengchun"	# 待检索用户昵称
user_number = 3			# 待检索用户数量
user_info_list = fb.search_users(user_name=user_name, user_number=user_number) 	# 获取用户信息

# 打印检索得到的用户信息
index = 1
for user_info in user_info_list:
	print("No.", index, "user info:", user_info)
	index += 1

# 图像的批量下载
fb.download_photos_batch(user_info_list, "photos")

API Documention

class Facebook

__init__(self, _email=None, _password=None, _browser_type="Chrome", _is_headless=False)
function: constructor function
parameters:
	_email: user email used for Facebook login
	_password: password used for Facebook login
	_browser_type: browser type (Chrome | Firefox)
	_is_headless: whether use headless browser or not (True | False)
return:
	browser_state:
		0 - init fail
        1 - init success

def get(self, url)
function: web pages hop, avoid repeated hop. If the current url is the same as the url, no operation.
parameters:
	url: the forthcoming url
return:
	 

login_with_account(self)
function: facebook login via email and password
parameters:
return:

make_post(self)
function: post photos, videos, text or others to Facebook
parameters:
return:

page_refresh(self, _refresh_times=0)
function: drop-down refresh of the page
parameters:
	_refresh_times: the times of refreshing
return:

enter_homepage_self(self)
function: enter the homepage of the current user
parameter:
return:

get_user_id(self, _user_homepage_url)
function: get the user id via homepage url
parameter:
	_user_homepage_url: the url of user's homepage
return:
	user id

get_friends_list(self, _friends_number=None)
function: get friends information
parameters:
	_friends_number: the number of friends to be searched, if None, return all friends information
return:
	self.user_info_friends: a built-in list 
			Formtat: [user_name, user_id, homepage_url]


search_users(self, _keyword, user_number)
function: get user's information according to key words
parameters:
	_keyword: user name to be searched
	user_number: the number of uesrs to be searched
return:
	self.user_info_search: a built-in list
			Formtat: [user_name, user_id, location, homepage_url]

get_photos_list(self, _homepage_url)
function: get all photos href via the given homepage url
parameters:
	_homepage_url: the given homepage url
return:
	photos_href_list: a photos href list

get_photo_info(self, _photo_href)
function: get photo info including url, data, location, text and size
parameters:
	_photo_href: a photo href
return: all information about the photo1
			Format: [link, date, location, text, width, height]

download_photos_one(self, _homepage_url, satrt_date=None, end_date=None,  _folder_name="./")
function: downloada photos of one user according to the homepage url and date, if the start_date, end_start and other which can be added further are None, download all photos of this user	
parameters:
	_homepage_url: the homepage url
	start_date: the start of the publish time
	end_date: the end of the publish time
	_folder_name: the floder name of the downloading photo, default is "./"
return:

download_photos_batch(self, _homepage_url_list, satrt_date=None, end_date=None,  _folder_name="./")
function: downloada photos of  user according to the homepage url and date, if the start_date, end_start and other which can be added further are None, download all photos of this user	
parameters:
	_homepage_url: the homepage url
	start_date: the start of the publish time
	end_date: the end of the publish time
	_folder_name: the floder name of the downloading photo, default is "./"
return:


utils.py

get_packages()
function: get all installed packages in the current machine
parameters:
return:
	_packages: all installed packages in the current machine

package_check(packages=None)
function: Check whether the current packages are installed or not
parameters:
	packages: the packages which need checking
return:

folder_make(_folder_name="./")
function: the encapsulation of function mkdir and exist
parameters:
	_folder_name: the folder name which need creating
return:


get_time(_unix_time_stamp)
function: exchange the expression of the from Unix timestamp to Beijing time
parameters:
	_unix_time_stamp: Unix timestamp
return:
	Beijing time in the format of "%Y-%m-%d %H:%M:%S"

get_unix_stamp(_time_string)
function: exchange the expression of the from  Beijing time to Unix timestamp
parameters:
	_time_string: Beijing time in the format of "%Y-%m-%d %H:%M:%S" 
return:
	Unix timestamp

get_size(_style_string)
function: get the width and height of the photo according to the style string extracted from the html page source
parameters:
	_style_string: the style string extracted from the html page source
return:
	_width, _height: the width and height of the photo	

url_type_judge(_url)
function: url type judgement
there are two types of url:
	1. https://www.facebook.com/erlyn.jumawan.7
    2. https://www.facebook.com/profile.php?id=100025029671192
the process mode of these two urls is different, if we want to enter the photos or friends page of the user, the urls are:
	1. https://www.facebook.com/erlyn.jumawan.7/photos
	   https://www.facebook.com/erlyn.jumawan.7/friends

	2. https://www.facebook.com/profile.php?id=100025029671192&sk=photos
	   https://www.facebook.com/profile.php?id=100025029671192&sk=friends
Thus, we need to get the url type first, so it is convenient for us to get the data of the user
parameters:
	_url: facebook url
return:
	_url_type: 1 | 2

get_jump_url(_main_page_url, _key)
function: get the jump url according to the key words such as photos, friends, videos, music, books and so on
	_main_page_url: the base url
	_key: the keyword
return:
	_url: the jump url
e.g.
	_main_page_url: https://www.facebook.com/erlyn.jumawan.7
	_key: friends
	output: https://www.facebook.com/erlyn.jumawan.7/friends

	_main_page_url: https://www.facebook.com/profile.php?id=100025029671192
	_key: photos
	output: https://www.facebook.com/profile.php?id=100025029671192&sk=photos

download_photos(_link, _folder_name="./", _name=None)
function: download the photos
parameters:
	_link: the photo href
	_folder_name: the folder name of downloading photo
	_name: the name of the downloading photo
return:
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].