All Projects → ruichongliu → Crawler_pubg.op.gg

ruichongliu / Crawler_pubg.op.gg

Licence: Unlicense license
This is a web crawler for pubg.op.gg, written by Ruichong Liu. 绝地求生游戏数据抓取

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Crawler pubg.op.gg

Spider
Spider项目将会不断更新本人学习使用过的爬虫方法!!!
Stars: ✭ 16 (+6.67%)
Mutual labels:  selenium, scrape
InstagramLocationScraper
No description or website provided.
Stars: ✭ 13 (-13.33%)
Mutual labels:  selenium, scrape
facebook-data-extraction
Experiences in extracting data from Facebook with these 3 methods: Facebook Graph API, Automation tools, DevTools Console
Stars: ✭ 81 (+440%)
Mutual labels:  selenium
testbench
Vaadin TestBench is a tool for automated user interface testing of Vaadin Framework applications.
Stars: ✭ 20 (+33.33%)
Mutual labels:  selenium
webdriverio-zap-proxy
Demo - how to easily build security testing for Web App, using Zap and Glue
Stars: ✭ 58 (+286.67%)
Mutual labels:  selenium
weibo topic
微博话题关键词,个人微博采集, 微博博文一键删除 selenium获取cookie,requests处理
Stars: ✭ 28 (+86.67%)
Mutual labels:  selenium
pubg-internal
Demo internal ESP and Aimbot for PUBG.
Stars: ✭ 129 (+760%)
Mutual labels:  pubg
selenium-java
This is the sample repository that we use in the Complete Selenium WebDriver with Java Bootcamp
Stars: ✭ 45 (+200%)
Mutual labels:  selenium
Gen2Kindle
Search, download, convert and send files directly to your kindle from Libgen in one place.
Stars: ✭ 21 (+40%)
Mutual labels:  beautifulsoup4
pyderman
Install Selenium-compatible Chrome/Firefox/Opera/PhantomJS/Edge webdrivers automatically.
Stars: ✭ 24 (+60%)
Mutual labels:  selenium
nightwatch101
使用 Nightwatch 實現 End-to-End Testing ★
Stars: ✭ 42 (+180%)
Mutual labels:  selenium
selenium-client
A PHP Selenium client
Stars: ✭ 31 (+106.67%)
Mutual labels:  selenium
YouTubeUploader
An automated, headless YouTube Uploader
Stars: ✭ 116 (+673.33%)
Mutual labels:  selenium
FUTpuppeteer
This is an auto-clicker bot used to trade players and items on FIFA Ultimate Team's Web App.
Stars: ✭ 11 (-26.67%)
Mutual labels:  selenium
thewaiter
A Waiter library for Selenium tests
Stars: ✭ 52 (+246.67%)
Mutual labels:  selenium
qzone-spider
QQ 空间爬虫,基于 selenium 模拟登录空间,破解滑动验证码,拿到 cookies,然后使用 requests 抓取好友留言板的所有留言与回复,并生成词图。只抓了留言,本来还想抓说说,不过因为我已经好多年不玩 QQ 空间,感觉它对我已经没什么意义了,遂作罢。
Stars: ✭ 23 (+53.33%)
Mutual labels:  selenium
InstaBot
Simple and friendly Bot for Instagram, using Selenium and Scrapy with Python.
Stars: ✭ 32 (+113.33%)
Mutual labels:  selenium
python-data-from-web
API and web scraping workshops
Stars: ✭ 32 (+113.33%)
Mutual labels:  selenium
SqueezePredictor
A python script that predicts a stock's susceptibility to a short squeeze.
Stars: ✭ 36 (+140%)
Mutual labels:  beautifulsoup4
selenified
The Selenified Test Framework provides mechanisms for simply testing applications at multiple tiers while easily integrating into DevOps build environments. Selenified provides traceable reporting for both web and API testing, wraps and extends Selenium calls to more appropriately handle testing errors, and supports testing over multiple browser…
Stars: ✭ 38 (+153.33%)
Mutual labels:  selenium

Web Crawler for pubg.op.gg

In this file, I will talk you through the entire code I wrote for this web crawler. I don't have much documentation within each file because I find most of them tedious and straightforward if you have done any scraping. Therefore, I will introduce the idea of this project and clarify some code in this file. If after reading this file, the whole thing still doesn't make sense to you, don't hesitate to leave a comment below or somewhere else in this repo.

Before the detailed introduction, check if you have Python 3.6 installed on your machine, because I used Python 3.6. If you do have Python 3.6, you could move on to installing dependencies.

pip install bs4
pip install lxml
pip install requests
pip install selenium

Composition

This entire project consists of four separate files, three of which are functional files and the remaining one is a simple wrapper.

main.py

This is a simple wrapper. Just remember to replace zsda123*(in line 13)* with a real PUBG user ID.

finder.py

This plays an important role in this entire project, because you will find that userID for each player in pubg.op.gg in a random hexadecimal number. However, I still managed to find a pattern in the previous version of finder.py:

  • UserID has a cluster-like distribution. In each cluster, there are 40 to 80 users.

I had my server run for a whole day and could not capture any cluster besides the starting one. Therefore, I decided to forgo this brutal way. Then, I decided to go with this new way:

Find the other players within the same game with the user0, In this case, zsda123. If you want to capture a larger sample, change line 73 to the following.

q = nameToId(userMap(userMap(q)))

By calling userMap twice, I was able to capture 4970 distinct users, and it took me an hour. The size of the data grows exponentially! Suppose there are 80 players in each game and you decided to call userMap 3 times. You should expect your server to do work proportionally to 512,000 users, which could eat up all of the memory of your server/PC. Be aware of that!

Let's talk more on documentation. In Line 19 and Line 42, the query variable server is set to as. You can change it to other servers if you feel like doing that. Since the companying I am working for is a company focused on Greater China, most of the server settings are Asia or Southeast Asia or Japan/Korea.

In Line 28, remember to change that value to a real op.gg player ID. The reason I have this line is that sometimes a player with name #unknown appears , due to their rubbish servers . I will tell you how to find a player ID now.

  • Go to pubg.op.gg
  • Search a real PUBG player username
  • Wait for the whole webpage to be fully loaded
  • Inspect the webpage and go to Network section
  • Scroll all the way to the bottom and click on More
  • You should be able to see a request begins with recent?
  • Check Request URL, the player ID is the string after https://pubg.op.gg/api/users/

In Line 40, don't forget to change the executable_path to the path of your driver. After El Capitan, Mac users no longer have access to /usr/bin. You can put the driver under /usr/local/bin and change the executable_path like what I did. Also, you have to have that browser installed on your machine, which means if you don't have Firefox and you use geckodriver, you will receive a bunch of warnings! Also, if you do use Firefox, make the change as such.

driver = webdriver.Firefox(executable_path = "/usr/local/bin/geckodriver")

In Line 44,45,46, you will see a big chuck of code there. Don't worry about it, I did not type it up. I will be mad if I did. It was generated by Xpath Finder, a plugin for Chrome. Line 44 is used to select the game with rank lower than 10, Line 45 for Top 10 games and Line 46 for Chicken Dinners. By clicking on those buttons, the game details are loaded into the webpage. Since it is asynchronous, I make the machine wait for a while (2 is tested to be not stable, so I go with 3 seconds).

The rest of the file is basically some beautiful soups, and I believe no further explanation is needed for that part.

scraper.py

I want to explain why I have a try statement in Line 46. Some users they just don't play solo games, so we could not get any data from there. Also, you could change the parameters passed into query() to extract something different. Besides these, we have another bunch of beautiful soups.

reader.py

The data I collect is quite primitive, so I also have a simple reader to summarize my data.

[userId, x['participant']['user']['nickname'], x['season'], x['server'], 
x['queue_size'], x['mode'], x['participant']['stats']['combat']['kda']['kills']]

If you demand something more complicated than what I did in my project, I will talk you through here. Recall that I talked about Request URL in the finder section. If you visit a Request URL, you will see a JSON object like this.

{"params":{"server":"na","season":null,"queue_size":0,"mode":"tpp"},
"matches":{"summary":{"matches_cnt":20,"win_matches_cnt":1,"topten_matches_cnt":7,"ranks_avg":16.95,
"ranks_list":[6,24,18,43,42,14,11,16,3,1,29,7,19,18,2,12,9,2,16,47],
"kills_avg":2,"deaths_avg":0.95,"kills_max":6,"damage_avg":230.261890915,
"time_survived_avg":964.2993499999999,"modes":{"2":{"matches_cnt":5,"win_matches_cnt":0,"topten_matches_cnt":1,
"rating_delta_sum":4.185349880000004},"4":{"matches_cnt":13,"win_matches_cnt":0,"topten_matches_cnt":5,
"rating_delta_sum":10.264028235999994},"1":{"matches_cnt":2,"win_matches_cnt":1,"topten_matches_cnt":1,
"rating_delta_sum":151.97761028000002}}},"items":[{"season":"2017-pre6","server":"na","queue_size":2,"mode":"tpp",
"started_at":"2017-12-06T03:27:29+0000","total_rank":42,"offset":101,
"match_id":"2U4GBNA0YmnSRjFPiSEp6LaN-bpuG8kRbg6Rdt5PZpPKmHyludByUMHwbLTOzeEO",
"participant":{"_id":"5a276bd059e73b0001e5b828","user":{"nickname":"LexWynnZzWw",
"profile_url":"https:\/\/pubg.op.gg\/user\/LexWynnZzWw?server=na"},"stats":{"rank":6,
"rating_delta":40.009836480000004,"combat":{"time_survived":1802.317,"vehicle_destroys":0,
"win_place":6,"kill_place":4,"heals":5,"weapon_acquired":9,"boosts":4,"death_type":"byplayer",
"most_damage":0,"kda":{"kills":4,"assists":2,"kill_steaks":1,"road_kills":0,"team_kills":0,"headshot_kills":2,
"longest_kill":49.3916779},"distance_traveled":{"walk_distance":2522.22559,"ride_distance":3938.64038},
"damage":{"damage_dealt":482.832336},"dbno":{"knock_downs":2,"revives":0}}}},
"team":{"_id":24,"stats":{"rank":6},"participants":[]}}
...
]}}

You can parse the object and get something interesting to you from there.

userIdList.txt

The following files might be messy on Windows machines. If you do have a *nix machine, you should see something like this in this file.

5a3befa88676120001104e8d
5a307bafc284c1000169e7db
59feb54368c1ea00019c056b
5a0c5e93f0eb7800013cd191
...
log.txt

Do you still remember that I talked about Player #unknown? You can always find something interesting in the log file. The last message I present in Finder happens to be #unknown. The overall log should be like this.

Master: START--START--START--START--START
Finder: Starting with User YechenDetoxic...
Finder: Collecting Friends of User YechenDetoxic...
Finder: Collecting Friends of User ashingboomORZ...
Finder: Collecting Friends of User Thinktomuch...
Finder: Collecting Friends of User miaomiao-3-...
Finder: Collecting Friends of User dujun211...
Finder: Collecting Friends of User SUSHAOLEI...
...
Finder: Translating User kanchao_ge...
Finder: Translating User QingFeng141...
Finder: Translating User Clearloveccp...
Finder: Translating User 980010...
Finder: Translating User with164...
Finder: Translating User E-RomanA...
Finder: Translating User #unknown...
...
Finder: DONE!!
Time Used: 3866 seconds
User Captured: 4941
Scraper: Scraper Starts
Scraper: Working on User 5a0c5e93f0eb7800013cd191
Scraper: Working on User 5a0bed2905279f00011d10f5
Scraper: Working on User 5a2e49d4e358310001185431
Scraper: Working on User 59fd962cab1fff00019e0759
Scraper: Working on User 59fdb0a699392b0001608809
Scraper: Working on User 59fd958031e4c1000157b475
Scraper: Working on User 59fe352cb503ad0001f16526
...
Scraper: DONE!!
Time Used: 15435 seconds
Reader: Reader Starts
Reader: DONE!!
Time Used: 0 seconds
Master: DONE--DONE--DONE--DONE--DONE
data.csv

You should see data collected in such format:

Player ID                   Username    Season      Server  Queue_Size Mode Kills
59fd96dddfa2830001fb24aa	Kev666--	2018-01	    sea	1	    tpp	9
59fd96dddfa2830001fb24aa	Kev666--	2018-01	    sea	1	    tpp	0
59fd96dddfa2830001fb24aa	Kev666--	2018-01	    sea	1	    tpp	0
59fd96dddfa2830001fb24aa	Kev666--	2018-01	    sea	1	    tpp	0
59fd96dddfa2830001fb24aa	Kev666--	2017-pre6	  sea	1 	   tpp	1
59fd96dddfa2830001fb24aa	Kev666--	2017-pre6	  sea	1	    tpp	3
59fd96dddfa2830001fb24aa	Kev666--	2017-pre5	  sea	1	    tpp	0
...
result.csv

You should see summarization as such:

kills   Frequency   Relative Frequency
0	158768	0.574224839
1	62972	0.227754249
2	27958	0.101117215
3	13143	0.047535001
5	3330	0.012043792
6	1776	0.006423356
4	6399	0.02314361
8	515	0.001862628
7	945	0.003417833
10	175	0.000632932
12	45	0.000162754
9	285	0.001030775
11	91	0.000329125
14	26	9.40E-05
26	1	3.62E-06
15	9	3.26E-05
13	35	0.000126586
16	4	1.45E-05
18	3	1.09E-05
17	7	2.53E-05
29	1	3.62E-06
20	1	3.62E-06
19	2	7.23E-06
...

Acknowledgement

I thank my co-workers here at Bullup Inc. for their generous help. I thank op.gg for not banning my IP, because as you can see, I did not set up proxy. All credits go to op.gg, because I am using their backdoor APIs and database and I feel obligated to say so.

Footnote

I know I said a lot. Today is the last business day of year 2017, and I finished this project on PUBG, my favorite game so far. To give something back to the world, I decided to make this repo public and write nice documentation for it XD.

On AWS, it took me 3866 seconds to run finder with two degrees of userMap, 15435 seconds to run scraper, 0 seconds to run reader.

</2017>
<2018>
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].