chenjiandongx / Stackoverflow Spider
Programming Languages
Labels
Projects that are alternatives of or similar to Stackoverflow Spider
爬取 Stackoverflow 1m 条问答
作为一个热爱编程的大学生,怎么能不知道面向 stackoverflow 编程呢。
打开 stackoverflow 主页,在 questions 页面下选择按 vote 排序,爬取前 20000 页,每页将问题数量设置为 50,共 1m 条,(实际上本来是想爬完 13m 条的,但 1m 条后面问题基本上都只有 1 个或 0 个回答,那就选取前 1m 就好吧)
实际上用数据库去重后只有 999654 条问答信息
对爬取数据进行简单分析
votes 分析
降序排列了 votes 数,生成折线图
2k 后的问题的 votes 数基本上就已经在 400 以下了,接着后面的就基本上是贴地飞行了
votes 数最多 : Why is it faster to process a sorted array than an unsorted array?
votes 数的连续分布情况
可见最多的还是集中在 1-2K 之间,从 6k 开始基本上就断层了
具体数据
description | count |
---|---|
votes >= 500 | 1630 |
votes >= 400 | 2325 |
votes >= 300 | 3782 |
votes >= 200 | 7062 |
votes >= 100 | 19781 |
如果以 100 为分界线的话,会得到这样的一个饼图
再来看看底层的数据
description | count |
---|---|
1 <= votes <= 5 | 211804 |
6 <= votes <= 10 | 430935 |
11 <= votes <= 15 | 136647 |
16 <= votes <= 20 | 64541 |
votes <= 20 | 843927 |
可见 votes 小于 20 的,数量高达 84m
看看总体的比例吧
answers 分析
降序排列了 answers 数,生成折线图
很明显 3k 之后的 answers 数基本上就小于 20 了
answers 数最多: What is the best comment in source code you have ever encountered? [closed]
answers 数的连续分布情况
具体数据
description | count |
---|---|
answers >= 5 | 218059 |
answers >= 10 | 34500 |
answers >= 20 | 3808 |
answers >= 30 | 968 |
views 分析
降序排列了 views 数,生成折线图
最高达到了 4.5m,100000 以后的基本上就不足 28000 了
views 数最多: How to undo last commit(s) in Git?
views 数的连续分布情况
具体数据
description | count |
---|---|
views >= 5000 | 486466 |
views >= 10000 | 315576 |
views >= 20000 | 171873 |
views >= 50000 | 59363 |
views >= 100000 | 22224 |
views >= 200000 | 7030 |
大部分问答的 views 数还是集中在 20000 以内
还是得看看总体分布
再看看 votes,views,answers 三者的散点图对应情况
votes - views
votes - answers
views - answers
总的来说,这三者对应关系类似于一个金字塔。三个图基本上都是左下角靠近原点的区域被填满,也就是说绝对大部分的问题的 votes,answers 和 views 都是属于最下层的。高质量活跃的问题是处于金字塔顶端的。三者的最高数好像也没特别明显的对应关系,且三者的最高数都不是同一个问题。
根据所有问题的 tags 提取出总量前 200 的关键词(前 50 条如下),第 1 名是 c#,python 排在第 5
('c#', 94614),
('java', 93244),
('javascript', 76722),
('android', 69321),
('python', 62502),
('c++', 58173),
('php', 42596),
('ios', 37773),
('jquery', 37405),
('.net', 36180),
('html', 28536),
('css', 26174),
('c', 24699),
('objective-c', 23253),
('iphone', 22171),
('ruby-on-rails', 20143),
('sql', 19171),
('asp.net', 18060),
('mysql', 17559),
('ruby', 16397),
('r', 15670),
('git', 13139),
('linux', 13080),
('asp.net-mvc', 12857),
('angularjs', 12606),
('sql-server', 12473),
('node.js', 12212),
('django', 11576),
('arrays', 11006),
('algorithm', 10959),
('wpf', 10631),
('performance', 10619),
('xcode', 10613),
('string', 10426),
('windows', 10132),
('eclipse', 10117),
('scala', 9942),
('regex', 9685),
('multithreading', 9601),
('json', 9266),
('swift', 8950),
('c++11', 8939),
('haskell', 8823),
('osx', 8159),
('visual-studio', 8140),
('html5', 7627),
('database', 7567),
('xml', 7478),
('spring', 7464),
('unit-testing', 7253),
('bash', 6825)
这样看好像不太直观,所以就把它根据词频生成了词云
因为是用 Python 写的爬虫,所以重点来分析下 Python 类的问答
votes 数前 10
- 6162 : What does the “yield” keyword do in Python?
- 3529 : What is a metaclass in Python?
- 3098 : How do I check whether a file exists using Python?
- 3035 : Does Python have a ternary conditional operator?
- 2620 : Calling an external command in Python
- 2605 : What does if name == “main”: do?
- 2194 : How to merge two Python dictionaries in a single expression?
- 2123 : Sort a Python dictionary by value
- 2058 : How to make a chain of function decorators?
- 1984 : How to check if a directory exists and create it if necessary?
answers 数前 10
- 191 : Hidden features of Python [closed]
- 87 : Best ways to teach a beginner to program? [closed]
- 55 : Favorite Django Tips & Features?
- 50 : How do you split a list into evenly sized chunks?
- 44 : Calling an external command in Python
- 43 : How can I represent an 'Enum' in Python?
- 38 : How to merge two Python dictionaries in a single expressions
- 38 : Finding local IP addresses using Python's stdlib
- 37 : Reverse a string in python without using reversed or [::-1]
- 37 : How do I check whether a file exists using Python?
views 数前 10
- 2121621 : Parse String to Float or Int
- 1905938 : Using global variables in a function other than the one that created them
- 1888666 : How do I check whether a file exists using Python?
- 1827126 : Calling an external command in Python
- 1699574 : Converting integer to string in Python?
- 1686230 : How do I read a file line-by-line into a list?
- 1682307 : Iterating over dictionaries using 'for' loops in Python
- 1569205 : How to get the size of a list
- 1554755 : How do I install pip on Windows?
- 1515505 : Finding the index of an item given a list containing it in Python