All Projects → gaalcaras → Mailinglistscraper

gaalcaras / Mailinglistscraper

Licence: gpl-3.0
A python web scraper for public email lists.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Mailinglistscraper

Django Dynamic Scraper
Creating Scrapy scrapers via the Django admin interface
Stars: ✭ 1,024 (+5289.47%)
Mutual labels:  spider, scraper, scrapy, webscraping
Goribot
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。
Stars: ✭ 190 (+900%)
Mutual labels:  spider, scraper, scrapy
robotstxt
robots.txt file parsing and checking for R
Stars: ✭ 65 (+242.11%)
Mutual labels:  scraper, spider, webscraping
OpenScraper
An open source webapp for scraping: towards a public service for webscraping
Stars: ✭ 80 (+321.05%)
Mutual labels:  scraper, spider, scrapy
Fbcrawl
A Facebook crawler
Stars: ✭ 536 (+2721.05%)
Mutual labels:  spider, scraper, scrapy
scrapy facebooker
Collection of scrapy spiders which can scrape posts, images, and so on from public Facebook Pages.
Stars: ✭ 22 (+15.79%)
Mutual labels:  scraper, spider, scrapy
Xcrawler
快速、简洁且强大的PHP爬虫框架
Stars: ✭ 344 (+1710.53%)
Mutual labels:  spider, scraper
Freshonions Torscraper
Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion
Stars: ✭ 348 (+1731.58%)
Mutual labels:  spider, scraper
Gosint
OSINT Swiss Army Knife
Stars: ✭ 401 (+2010.53%)
Mutual labels:  spider, scraper
Funpyspidersearchengine
Word2vec 千人千面 个性化搜索 + Scrapy2.3.0(爬取数据) + ElasticSearch7.9.1(存储数据并提供对外Restful API) + Django3.1.1 搜索
Stars: ✭ 782 (+4015.79%)
Mutual labels:  spider, scrapy
Linkedin
Linkedin Scraper using Selenium Web Driver, Chromium headless, Docker and Scrapy
Stars: ✭ 309 (+1526.32%)
Mutual labels:  scraper, scrapy
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Stars: ✭ 440 (+2215.79%)
Mutual labels:  spider, scraper
Seeker
Seeker - another job board aggregator.
Stars: ✭ 16 (-15.79%)
Mutual labels:  spider, scrapy
Xidel
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
Stars: ✭ 335 (+1663.16%)
Mutual labels:  scraper, webscraping
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+21357.89%)
Mutual labels:  scraper, webscraping
Advanced Web Scraping Tutorial
The Zipru scraper developed in the Advanced Web Scraping Tutorial.
Stars: ✭ 384 (+1921.05%)
Mutual labels:  scraper, scrapy
Elves
🎊 Design and implement of lightweight crawler framework.
Stars: ✭ 315 (+1557.89%)
Mutual labels:  spider, scrapy
Awesome Crawler
A collection of awesome web crawler,spider in different languages
Stars: ✭ 4,793 (+25126.32%)
Mutual labels:  spider, scraper
Python Spider
豆瓣电影top250、斗鱼爬取json数据以及爬取美女图片、淘宝、有缘、CrawlSpider爬取红娘网相亲人的部分基本信息以及红娘网分布式爬取和存储redis、爬虫小demo、Selenium、爬取多点、django开发接口、爬取有缘网信息、模拟知乎登录、模拟github登录、模拟图虫网登录、爬取多点商城整站数据、爬取微信公众号历史文章、爬取微信群或者微信好友分享的文章、itchat监听指定微信公众号分享的文章
Stars: ✭ 615 (+3136.84%)
Mutual labels:  spider, scrapy
Icrawler
A multi-thread crawler framework with many builtin image crawlers provided.
Stars: ✭ 629 (+3210.53%)
Mutual labels:  spider, scrapy

Mailing List Scraper

Build Status

mailingListScraper is a tool to extract data from public email lists in a format suitable for data analysis.

Introduction

If you want to make some data analysis on a mailing list, first you need a dataset. mailingListScraper is a python tool enabling you to process the unstructured data available in public mailing list archives. The data is saved as .csv and .xml for easy statistical analysis, data modeling, text mining, machine learning, etc.

mailingListScraper is organized around Mailing List Archives. They usually store many mailing lists and provide a web interface to browse and read the emails.

Supported archives include:

Email Archive Lists # Emails # Default Mailing List
Hypermail 3 (list) 2,5m+ Linux Kernel Mailing List
MARC 3500+ (list) 80m+ git

User guide

Installation

Make sure to install scrapy and the dependencies first.

pip install -r requirements.txt

Clone the repo and cd into it. You're done!

Quick start

I strongly recommend that you identify yourself in the user-agent (mailingListScraper/settings.py) so that people can contact you if needed. Also, be mindful of the potential impact of your scraper on the server's load.

mailingListScraper is composed of several spiders. Each spider targets a specific email archive, which can host one or several mailing lists.

You can launch a spider running this command at the root level of the repo:

scrapy crawl {archiveName}

For instance, if I want to collect data from the Hypermail archive:

scrapy crawl hypermail

If the archive hosts multiple mailing lists, the spider only crawls one of them by default and lets you know which one. In the Hypermail case, that's the Linux Kernel Mailing List :

[hypermail] INFO: Crawling the LKML by default.

That's it! The spider is collecting data.

Collected data

The spider stores extracted emails in a data folder, containing:

  • {ArchiveName}ByEmail.csv: all metadata collected are stored in this file, with each row corresponding to an email. If you only crawl one mailing list, then the name is {mailingList}ByEmail.csv (and column mailingList is dropped).
  • {ArchiveName}{year}Bodies.xml: a XML file with the email body, in which each item is an email . If you only crawl one mailing list, then the name is {mailingList}{year}Bodies.xml (and node mailingList is dropped).

CSV file

Each row corresponds to an email, each column to one of the following fields:

Field Example Comment
mailingList lkml Migth be dropped if you only crawl through one mailing list.
emailId 20161017142556 The timestamp for received time ("received on 2016-10-17 at 14:25:56"). If two or more emails were received at the same time, we append a 0 (or more) at the end of the timestamp.
senderName Linus Torvalds If no name is found, will be the email.
senderEmail [email protected] Might not be complete.
timestampSent 20161017142556+0500 Based upon previous field, a timestamp with timezone (if available). Will be "NA" if timeSent is "NA" or cannot be parsed. Some mailing list don't have one, so the whole column will be dropped.
timestampReceived 20161017142556+0500 Based upon previous field, a timestamp with timezone (if available). Will be "NA" if timeReceived is "NA" or cannot be parsed.
subject Re: [PATCH v1] oops Pretty obvious :)
url http://archive.org/mailingList/msg2.html The url of the message.
replyto http://archive.org/mailingList/msg1.html The url of the message the current email replies to.

When the scraper fails to extract the relevant information from the email, the field is marked as "NA".

XML file

The body of the emails is stored in an XML file, with some metadata, to make text mining easier. It's organized like this:

<?xml version="1.0" encoding="utf-8"?>
<emails>
  <email>
    <emailId>20060130061212</emailId>
    <senderName>Linus Torvalds</senderName>
    <senderEmail>[email protected]</senderEmail>
    <timestampReceived>2006-01-30 06:12:12-0400</timestampReceived>
    <subject>Re: [PATCH v1] oops</subject>
    <body>bla bla bla</body>
  </email>
</emails>

Each email is an item node.

This is how you would load the data with the tm package in R.

library('tm')

# Define custom XML reader
ml_reader <- readXML(
  spec = list(id = list("node", "/email/emailId"),
              content = list("node", "/email/body"),
              datetimestamp = list("node", "/email/timestampReceived"),
              subject = list("node", "/email/subject"),
              author = list("node", "/email/senderName"),
              author_email = list("node", "/email/senderEmail")
              ),
  doc = PlainTextDocument()
)

# Create custom source
ml_source <- function(x) {
  XMLSource(x, function(tree) XML::xmlChildren(XML::xmlRoot(tree)), ml_reader)
}

# Load documents as a VCorpus for example
ml <- VCorpus(ml_source("./lkml2017Bodies.xml"))

Options

The spiders accept arguments from the command line. You can combine them to adjust the scope of your crawl.

Say I only care about the metadata of the emails sent in 1995, but I want to crawl all the lists in the Hypermail archive:

scrapy crawl hypermail -a mlist=all -a body=false -a year=1995

mlist

You can provide a comma separated list of mailing lists for a specific spider:

scrapy crawl archiveName -a mlist=mailinglist1,mailinglist2

To print the available mailing lists in an archive:

scrapy crawl hypermail -a mlist=print

To crawl every mailing list in the archive:

scrapy crawl hypermail -a mlist=all

body

Since downloading the body of each email can take up a lot of disk space, you can disable it:

scrapy crawl archiveName -a body=false

year, month

By default, the spiders crawl through every message in the mailing list. If you're only interested in a specific period of time, you can use the year and/or month argument for that.

You can focus on one year/month:

scrapy crawl marc -a year=2006
scrapy crawl marc -a month=01

Or you can give it a comma separated list of years/months:

scrapy crawl marc -a year=2006,2011
scrapy crawl marc -a month=01,06

Or even a range of months:

scrapy crawl marc -a year=2006:2008
scrapy crawl marc -a month=01:06

You can also combine the year and month arguments:

scrapy crawl marc -a year=2006:2008 -a month=01:06

Development and testing

I am currently developing this scraper to collect data for my PhD (EHESS in Paris, France). If you do see problems with the code, I'll be glad to review your Pull Requests ;-)

Development

This scraper is developed in Python 3.5.2 with the scrapy framework.

Before you start working on your own spiders, you should set the LOG_LEVEL setting (mailingListScraper/settings.py) to DEBUG (or just uncomment the line).

Testing

Running tests and git pre-commit hook

You can run all tests with a simple ./run-tests.sh (see below for a detailed explanation of how the tests work).

If you want to only commit code that passes tests, you should install the git pre-commit hook by running:

ln -s ../../pre-commit.sh .git/hooks/pre-commit

How the tests work

You can run the scrapy check command to run simple tests, with the built-in Scrapy contracts.

While contracts are fine for small verifications, they are not enough: you need to make sure that the data collected is consistent with your expectations. That's why some basic unit testing is provided in mailingListScraper/tests. Each spider and pipeline is to be tested with "real world" test cases.

The data for these test cases is provided in the pages directory. Cases are organized into subfolders, named after their spiders. In these subfolders, you'll find the test cases which consist of three files:

  • emailId.html: this is the page used as a Scrapy response to test the methods of the spider.
  • emailId.json: this is the data you expect to get from the spider. You can test the items extracted after they've been processed by the ItemLoader (itemOutput) and after they've been processed by the pipelines (pipelineOutput).
  • emailId.txt: that is the body you expect to retrieve from the page.

Each time you test a spider, it will iterate through a certain number of test cases. You can test a specific spider running this command at the root level of the repo:

python -m unittest mailingListScraper.tests.hypermail

Pipelines can also be tested:

python -m unittest mailingListScraper.tests.pipelines

Privacy

The data is already publicly available online ; I am merely organizing it in a form that is convenient for data analysis. For instance, I cannot collect email addresses if the email archive hides it.

But when data is available and adds valuable information to the dataset, I will collect it. If your email address is not hidden, I do extract it to improve my chances of following an individual user over the years. Specifically, the same user might change her name but not her email address or the other way around. Collecting both the name and the email increases the probability of attributing these emails to the same person.

Keep in mind that I will only use the data collected with this scraper for research. In general, I will never use this data for spamming or targetting users with ads.

If you think I might have collected some of your emails on these public lists, feel free to contact me (@gaalcaras) if you have any questions or requests regarding your personal data.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].