All Projects → skydome20 → crawler_CIA_CREST

skydome20 / crawler_CIA_CREST

Licence: MIT License
R-crawler for CIA website (CREST)

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to crawler CIA CREST

Html Agility Pack
Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
Stars: ✭ 2,014 (+13326.67%)
Mutual labels:  parse, xpath
carsBase
База автомобилей с марками и моделями JSON, CSV, XLSX и MySQL
Stars: ✭ 49 (+226.67%)
Mutual labels:  parse
CROHME extractor
CROHME dataset extractor for OFFLINE-text-recognition task.
Stars: ✭ 77 (+413.33%)
Mutual labels:  parse
easy-json-parse
Parse your json safely and easily.
Stars: ✭ 33 (+120%)
Mutual labels:  parse
gitsum
parse and summarise git repository history
Stars: ✭ 43 (+186.67%)
Mutual labels:  parse
Z-Spider
一些爬虫开发的技巧和案例
Stars: ✭ 33 (+120%)
Mutual labels:  xpath
astutils
Bare essentials for building abstract syntax trees, and skeleton classes for PLY lexers and parsers.
Stars: ✭ 13 (-13.33%)
Mutual labels:  parse
web-data-extractor
Extracting and parsing structured data with jQuery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.
Stars: ✭ 52 (+246.67%)
Mutual labels:  xpath
codechef-rank-comparator
Web application hosted on Heroku cloud platform based on web scraping in python using lxml library (XML Path Language).
Stars: ✭ 23 (+53.33%)
Mutual labels:  xpath
Splain
small parser to create more interesting language/sentences
Stars: ✭ 15 (+0%)
Mutual labels:  parse
vgprompter
C# library to parse a subset of Ren'Py script syntax
Stars: ✭ 17 (+13.33%)
Mutual labels:  parse
DouBanReptile
豆瓣租房小组多线程爬虫。爬取后自动按时间排序生成markdown文件。
Stars: ✭ 31 (+106.67%)
Mutual labels:  xpath
parse-cloud-class
Extendable way to set up Parse Cloud classes behaviour
Stars: ✭ 40 (+166.67%)
Mutual labels:  parse
der-parser
BER/DER parser written in pure Rust. Fast, zero-copy, safe.
Stars: ✭ 73 (+386.67%)
Mutual labels:  parse
expresol
Library for executing customizable script-languages in python
Stars: ✭ 11 (-26.67%)
Mutual labels:  parse
go-xmldom
XML DOM processing for Golang, supports xpath query
Stars: ✭ 38 (+153.33%)
Mutual labels:  xpath
Android-Shortify
An Android library used for making an Android application more faster with less amount of code. Shortify for Android provides basic functionalities of view and resource binding, view customization, JSON parsing, AJAX, various readymade dialogs and much more.
Stars: ✭ 21 (+40%)
Mutual labels:  parse
eval-estree-expression
Safely evaluate JavaScript (estree) expressions, sync and async.
Stars: ✭ 22 (+46.67%)
Mutual labels:  parse
XPath2.Net
Lightweight XPath2 for .NET
Stars: ✭ 26 (+73.33%)
Mutual labels:  xpath
cmd-ts
💻 A type-driven command line argument parser
Stars: ✭ 92 (+513.33%)
Mutual labels:  parse

License: MIT

Introduction

2017/01/18, Central Intelligence Agency (CIA) released their CIA Records Search Tool(CREST) database online, including 930,000 declassified documents.

Being interested, I try to write a web crawler for the public CIA CREST website ( https://www.cia.gov/library/readingroom/collection/crest-25-year-program-archive ), making it convenient to fast browse information of your query and automatically download documents in your own equipment.

The search query is for CIA Freedom of Information Act (FOIA) Electronic Reading Room (ERR), and the crawler is coded by R.

crawler_CIA_CREST.R

This is a R script which has 3 functions:

  1. basic.info.query.CIA_CREST(query) : get the basic information by a given query.

  2. parsing.pages.CIA_CREST(query, pages) : return a parse.table according to the given query and range of pages where you want to search, should be provied to the next function.

  3. download.doc.CIA_CREST(parse.table) : automatically download documents based on the parse.table, and return a reference.table which helps to match titles of documents with downloaded documents(.pdf).

main.R

I provided some examples in this script, now just showing one example.

1. basic.info.query.CIA_CREST(query)

For example, if you are interesting in "secret lettet" and want to search some documents:

basic.info.query.CIA_CREST(query = "secret letter") 
# Response 
The search query is for CIA Freedom of Information Act (FOIA) Electronic Reading Room (ERR)
URL: https://www.cia.gov/library/readingroom/collection/crest-25-year-program-archive

Your query is : secret letter
Search found 388350 items
The results contain 0 ~ 19417 pages

and you will get the response of 388350 search items and the range of result pages is 0~19417 pages.

(Note that 0 page equals to the first page on the web)

2. parsing.pages.CIA_CREST(query, pages)

The next step is to decide pages where you want to search.

For example, you want to check documenets about "secret letter" in the top 10 pages:

your.query = 'secret letter'
page.nums = c(0:9)   # the top 10 pages

parse.table = parsing.pages.CIA_CREST(query = your.query, 
									  pages = page.nums)

The return parse.table includes 4 columns:

  1. title : titles of documents.

  2. download.url : urls where to download documents.

  3. page : the page where this document is in.

  4. correspond.page : the page url where this documents is in.

This parse.table should be supplied to download.doc.CIA_CREST(), the function which will automatically download all documents in parse.table to the relative folder.

3. download.doc.CIA_CREST(parse.table)

That is, we want to download documents(.pdf) about "secret letter" in the top 10 pages.

your.query = 'secret letter'
page.nums = c(0:9)   # the top 10 pages

parse.table = parsing.pages.CIA_CREST(query = your.query, 
                                      pages = page.nums)
									  
reference.table = download.doc.CIA_CREST(parse.table)

Or we want to download the top 10 documents(.pdf) about "UFO" in the first page.

your.query = 'UFO'
page.nums = c(0)   # the first pages

parse.table = parsing.pages.CIA_CREST(query = your.query, 
                                      pages = page.nums)
									  
reference.table = download.doc.CIA_CREST(parse.table[1:10,]) # only the top 10 documents  

Note that the return reference.table includs 2 columns,

  1. title : title of documents

  2. pdf.name : file name of downloaded documents(.pdf)

for the reason that downloaded documents have their own file name by CIA encoded style; therefore, it's necessary to have a reference.table for mathcing titles to documents.

R Note for more detail

I write an article in Chinese for more detail about how I implement this crawler.

(Sorry, there is no English version)

http://rpubs.com/skydome20/R-Note13-Web-Crawler-on-CIA-CREST-by-xml2

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].