skydome20 / crawler_CIA_CREST

Licence: MIT License

R-crawler for CIA website (CREST)

Programming Languages

7636 projects

Projects that are alternatives of or similar to crawler CIA CREST

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.

Stars: ✭ 2,014 (+13326.67%)

Mutual labels: parse, xpath

carsBase

База автомобилей с марками и моделями JSON, CSV, XLSX и MySQL

Stars: ✭ 49 (+226.67%)

Mutual labels: parse

CROHME extractor

CROHME dataset extractor for OFFLINE-text-recognition task.

Stars: ✭ 77 (+413.33%)

Mutual labels: parse

easy-json-parse

Parse your json safely and easily.

Stars: ✭ 33 (+120%)

Mutual labels: parse

gitsum

parse and summarise git repository history

Stars: ✭ 43 (+186.67%)

Mutual labels: parse

Z-Spider

一些爬虫开发的技巧和案例

Stars: ✭ 33 (+120%)

Mutual labels: xpath

astutils

Bare essentials for building abstract syntax trees, and skeleton classes for PLY lexers and parsers.

Stars: ✭ 13 (-13.33%)

Mutual labels: parse

web-data-extractor

Extracting and parsing structured data with jQuery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.

Stars: ✭ 52 (+246.67%)

Mutual labels: xpath

codechef-rank-comparator

Web application hosted on Heroku cloud platform based on web scraping in python using lxml library (XML Path Language).

Stars: ✭ 23 (+53.33%)

Mutual labels: xpath

Splain

small parser to create more interesting language/sentences

Stars: ✭ 15 (+0%)

Mutual labels: parse

vgprompter

C# library to parse a subset of Ren'Py script syntax

Stars: ✭ 17 (+13.33%)

Mutual labels: parse

DouBanReptile

豆瓣租房小组多线程爬虫。爬取后自动按时间排序生成markdown文件。

Stars: ✭ 31 (+106.67%)

Mutual labels: xpath

parse-cloud-class

Extendable way to set up Parse Cloud classes behaviour

Stars: ✭ 40 (+166.67%)

Mutual labels: parse

der-parser

BER/DER parser written in pure Rust. Fast, zero-copy, safe.

Stars: ✭ 73 (+386.67%)

Mutual labels: parse

expresol

Library for executing customizable script-languages in python

Stars: ✭ 11 (-26.67%)

Mutual labels: parse

go-xmldom

XML DOM processing for Golang, supports xpath query

Stars: ✭ 38 (+153.33%)

Mutual labels: xpath

Android-Shortify

An Android library used for making an Android application more faster with less amount of code. Shortify for Android provides basic functionalities of view and resource binding, view customization, JSON parsing, AJAX, various readymade dialogs and much more.

Stars: ✭ 21 (+40%)

Mutual labels: parse

eval-estree-expression

Safely evaluate JavaScript (estree) expressions, sync and async.

Stars: ✭ 22 (+46.67%)

Mutual labels: parse

XPath2.Net

Lightweight XPath2 for .NET

Stars: ✭ 26 (+73.33%)

Mutual labels: xpath

cmd-ts

💻 A type-driven command line argument parser

Stars: ✭ 92 (+513.33%)

Mutual labels: parse

View All Similar Projects ➔

Introduction

2017/01/18, Central Intelligence Agency (CIA) released their CIA Records Search Tool(CREST) database online, including 930,000 declassified documents.

Being interested, I try to write a web crawler for the public CIA CREST website ( https://www.cia.gov/library/readingroom/collection/crest-25-year-program-archive ), making it convenient to fast browse information of your query and automatically download documents in your own equipment.

The search query is for CIA Freedom of Information Act (FOIA) Electronic Reading Room (ERR), and the crawler is coded by R.

crawler_CIA_CREST.R

This is a R script which has 3 functions:

basic.info.query.CIA_CREST(query) : get the basic information by a given query.
parsing.pages.CIA_CREST(query, pages) : return a parse.table according to the given query and range of pages where you want to search, should be provied to the next function.
download.doc.CIA_CREST(parse.table) : automatically download documents based on the parse.table, and return a reference.table which helps to match titles of documents with downloaded documents(.pdf).

main.R

I provided some examples in this script, now just showing one example.

1. basic.info.query.CIA_CREST(query)

For example, if you are interesting in "secret lettet" and want to search some documents:

basic.info.query.CIA_CREST(query = "secret letter")

# Response 
The search query is for CIA Freedom of Information Act (FOIA) Electronic Reading Room (ERR)
URL: https://www.cia.gov/library/readingroom/collection/crest-25-year-program-archive

Your query is : secret letter
Search found 388350 items
The results contain 0 ~ 19417 pages

and you will get the response of 388350 search items and the range of result pages is 0~19417 pages.

(Note that 0 page equals to the first page on the web)

2. parsing.pages.CIA_CREST(query, pages)

The next step is to decide pages where you want to search.

For example, you want to check documenets about "secret letter" in the top 10 pages:

your.query = 'secret letter'
page.nums = c(0:9)   # the top 10 pages

parse.table = parsing.pages.CIA_CREST(query = your.query, 
									  pages = page.nums)

The return parse.table includes 4 columns:

title : titles of documents.
download.url : urls where to download documents.
page : the page where this document is in.
correspond.page : the page url where this documents is in.

This parse.table should be supplied to download.doc.CIA_CREST(), the function which will automatically download all documents in parse.table to the relative folder.

3. download.doc.CIA_CREST(parse.table)

That is, we want to download documents(.pdf) about "secret letter" in the top 10 pages.

your.query = 'secret letter'
page.nums = c(0:9)   # the top 10 pages

parse.table = parsing.pages.CIA_CREST(query = your.query, 
                                      pages = page.nums)
									  
reference.table = download.doc.CIA_CREST(parse.table)

Or we want to download the top 10 documents(.pdf) about "UFO" in the first page.

your.query = 'UFO'
page.nums = c(0)   # the first pages

parse.table = parsing.pages.CIA_CREST(query = your.query, 
                                      pages = page.nums)
									  
reference.table = download.doc.CIA_CREST(parse.table[1:10,]) # only the top 10 documents

Note that the return reference.table includs 2 columns,

title : title of documents
pdf.name : file name of downloaded documents(.pdf)

for the reason that downloaded documents have their own file name by CIA encoded style; therefore, it's necessary to have a reference.table for mathcing titles to documents.

R Note for more detail

I write an article in Chinese for more detail about how I implement this crawler.

(Sorry, there is no English version)

http://rpubs.com/skydome20/R-Note13-Web-Crawler-on-CIA-CREST-by-xml2

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

skydome20 / crawler_CIA_CREST

Programming Languages

Labels

Projects that are alternatives of or similar to crawler CIA CREST

Introduction

crawler_CIA_CREST.R

main.R

1. basic.info.query.CIA_CREST(query)

2. parsing.pages.CIA_CREST(query, pages)

3. download.doc.CIA_CREST(parse.table)

R Note for more detail