All Projects → nlpaueb → edgar-crawler

nlpaueb / edgar-crawler

Licence: GPL-3.0 license
Download financial reports from SEC's EDGAR. Extract clean textual data from specific item sections and bootstrap your financial NLP research. Software from the research paper published in ECONLP 2021.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to edgar-crawler

fortune500
Fortune 500 company lists since 1955 in CSV format, mostly parsed using Beautiful Soup
Stars: ✭ 78 (+9.86%)
Mutual labels:  finance, business, economics
priceR
Economics and Pricing in R
Stars: ✭ 32 (-54.93%)
Mutual labels:  finance, economics
Fecon235
Notebooks for financial economics. Keywords: Jupyter notebook pandas Federal Reserve FRED Ferbus GDP CPI PCE inflation unemployment wage income debt Case-Shiller housing asset portfolio equities SPX bonds TIPS rates currency FX euro EUR USD JPY yen XAU gold Brent WTI oil Holt-Winters time-series forecasting statistics econometrics
Stars: ✭ 708 (+897.18%)
Mutual labels:  finance, economics
mxfactorial
a payment application intended for deployment by the united states treasury
Stars: ✭ 36 (-49.3%)
Mutual labels:  finance, economics
akshare
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
Stars: ✭ 5,155 (+7160.56%)
Mutual labels:  finance, economics
ESL
​The Economic Simulation Library provides an extensive collection of tools to develop, test, analyse and calibrate economic and financial agent-based models. The library is designed to take advantage of different computer architectures. In order to facilitate rapid iteration during model development the library can use parallel computation. Econ…
Stars: ✭ 36 (-49.3%)
Mutual labels:  finance, economics
Akshare
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
Stars: ✭ 4,334 (+6004.23%)
Mutual labels:  finance, economics
Bootstrapping Calculator
Do you have enough savings to fund your business?
Stars: ✭ 465 (+554.93%)
Mutual labels:  finance, business
Fecon236
Tools for financial economics. Curated wrapper over Python ecosystem. Source code for fecon235 Jupyter notebooks.
Stars: ✭ 72 (+1.41%)
Mutual labels:  finance, economics
Ta Rs
Technical analysis library for Rust language
Stars: ✭ 248 (+249.3%)
Mutual labels:  finance
housing-model
Agent-based model of the UK housing market.
Stars: ✭ 29 (-59.15%)
Mutual labels:  economics
Trading Backtest
A stock backtesting engine written in modern Java. And a pairs trading (cointegration) strategy implementation using a bayesian kalman filter model
Stars: ✭ 247 (+247.89%)
Mutual labels:  finance
Alpaca Backtrader Api
Alpaca Trading API integrated with backtrader
Stars: ✭ 246 (+246.48%)
Mutual labels:  finance
finance-news-aggregator
A news aggregator in python, that focuses primarily on business and market news sources.
Stars: ✭ 59 (-16.9%)
Mutual labels:  business
Dash.jl
Dash for Julia - A Julia interface to the Dash ecosystem for creating analytic web applications in Julia. No JavaScript required.
Stars: ✭ 248 (+249.3%)
Mutual labels:  finance
exactonline
Exact Online (accounting software) REST API Library in Python
Stars: ✭ 35 (-50.7%)
Mutual labels:  finance
News Emotion
📉 金融文本情感分析模型
Stars: ✭ 239 (+236.62%)
Mutual labels:  finance
Stock Bot
An application that allows you to design and test your own stock trading algorithms in an attempt to beat the market.
Stars: ✭ 240 (+238.03%)
Mutual labels:  finance
rRofex
R library to connect to Matba Rofex's Trading API. Functionality includes accessing account data and current holdings, retrieving investment quotes, placing and canceling orders, and getting reference data for instruments.
Stars: ✭ 21 (-70.42%)
Mutual labels:  finance
starling-roundup
Round-up your Starling Bank transactions and transfer the proceeds to a savings goal
Stars: ✭ 17 (-76.06%)
Mutual labels:  finance

EDGAR-CRAWLER

Crawl and fetch all publicly-traded companies annual reports from SEC's EDGAR database.

edgar-crawler is an optimized toolkit that retrieves textual information from financial reports, such as 10-K, 10-Q or 8-K filings.

More specifically, it can:

  • Crawl and download financial reports for each publicly-traded company, for specified years, through the edgar_crawler.py module.
  • Extract and clean specific text sections, such as Risk Factors, MD&A, and others, through the extract_items.py module. Currently, we only support extraction of 10-K filings (i.e., annual reports).

The purpose of EDGAR-CRAWLER is to speed up research and experiments that rely on financial information, as they are widely seen in the research literature of economics, finance, business and management.

🚨 News

Table of Contents

Install

  • Before starting, it's recommended to create a new virtual environment using Python 3.8 or greater. We recommend installing and using Anaconda for this.
  • Install dependencies via pip install -r requirements.txt

Usage

  • Before running any script, you should edit the config.json file.

    • Arguments for edgar_crawler.py, the module to download financial reports:
      • --start_year XXXX: the year range to start from (default is 2021)
      • --end_year YYYY: the year range to end to (default is 2021)
      • --quarters: the quarters that you want to download filings from (List).
        Default value is: [1, 2, 3, 4].
      • --filing_types: list of filing types to download.
        Default value is: ['10-K', '10-K405', '10-KT'].
      • --cik_tickers: list or path of file containing CIKs or Tickers. e.g. [789019, "1018724", "AAPL", "TWTR"]
        In case of file, provide each CIK or Ticker in a different line.
        If this argument is not provided, then the toolkit will download annual reports for all the U.S. publicly traded companies.
      • --user_agent: the User-agent (name/email) that will be declared to SEC EDGAR.
      • --raw_filings_folder: the name of the folder where downloaded filings will be stored.
        Default value is 'RAW_FILINGS'.
      • --indices_folder: the name of the folder where EDGAR TSV files will be stored. These are used to locate the annual reports. Default value is 'INDICES'.
      • --filings_metadata_file: CSV filename to save metadata from the reports.
      • --skip_present_indices: Whether to skip already downloaded EDGAR indices or download them nonetheless.
        Default value is True.
    • Arguments for extract_items.py, the module to clean and extract textual data from already-downloaded 10-K reports:
      • --raw_filings_folder: the name of the folder where the downloaded documents are stored.
        Default value s 'RAW_FILINGS'.
      • --extracted_filings_folder: the name of the folder where extracted documents will be stored.
        Default value is 'EXTRACTED_FILINGS'.
        For each downloaded report, a corresponding JSON file will be created containing the item sections as key-pair values.
      • --filings_metadata_file: CSV filename to load reports metadata (Provide the same csv file as in edgar_crawler.py)
      • --items_to_extract: a list with the certain item sections to extract.
        e.g. ['7','8'] to extract 'Management’s Discussion and Analysis' and 'Financial Statements' section items.
        The default list contains all item sections.
      • remove_tables: Whether to remove tables containing mostly numerical (financial) data. This work is mostly to facilitate NLP research and often numerical tables are not useful
      • skip_extracted_filings: Whether to skip already extracted filings or extract them nonetheless.
        Default value is True.
  • To download financial reports from EDGAR, run python edgar_crawler.py

  • To clean and extract specific item sections from already-downloaded 10-K documents, run python extract_items.py.

    • Reminder: We currently support the extraction of 10-K documents.

Citation

If this work helps or inspires you in any way, please consider citing the relevant paper published at the 3rd Economics and Natural Language Processing (ECONLP) workshop at EMNLP 2021 (Punta Cana, Dominican Republic):

@inproceedings{loukas-etal-2021-edgar,
    title = "{EDGAR}-{CORPUS}: Billions of Tokens Make The World Go Round",
    author = "Loukas, Lefteris  and
      Fergadiotis, Manos  and
      Androutsopoulos, Ion  and
      Malakasiotis, Prodromos",
    booktitle = "Proceedings of the Third Workshop on Economics and Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.econlp-1.2",
    pages = "13--18",
}

Read the paper here: https://arxiv.org/abs/2109.14394

Accompanying Resources

Contributing

PRs and contributions are accepted.

Please use the Feature Branch Workflow.

Issues

Please create an issue on GitHub instead of emailing us directly so all possible users can benefit from the troubleshooting.

License

Please see the GNU General Public License v3.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].