All Projects → sudhamstarun → Understanding Financial Reports Using Natural Language Processing

sudhamstarun / Understanding Financial Reports Using Natural Language Processing

Licence: mit
Investigate how mutual funds leverage credit derivatives by studying their routine filings to the SEC using NLP techniques 📈🤑

Projects that are alternatives of or similar to Understanding Financial Reports Using Natural Language Processing

Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (+236.11%)
Mutual labels:  natural-language-processing, named-entity-recognition, information-extraction
Nested Ner Tacl2020 Transformers
Implementation of Nested Named Entity Recognition using BERT
Stars: ✭ 76 (+111.11%)
Mutual labels:  natural-language-processing, named-entity-recognition, information-extraction
Nlp Progress
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
Stars: ✭ 19,518 (+54116.67%)
Mutual labels:  natural-language-processing, named-entity-recognition
Transformers Tutorials
Github repo with tutorials to fine tune transformers for diff NLP tasks
Stars: ✭ 384 (+966.67%)
Mutual labels:  natural-language-processing, named-entity-recognition
Awesome Persian Nlp Ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (+1177.78%)
Mutual labels:  natural-language-processing, named-entity-recognition
Vncorenlp
A Vietnamese natural language processing toolkit (NAACL 2018)
Stars: ✭ 354 (+883.33%)
Mutual labels:  natural-language-processing, named-entity-recognition
Spacy Streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Stars: ✭ 360 (+900%)
Mutual labels:  natural-language-processing, named-entity-recognition
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+60950%)
Mutual labels:  natural-language-processing, named-entity-recognition
Medacy
🏥 Medical Text Mining and Information Extraction with spaCy
Stars: ✭ 287 (+697.22%)
Mutual labels:  natural-language-processing, information-extraction
Hanlp
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Stars: ✭ 24,626 (+68305.56%)
Mutual labels:  natural-language-processing, named-entity-recognition
Ner Lstm
Named Entity Recognition using multilayered bidirectional LSTM
Stars: ✭ 532 (+1377.78%)
Mutual labels:  natural-language-processing, named-entity-recognition
Stanza
Official Stanford NLP Python Library for Many Human Languages
Stars: ✭ 5,887 (+16252.78%)
Mutual labels:  natural-language-processing, named-entity-recognition
Snips Nlu
Snips Python library to extract meaning from text
Stars: ✭ 3,583 (+9852.78%)
Mutual labels:  named-entity-recognition, information-extraction
Gcn Over Pruned Trees
Graph Convolution over Pruned Dependency Trees Improves Relation Extraction (authors' PyTorch implementation)
Stars: ✭ 312 (+766.67%)
Mutual labels:  natural-language-processing, information-extraction
Usc Ds Relationextraction
Distantly Supervised Relation Extraction
Stars: ✭ 378 (+950%)
Mutual labels:  natural-language-processing, information-extraction
Ner
Named Entity Recognition
Stars: ✭ 288 (+700%)
Mutual labels:  natural-language-processing, named-entity-recognition
Neuronlp2
Deep neural models for core NLP tasks (Pytorch version)
Stars: ✭ 397 (+1002.78%)
Mutual labels:  natural-language-processing, named-entity-recognition
Named Entity Recognition
name entity recognition with recurrent neural network(RNN) in tensorflow
Stars: ✭ 20 (-44.44%)
Mutual labels:  natural-language-processing, named-entity-recognition
Chatbot ner
chatbot_ner: Named Entity Recognition for chatbots.
Stars: ✭ 273 (+658.33%)
Mutual labels:  natural-language-processing, named-entity-recognition
Oie Resources
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Stars: ✭ 283 (+686.11%)
Mutual labels:  natural-language-processing, information-extraction

Understanding Financial Reports using Natural Language Processing

This project serves as my undergraduate Computer Science thesis in Natural Language Processing.

Background

This project investigates how mutual funds leverage credit derivative by studying their routine filings to the U.S. Securities and Exchange Commission. Credit derivatives are used to transfer credit risk related to an underlying entity from one party to another without transferring the actual underlying entity.

Instead of studying all credit derivatives, we focus on Credit Default Swap (CDS), one of the popular credit derivatives that were considered the culprit of the 2007-2008 financial crisis. A credit default swap is a particular type of swap designed to transfer the credit exposure of fixed income products between two or more parties. In a credit default swap, the buyer of the swap makes payments to the swaps seller up until the maturity date of a contract. In return, the seller agrees that, in the event that the debt issuer defaults or experiences another credit event, the seller will pay the buyer the securitys premium as well as all interest payments that would have been paid between that time and the securitys maturity date.

CDS is traded over-the-counter, thus there exists little public information on its trading activities for the outside investors. However, such information is valuable. CDS is designed as a hedging tool that the buyers use to protect themselves from potential default events of the reference entity. Besides, it is also used for speculation and liquidity management especially during a crisis.

Before SEC has requested more frequent and detailed fund holdings reporting at the end of 2016, mutual funds filed the forms in discrepant formats. This made it extremely difficult to effectively extract information from the reports for carrying out further analysis. There exist some previous studies that explored how mutual funds have made use of CDS (Adam and Guttler, 2015, Jiang and Zhu, 2016), but only examined a fraction of institutions over a short period of time. In this project, we aim to extract as much CDS-related information as possible from all the filings available to date to enable more thorough downstream analysis. This information appears not only in the form of charts but also in words, thus Natural Language Processing (NLP) is the key.

Tools Used

  1. The core of this project can be recognised as a Named Entity Recognition Task, so we implemented a BiLSTM-CRF model and a CRF model to conduct sequence labelling on unsturctured data. Its implementation is still in progress and can be found here: https://github.com/sudhamstarun/AwesomeNER
  2. A RESTful API based web application is developed to work as a Credit Default Swap Search Engine in order to make it extremely accessible for researchers and analysts to have access to all the historical mentions of Credit Default Swap by simply searching counterparty or reference entities https://github.com/sudhamstarun/Credit-Default-Swap-Search-Engine

Basic Folder Structure

  1. The Data Crawling folder is essentially the web crawling scripts written in Python to extract the N-CSR, N-CSRS and N-Q reports from the SEC website.
  2. Data Preprocessing folder contains two further folders dedicated to:
    1. Restructuring Scripts: These scripts were written to further restructure the data extracted from the SEC website(148 GB) and to it's current folder heirarchy shown in the image below. Some of the noteworthy scripts are:
      1. restructure.sh: This script focuses on restructuring the initial folder structure into 3 different folders for N-CSR, N-CSRS, N-Q
    2. Sentence Extraction: The python-based scripts were written to parse the HTML tags present in the report and also to perform other tasks such as removing stop words and extracting sentences which contained unstructured CDS information.
  3. Rule-Based Extraction: This folder contains the rule-based framework developed based on python to extract the tables containing CDS information and save it in a .csv format. This makes it extremely easy to convert reports from .NET format to .csv format making it easy to visualise and analyse the data.
  4. Finally, the website folder contains the code for the landing page created for course requirements.

Installation and Demo

  1. Before running any of the scripts, make sure you set up a virtual environment and activate the environment.
  2. Then install all the necessary python dependencies by using the command:
pip3 install -r requirements.txt
  1. To run the sentence extraction script simply run:
python3 sentenceExtraction.py [name of the .txt or .htmlfile]
  1. To run the HTMl tags parsing script, run:
python3 HTML_Parser.py [name of the .txt or .html file]
  1. Finally, to run the table extractor script, simply run the following command:
python3 parserExtractor.py [name of the .txt or .html file]

The output of the table-extractor script will be saved in the sample output folder.

Authors:

Tarun Sudhams Varun Vamsi

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].