All Projects → liuaiting → Financial-News-Analysis

liuaiting / Financial-News-Analysis

Licence: Apache-2.0 license
招商银行FinTech-复赛-财经新闻分析

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Financial-News-Analysis

Finrl Library
FinRL: Financial Reinforcement Learning Framework. Please star. 🔥
Stars: ✭ 3,037 (+17764.71%)
Mutual labels:  fintech
Vnpy
基于Python的开源量化交易平台开发框架
Stars: ✭ 17,054 (+100217.65%)
Mutual labels:  fintech
intrinio-realtime-java-sdk
Intrinio Java SDK for Real-Time Stock Prices
Stars: ✭ 22 (+29.41%)
Mutual labels:  fintech
Learning Blockchain
Tidy up Blockchain ecosystem and tutorial
Stars: ✭ 188 (+1005.88%)
Mutual labels:  fintech
Fintech To Ynab
Automatically push Monzo and Starling transactions into YNAB in real time.
Stars: ✭ 214 (+1158.82%)
Mutual labels:  fintech
gosimhash
A simhasher for Chinese documents implemented by golang, simply translated from yanyiwu/gosimhash
Stars: ✭ 17 (+0%)
Mutual labels:  simhash
Sepa king
Ruby gem for creating SEPA XML files
Stars: ✭ 125 (+635.29%)
Mutual labels:  fintech
Awesome-FinTech
Everything about fintech: companies, technologies, libraries & packages, policies, jobs, milestones .
Stars: ✭ 29 (+70.59%)
Mutual labels:  fintech
Wondertrader
WonderTrader——量化研发交易一站式框架
Stars: ✭ 221 (+1200%)
Mutual labels:  fintech
word-embeddings-from-scratch
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.
Stars: ✭ 22 (+29.41%)
Mutual labels:  gensim
Bankscrap
Ruby gem to extract balance and transactions from multiple banks.
Stars: ✭ 197 (+1058.82%)
Mutual labels:  fintech
Ach
ACH implements a reader, writer, and validator for Automated Clearing House (ACH) files. The HTTP server is available in a Docker image and the Go package is available.
Stars: ✭ 210 (+1135.29%)
Mutual labels:  fintech
ultimateMICR-SDK
Bank check information extraction/OCR from Magnetic Ink Character Recognition [MICR] (E-13B & CMC-7) using deep learning
Stars: ✭ 42 (+147.06%)
Mutual labels:  fintech
Finnlp Progress
NLP progress in Fintech. A repository to track the progress in Natural Language Processing (NLP) related to the domain of Finance, including the datasets, papers, and current state-of-the-art results for the most popular tasks.
Stars: ✭ 148 (+770.59%)
Mutual labels:  fintech
clearth
Test automation tool for Clearing, Settlement and Back-Office Systems
Stars: ✭ 26 (+52.94%)
Mutual labels:  fintech
Tushare
TuShare is a utility for crawling historical data of China stocks
Stars: ✭ 11,288 (+66300%)
Mutual labels:  fintech
simhash-js
Simhash implementation in Javascript
Stars: ✭ 35 (+105.88%)
Mutual labels:  simhash
terms-dictionary
Simple definitions of terms, acronyms, abbreviations, companies, and projects related to financial services and Moov.
Stars: ✭ 48 (+182.35%)
Mutual labels:  fintech
Word2VecAndTsne
Scripts demo-ing how to train a Word2Vec model and reduce its vector space
Stars: ✭ 45 (+164.71%)
Mutual labels:  gensim
FUTURE
A private, free, open-source search engine built on a P2P network
Stars: ✭ 19 (+11.76%)
Mutual labels:  gensim

Financial News Analysis

财经新闻分析

Table of Contents

Background

  • 财经新闻作为重要却海量的投资数据,无时无刻不在影响着投资者们的投资决策,为了更好地提示客户当下新闻事件对应的投资机会和投资风险,本课以研发“历史事件连连看”为目的,旨在根据当前新闻内容从历史事件中搜索出相似新闻报道,后期可以结合事件与行情,辅助客户采取相应投资策略。
  • 该赛题是让参赛者为每一条测试集数据寻找其最相似的TOP 20条新闻(不包含测试新闻本身),我们会根据参赛者提交的结果和实际的数据进行对比,采用mAP值作为评价指标。 评价指标 该赛题是让参赛者为每一条测试集数据寻找其最相似的TOP 20条新闻,我们会根据参赛者提交的结果和实际的数据进行对比,采用mAP值作为评价指标,评分公式如下:

其中D表示测试集中新闻的总数量,Yd表示新闻d的n条真实相似新闻集合,无序,Yd={Yd1,Yd2,Yd3,……,Ydn };Zd表示选手提交m条(赛题中m=20)相似新闻的有序集合Zd={Zd1,Zd2,Zd3,……,Zdm };Zd中各元素的Rank值分别1,2,3,……,m,记为ri。对于集合K,|K|表示K中元素的个数,即|Yd |=n。

Data

训练集数据 train_data.csv
id 训练集的新闻编号
title 训练集新闻(标题)
测试集数据 test_data.csv
id 测试集的新闻编号
title 测试集新闻(标题)

Setup

  • Python 3.6

Requirements

pip install -r requirements.txt

Results

Algorithm mAP
sequence-overlap 0.0587
average-word2vec 0.0770
LSI(num_topics=1000) 0.0854
LSI(num_topics=2000) 0.0870
simhash 0.0310
bm25(jieba) 0.1137
bm25(jieba&去停用词) 0.1058
bm25(char) 0.0850
bm25(thulac) 0.0932
bm25(NLPIR) 0.0957

Models

bm25

bm25_model.py :该模型效果最好,只给出该模型的代码
把train_data.csv及test_data.csv放在path_corpus文件下,运行bm25_model.py即可
  • 建立词袋模型

  • 用gensim建立BM25模型

  • 根据gensim源码,计算平均逆文档频率

  • 利用BM25模型计算所有文本与搜索词的相关性(使用gensim库)

  • 找到最相关的top20文本

  • 通过调整k1和b这两个参数,可以达到更好的效果

bm25 mAP
k1=1.5, b=0.75 0.1137
k1=1.5,b=0.85 0.1182
k1=1, b=1 0.12
k1=1.2, b=0.9
k1=1.4, b=0.85 0.1185

average-word2vec

  1. 使用jieba分词对训练集数据进行分词
  2. 使用google提供的word2vec对分词后的训练语料进行训练,得到词向量(命令行参数设置为:./word2vec -train ../path_corpus/corpus_train.txt -output ../path_corpus/vec.txt -cbow 1 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -iter 10 -binary 0 -min-count 0 -save-vocab ../path_corpus/vocab.txt
  3. 对于分词后的训练语料中的每一个样本,将句子中每一个词对应的词向量按位相加取平均值作为句子的向量表示
  4. 采用cosine similarity作为文本相似度的度量标准,并取top20作为最终结果

sequence-overlap

  1. 该模型不需要对训练语料进行分词
  2. 文本相似度采用字符串的重叠度来度量

LSI

  1. 分词、去停用词
  2. 词袋模型向量化文本
  3. TF-IDF模型向量化文本
  4. LSI模型向量化文本(使用gensim库)
  5. 计算相似度

simhash

  1. 过滤清洗,提取n个特征关键词
  2. 特征加权,tf-idf
  3. 对关键词进行hash降维01组成的签名(上述是6位)
  4. 然后向量加权,对于每一个6位的签名的每一位,如果是1,hash和权重正相乘,如果为0,则hash和权重负相乘,至此就能得到每个特征值的向量。
  5. 合并所有的特征向量相加,得到一个最终的向量,然后降维,对于最终的向量的每一位如果大于0则为1,否则为0,这样就能得到最终的simhash的指纹签名(使用simhash库)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].