基于NLP的股价预测

基于NLP的股价预测

1
2
3
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
1
data = pd.read_csv('Combined_News_DJIA.csv')
  • 每行是某公司 这一天股市数据;label表示当天涨/跌,Top表示依重要程度排列的当天新闻事件
  • 通过NLP处理可以把这些字符串转换为 机器认识的语言
1
data.head()
Date Label Top1 Top2 Top3 Top4 Top5 Top6 Top7 Top8 ... Top16 Top17 Top18 Top19 Top20 Top21 Top22 Top23 Top24 Top25
0 2008-08-08 0 b"Georgia 'downs two Russian warplanes' as cou... b'BREAKING: Musharraf to be impeached.' b'Russia Today: Columns of troops roll into So... b'Russian tanks are moving towards the capital... b"Afghan children raped with 'impunity,' U.N. ... b'150 Russian tanks have entered South Ossetia... b"Breaking: Georgia invades South Ossetia, Rus... b"The 'enemy combatent' trials are nothing but... ... b'Georgia Invades South Ossetia - if Russia ge... b'Al-Qaeda Faces Islamist Backlash' b'Condoleezza Rice: "The US would not act to p... b'This is a busy day: The European Union has ... b"Georgia will withdraw 1,000 soldiers from Ir... b'Why the Pentagon Thinks Attacking Iran is a ... b'Caucasus in crisis: Georgia invades South Os... b'Indian shoe manufactory - And again in a se... b'Visitors Suffering from Mental Illnesses Ban... b"No Help for Mexico's Kidnapping Surge"
1 2008-08-11 1 b'Why wont America and Nato help us? If they w... b'Bush puts foot down on Georgian conflict' b"Jewish Georgian minister: Thanks to Israeli ... b'Georgian army flees in disarray as Russians ... b"Olympic opening ceremony fireworks 'faked'" b'What were the Mossad with fraudulent New Zea... b'Russia angered by Israeli military sale to G... b'An American citizen living in S.Ossetia blam... ... b'Israel and the US behind the Georgian aggres... b'"Do not believe TV, neither Russian nor Geor... b'Riots are still going on in Montreal (Canada... b'China to overtake US as largest manufacturer' b'War in South Ossetia [PICS]' b'Israeli Physicians Group Condemns State Tort... b' Russia has just beaten the United States ov... b'Perhaps *the* question about the Georgia - R... b'Russia is so much better at war' b"So this is what it's come to: trading sex fo...
2 2008-08-12 0 b'Remember that adorable 9-year-old who sang a... b"Russia 'ends Georgia operation'" b'"If we had no sexual harassment we would hav... b"Al-Qa'eda is losing support in Iraq because ... b'Ceasefire in Georgia: Putin Outmaneuvers the... b'Why Microsoft and Intel tried to kill the XO... b'Stratfor: The Russo-Georgian War and the Bal... b"I'm Trying to Get a Sense of This Whole Geor... ... b'U.S. troops still in Georgia (did you know t... b'Why Russias response to Georgia was right' b'Gorbachev accuses U.S. of making a "serious ... b'Russia, Georgia, and NATO: Cold War Two' b'Remember that adorable 62-year-old who led y... b'War in Georgia: The Israeli connection' b'All signs point to the US encouraging Georgi... b'Christopher King argues that the US and NATO... b'America: The New Mexico?' b"BBC NEWS | Asia-Pacific | Extinction 'by man...
3 2008-08-13 0 b' U.S. refuses Israel weapons to attack Iran:... b"When the president ordered to attack Tskhinv... b' Israel clears troops who killed Reuters cam... b'Britain\'s policy of being tough on drugs is... b'Body of 14 year old found in trunk; Latest (... b'China has moved 10 *million* quake survivors... b"Bush announces Operation Get All Up In Russi... b'Russian forces sink Georgian ships ' ... b'Elephants extinct by 2020?' b'US humanitarian missions soon in Georgia - i... b"Georgia's DDOS came from US sources" b'Russian convoy heads into Georgia, violating... b'Israeli defence minister: US against strike ... b'Gorbachev: We Had No Choice' b'Witness: Russian forces head towards Tbilisi... b' Quarter of Russians blame U.S. for conflict... b'Georgian president says US military will ta... b'2006: Nobel laureate Aleksander Solzhenitsyn...
4 2008-08-14 1 b'All the experts admit that we should legalis... b'War in South Osetia - 89 pictures made by a ... b'Swedish wrestler Ara Abrahamian throws away ... b'Russia exaggerated the death toll in South O... b'Missile That Killed 9 Inside Pakistan May Ha... b"Rushdie Condemns Random House's Refusal to P... b'Poland and US agree to missle defense deal. ... b'Will the Russians conquer Tblisi? Bet on it,... ... b'Bank analyst forecast Georgian crisis 2 days... b"Georgia confict could set back Russia's US r... b'War in the Caucasus is as much the product o... b'"Non-media" photos of South Ossetia/Georgia ... b'Georgian TV reporter shot by Russian sniper ... b'Saudi Arabia: Mother moves to block child ma... b'Taliban wages war on humanitarian aid workers' b'Russia: World "can forget about" Georgia\'s... b'Darfur rebels accuse Sudan of mounting major... b'Philippines : Peace Advocate say Muslims nee...

5 rows × 27 columns

1 数据简单预处理、划分

1
2
3
# 根据日期划分 训练集 测试集
train = data[data['Date'] < '2015-01-01']
test = data[data['Date'] > '2014-12-31']
1
2
example = train.iloc[3,10]
print(example)
b"The commander of a Navy air reconnaissance squadron that provides the President and the defense secretary the airborne ability to command the nation's nuclear weapons has been relieved of duty"
1
2
example2 = example.lower()
print(example2)
b"the commander of a navy air reconnaissance squadron that provides the president and the defense secretary the airborne ability to command the nation's nuclear weapons has been relieved of duty"
1
2
example3 = CountVectorizer().build_tokenizer()(example2)
print(example3)
['the', 'commander', 'of', 'navy', 'air', 'reconnaissance', 'squadron', 'that', 'provides', 'the', 'president', 'and', 'the', 'defense', 'secretary', 'the', 'airborne', 'ability', 'to', 'command', 'the', 'nation', 'nuclear', 'weapons', 'has', 'been', 'relieved', 'of', 'duty']
1
pd.DataFrame([[x,example3.count(x)] for x in set(example3)], columns = ['Word', 'Count'])
Word Count
0 the 5
1 command 1
2 secretary 1
3 weapons 1
4 has 1
5 defense 1
6 commander 1
7 squadron 1
8 relieved 1
9 navy 1
10 of 2
11 air 1
12 reconnaissance 1
13 provides 1
14 president 1
15 been 1
16 to 1
17 and 1
18 ability 1
19 nation 1
20 that 1
21 duty 1
22 nuclear 1
23 airborne 1

2 基于词频的特征提取——构造词频矩阵

1)构造一个字符串数组StringList,每个元素是对应行所有top特征字符串拼成的长字符串

1
2
3
4
trainheadlines = [] 
for row in range(0,len(train.index)):
trainheadlines.append(' '.join(str(x) for x in train.iloc[row,2:27]))
print(trainheadlines[0:1])
['b"Georgia \'downs two Russian warplanes\' as countries move to brink of war" b\'BREAKING: Musharraf to be impeached.\' b\'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)\' b\'Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire\' b"Afghan children raped with \'impunity,\' U.N. official says - this is sick, a three year old was raped and they do nothing" b\'150 Russian tanks have entered South Ossetia whilst Georgia shoots down two Russian jets.\' b"Breaking: Georgia invades South Ossetia, Russia warned it would intervene on SO\'s side" b"The \'enemy combatent\' trials are nothing but a sham: Salim Haman has been sentenced to 5 1/2 years, but will be kept longer anyway just because they feel like it." b\'Georgian troops retreat from S. Osettain capital, presumably leaving several hundred people killed. [VIDEO]\' b\'Did the U.S. Prep Georgia for War with Russia?\' b\'Rice Gives Green Light for Israel to Attack Iran: Says U.S. has no veto over Israeli military ops\' b\'Announcing:Class Action Lawsuit on Behalf of American Public Against the FBI\' b"So---Russia and Georgia are at war and the NYT\'s top story is opening ceremonies of the Olympics?  What a fucking disgrace and yet further proof of the decline of journalism." b"China tells Bush to stay out of other countries\' affairs" b\'Did World War III start today?\' b\'Georgia Invades South Ossetia - if Russia gets involved, will NATO absorb Georgia and unleash a full scale war?\' b\'Al-Qaeda Faces Islamist Backlash\' b\'Condoleezza Rice: "The US would not act to prevent an Israeli strike on Iran." Israeli Defense Minister Ehud Barak: "Israel is prepared for uncompromising victory in the case of military hostilities."\' b\'This is a busy day:  The European Union has approved new sanctions against Iran in protest at its nuclear programme.\' b"Georgia will withdraw 1,000 soldiers from Iraq to help fight off Russian forces in Georgia\'s breakaway region of South Ossetia" b\'Why the Pentagon Thinks Attacking Iran is a Bad Idea - US News &amp; World Report\' b\'Caucasus in crisis: Georgia invades South Ossetia\' b\'Indian shoe manufactory  - And again in a series of "you do not like your work?"\' b\'Visitors Suffering from Mental Illnesses Banned from Olympics\' b"No Help for Mexico\'s Kidnapping Surge"']

2)将这个字符串数组 转换成 词频矩阵,以便可以作为训练集

1
2
3
basicvectorizer = CountVectorizer()
basictrain = basicvectorizer.fit_transform(trainheadlines)
print(basictrain.shape) # 生成了一个词频矩阵,总共1611个样本,31675个不重复的单词
(1611, 31675)

3 用逻辑回归 进行训练,查看训练结果精度 和 每个单词的权重参数ceof_

1
2
3
# 逻辑回归 fit 训练集词频矩阵
basicmodel = LogisticRegression()
basicmodel = basicmodel.fit(basictrain, train["Label"])
1
2
3
4
5
6
testheadlines = []
for row in range(0,len(test.index)):
testheadlines.append(' '.join(str(x) for x in test.iloc[row,2:27]))
basictest = basicvectorizer.transform(testheadlines)
# 逻辑回归 predict 测试集词频矩阵
predictions = basicmodel.predict(basictest)
1
2
3
# 构造简易 混淆矩阵
pd.crosstab(test["Label"], predictions, rownames=["Actual"], colnames=["Predicted"])
#0.42
Predicted 0 1
Actual
0 61 125
1 92 100
1
观察:用精度做的混淆矩阵,精度只有42% 不理想
1
2
3
4
5
6
basicwords = basicvectorizer.get_feature_names() # 得到分词模型中所有单词(特征)
basiccoeffs = basicmodel.coef_.tolist()[0] # 得到logistic模型中所有单词对应的 权重参数
coeffdf = pd.DataFrame({'Word' : basicwords,
'Coefficient' : basiccoeffs})
coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1]) # 从大到小排序
coeffdf.head(10) # 前面的正相关
Coefficient Word
19419 0.497924 nigeria
25261 0.452526 self
29286 0.428011 tv
15998 0.425863 korea
20135 0.425716 olympics
15843 0.411636 kills
26323 0.411267 so
29256 0.394855 turn
10874 0.388555 fears
28274 0.384031 territory
1
coeffdf.tail(10) # 前面的负相关
Coefficient Word
27299 -0.424441 students
8478 -0.427079 did
6683 -0.431925 congo
12818 -0.444069 hacking
7139 -0.448570 country
16949 -0.463116 low
3651 -0.470454 begin
25433 -0.494555 sex
24754 -0.549725 sanctions
24542 -0.587794 run

4 改进特征选择方法。用2个单词的词组 进行分词提取特征,构造新的频率矩阵

1
2
advancedvectorizer = CountVectorizer(ngram_range=(2,2))
advancedtrain = advancedvectorizer.fit_transform(trainheadlines)
1
print(advancedtrain.shape)
(1611, 366721)
1
2
advancedmodel = LogisticRegression()
advancedmodel = advancedmodel.fit(advancedtrain, train["Label"])
1
2
3
4
5
testheadlines = []
for row in range(0,len(test.index)):
testheadlines.append(' '.join(str(x) for x in test.iloc[row,2:27]))
advancedtest = advancedvectorizer.transform(testheadlines)
advpredictions = advancedmodel.predict(advancedtest)
1
2
pd.crosstab(test["Label"], advpredictions, rownames=["Actual"], colnames=["Predicted"])
#.57
Predicted 0 1
Actual
0 66 120
1 45 147
1
2
3
4
5
6
advwords = advancedvectorizer.get_feature_names()
advcoeffs = advancedmodel.coef_.tolist()[0]
advcoeffdf = pd.DataFrame({'Words' : advwords,
'Coefficient' : advcoeffs})
advcoeffdf = advcoeffdf.sort_values(['Coefficient', 'Words'], ascending=[0, 1])
advcoeffdf.head(10)
Coefficient Words
272047 0.286533 right to
24710 0.275274 and other
285392 0.274698 set to
316194 0.262873 the first
157511 0.227943 in china
159522 0.224184 in south
125870 0.219130 found in
124411 0.216726 forced to
173246 0.211137 it has
322590 0.209239 this is
1
advcoeffdf.tail(10)
Coefficient Words
326846 -0.198495 to help
118707 -0.201654 fire on
155038 -0.209702 if he
242528 -0.211303 people are
31669 -0.213362 around the
321333 -0.215699 there is
327113 -0.221812 to kill
340714 -0.226289 up in
358917 -0.227516 with iran
315485 -0.331153 the country
1
2


꧁༺The༒End༻꧂