基于NLP的股价预测
1 | import pandas as pd |
1 | data = pd.read_csv('Combined_News_DJIA.csv') |
- 每行是某公司 这一天股市数据;label表示当天涨/跌,Top表示依重要程度排列的当天新闻事件
- 通过NLP处理可以把这些字符串转换为 机器认识的语言
1 | data.head() |
Date | Label | Top1 | Top2 | Top3 | Top4 | Top5 | Top6 | Top7 | Top8 | ... | Top16 | Top17 | Top18 | Top19 | Top20 | Top21 | Top22 | Top23 | Top24 | Top25 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-08-08 | 0 | b"Georgia 'downs two Russian warplanes' as cou... | b'BREAKING: Musharraf to be impeached.' | b'Russia Today: Columns of troops roll into So... | b'Russian tanks are moving towards the capital... | b"Afghan children raped with 'impunity,' U.N. ... | b'150 Russian tanks have entered South Ossetia... | b"Breaking: Georgia invades South Ossetia, Rus... | b"The 'enemy combatent' trials are nothing but... | ... | b'Georgia Invades South Ossetia - if Russia ge... | b'Al-Qaeda Faces Islamist Backlash' | b'Condoleezza Rice: "The US would not act to p... | b'This is a busy day: The European Union has ... | b"Georgia will withdraw 1,000 soldiers from Ir... | b'Why the Pentagon Thinks Attacking Iran is a ... | b'Caucasus in crisis: Georgia invades South Os... | b'Indian shoe manufactory - And again in a se... | b'Visitors Suffering from Mental Illnesses Ban... | b"No Help for Mexico's Kidnapping Surge" |
1 | 2008-08-11 | 1 | b'Why wont America and Nato help us? If they w... | b'Bush puts foot down on Georgian conflict' | b"Jewish Georgian minister: Thanks to Israeli ... | b'Georgian army flees in disarray as Russians ... | b"Olympic opening ceremony fireworks 'faked'" | b'What were the Mossad with fraudulent New Zea... | b'Russia angered by Israeli military sale to G... | b'An American citizen living in S.Ossetia blam... | ... | b'Israel and the US behind the Georgian aggres... | b'"Do not believe TV, neither Russian nor Geor... | b'Riots are still going on in Montreal (Canada... | b'China to overtake US as largest manufacturer' | b'War in South Ossetia [PICS]' | b'Israeli Physicians Group Condemns State Tort... | b' Russia has just beaten the United States ov... | b'Perhaps *the* question about the Georgia - R... | b'Russia is so much better at war' | b"So this is what it's come to: trading sex fo... |
2 | 2008-08-12 | 0 | b'Remember that adorable 9-year-old who sang a... | b"Russia 'ends Georgia operation'" | b'"If we had no sexual harassment we would hav... | b"Al-Qa'eda is losing support in Iraq because ... | b'Ceasefire in Georgia: Putin Outmaneuvers the... | b'Why Microsoft and Intel tried to kill the XO... | b'Stratfor: The Russo-Georgian War and the Bal... | b"I'm Trying to Get a Sense of This Whole Geor... | ... | b'U.S. troops still in Georgia (did you know t... | b'Why Russias response to Georgia was right' | b'Gorbachev accuses U.S. of making a "serious ... | b'Russia, Georgia, and NATO: Cold War Two' | b'Remember that adorable 62-year-old who led y... | b'War in Georgia: The Israeli connection' | b'All signs point to the US encouraging Georgi... | b'Christopher King argues that the US and NATO... | b'America: The New Mexico?' | b"BBC NEWS | Asia-Pacific | Extinction 'by man... |
3 | 2008-08-13 | 0 | b' U.S. refuses Israel weapons to attack Iran:... | b"When the president ordered to attack Tskhinv... | b' Israel clears troops who killed Reuters cam... | b'Britain\'s policy of being tough on drugs is... | b'Body of 14 year old found in trunk; Latest (... | b'China has moved 10 *million* quake survivors... | b"Bush announces Operation Get All Up In Russi... | b'Russian forces sink Georgian ships ' | ... | b'Elephants extinct by 2020?' | b'US humanitarian missions soon in Georgia - i... | b"Georgia's DDOS came from US sources" | b'Russian convoy heads into Georgia, violating... | b'Israeli defence minister: US against strike ... | b'Gorbachev: We Had No Choice' | b'Witness: Russian forces head towards Tbilisi... | b' Quarter of Russians blame U.S. for conflict... | b'Georgian president says US military will ta... | b'2006: Nobel laureate Aleksander Solzhenitsyn... |
4 | 2008-08-14 | 1 | b'All the experts admit that we should legalis... | b'War in South Osetia - 89 pictures made by a ... | b'Swedish wrestler Ara Abrahamian throws away ... | b'Russia exaggerated the death toll in South O... | b'Missile That Killed 9 Inside Pakistan May Ha... | b"Rushdie Condemns Random House's Refusal to P... | b'Poland and US agree to missle defense deal. ... | b'Will the Russians conquer Tblisi? Bet on it,... | ... | b'Bank analyst forecast Georgian crisis 2 days... | b"Georgia confict could set back Russia's US r... | b'War in the Caucasus is as much the product o... | b'"Non-media" photos of South Ossetia/Georgia ... | b'Georgian TV reporter shot by Russian sniper ... | b'Saudi Arabia: Mother moves to block child ma... | b'Taliban wages war on humanitarian aid workers' | b'Russia: World "can forget about" Georgia\'s... | b'Darfur rebels accuse Sudan of mounting major... | b'Philippines : Peace Advocate say Muslims nee... |
5 rows × 27 columns
1 数据简单预处理、划分
1 | # 根据日期划分 训练集 测试集 |
1 | example = train.iloc[3,10] |
b"The commander of a Navy air reconnaissance squadron that provides the President and the defense secretary the airborne ability to command the nation's nuclear weapons has been relieved of duty"
1 | example2 = example.lower() |
b"the commander of a navy air reconnaissance squadron that provides the president and the defense secretary the airborne ability to command the nation's nuclear weapons has been relieved of duty"
1 | example3 = CountVectorizer().build_tokenizer()(example2) |
['the', 'commander', 'of', 'navy', 'air', 'reconnaissance', 'squadron', 'that', 'provides', 'the', 'president', 'and', 'the', 'defense', 'secretary', 'the', 'airborne', 'ability', 'to', 'command', 'the', 'nation', 'nuclear', 'weapons', 'has', 'been', 'relieved', 'of', 'duty']
1 | pd.DataFrame([[x,example3.count(x)] for x in set(example3)], columns = ['Word', 'Count']) |
Word | Count | |
---|---|---|
0 | the | 5 |
1 | command | 1 |
2 | secretary | 1 |
3 | weapons | 1 |
4 | has | 1 |
5 | defense | 1 |
6 | commander | 1 |
7 | squadron | 1 |
8 | relieved | 1 |
9 | navy | 1 |
10 | of | 2 |
11 | air | 1 |
12 | reconnaissance | 1 |
13 | provides | 1 |
14 | president | 1 |
15 | been | 1 |
16 | to | 1 |
17 | and | 1 |
18 | ability | 1 |
19 | nation | 1 |
20 | that | 1 |
21 | duty | 1 |
22 | nuclear | 1 |
23 | airborne | 1 |
2 基于词频的特征提取——构造词频矩阵
1)构造一个字符串数组StringList,每个元素是对应行所有top特征字符串拼成的长字符串
1 | trainheadlines = [] |
['b"Georgia \'downs two Russian warplanes\' as countries move to brink of war" b\'BREAKING: Musharraf to be impeached.\' b\'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)\' b\'Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire\' b"Afghan children raped with \'impunity,\' U.N. official says - this is sick, a three year old was raped and they do nothing" b\'150 Russian tanks have entered South Ossetia whilst Georgia shoots down two Russian jets.\' b"Breaking: Georgia invades South Ossetia, Russia warned it would intervene on SO\'s side" b"The \'enemy combatent\' trials are nothing but a sham: Salim Haman has been sentenced to 5 1/2 years, but will be kept longer anyway just because they feel like it." b\'Georgian troops retreat from S. Osettain capital, presumably leaving several hundred people killed. [VIDEO]\' b\'Did the U.S. Prep Georgia for War with Russia?\' b\'Rice Gives Green Light for Israel to Attack Iran: Says U.S. has no veto over Israeli military ops\' b\'Announcing:Class Action Lawsuit on Behalf of American Public Against the FBI\' b"So---Russia and Georgia are at war and the NYT\'s top story is opening ceremonies of the Olympics? What a fucking disgrace and yet further proof of the decline of journalism." b"China tells Bush to stay out of other countries\' affairs" b\'Did World War III start today?\' b\'Georgia Invades South Ossetia - if Russia gets involved, will NATO absorb Georgia and unleash a full scale war?\' b\'Al-Qaeda Faces Islamist Backlash\' b\'Condoleezza Rice: "The US would not act to prevent an Israeli strike on Iran." Israeli Defense Minister Ehud Barak: "Israel is prepared for uncompromising victory in the case of military hostilities."\' b\'This is a busy day: The European Union has approved new sanctions against Iran in protest at its nuclear programme.\' b"Georgia will withdraw 1,000 soldiers from Iraq to help fight off Russian forces in Georgia\'s breakaway region of South Ossetia" b\'Why the Pentagon Thinks Attacking Iran is a Bad Idea - US News & World Report\' b\'Caucasus in crisis: Georgia invades South Ossetia\' b\'Indian shoe manufactory - And again in a series of "you do not like your work?"\' b\'Visitors Suffering from Mental Illnesses Banned from Olympics\' b"No Help for Mexico\'s Kidnapping Surge"']
2)将这个字符串数组 转换成 词频矩阵,以便可以作为训练集
1 | basicvectorizer = CountVectorizer() |
(1611, 31675)
3 用逻辑回归 进行训练,查看训练结果精度 和 每个单词的权重参数ceof_
1 | # 逻辑回归 fit 训练集词频矩阵 |
1 | testheadlines = [] |
1 | # 构造简易 混淆矩阵 |
Predicted | 0 | 1 |
---|---|---|
Actual | ||
0 | 61 | 125 |
1 | 92 | 100 |
1 | 观察:用精度做的混淆矩阵,精度只有42% 不理想 |
1 | basicwords = basicvectorizer.get_feature_names() # 得到分词模型中所有单词(特征) |
Coefficient | Word | |
---|---|---|
19419 | 0.497924 | nigeria |
25261 | 0.452526 | self |
29286 | 0.428011 | tv |
15998 | 0.425863 | korea |
20135 | 0.425716 | olympics |
15843 | 0.411636 | kills |
26323 | 0.411267 | so |
29256 | 0.394855 | turn |
10874 | 0.388555 | fears |
28274 | 0.384031 | territory |
1 | coeffdf.tail(10) # 前面的负相关 |
Coefficient | Word | |
---|---|---|
27299 | -0.424441 | students |
8478 | -0.427079 | did |
6683 | -0.431925 | congo |
12818 | -0.444069 | hacking |
7139 | -0.448570 | country |
16949 | -0.463116 | low |
3651 | -0.470454 | begin |
25433 | -0.494555 | sex |
24754 | -0.549725 | sanctions |
24542 | -0.587794 | run |
4 改进特征选择方法。用2个单词的词组 进行分词提取特征,构造新的频率矩阵
1 | advancedvectorizer = CountVectorizer(ngram_range=(2,2)) |
1 | print(advancedtrain.shape) |
(1611, 366721)
1 | advancedmodel = LogisticRegression() |
1 | testheadlines = [] |
1 | pd.crosstab(test["Label"], advpredictions, rownames=["Actual"], colnames=["Predicted"]) |
Predicted | 0 | 1 |
---|---|---|
Actual | ||
0 | 66 | 120 |
1 | 45 | 147 |
1 | advwords = advancedvectorizer.get_feature_names() |
Coefficient | Words | |
---|---|---|
272047 | 0.286533 | right to |
24710 | 0.275274 | and other |
285392 | 0.274698 | set to |
316194 | 0.262873 | the first |
157511 | 0.227943 | in china |
159522 | 0.224184 | in south |
125870 | 0.219130 | found in |
124411 | 0.216726 | forced to |
173246 | 0.211137 | it has |
322590 | 0.209239 | this is |
1 | advcoeffdf.tail(10) |
Coefficient | Words | |
---|---|---|
326846 | -0.198495 | to help |
118707 | -0.201654 | fire on |
155038 | -0.209702 | if he |
242528 | -0.211303 | people are |
31669 | -0.213362 | around the |
321333 | -0.215699 | there is |
327113 | -0.221812 | to kill |
340714 | -0.226289 | up in |
358917 | -0.227516 | with iran |
315485 | -0.331153 | the country |
1 |