














Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Yahoo finance website to predict weekly changes in stock price. ... Hypothesis (EMH), which asserts that the price of a stock reflects all information.
Typology: Exercises
1 / 22
This page cannot be seen from the preview
Don't miss anything!
Selene Yue Xu (UC Berkeley) Abstract: Stock price forecasting is a popular and important topic in financial and academic studies. Time series analysis is the most common and fundamental method used to perform this task. This paper aims to combine the conventional time series analysis technique with information from the Google trend website and the Yahoo finance website to predict weekly changes in stock price. Important news/events related to a selected stock over a five-‐year span are recorded and the weekly Google trend index values on this stock are used to provide a measure of the magnitude of these events. The result of this experiment shows significant correlation between the changes in weekly stock prices and the values of important news/events computed from the Google trend website. The algorithm proposed in this paper can potentially outperform the conventional time series analysis in stock price forecasting. Introduction: There are two main schools of thought in the financial markets, technical analysis and fundamental analysis. Fundamental analysis attempts to determine a stock’s value by focusing on underlying factors that affect a company’s actual business and its future prospects. Fundamental analysis can be performed on industries or the economy as a whole. Technical analysis, on the other hand, looks at the price movement of a stock and uses this data to predict its future price movements. In this paper, both fundamental and technical data on a selected stock are collected from the Internet. Our selected company is Apple Inc. (aapl). We choose this stock mainly because it is popular and there is a large amount of information online that is relevant to our research and can facilitate us in evaluating ambiguous news. Our fundamental data is in the form of news articles and analyst opinions, whereas our technical data is in the form of historical stock prices. Scholars and researchers have developed many techniques to evaluate online news over the recent years. The most popular technique is text mining. But this method is complicated and subject to language biases. Hence we attempt to use information from the Yahoo finance website and the Google trend website to simplify the evaluation process of online news information.
In this paper, we first apply the conventional ARMA time series analysis on the historical weekly stock prices of aapl and obtain forecasting results. Then we propose an algorithm to evaluate news/events related to aapl stock using information from the Yahoo finance website and the Google trend website. We then regress the changes in weekly stock prices on the values of the news at the beginning of the week. We aim to use this regression result to study the relationship between news and stock price changes and improve the performance of the conventional stock price forecasting process. Literature review: The basic theory regarding stock price forecasting is the Efficient Market Hypothesis (EMH), which asserts that the price of a stock reflects all information available and everyone has some degree of access to the information. The implication of EMH is that the market reacts instantaneously to news and no one can outperform the market in the long run. However the degree of market efficiency is controversial and many believe that one can beat the market in a short period of time^1. Time series analysis covers a large number of forecasting methods. Researchers have developed numerous modifications to the basic ARIMA model and found considerable success in these methods. The modifications include clustering time series from ARMA models with clipped data^2 , fuzzy neural network approach^3 and support vector machines model^4. Almost all these studies suggest that additional factors should be taken into account on top of the basic or unmodified model. The most common and important one of such factors is the online news information related to the stock. Many researchers attempt to use textual information in public media to evaluate news. To perform this task, various mechanics are developed, such as the AZFin text system^5 , a matrix form text mining system^6 and named entities representation scheme^7. All of these processes require complex algorithm that performs text extraction and evaluation from online sources. Data: Weekly stock prices of aapl from the first week of September 2007 to the last week of August 2012 are extracted from the Yahoo finance website. This data set contains the open, high, low, close and adjusted close prices of aapl stock on every Monday throughout these five years. It also contains trading volume values on these days. To achieve consistency, the close prices are used as a general measure of stock price of aapl over the past five years.
past five years. In other words, we assume the criterion used by the Yahoo finance website when judging if certain news is important enough to be included in our analysis. Note that this list might not be a comprehensive list of the important news related to Apple Inc. over the past five years. But again, this paper attempts to simplify news selection process. Hence a simple criterion from the Yahoo finance website is used.
Analysis of Data:
1. The basic ARIMA model analysis of the historical stock prices: To perform the basic ARIMA time series analysis on the historical stock prices, we first make a plot of the raw data, i.e. the weekly close prices of aapl over time. The plot is shown below: This plot shows that the close price of aapl increases in general over the past five years. However, there is no apparent pattern in the movement of the stock price. The variance of the stock price seems to increase slightly with time. The stock price is especially volatile near the end. These observations imply that a log or square-‐root transformation of the raw data might be appropriate in order to stabilize variance.
Next we plot the autocorrelation function and the partial autocorrelation function of the first differences of the transformed data: From these plots, it seems that the first differences of the transformed data are random. There doesn’t seem to be an ARMA relationship in the first differences with time. The transformed stock prices essentially follow an ARIMA(0,1,0) process. This is essentially a random walk process. The random walk model in stock price forecasting has been commonly used and studied throughout history^10. The random walk model has similar implications as the efficient market hypothesis as they both suggest that one cannot outperform the market by analyzing historical prices of a certain stock.
2. Algorithm for computing the value of news at a certain time: As mentioned earlier, important news and events related to Apple Inc. over the past five years are recorded and assessed to be either +1 or -‐1. An exponential algorithm is used to simulate the change of impact of a piece of news over time. The details of the algorithm is as follow: a. To start with, each day in the past five years is assigned a value of +1, -‐1 or 0 depending on if there is important news/event on that day and if the news is positive, negative or neutral respectively. b. The impact of the news decreases exponentially such that n days later, the absolute value of the news/event becomes exp(-‐n/7) and the sign still follows the original sign of the news/event. This exponential form means that on the day that certain news/event occurs, the absolute value of that news/event is always exp(0)=1. One day later, the absolute value becomes exp(-‐1/7)=86.69% of the original absolute value. A week later, the absolute value becomes exp(-‐7/7)=36.79% of the original absolute value. Two weeks later, the absolute value becomes exp(-‐14/7)=13.53% of the original absolute value. We design the algorithm in a way that news/event almost dies off after two weeks. c. On any day within the past five years, the value of news/event on that date is the sum of all news/events values on and before that date, which are calculated using the aforementioned algorithm. d. After computing the value of news/event for every single day in the past five years, the days corresponding to the Google trend dates (every Sunday in the past five years) are selected. The news/events value computed from the key developments feature on the Yahoo finance website gives the sign of news/event at a particular time and the Google trend index data gives the magnitude of the news/event at that time. They are multiplied together to give a measure of the final value of news/event at that point of time.
3. Regression of weekly stock price changes on the news values at the beginning of each week: In order to study the relationship between news/event at certain time and stock price at a later time, we regress the weekly stock price changes on the news values from the previous week. The summary of the regression is shown below:
summary(mod) Call: lm(formula = d1 ~ inf2, data = d) Residuals: Min 1Q Median 3Q Max
Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1 Residual standard error: 12.28 on 258 degrees of freedom Multiple R-squared: 0.09243, Adjusted R-squared: 0. F-statistic: 26.28 on 1 and 258 DF, p-value: 5.808e- 07 The regression result shows that there seems to be a very significant and positive correlation (6.3290) between the weekly changes in stock prices and the news values at the beginning of each week. To put this into words, one unit increase in news values should result in roughly 6.33 unit increase in stock price change by the end of the week. However we notice that the R-‐squared value, which shows the proportion of the variability in the weekly stock price changes explained by the model, is very small (0.09243). This means that the model is not doing a good job predicting the changes in stock prices.
In order to further examine the regression data, we plot the weekly stock price changes against the news values from the previous week with the regression line drawn: From the plot, we notice that there are potential outliers in the data such as the point well above the regression line near news value -‐ 2. An outlier is a data point that is markedly distant from the rest of the data and does not fit the current model. Identifying outliers helps to distinguish between truly unusual points and residuals that are large but not exceptional. We use the Jacknife residuals and the Bonferroni inequality method to come up with a conservative test at a level of 10% to filter out the outliers in our data. This procedure picks out the data point we just mentioned as the sole outlier in our data set. We exclude this data point for the rest of our analysis. The regression result and the plot of the new data set without the outlier are shown on the next page:
The regression results in a similar estimate of the coefficient on the news variable (7.3542), which is also significant at level 0.001. This shows that the estimate for the correlation coefficient between weekly stock price changes and news values at the beginning of each week is relatively stable. There is slight improvement in the R-‐squared value (0.1205). A little over 10% of the variability in the weekly stock price changes can be explained by the model. Although the R-‐ squared value shows slight increase, it is still very small, implying that the model is still not doing a great job predicting the changes in stock prices. We further examine the model by analyzing data leverages. If leverage is large, the data point will pull the fitted value toward itself. Such observations often have small residuals and do not show up as outliers on the residual plots. The fit looks good, but is dominated by a single observation. If the leverage of a data point is substantially greater than k/n, where k is the number of parameters being estimated and n is the sample size, then we say that it has high leverage. A rule of thumb says that if leverage is bigger than 2k/n, then this data point is a candidate for further consideration. When checking for high leverage points in our data set using R, we find 24 influential points. If we take out these points and perform regression on the remaining data. The regression summary and plot are shown here:
summary(mod2) Call: lm(formula = d1 ~ inf2, data = d2) Residuals: Min 1Q Median 3Q Max
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.11 on 234 degrees of freedom Multiple R-squared: 0.01076, Adjusted R-squared: 0. F-statistic: 2.545 on 1 and 234 DF, p-value: 0.
The estimate of the correlation coefficient between weekly stock price changes and news values is no longer significant. In fact, now the distribution of weekly stock price changes over news values seems rather random. Hence we suspect that the previous regression result is dominated by a number of highly influential data points in our sample and does not represent general situations very well.
Next we divide the news values into finer intervals: smaller than -‐2, bigger than or equal to -‐2 and smaller than -‐1.5, bigger than or equal to -‐1.5 and smaller than -‐1, bigger than or equal to -‐1 and smaller than -‐0.5, bigger than or equal to -‐0. and smaller than 0, bigger than or equal to 0 and smaller than 0.5, bigger than or equal to 0.5 and smaller than 1, bigger than or equal to 1 and smaller than 1.5, and finally bigger than or equal to 1.5. The regression result over these nine dummy variables (without intercept term) is shown below:
summary(m2) Call: lm(formula = d1 ~ dum1 + dum2 + dum3 + dum4 + dum5 + dum6 + dum
dum8 + dum9 - 1) Residuals: Min 1Q Median 3Q Max
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.22 on 251 degrees of freedom Multiple R-squared: 0.147, Adjusted R-squared: 0. F-statistic: 4.807 on 9 and 251 DF, p-value: 6.274e- 06 From the regression summary, we see that the averages of weekly stock price changes for news values that fall into the first four intervals are negative. But only the averages in the first interval (<-‐2) and the fourth interval (-‐1~-‐0.5) are significant, both at level 0.1. The other two are not significant at any level smaller than 0.1. Also we would have expected the mean of weekly stock price changes to be more negative in the second interval (-‐2~-‐1.5) than those in the third (-‐1.5~-‐1) and
fourth interval (-‐1~-‐0.5). The average of weekly stock price changes in the fifth interval (-‐0.5~0) is very interesting. Contrary to our expectation that it should be negative, it is actually positive here and is significantly so (at a level 0.05). We will discuss about this in the next section. The averages of weekly stock price changes for news values that fall into the last four intervals, except for the seventh interval, are positive, just as what we would have expected. The averages in the sixth interval (0~0.5) and the last interval (>1.5) are significant at level 0.01 and level 0. respectively. The mean in the eighth interval (1~1.5) is not significant at any level smaller than 0.1. The mean in the seventh interval (0.5~1) is a negative value, which contradicts to our expectation. But it is not significant at any level smaller than 0.1. All in all, when dividing news values into different intervals, the trend of average weekly stock price changes is similar to what we would have expected, with a couple of exceptions.
Reference:
Appendix (R code): price=read.csv("/Users/selenexu/Desktop/Econ Honor/aapl table.csv",header=TRUE) price=price[order(price$Date,decreasing=FALSE),] trend=read.csv("/Users/selenexu/Desktop/Econ Honor/aapl trends.csv",header=TRUE) plot(trend$aapl,type='l') plot(price$Close,type='l',xlab='Time',ylab='Close Prices',main='Weekly Close Prices of aapl') d1=diff(price$Close) logd1=diff(log(price$Close)) sd1=diff(sqrt(price$Close)) par(mfrow=c( 3 , 1 )) plot(d1,type='l',xlab='Time',ylab='Difference',main='First Degree Differencing on Raw Data') plot(logd1,type='l',xlab='Time',ylab='Difference',main='First Degree Differencing on Logged Data') plot(sd1,type='l',xlab='Time',ylab='Difference',main='First Degree Differencing on Square-root Data') par(mfrow=c( 2 , 1 )) acf(sd1,main='Autocorrelation Function of the First Differences') pacf(sd1,main='Partial Autocorrelation Function of the First Differences') sd2=diff(sd1) par(mfrow=c( 2 , 1 )) acf(sd2,main='Autocorrelation Function of the Second Differences') pacf(sd2,main='Partial Autocorrelation Function of the Second Differences') arima(sqrt(price$Close),order=c( 0 , 2 , 1 )) inf2=info[ 1 :(length(info)- 1 )] d=data.frame(d1,inf2) mod=lm(d1~inf2,data=d)