Basic LSTM model for predicting stock prices (Python)

Federico M. Glancszpigel
5 min readJan 30, 2020

In this article i present a simplified version of a Recurrent Neural Network model for stock price prediction. After an extensive research on Machine Learning and Neural Networks i wanted to present a guide to build, understand and use a model for predicting the price of a stock. Keep in mind that in this article i wont explain the basics of RNN and LSTM, i will go directly to the model explanation. The article is divided in three sections: 1-Data preprocessing, 2-Creating and training the model & 3-Evaluating the model.

1-Data Preprocessing

I first import a series of essential libraries.

import pandas as pd
import yfinance as yf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Yfinance is a good library to download data from Yahoo finance since Pandas Data Reader stopped working. I will download the historical adjust close time series of Apple stock. We wont need the datetime index so i will drop it.

data=yf.download('AAPL')[['Adj Close']]
data.reset_index(inplace=True)
data.drop('Date', axis=1, inplace=True)

Having our data frame, the next we are going to do is split it into a train and a test set manually. Here we can’t use the train_test_split function of Scikit learn because we don’t want a random split. I choose arbitrarily a split percentage of 90%.

split_percentage=0.9
split_point=round(len(data)*split_percentage)
train_data=data.iloc[:split_point]
test_data=data.iloc[split_point:]

The next we have to do is to normalize our data. This step and the next steps are very important because we have to transform a data frame into a numpy array with certain features so to the model to understand our data.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(train_data)
scaled_train = scaler.transform(train_data)
scaled_test = scaler.transform(test_data)

Since the Keras TimeSeriesGenerator function doesn’t work well with stock data, we have to manually organize our data so to work like a time series object .

def timeseries_preprocessing(scaled_train, scaled_test, lags):
X,Y = [],[]
for t in range(len(scaled_train)-lags-1):
X.append(scaled_train[t:(t+lags),0])
Y.append(scaled_train[(t+lags),0])

Z,W = [],[]
for t in range(len(scaled_test)-lags-1):
Z.append(scaled_test[t:(t+lags),0])
W.append(scaled_test[(t+lags),0])

X_train, Y_train, X_test, Y_test=np.array(X), np.array(Y), np.array(Z),np.array(W)

Above, we are creating a matrix of shape (#samples, #lags) for the Xs and a a matrix of shape (#samples,) for the Ys. I choose arbitrarily the lags to be 10. Then we have to reshape the Xs matrices to (#Samples,#Lags,#Features). Features will be 1, because we are dealing only with the Adjust Close.

    X_train = X_train.reshape((X_train.shape[0],X_train.shape[1],1))
X_test = X_test.reshape((X_test.shape[0],X_test.shape[1],1))

return X_train, Y_train, X_test, Y_test
X_train, Y_train, X_test,Y_test=timeseries_preprocessing(scaled_train, scaled_test, 10)

Now our data is ready for Keras to understand it. Now we are going to build the model.

2-Creating and training the model

Because the object of the article is to present a simplified model, our model will consist in one LSTM layer of 300 neurons (input layer) and one Dense layer of only one neuron (Output). I choose an adam optimizer because it is prooved to have the best results, and cause we are trying to solve a regression problem we need to choose the Mean Squared Error as the loss metric.

from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense, LSTM
model = Sequential()
model.add(LSTM(256,input_shape=(X_train.shape[1],1)))
model.add(Dense(1))
model.compile(optimizer='adam',loss='mse')

Then, we train the model. To prevent overfitting you can add a validation data. With model.history.history you can see the loss and validation loss of each epoch during the training process. Graphically overfitting occurs when the validation loss starts to increase while the loss of the model continues to decrease, is in that epoch where you have to stop training your model. You can also use EarlyStopping, but keep in mind that the validation loss fluctuates a lot in the early stages so i don’t recommend to use this tool in these cases.

history = model.fit(x=X_train,y=Y_train,epochs=300,validation_data=(X_test,Y_test),shuffle=False)

3-Evaluating the model

The first we can see to evaluate performance is the historical loss and validation loss. Below we can see that we are not overfitting.

axes=plt.axes()
axes.plot(pd.DataFrame(model.history.history)['loss'], label='Loss')
axes.plot(pd.DataFrame(model.history.history)['val_loss'], label='Validation Loss')
axes.legend(loc=0)
axes.set_title('Model fitting performance')

Next we are going to evaluate the prediction power of the model. To do so, first we have to re-scale our output data.

Y_predicted=scaler.inverse_transform(model.predict(X_test))
Y_true=scaler.inverse_transform(Y_test.reshape(Y_test.shape[0],1))

Now we can plot this out.

axes.plot(Y_true, label='True Y')
axes.plot(Y_predicted, label='Predicted Y')
axes.legend(loc=0)
axes.set_title('Prediction adjustment')

It seems that the prediction is accurate at the beginning but as the days goes by, the predicted Y starts to deviate. To be more precise lets use some metrics:

from sklearn import metrics
print('Model accuracy (%)')
Y_p=scaler.inverse_transform(model.predict(X_train))
Y_t=scaler.inverse_transform(Y_train.reshape(Y_train.shape[0],1))
print((1-(metrics.mean_absolute_error(Y_t, Y_p)/Y_t.mean()))*100)
print('')
print('Prediction performance')
print('MAE in %', (metrics.mean_absolute_error(Y_true, Y_predicted)/Y_true.mean())*100)
print('MSE', metrics.mean_squared_error(Y_true, Y_predicted))
print('RMSE',np.sqrt(metrics.mean_squared_error(Y_true, Y_predicted)))print('R2', metrics.r2_score(Y_true, Y_predicted))

Model accuracy (%)
90.72455130726455

Prediction performance
MAE in % 6.057973590680105
MSE 222.46304370005493
RMSE 14.915195060744425
R2 0.9085727425185598

This results states that we are deviating from the mean at about 6% when we predict, and if we predict with the training set and calculate the weighted mean absolute error we obtain a 90% of accuracy for this model.

Conclusion

There’s a lot more to improve in this field. In this article i presented a basic model that got a non insignificant 90% of accuracy. If we compared it to models like the ARIMA, DDM or Montecarlo Simulations, this number is quite big. There are lot of variables that are affecting the results of the model such as the stock that was choose, the number of layers in the model, the number of neurons in each layer, the number of epochs trained, etc, etc, etc. The next challenge will be improving this basic model so to gain more accuracy. The deep learning field is just starting and has demonstrated to be a very powerful tool for finance.

References

--

--