Factorization Machine on AWS: The best algorithm for recommender systems
Item Recommendation with Explicit Feedback Data
by Mariela Bisso, Data Scientist
This post shows end-to-end how SageMaker’s built-in Factorization Machine algorithm works using the MovieLens dataset (F. Maxwell Harper y Joseph A. Konstan. 2015. Los conjuntos de datos de MovieLens: historia y contexto. Transacciones ACM en sistemas inteligentes interactivos (TiiS) 5, 4: 19: 1–19: 19. https://doi.org/10.1145/2827872).
To download the dataset or for more information see the following link. If you already know the details of the algorithm and its history, you can skip the first part and jump right on the implementation below. (SageMaker Studio)
Recommendation systems are based on modeling the preferences of the users on the items based on their past interactions (for example, ratings or clicks) and on past interactions of “similar” users, which is also known as a recommendation system based on collaborative filtering.
There are two main approaches to the recommendation system: Explicit feedback and implicit feedback. Explicit feedback usually appears as a rating, on a scale between 1 to 5, with 1 being bad and 5 being excellent. On the other hand, the implicit feedback could be, for instance, the number of clicks that a user made on a web page or the number of transactions in electronic commerce. The difference between both kinds of feedback is that in the explicit case, the user says what things he does not like, while implicit feedback does not necessarily entail that if a user does not buy a certain item it means he does not like the item.
Let’s start by talking about some algorithms or techniques for these types of problems.
Matrix Factorization (MF)
There are many different collaborative filtering techniques, matrix factorization (MF) being one of the most widely used. It projects users and elements into a shared latent space, using a vector of latent characteristics to represent a user or an element. Thus, user interaction with an item is modeled as the inner product of its latent vectors.
But let’s start from the very beginning, and suppose that we have 4 users and 5 movies as shown in the previous image, each row represents a user and each column represents an item or movie, we can think the checks, to simplify, as a binary variable 0–1 where the 1 represents that the user is interested in the movie.
Let’s briefly explain what we mean by latent characteristics. Imagine that we have two variables that somehow separate both movies and users, these are: what audience they are intended for (boys-adults) and the budget used for their filming (blockbuster- arthouse). We can think that each user has certain preferences for each of these characteristics and we measure them between the interval [-1,1] and we do the same for each of the films, we put a weight also between [-1; 1], we can then estimate a user’s preference for a movie by multiplying the weights found for the user and the movie within the space of latent factors.
Of course this is an example and in reality the space of latent characteristics is found automatically and with a specified dimension in a hyperparameter of the model (num_factors).
Summary,
This algorithm is one of the most popular in recommendation systems, but it has the disadvantage of not having the possibility of incorporating additional information. To overcome these limitations, we need a more general model that can extend the latent factor approach to incorporate arbitrary auxiliary characteristics and specialized loss functions.
Factorization Machine (FM)
Factorization Machine is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to parsimoniously capture interactions between features in high dimensional sparse datasets.
Amazon SageMaker’s Factorization Machine algorithm provides a robust, highly scalable implementation of this algorithm, which has become extremely popular in ad click prediction and recommender systems. The main purpose of this notebook is to quickly show the basics of implementing Amazon SageMaker Factorization Machines.
(Source: https://docs.aws.amazon.com/es_es/sagemaker/latest/dg/fact-machines.html)
The difference in the FM models is that they represent user-element interactions as tuples, the features are binary indicating the user and the item purchased by the customer (user-element as tuples) with the possibility of adding auxiliary features, as shown in the following image.
The equation of the model is as follows:
Once the algorithm has been introduced, we move to the main objective of this post, which is to quickly show the basic concepts of the implementation of Amazon SageMaker factorization machines.
Sagemaker Studio
First we’ll create a notebook into our SageMaker Studio. To find out what Sagemaker Studio is, you can visit the following link
Once we have our notebook ready we can start to get our hands dirty. In this post I describe only the fundamental steps. If you want a complete description of the solution and the notebooks, you can download the repository from GitHub.
1- Dataset
Let’s start by visualizing the database
ratings = pd.read_csv('ua.base', sep='\t', names=['userId','movieId','rating','timestamp'] )
ratings_test = pd.read_csv('ua.test', sep='\t', names=['userId','movieId','rating','timestamp'] )
#ratings.drop(columns='timestamp', inplace=True)
print('Shape of ratings dataset for training: {}'.format(ratings.shape))
ratings.head()
2- Preprocessing of the data
Prior to implementing a factoring machine classification model, the implementation requires, that we transform our target variable to binary 0–1, the cut-off point that was chosen is 4–5 for high scores.
ratings['rating_bin'] = (ratings.rating>=4).astype('float32')
ratings_test['rating_bin'] = (ratings_test.rating>=4).astype('float32')
ratings.head()
We transform the dataset to the sparse matrix through the OneHotEncoder method
enc = OneHotEncoder(handle_unknown='ignore', sparse=True)
enc.fit(ratings[['userId','movieId']])
X_train_OH = enc.transform(ratings[['userId','movieId']]).astype('float32')
Y_train_OH = ratings['rating_bin']
X_test_OH = enc.transform(ratings_test[['userId','movieId']]).astype('float32')
Y_test_OH = ratings_test['rating_bin']
3- Upload training data
Since algorithms have particular input and output requirements, converting the dataset is also part of the process that a data scientist goes through prior initiating training. In this particular case, the Amazon SageMaker implementation of Factorization Machines takes recordIO-wrapped protobuf, where the data we have today is a pickle-ized numpy array on disk. The writeDatasetToProtobuf function allows you to convert the data into the desired format and upload it to an s3 bucket. By specifying the parameters bucket and prefix we can control the location of these files.
def writeDatasetToProtobuf(X, Y, bucket, prefix, key):
buf = io.BytesIO()
smac.write_spmatrix_to_sparse_tensor(buf, X, Y)
buf.seek(0)
obj = '{}/{}'.format(prefix, key)
boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
return 's3://{}/{}'.format(bucket, obj)
train_data = writeDatasetToProtobuf(X_train_OH, Y_train_OH, bucket, train_prefix, train_key)
test_data = writeDatasetToProtobuf(X_test_OH, Y_test_OH, bucket, test_prefix, test_key)
print(train_data)
print(test_data)
print('Output: {}'.format(output_prefix))
4- Training the model
Now that we are done with all the setup that we needed, we are ready to train our Factorization Machine. In order to begin, let’s create a sageMaker.estimator.Estimator
object. This estimator will launch the training job. For the purpose of this demo, we’re using a single Training Job, but we recommend you to use an hyperparameter tuning job to get better results in the end. You can check how to do it in the tuning notebook (link).
from sagemaker.image_uris import retrieve
training_image = retrieve(region=boto3.Session().region_name, framework="factorization-machines", version='latest')
fm = sagemaker.estimator.Estimator(
training_image,
role,
instance_count=1,
instance_type='ml.c4.xlarge',
volume_size=30,
max_run=86400,
output_path=output_prefix,
sagemaker_session=sess,
)
fm.set_hyperparameters(
feature_dim=columns,
num_factors=64,
predictor_type='binary_classifier',
epochs=30,
mini_batch_size=200
)
Here are two kinds of parameters that need to be set for training. The first one are the parameters for the training job. These include:
- image_uri: Container image for the algorithm
- Training instance count: This is the number of instances on which to run the training. When the number of instances is greater than one, then the Factorization Machine algorithm will run in distributed settings.
- Training instance type: This indicates the type of machine on which to run the training.
- Volume size: Size in GB of the EBS volume to use for storing input data during training. Must be large enough to store training data.
- Max run time: Timeout in seconds for training. After this amount of time Amazon SageMaker terminates the job regardless of its current status.
- Output path: This the s3 folder in which the training output is stored.
Apart from the above set of parameters, there are hyperparameters that are specific to the algorithm. These are:
- feature_dim: The dimension of the input feature space. This could be very high with sparse input.
- num_factors: The dimensionality of factorization. As mentioned initially, factorization machines find a lower dimensional representation of the interactions for all features. Making this value smaller provides a more parsimonious model, closer to a linear model, but may sacrifice information about interactions. Making it larger provides a higher-dimensional representation of feature interactions, but adds computational complexity and can lead to overfitting. In a practical application, time should be invested to tune this parameter to the appropriate value.
- predictor_type: The type of predictor. binary_classifier: For binary classification tasks. regressor: For regression tasks.
- epochs: The number of training epochs to run.
- mini_batch_size: The size of mini-batch used for training. This value can be tuned for relatively minor improvements in fit and speed, but selecting a reasonable value relative to the dataset is appropriate in most cases.
You can check all the available hyperparameters at Factorization Machines Hyperparameters.
data_channels = {
"train": train_data,
"test": test_data
}
fm.fit(inputs=data_channels, logs=True)
5- Perform Real-Time Inference
Factorization Machine is often used with sparse data, performing inference requests with a CSV format (as is done in other algorithm examples) can be hugely inefficient. Instead of wasting space and time generating all those zeros, JSON can be used more efficiently to fill the row with the correct dimensionality.
fm_predictor = fm.deploy(
initial_instance_count=1,
instance_type="ml.c4.xlarge",
deserializer= JSONDeserializer()
)
endpoint_name = fm_predictor.endpoint_name
display(f"Endpoints name: {endpoint_name}")
fm_predictor = sagemaker.predictor.Predictor(endpoint_name,
sagemaker_session=sagemaker.Session(),
deserializer=JSONDeserializer()
)
6- Recommend
Now, we will generate a Recommend function that will allow us to make the top X recommendations of the movies that we have not rated yet for each userID.
def recommend(userId, data, enc, k=10):
js = {'instances': []}
shape = len(enc.categories_[0])+len(enc.categories_[1])
recommend = pd.DataFrame()
movie = data.movieId[data.userId==userId]
movie_not = set(data.movieId.unique())-set(movie)
userId_index = np.where(enc.categories_[0]==userId)[0][0]
for m in movie_not:
movie_index = np.where(enc.categories_[1]==m)[0][0]
js['instances'].append({"data":{"features":{"keys":[int(userId_index),int(movie_index)], "shape": [shape], "values": [1,1]}}})
result = fm_predictor.predict(json.dumps(js), initial_args={"ContentType": "application/json"})
recommend = pd.DataFrame.from_dict(result['predictions'])
recommend['userId'] = userId
recommend['movieId'] = movie_not
return recommend[['userId','movieId','score']].loc[recommend.predicted_label==1].sort_values(by='score',ascending=False).reset_index(drop=True).loc[:(k-1),:]
recommend = recommend(1, ratings, enc)
recommend
Recommendation systems can be a great challenge given the amount of information that we as users generate. Techniques and algorithms are needed in order to narrow the user’s vision with a personalized and fully automated pre-selection (recommendation).
If you want to challenge yourself in this area, Edrans can be the place for you. You can get the knowledge and practice to master the science and AWS services behind this kind of solution, and get hands-on experience implementing business solutions to world-class clients.
7- Clean up
Finally, in order to avoid incurring expenses, we will eliminate the endpoint.
fm_predictor.delete_endpoint()
Wrapping up
Recommendation systems are a great challenge nowadays, the amount of information that we generate as users requires techniques and algorithms to be able to narrow the user’s vision with a personalized and fully automated pre-selection (recommendation).
If you want to start down this path, Edrans can help you obtain the necessary knowledge to be able to understand and implement, together with AWS, the steps to follow in order to achieve it.
We hope that this article has been useful- for more information you can check the AWS documentation and the notebooks on GitHub.