Sentiment Analysis model deployed!

I’ve trained a sentiment analysis on simple data set:

Amazon Reviews: Unlocked Mobile Phones

based on the amazon phone purchase reviews. Simple linear SVM classifier using scikit-learn. The code is down below, please scroll down

Yet I’ve successful deployed the model on an AWS server!  original deployment page

Model building

 

Supervised model: Sentiment Analysis

Tool used: Python; Main library: Scikit-learn: http://scikit-learn.org

Code: attachment: “sup_demo.html”, “sup_demo.ipnb”

Summary

Ways of labeling data: Arbitrarily treat 1-3 star reviews as “negative” and 5 star reviews as “positive”. This is the highlight of this model and lead to the success. By and large, rating views are really subjective since there is not an universal rating criteria. In my opinion, building a model that trains and predicts reviews into 5 categories (1 to 5 stars) is not only unrealistic but lack of business values.

From the perspective of technical implementation, this sentiment analysis supervised model is fairly simple, which is mainly due to the way of labeling the data and intrinsic attributes of the dataset entail a high accuracy (over 90%) of the prediction power of the trained model. So I did not implement more complex methods to improve the accuracy since it is already satisfactory.

From the perspective of model deployment, a more complex model generally takes more storage space, more time to train & predict new data. Moreover, data preprocessing instances will take extra storage & time, for example, to predict new inputs, new inputs have to go through data preprocessing instances before trained model. Imagine you want to predict a new review, you have to wait 5 seconds until the model can give you the result. More importantly, all the program  instances in deployment have to be active in the RAM, so more sophisticated model & preprocessing will be a burden to the deployment server which entails higher probability of errors & disconnections. Since a simple model can achieve over 90% prediction rate, therefore, it is no worth to build a more complex model.

 

Methodology

  1. Stopwords implementation: prepared a txt document of customized stop words list
  • Important: keep negative stop words! Many stop word lists will filter words like “dont”,”isnt”, etc. Yet since we are building a sentiment analysis model, negative words are very important. So the first thing should do is to delete stop words that have positive/negative emotions.

  • Customizing Add more stop words to stop.txt, as well as directly in Python code

 

  1. Stemming and Lemmatization: not used
  • For some really after implementing stemming and lemmatization, the model results in lower accuracy.

 

  1. Tokenization: HashingVectorizer with over 1 million features (words & phases)
  • N-grams: trigram , simulations of n from 1 to 5, n=3 works the best (highest accuracy) https://en.wikipedia.org/wiki/N-gram

  • TfidfVectorizer: no signs of improvement of accuracy, actually will be less accurate

CountVecorizer: difference between Hashing vectorizer and Count vectorizer

  • Practical: in implantation, Trained CountVecorizer can be as large as 500 MB, whereas trained Hashing vectorizer is 7 KB, the difference is hug (up to 100000 times). Especially the vectorizer has to be active in RAM,  so CountVecorizer will be too costly and more unstable.
  1. Filter out tokens appears only once & most frequently appeared : Not implement
  • If apply, result in significantly reduced accuracy, this methods is useful with extremely large dataset, in my experience, text data over 1 GB can be benefit from.
  1. Modeling
  • Splitting the data, train: validation = 0.75: 0.25

  • Since we have over 1 million features and over 300 thousand reviews, I choose models with stochastic & linear features / algorithms good with scaled data.

  • Implemented models: LogisticRegression, SGDClassifier, PassiveAggressiveClassifier, LinearSVC, RidgeClassifier, etc.

Cross validation: to ensure the models are unbiased after train-test splitting

10 fold testing (PassiveAggressiveClassifier takes less time)

 

  • Best model: LinearSVC with over 95% accuracy on 25% validate dataset
  1. Extra methods: generally should be implemented in order to build a really good Machine Learning application.

Dimensional reduction: we have over 1 million features, impossible to use models such as SVM, neural networks, trees, etc.

  • Feature selection: e.g. Kbest features

  • Decomposition : e.g. PCA, SVD, NMF, Isomap

Normalization: e.g. maxabs_scale, normalize, StandardScaler

  • Tried normalization, did not improve in our scenario
  1. Other Machine Learning models: popular models generally with good accuracy on categorical data, such as neural networks, trees & forests, ensembles
  • Time & computation expansive: With 1 million features and 300 thousand instances, for example, a SVM may take hours/ days to converge. Yet after reducing features to e.g. 50000, a, neural network model can achieve 97% accuracy (but more time & difficult to deploy).

 

Further

  1. Parameter tuning: yet most machine learning models in scikit-learn preform satisfactory with default parameters, yet a good application always involve customized parameter tuning. Methods: GridsearchCV, or writing simulation codes directly in Python.

Stacking: combine different models & features & methods.

  1. g. build topic models first and perform random forest on topics

Other Python tools

XGboost is always a winning tool

Require too much time on feature wrangling, parameter tunning.

  1. Without above, XGboost performs worse than general scikit-learn models

Deep learning

  1. Similar to XGboost, require time & energy

XGboost is always a winning tool

Require too much time on feature wrangling, parameter tunning.

  • Without above, XGboost performs worse than general scikit-learn models

Product

Deployment page alive online:  Click here


In Python, to predict new inputs, example:

  • token is the trained tokenizer such as Hashvectorizer

  • model is the trained model

 

code is here: