Starspace for NLP

Our recent addition to the NLP R universe is called R package ruimtehol which is open sourced at https://github.com/bnosac/ruimtehol This R package is a wrapper around Starspace which provides a neural embedding model for doing the following on text:

  • Text classification

  • Learning word, sentence or document level embeddings

  • Finding sentence or document similarity

  • Ranking web documents

  • Content-based recommendation (e.g. recommend text/music based on the content)

  • Collaborative filtering based recommendation (e.g. recommend text/music based on interest)

  • Identification of entity relationships

If you are an R user and are interested in NLP techniques. Feel free to test out the framework and provide feedback at https://github.com/bnosac/ruimtehol/issues. The package is not on CRAN yet, but can be installed easily with the command devtools::install_github(“bnosac/ruimtehol”, build_vignettes = TRUE).

Below is an example how the package can be used for multi-label classification on questions asked in Belgian parliament. Each question in parliament was labelled with several of one of the 1785 categories.

Each question in parliament was labelled with more than 1 category. There are 1785 categories in this datasetdekamer$question_themes <- strsplit(dekamer$question_theme, “ +\| +”)## Plain text of the question in parliamentdekamer$text <- strsplit(dekamer$question, “\W”)dekamer$text <- sapply(dekamer$text, FUN=function(x) paste(x, collapse = “ “))dekamer$text <- tolower(dekamer$text)

term1                      term2 similarity rankfederale politie patrouille           __label__POLITIE  0.8480641    1federale politie patrouille          __label__OPENBARE  0.6919607    2federale politie patrouille __label__BEROEPSMOBILITEIT  0.6907637    3

The list of R packages regarding text mining with R provided by BNOSAC has been steadily growing. This is the list of R packages maintained by BNOSAC.

  • udpipe: tokenisation, lemmatisation, parts of speech tagging, dependency parsing, morphological feature extraction, sentiment scoring, keyword extraction, NLP flows

  • crfsuite: named entity recognition, text classification, chunking, sequence modelling

  • textrank: text summarisation

  • ruimtehol: text classification, word/sentence/document embeddings, document/label similarities, ranking documengs, content based recommendation, collaborative filtering-based recommendation

More details of ruimtehol at the development repository https://github.com/bnosac/ruimtehol where you can also provide feedback.

Training on Text Mining 

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.

  • 19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here

  • 21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here

  • 13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here

  •      15/03/2019: Image Recognition with R and Python: Subscribe here

  • 01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

Related