Starspace for NLP

Our recent addition to the NLP R universe is called R package ruimtehol which is open sourced at https://github.com/bnosac/ruimtehol This R package is a wrapper around Starspace which provides a neural embedding model for doing the following on text:

Text classification
Learning word, sentence or document level embeddings
Finding sentence or document similarity
Ranking web documents
Content-based recommendation (e.g. recommend text/music based on the content)
Collaborative filtering based recommendation (e.g. recommend text/music based on interest)
Identification of entity relationships

If you are an R user and are interested in NLP techniques. Feel free to test out the framework and provide feedback at https://github.com/bnosac/ruimtehol/issues. The package is not on CRAN yet, but can be installed easily with the command devtools::install_github(“bnosac/ruimtehol”, build_vignettes = TRUE).

Below is an example how the package can be used for multi-label classification on questions asked in Belgian parliament. Each question in parliament was labelled with several of one of the 1785 categories.

Each question in parliament was labelled with more than 1 category. There are 1785 categories in this datasetdekamer$question_themes <- strsplit(dekamer$question_theme, “ +\| +”)## Plain text of the question in parliamentdekamer$text <- strsplit(dekamer$question, “\W”)dekamer$text <- sapply(dekamer$text, FUN=function(x) paste(x, collapse = “ “))dekamer$text <- tolower(dekamer$text)

term1 term2 similarity rankfederale politie patrouille __label__POLITIE 0.8480641 1federale politie patrouille __label__OPENBARE 0.6919607 2federale politie patrouille __label__BEROEPSMOBILITEIT 0.6907637 3

The list of R packages regarding text mining with R provided by BNOSAC has been steadily growing. This is the list of R packages maintained by BNOSAC.

udpipe: tokenisation, lemmatisation, parts of speech tagging, dependency parsing, morphological feature extraction, sentiment scoring, keyword extraction, NLP flows
crfsuite: named entity recognition, text classification, chunking, sequence modelling
textrank: text summarisation
ruimtehol: text classification, word/sentence/document embeddings, document/label similarities, ranking documengs, content based recommendation, collaborative filtering-based recommendation

More details of ruimtehol at the development repository https://github.com/bnosac/ruimtehol where you can also provide feedback.

Training on Text Mining

Are you interested in how text mining techniques work, then you might be interested in the following data science courses that are held in the coming months.

19-20/12/2018: Applied spatial modelling with R. Leuven (Belgium). Subscribe here
21-22/02/2018: Advanced R programming. Leuven (Belgium). Subscribe here
13-14/03/2018: Computer Vision with R and Python. Leuven (Belgium). Subscribe here
15/03/2019: Image Recognition with R and Python: Subscribe here
01-02/04/2019: Text Mining with R. Leuven (Belgium). Subscribe here

Related