The 4th Annual TEXATA Summit is only 3 weeks away! Join us on Friday October 19th in Austin, Texas to learn and connect with your fellow Industry Leaders discussing the latest trends and innovations in AI, Advanced Analytics, Machine Learning and Big Data.
Modeling muti-category Outcomes With vtreat
vtreat
is a powerful R
package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).
Chromebook Data Science - a free online data science program for anyone with a web browser.
Jeff Leek ** 2018/10/01
A Right to Reasonable Inferences
By Dr. Sandra Wachter, Lawyer and Research Fellow (Asst. Prof.), University of Oxford
Import AI 114: Synthetic images take a big leap forward with BigGANs; US lawmakers call for national AI strategy; researchers probe language reasoning via HotspotQA
Getting hip to multi-hop reasoning with HotpotQA:…New dataset and benchmark designed to test common sense reasoning capabilities…Researchers with Carnegie Mellon University, Stanford University, the Montreal Institute for Learning Algorithms, and Google AI, have created a new dataset and associated competition designed to test the capabilities of question answering systems. The new dataset, HotspotQA, is far larger than many prior datasets designed for such tasks, and has been designed to require ‘multi-hop’ reasoning to thereby test the growing sophistication of newer NLP systems at performing increasing cognitive tasks. HotpotQA consists of around ~113,000 Wikipedia-based question-answer pairs. Answering these questions correctly is designed to test for ‘multi-hop’ reasoning – the ability for systems to look at multiple documents and perform basic iterative problem-solving to come up with correct answers. These questions were “collected by crowdsourcing based on Wikipedia articles, where crowd workers are shown multiple supporting context documents and asked explicitly to come up with questions requiring reasoning about all of the documents”. These workers also provide the supporting facts they use to answer these questions, providing a strong supervised training set. It’s the data, stupid: **To develop HotpotQA the researchers needed to themselves create a kind of multi-hop pipeline to be able to figure out what documents to give cloud workers to use to compose questions for. To do this, they mapped the Wikipedia Hyperlink Graph and used this information to build a directed graph, then they try to detect correspondences between these pairs. They also created a hand-made list of categories to use to compare things of similar categories (eg, basketball players, etc). Testing: HotpotQA can be used to test models’ capabilities in different ways, ranging from information retrieval to question answering. The researchers train a system to give a baseline and the results show that the (relatively strong baseline) obtains performance significantly below that of a competent human across all tasks (with the exception of certain ‘supporting fact’ evaluations, in which it obtains performance on par with an average human). Why it matters: Natural language processing research is currently going through what some have called an ‘ImageNet moment’ following recent algorithmic developments relating to the usage of memory and attention-based systems, which have demonstrated significantly higher performance across a range of reasoning tasks compared to prior techniques, while also being typically much simpler. Like with ImageNet and the associated supervised classification systems, these new types of NLP approaches require larger datasets to be trained on and evaluated against, and as with ImageNet it’s likely that by scaling up techniques to take on challenges defined by datasets like HotpotQA progress in this domain will increase further. Caveat: As with all datasets with an associated competitive leaderboard it is feasible that HotpotQA could be relatively easy and systems could end up exceeding human performance against it in a relatively short amount of time – this happened over the past year with the Stanford SQuAD dataset. Hopefully the relatively higher sophistication of HotspotQA will protect against this. Read more:** HotpotQA website with leaderboard and data (HotpotQA Github). Read more: HOTPOTQA: A Dataset for Diverse, Explainable Multi-hop Question Answering (Arxiv).
Up your open source game with Hacktoberfest at Locke Data!
How awesome is open source software? Quite awesome in our opinion! Locke Data maintains several open source repos on GitHub, in particular of R packages, and we’d like you to join in the fun! This month, we’re taking part in Hacktoberfest and will do our best to mentor you through your first open source contributions if you wish!
A Review of the Neural History of Natural Language Processing
Disclaimer This post tries to condense ~15 years’ worth of work into eight milestones that are the most relevant today and thus omits many relevant and important developments. In particular, it is heavily skewed towards current neural approaches, which may give the false impression that no other methods were influential during this period. More importantly, many of the neural network models presented in this post build on non-neural milestones of the same era. In the final section of this post, we highlight such influential work that laid the foundations for later methods.
Reinforcement Learning: Super Mario, AlphaGo and beyond
Distilled News
Text Mining 101: A Stepwise Introduction to Topic Modeling using Latent Semantic Analysis (using Python)
Bob Erikson on the 2018 Midterms
Donald Trump’s tumultuous presidency has sparked far more than the usual interest in the next midterm elections as a possible midcourse correction. Can the Democrats can win back the House of Representatives and possibly even the Senate in 2018? This short essay presents some observations about midterm elections and congressional elections generally, followed by some considerations relevant toward understanding the upcoming 2018 midterm verdict. Most of my [Erikson’s] remarks would be commonplace among seasoned congressional election scholars. Please note, however, that I tout a theory of ideological balancing in elections, that remains controversial in some quarters.