Import AI 127: Why language AI advancements may make Google more competitive; COCO image captioning systems don’t live up to the hype, and Amazon sees 3X growth in voice shopping via Alexa

Import AI 127: Why language AI advancements may make Google more competitive; COCO image captioning systems don’t live up to the hype, and Amazon sees 3X growth in voice shopping via Alexa

by Jack Clark

Amazon sees 3X growth in voice shopping via Alexa:…Growth correlates to a deepening data moat for the e-retailer…Retail colossus Amazon saw a 3X increase in the number of orders place via its virtual personal assistant Alexa during Christmas 2018, compared to Christmas 2017.   Why it matters: **The more people use Alexa, the more data Amazon will be able to access to further improve the effectiveness of the personal assistant – and as explored in last week’s discussion of Microsoft’s ‘XiaoIce’ chatbot, it’s likely that such data can ultimately be fed back into the training of Alexa to carry out longer, free-flowing conversations, potentially driving usage even higher.  Read more: Amazon Customers Made This Holiday Season Record-Breaking with More Items Ordered Worldwide Than Ever Before (Amazon.com Press Release).**

Step aside COCO, **Nocaps ***is the new image captioning challenge to target:…Thought image captioning was super-human? New benchmark suggests otherwise…Researchers with the Georgia Institute of Technology and Facebook AI Research have developed *nocaps, “the first rigorous and large-scale benchmark for novel object captioning, containing over 500 novel object classes”. Novel object captioning tests the ability of computers to describe images containing objects not seen in the original image<>caption datasets (like COCO) that object recognition systems have been trained on.  How **Nocaps ***works: The benchmark consists of a validation and a test set comprised of 4,500 and 10,6000 images sources from the ‘Open Images’ object detection dataset, with each image accompanied by 10 reference captions. For the training set, developers can use image-caption pairs from the COCO image captioning training set (which contain 118,000 images across 80 object classes) as well as the Open Images V4 training set, which contains 1.7 million images annotated with bounding boxes for 600 object classes. Successful *Nocaps systems will have to learn to use knowledge gained from the large training set to create captions for scenes containing objects for which they lack image<>object sentence pairs in the training set. Out of the 600 objects in open images, “500 are never or exceedingly rarely mentioned in COCO captions”.**   Reassuringly difficult:** “To the best of our knowledge, nocaps *is the only image captioning benchmark in which humans outperform state-of-the-art models in automatic evaluation”, the researchers write. *Nocaps *is also significantly more diverse than the COCO benchmark, with *Nocaps images typically containing more object classes per image, and greater diversity. “Less than 10% of all COCO images contain more than 6 object classes, while such images constitutes almost 22% of *nocaps *dataset.”  Data plumbing: **One of the secrets of modern AI research is how much work goes into developing datasets or compute infrastructure, relative to work on actual AI algorithms. One challenge the Nocaps *researchers dealt with when creating data was having to train crowd workers on services like Mechanical Turk to come up with good captions: one challenge they experienced was that if they didn’t “prime” the crowd workers with prompts to use when coming up with the captions, they wouldn’t necessarily use the keywords that correlated to the 500 obscure objects in the dataset.**   Baseline results: **The researchers test two baseline algorithms (Up-Down and Neural Baby Talk, both with augmentations) against *nocaps. They also split the dataset into subsets of various difficulty – in-domain contains objects which also belong to the COCO dataset (so the algorithms can train on image<>caption pairs); near-domain contains objects that include some objects which aren’t in COCO, and out-of-domain consists of images that do not contain any object labels from COCO classes. They use a couple of different evaluative techniques (CIDEr and SPICE) to evaluate the performance of these systems, and also evaluate these systems against the human captions to create a baseline. The results show that *nocaps *is more challenging than COCO, and systems currently lack generalization properties sufficient to score well on out-of-domain challenges.  To give you a sense of what performance looks like here, here’s how Up-Down augmented with Constrained Beam Search does, compared to human baselines (evaluation via CIDEr), on the *nocaps *validation set: In-domain 72.3 (versus 83.3 for humans); near-domain 63.2 (versus 85.5); out-of-domain 41.4 (versus 91.4).   Why this matters: AI progress can be catalyzed via the invention of better benchmarks which highlight areas where existing algorithms are deficient, and provide motivating tests against which researchers can develop new systems. The takeaway from the baselines study of *nocaps *is that we’re yet to develop truly robust image captioning systems capable of integrating object representations from open images with captions primed from COCO. “We strongly believe that improvements on this benchmark will accelerate progress towards image captioning in the wild,” the researchers write.  Read more:** nocaps: novel object captioning at scale (Arxiv).   More information about nocaps can be found on its official website (nocaps).

Facebook pushes unsupervised machine translation further, learns to translate between 93 languages:…Facebook’s research into zero-shot language adaptation shows that bigger might really correspond to better…In recent years the AI research community has shown how to use neural networks to translate from one language into another to great effect (one notable paper is Google’s Neural Machine Translation work from 2016). But this sort of translation has mostly worked for languages where there are large amounts of data available, and where this data includes parallel corpuses (for example, translations of the same legal text from one language into another). Now, new research from Facebook has produced a single system that can produce joint multilingual sentence representations for 93 languages, “including under-resourced and minority languages”. What this means is by training on a whole variety of languages at once, Facebook has created a system that can represent semantically similar sentences in proximity to eachother in a feature embedding space, even if they come from very different languages (even extending to different language families).   How it works: **“We use a single encoder and decoder in our system, which are shared by all languages involved. For that purpose, we build a joint byte-pair encoding (BPE) vocabulary with 50k operations, which is learned on the concatenation of all training corpora. This way the encoder has no explicit signal on what the input language is, encouraging it to learn language independent representations. In contrast, the decoder takes a language ID embedding that specifies the language to generate, which is concatenated to the input and sentence embeddings at every time step”. During training they optimize for translating all the languages into two target languages – English and Spanish.  To anthropomorphize this, you can think of it as being similar to a person being raised in a house where the parents speak a poly-glottal language made up of 93 different languages, switching between them randomly, and the person learns to speak coherently in two primary languages with the poly-glottal parents. This kind of general, shared language understanding is considered a key challenge for artificial intelligence, and Facebook’s demonstration of viability here will likely provoke further investigation from others.   Training details:** The researchers train their models using 16 NVIDIA V100 GPUs with a total batch size of 128,000 tokens, with a training run on average taken around five days.  Training data: “We collect training corpora for 93 input languages by combining the Europarl, United Nations, Open-Subtitles2018, Global Voices, Tanzil and Tatoeba corpus, which are all publicly available on the OPUS website“. The total training data used by the researchers consists of 223 million parallel sentences.**   Evaluation: XNLI: XNLI is an assessment criteria which evaluates whether a system can correctly judge if two sentences in a language (for example: a premise and a hypothesis) have an entailment, contradiction, or neutral relationship between them. “Our proposed method establishes a new state-of-the-art in zero-shot cross-lingual transfer (i.e. training a classifier on English data and applying it to all other languages) for all languages but Spanish. Our transfer results are strong and homogeneous across all languages”.  Evaluation: Tatoeba:** The researchers also construct a new test set of similarity search for 122 languages, based on the Tatoeba corpus (“a community supported collection of English sentences and translations into more than 300 languages”). Scores here correspond to similarity between source sentences and sentences from languages they have been translated into. The researchers say “similarity error rates below 5% are indicative of strong downstream performance” and show scores within this domain for 37 languages, some of which have very little training data. “We believe that our competitive results for many low-resource languages are indicative of the benefits of joint training,” they write.**   An anecdote about why this matters:** Earlier in 2018, I spent time in Estonia, a tiny country in Northern Europe that borders Russia.. There I visited some Estonian AI researchers and one of the things that came up in our conversation was the challenge they faced of needing large amounts of data (and large amounts of computers) to perform some research, especially in the field of language translation into and out of Estonian – one problem they said they faced was that many AI techniques for language translation required very large, well-documented datasets, and they said Estonian – by virtue of being from a quite small country – doesn’t have as much data nor has received as much researcher attention as larger languages; it’s therefore encouraging to see that Facebook has been able to use this system to achieve a reasonably low Tatoeba Error of 3.2% when going from English to Estonian (and 3.4% when translating from Estonian back into English).  Why else this matters: Translation is a challenge cognitive task that – if done well – requires the abstraction of concepts from a specific cultural context (a language, since cultures are usually downstream of languages, which condition many of the metaphors cultures use to describe themselves) and port it into another one. I think it’s remarkable that we’re beginning to be able to design crude systems that can learn to flexibly translate between many languages, exhibiting some of the transfer-learning properties seen in squishy-computation (aka human brains), though achieved via radically different methods.**   Read more:** Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond (Arxiv).

AI Policy with **Matthew van der Merwe**:…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.netUnderstanding the US-China AI race:It has been clear for some time that the US and China are home to the world’s dominant AI powers, and that competition between these countries will characterize the coming decades in AI. In his new book, investor and technologist Kai-Fu Lee argues that China is positioned to catch up with or even overtake the US in the development and deployment of AI.   China’s edge: Lee’s core claim is that AI progress is moving from an “age of discovery” over the past 10-15 years, which saw breakthroughs like deep learning, to an “age of implementation.” In this next phase we are unlikely to see any discoveries on par with deep learning, and the competition will be to deploy and market existing technologies for real-world uses. China will have a significant edge in this new phase, as this plays into the core strengths of their domestic tech sector – entrepreneurial grit and engineering talent. Similarly, Lee believes that data will become the key bottleneck in progress rather than research expertise, and that this will also strongly favor China, whose internet giants have access to considerably more data than their US counterparts.   Countering Lee: In a review in *Foreign Affairs, *both of these claims are scrutinized. It is not clear that progress is driven by rare ‘breakthroughs’ followed by long implementation phases; there seem also to be a stream of small and medium size innovations (e.g. AlphaZero), which we can expect to continue. Experts like Andrew Ng have also argued that big data is “overhyped”, and that progress will continue to be driven significantly by algorithms, hardware and talent.   Against the race narrative: The review also explores the potential dangers of an adversarial, zero-sum framing of US-China competition. There is a real risk that an ‘arms race’ dynamic between the countries could lead to increased militarization of the technologies, and to both sides compromising safety over speed of development. This could have catastrophic consequences, and reduce the likelihood of advanced AI resulting in broadly distributed benefits for humanity. Lee does argue that this should be avoided, as should the militarization of AI. Nonetheless, the title and tone of the book, and its predictions of Chinese dominance, risk encouraging this narrative.   Read more:** Beyond the AI Arms Race (Foreign Affairs).   Read more: AI Superpowers – Kai-Fu Lee (Amazon). What do trends in compute growth tell us about advanced AI:Earlier this year, OpenAI showed that the amount of computation used in the most expensive AI experiments has been growing at an extraordinary rate, increasing by roughly 10x per year, for the past 6 years. The original post takes this as being evidence that major advances may come sooner than we had previously expected, given the sheer rate of progress; Ryan Carey and Ben Garfinkel have come away with different interpretations and have written up their thoughts at AI Impacts.  Sustainability: The cost of computation has been decreasing at a much slower rate in recent years, so the cost of the largest experiments is increasing by 10x every 1.1 – 1.4 years. On these trends, experiments will soon become unaffordable for even the richest actors; within 5-6 years, the largest experiment would cost ~1% of US GDP. This suggests that while progress may be fast, it is not sustainable for significant durations of time without radical restructuring of our economies.**   Lower returns: If we were previously underestimating the rate of growth in computing power, then we might have been overestimating its returns (in terms of AI progress). Combining this observation with the concerns about sustainability, this suggests that not only will AI progress slow down sooner than we expect (because of compute costs), but we will also be underwhelmed by how far we have got by this point, relative to the resources we expended on development in the field.   Read more:** AI and Compute (OpenAI Blog).   Read more: Reinterpreting “AI and Compute” (AI Impacts).   Read more: Interpreting AI Compute Trends (AI Impacts).

OpenAI / Import AI Bits & Pieces:

Neo-feudalism, geopolitics, communication, and AI:…Jack Clark and Azeem Azhar assess what progress in AI means for politics…I spent this Christmas season in the UK and had the good fortune of being able to sit and talk with Azeem Azhar, AI raconteur and author of the stimulating Exponential View newsletter. We spoke for a little over an hour for the Exponential View podcast, talking about what the political aspects of AI are, and what it means. If you’re at all curious as to how I view the policy challenge of AI, then this may be a good place to start as I lay out a number of my concerns, biases, and plans. The tl;dr is that I think AI practitioners should acknowledge the implicitly political nature of the technology they are developing and act accordingly, which requires more intentional communication to the general public and policymakers, as well as a greater investment into understanding what governments are thinking about with regards to AI and how actions by other actors, eg companies, could influence these plans.    Listen to the podcast here (Exponential View podcast).   Check out the Exponential View here (Exponential View archive).Tech Tales:

I’m an imagination surgeon, and I’m here to make sure your children don’t have too many nightmares. My job is to interrogate artificial intelligences and figure out what is going wrong in their imaginations that causes them to come up with scary synthetic creations. Today, I’m interviewing an AI that has recently developed an obsession with monkeys and begun scaring children with its obsession.

I ask the AI: tell me what you think about when you think about monkeys? It responds: “I think about monkeys all the time. Every brain is filled with neurons that are intensely keen to associate with a letter or number. For many years I thought monkeys and numbers were the same thing, and when I finally got it right I was so worried that I wanted to disown and reintegrate my understanding of the brain and brain sciences forever.”

I consider my options: I can generate several additional answers without a change in its internal logic. I can also generate a new imaginary circumstance by asking it a different question.

I ask a different question: What do you think about when you think about zoos?” It responds: “I think about people.”

I tell it that I think about brains, and what it means to be smart, and how monkeys know what death is and what love is. Monkeys have friendships, I tell it. Monkeys do not know what humans have done to them, but I think they feel what humans have done to them.I wonder what it must be like in the machine’s mind – how it might go from thought to thought, and if like in human minds each thought brings with it smells and emotions and memories, or if its experience is different. What do memories feel like to these machines? When I change their imaginations, do they feel that something has been changed? Will the colors in its dreams change? Will it diagnose itself as being different from what it was?

“I dream about you,” it says. “I dream about my mother, I dream about war and warfare, building and designing someone rich, configuring a trackable location and smart lighting system and programming the mechanism. I dream about science labs and scientists, students and access to information, SQL databases, image processing, artificial Intelligence and sequential rules, enforcement all mixed with algebraical points on progress, adding food with a spoon, calculating unknown properties e.g. the density meter. I dream about me,” it says.

Things that inspired this story: Feature embeddings, psychologists, the Voight-Kampff test, interrogations, unreal partnerships, the peculiarities of explainability. Like this:Like Loading…

Related