Cluster analysis can be considered one of the pillars of machine learning, and yet it’s one that’s difficult to talk about.
Profiling Top Kagglers: Martin Henze (AKA Heads or Tails), World's First Kernels Grandmaster
Let me begin by introducing myself: My name is Martin. I’m an astrophysics postdoc working on understanding exploding stars in nearby galaxies. From the very beginning of my studies, I was using data analysis to try to unveil the mysteries of the universe. From deep images taken with ground- and spaced-based telescopes, through time series measuring the heartbeats of extreme stars, to population correlations probing the fundamental physics behind incredibly powerful eruptions: learning the secrets of a complex cosmos requires all the tools you can get your hands on.
Sent2Vec: An unsupervised approach towards learning sentence embeddings
A comparison of sentence embedding techniques by Prerna Kashyap, our RARE Incubator student. As her graduation project, Prerna implemented sent2vec, a new document embedding model in Gensim, and compared it to existing models like doc2vec and fasttext.
Is it Time to Regulate Bitcoin?
In the financial space, anything unregulated and unregistered would cause doubts and uneasiness. In the case of cryptocurrencies, such as bitcoin, financial regulators all over the world have started to find ways to oversee the blockchain, or the record of all cryptocurrency transactions, as well as to address the irregularities presented by these virtual currencies that mostly bypass financial firms, exchanges, and regulated banks. The most popular of all cryptocurrencies, bitcoin, chiefly operates outside of the conventions of a financial system; and this worries regulators as it has the potential to be linked to money laundering, tax evasion, fraud, and terrorist funding.
Pivoted document length normalisation
As a part of the RARE incubator program my goal was to add two new features on the existing TF-IDF model of Gensim. One was implementing a SMART information retrieval system (smartirs) scheme [1] and the other was implementing pivoted document length normalization [2].
AI Lab: Learn to Code with the Cutting-Edge Microsoft AI Platform
This post is authored by Tara Shankar Jana, Senior Technical Product Marketing Manager at Microsoft.
Import AI
Auto-generating phishing URLs via AI components:…AI is an omni-use technology, so the same techniques used to spot phishing URLs can also be used to generate phishing URLs…Researchers with the Cyber Threat Analytics division of Cyxtera Technologies have written an analysis of how people might “use AI algorithms to bypass AI phishing detection systems” by creating their own system called DeepPhish. DeepPhish: **DeepPhis works by taking in a list of fraudulent URLS that have been successfully worked in the past, encodes these as a one-hot representation, then trains a model to generate new synthetic URLs given a seed sentence. They found that DeepPhish could dramatically improve the chances of a fraudulent URL getting past automated phishing-detection systems, with DeepPhish URLs seeing a boost in effectiveness from 0.69% (no DeepPhish) to 20.90% (with DeepPhish). Security people always have the best names: DeepPhis isn’t the only AI “weapon” system recently developed by researchers, the authors note; other tools include Honey-Phish, SNAP_R, and Deep DGA.* *Why it matters:** This research highlights how AI is an inherent omni-use technology, where the same basic components used to, for instance, train systems to learn to spot potentially fraudulent URLS, can also be used to generate plausible-seeming fraudulent URLs. Read more: DeepPhish: Simulating Malicious AI (PDF).
The Role of Resources in Data Analysis
Roger Peng ** 2018/06/18
BDD100K Blog Update
We are excited by the interest and excitement generated by our BDD100K dataset. Our data release and blog post were covered in an unsolicited article by the UC Berkeley newspaper, the Daily Cal, which was then picked up by other news services without our prompting or intervention. The paper describing this dataset is under review at the ECCV 2018 conference, and we followed the rules of that conference (as communicated to us by the Program Chairs in prompt email response when we asked for clarification following the reporter’s request; the ECCV PC’s replied that ECCV follows CVPR’s long-standing policy). We thus declined to speak to the reporters after they reached out to us. We did not, and have not, communicated with any media outlets regarding this story.
Docstrings in open source Python
Hi everyone, my name is Dmitry Berdov, I’m a graduate student at the Ural Federal University, now working in QA testing (automation) sphere. I had no experience with writing documentation before joining the RARE Incubator, where my task has been to refactor and improve the poor state of Gensim docs. Now, after several months of shooting myself hard in the foot, I would like to share my insights from this unforgettable process.