Data science, machine learning, and AI have clear applications for e-commerce, and given their relative ease of implementation, most online retailers are already deeply invested in strategies like recommendation engines, dynamic pricing optimization, and supply chain optimization. But so far, aside from the big players like Amazon, brick-and-mortar retail is behind in the move to AI.
Part 2: Optimism corrected bootstrapping is definitely bias, further evidence
Some people are very fond of the technique known as ‘optimism corrected bootstrapping’, however, this method is bias and this becomes apparent as we increase the number of noise features to high numbers (as shown very clearly in my previous blog post). This needs exposing, I don’t have the time to do a publication on this nor the interest so hence this article. Now, I have reproduced the bias with my own code.
Following your gut, following the data
The Wall Street Journal highlighted a disagreement between data and business at Netflix. Ultimately, the business side “won.” However, maybe that’s the wrong framing. Roger Peng describes the differences between analysis and the full truth:
If you did not already know
Apache Hadoop
Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0.
The Apache Hadoop framework is composed of the following modules:
· Hadoop Common – contains libraries and utilities needed by other Hadoop modules
· Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
· Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
· Hadoop MapReduce – a programming model for large scale data processing.
Hadoop is being regarded as one of the best platforms for storing and managing big data. It owes its success to its high data storage and processing scalability, low price/performance ratio, high performance, high availability, high schema flexibility, and its capability to handle all types of data. …
Statistical Assessments of AUC
In the scorecard development, the area under ROC curve, also known as AUC, has been widely used to measure the performance of a risk scorecard. Given everything else equal, the scorecard with a higher AUC is considered more predictive than the one with a lower AUC. However, little attention has been paid to the statistical analysis of AUC itself during the scorecard development.
If you did not already know
Task Embedded Coordinate Update (TECU) We in this paper propose a realizable framework TECU, which embeds task-specific strategies into update schemes of coordinate descent, for optimizing multivariate non-convex problems with coupled objective functions. On one hand, TECU is capable of improving algorithm efficiencies through embedding productive numerical algorithms, for optimizing univariate sub-problems with nice properties. From the other side, it also augments probabilities to receive desired results, by embedding advanced techniques in optimizations of realistic tasks. Integrating both numerical algorithms and advanced techniques together, TECU is proposed in a unified framework for solving a class of non-convex problems. Although the task embedded strategies bring inaccuracies in sub-problem optimizations, we provide a realizable criterion to control the errors, meanwhile, to ensure robust performances with rigid theoretical analyses. By respectively embedding ADMM and a residual-type CNN in our algorithm framework, the experimental results verify both efficiency and effectiveness of embedding task-oriented strategies in coordinate descent for solving practical problems. …
Will Julia Replace Python and R for Data Science?
For those of you who don’t know, Julia is a multiple-paradigm (fully imperative, partially functional, and partially object-oriented) programming language designed for scientific and technical (read numerical) computing. It offers significant performance gains over Python (when used without optimization and vectorized computing using Cython and NumPy). Time to develop is reduced by a factor of 2x on average. Performance gains range in the range from 10x-30x over Python (R is even slower, so we don’t include it. R was not built for speed). Industry reports in 2016 indicated that Julia was a language with high potential and possibly the chance of becoming the best option for data science if it received advocacy and adoption by the community. Well, two years on, the 1.0 version of Julia was out inAugust 2018 (version 1.0), and it has the advocacy of the programming community and the adoption by a number of companies (see https://www.juliacomputing.com) as the preferred language for many domains – including data science.
BERT: State of the Art NLP Model, Explained
By Rani Horev, Co-Founder & CTO at Snip
Miami University: Assistant Provost for Institutional Research and Effectiveness [Oxford, OH]
At: Miami University Location: Oxford, OHWeb: www.miami.miamioh.eduPosition: Assistant Provost for Institutional Research and Effectiveness
Deep learning in Satellite imagery
By Damian Rodziewicz, Appsilon.