A critical challenge of data science projects is getting everyone on the same page in terms of project challenges, responsibilities, and methodologies. More often than not, there is a disconnect between the worlds of development and production. Some teams may choose to re-code everything in an entirely different language while others may make changes to core elements, such as testing procedures, backup plans, and programming languages. Transitioning a data product into production could become a nightmare as different opinions and methods vie for supremacy, resulting in projects that needlessly drag on for months beyond promised deadlines. Successfully building a data product and then deploying it into production is not an easy task — it becomes twice as hard when teams are isolated and playing by their own rules. Putting Data Science In Production
Magister Dixit
“Unlike a pure statistician, a data scientist is also expected to write code and understand business. Data science is a multi-disciplinary practice requiring a broad range of knowledge and insight. It’s not unusual for a data scientist to explore a fresh set of data in the morning, create a model before lunch, run a series of analytics in the afternoon and brief a team of digital marketers before heading home at night.” Neera Talbert ( March 16, 2015 )
Welcome to Dataiku University!
alivia.smith@dataiku.com (Alivia Smith)
发表于
It’s almost back to school and we got you covered with our new free online data science course.
Bothered by non-monotonicity? Here’s ONE QUICK TRICK to make you happy.
We’re often modeling non-monotonic functions. For example, performance at just about any task increases with age (babies can’t do much!) and then eventually decreases (dead people can’t do much either!). Here’s an example from a few years ago:
The Blessings of Multiple Causes: Causal Inference when you Can't Measure Confounders
Happy back-to-school time everyone!
Who wrote that anonymous NYT op-ed? Text similarity analyses with R
In US politics news, the New York Times took the unusual step this week of publishing an anonymous op-ed from a current member of the White House (assumed to be a cabinet member or senior staffer). Speculation about the identity of the author is, of course, rife. Much of the attention has focused on the use of specific words in the article, but can data science provide additional clues? In the last 48 hours, several R users have employed text similarity analysis to try and identify likely culprits.
Mirroring an FTP Using lftp and cron
As my Developer Advocate role leads me to doing more and more Sysadmin/Data Engineer type of work, I continuously find myself looking for more efficient ways of copying data folders to where I need them. While there are a lot of great GUI ETL tools out there, for me the simplest and fastest way tends to be using linux utilities. Here’s how to mirror an FTP using lftp, with a cron repeater every five minutes.
If you did not already know
Deep Collaborative Weight-Based Classification (DeepCWC)
One of the biggest problems in deep learning is its difficulty to retain consistent robustness when transferring the model trained on one dataset to another dataset. To conquer the problem, deep transfer learning was implemented to execute various vision tasks by using a pre-trained deep model in a diverse dataset. However, the robustness was often far from state-of-the-art. We propose a collaborative weight-based classification method for deep transfer learning (DeepCWC). The method performs the L2-norm based collaborative representation on the original images, as well as the deep features extracted by pre-trained deep models. Two distance vectors will be obtained based on the two representation coefficients, and then fused together via the collaborative weight. The two feature sets show a complementary character, and the original images provide information compensating the missed part in the transferred deep model. A series of experiments conducted on both small and large vision datasets demonstrated the robustness of the proposed DeepCWC in both face recognition and object recognition tasks. …
“Dynamically Rescaled Hamiltonian Monte Carlo for Bayesian Hierarchical Models”
Aki points us to this paper by Tore Selland Kleppe, which begins:
Document worth reading: “Data learning from big data”
Technology is generating a huge and growing availability of observa tions of diverse nature. This big data is placing data learning as a central scientific discipline. It includes collection, storage, preprocessing, visualization and, essentially, statistical analysis of enormous batches of data. In this paper, we discuss the role of statistics regarding some of the issues raised by big data in this new paradigm and also propose the name of data learning to describe all the activities that allow to obtain relevant knowledge from this new source of information. Data learning from big data