Technology-focused discussions about genomics usually highlight the huge growth in DNA sequencing since the beginning of the century, growth that has outpaced Moore’s law and resulted in the $1000 genome. However, future growth is projected to be even more dramatic. In the paper “Big Data: Astronomical or Genomical?”, the authors say it is estimated that “between 100 million and as many as 2 billion human genomes could be sequenced by 2025”, requiring between 2-40 exabytes of storage. (An exabyte is 1000 petabytes.)
Flipping a Coin on a Crazy Plane
I’m a big fan of brainteasers and happen to find quirky, counterintuitive mathematical truisms fascinating. I’d be lying if I said I hadn’t spent a weekend figuring out why once you have 23 people in a room, you’re more likely than not to have at least one shared birthday among them. At the same time, I don’t consider myself particularly “good” at brainteasers, and can’t say I’d perform as well as Bruce Willis in that classic scene from Die Hard 3, even if Samuel L. were there to help me.
Hacking A Hackaton
After the AI hackaton in Minsk (where I joined the team that reached the 2nd prize) and Spacehack in Moscow (where I was a member of a tech committee and reviewed ~30 projects), I’d like to naively generalize some thoughts about preparations for hackaton. Of course, a hackaton means rapid decisions, more code and inspiration, less planning. But some planning prior to the event is still useful.
Tutorial: Sentiment Analysis of Airlines Using the syuzhet Package and Twitter
In my last job, I was a frequent flyer. Each week I flew between 2 or 3 countries, briefly returning for 24 hours on the weekend to get a change of clothes. My favourite airlines were Cathay Pacific, Emirates and Singapore Air. Now, unless you have been living in a cave, you’d be well aware of the recent news story of how United Airlines removed David Dao from an aircraft. I wondered how that incident had affected United’s brand value, and being a data scientist I decided to do sentiment analysis of United versus my favourite airlines.
Announcement
I’ve decided to migrate this blog to Pelican from Jekyll. I did this largely because Pelican has a plugin for allowing Jupyter notebooks to be served automatically as blog posts. With Jekyll, I had to use nbconvert to convert my ipynb document to html every time I wanted to publish a new post, and I also had to often manually edit that html document to get it formatted correctly when embedded into the site. That was fine once, but if you want to make edits or updates, it becomes a real hassle. Now I can directly edit the notebook and the post will automatically update. There are a number of typos and code bugs in some of my old posts that I just never got around to fixing because it was so annoying to re-publish the post. But now I will work on them. Unfortunately some formatting/display issues have arisen from the migration but I will fix them all soon.
Use your favorite Python library on PySpark cluster with Cloudera Data Science Workbench
Cloudera Data Science Workbench provides freedom for data scientists. It gives them the flexibility to work with their favorite libraries using isolated environments with a container for each project.
XOR Revisited: Keras and TensorFlow
A few weeks ago, it was announced that Keras would be getting official Google support and would become part of the TensorFlow machine learning library. Keras is a collection of high-level APIs in Python for creating and training neural networks, using either Theano or TensorFlow as the underlying engine.
Re-parameterising for non-negativity yields multiplicative updates
Suppose you have a model that depends on real-valued parameters, and that you would like to constrain these parameters to be non-negative. For simplicity, suppose the model has a single parameter . Let denote the error function. To constrain to be non-negative, parameterise as the square of a real-valued parameter :
How to make the transition from academia to data science
Ever since I’ve started writing about my transition from academia to industry (both my reasons for leaving and what I think about the transition in retrospect), I’ve started receiving a lot of requests for advice on making that transition. Sometimes these requests come from former peers, former professors looking to advise students, or just someone who read one of my blog posts online.
F beta score for Keras
I’m a newbie in Deep Learning, so it makes sense to practice a bit at simple problems. So I’ve decided to join the contest Planet: Understanding the Amazon from Space at Kaggle. The contest looks promising and not too complicated comparing to other deep learning competitions: not too much data, no image segmentation or image localization required, data is preprocessed properly. The goal is to assign tags (e.g. forest, agriculture, roads etc.) to satellite image patches. Nice problem for the beginner, isn’t it?