Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

What was the largest dataset you analyzed / data mined?

This poll received 1108 votes, about 10% less than in 2016, but still a large enough sample. The results again show a surprising stability, fitting a pattern that emerged already in 2012, with a majority of data scientists and analysts working with data in Gigabytes range, and a small, but notable segment working with web-scale data of over 100 Petabytes.

Note that the poll asks about the largest ever dataset, so a typical dataset analyzed is expected to be significantly smaller.

Highlights:

Gigabytes still rule: ** Majority of answers (56% in 2018, 57% in 2016, 56% in 2015, 54% in 2014, 53% in 2013) are in Gigabyte range. The overall median response was again between 11 and 100 GB (which comfortably fits on one laptop) for each year since 2012. **Consistency: the shape of the curve each year is almost the same. Although in 2018 there were fewer responses in under 10MB range, and more in 1-10GB range, bit not significantly so. Petabyte Big Data Scientists still stand apart: There is a small but significant gap, with almost no answers in 1-10 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with 100+ petabyte web-scale data stores. See for example a recent story on current Uber data warehouse of 100PB. Academic researchers on par with Government, Industry: The estimated median for academic researchers is 90GB, on par with Government (60 GB) and Industry analysts (50 GB). The estimated median answer has increased a little for all segments in 2018.

Fig. 1: KDnuggets Poll: Largest Dataset Analyzed, 2014-2018 2018 data is shown as a column, to stand apart from lines for previous years.

This poll also asked about employment type, and the breakdown was

Company or Self-Employed, 62% (was also 62% in 2016) Student, 17% (was 20% in 2016) Academia/University, 13% (was 10% in 2016) Government/non-profit, 4.8% (was 5.1% in 2016) Other, 3.2% (was 2.4% in 2016)

Fig. 2: KDnuggets Poll: Largest Dataset 2018, by Employment. Red line shows the estimated median Circle size corresponds to the number of responses.

Regional trends show a little more voters from Latin America, Middle East, and Australia, and a little less from US. The numbers were:

Europe, 34.9% (was 35.1%) US/Canada, 34.4% (was 36.9% in 2016) Asia, 15.6% (was 17%) Latin America, 6.9% (was 5.6%) Africa/Middle East, 4.9% (was 3.2%) Australia/NZ, 3.2% (was 2.3%)

Finally, we examine the largest dataset analyzed by both employment and region for 3 largest regions.

Fig. 3: Largest Dataset Analyzed, by Employment for US/Canada, Europe, and Asia. Circle size corresponds to the number of responses

We got more responses from Asian “Company” Data Scientists for 100PB data than from US/Canada or Europe Data Scientist. We see a similar situation with Asian students.

Here are the results of past polls: