1. What is Big Data?
One of the most important, and talked about, technological developments occurring in the last couple of years is referred to as “Big Data”.
Big Data is the name given to the phenomenon of very fast-growing creation, collection, communication and use of digital data.
According to IBM, humanity produced 2.5 Exabyte (that’s 2,500,000,000,000,000,000 or 2.5×1018 bytes) of data every day in 2012. In actual fact, the numbers are so vast that they start to lose meaning. Can we really understand the difference between an Exabyte (10006 bytes, or 1018 bytes) and a Zettabyte (10007 bytes, or 1021 bytes), or do we care a iota about the meaning of a Yottabyte (10008 bytes, or 1024 bytes)? Yet, it is clear that these expressions will inevitably become household expressions, such as Megabytes and Gigabytes are today.
Big Data is typically described through its characteristics of Volume, Velocity and Variety, the so-called “three V’s”.
Volume stands for the increase in volume of data generated, communicated and used by humanity and its machines. The volumes are indeed enormous – and are certain to keep growing. CERN, the European centre for nuclear physics that gave us the World Wide Web , has completed its Large Hadron Collider in Geneva, used to discover the H-Boson. All the 150 million sensors of the Large Hadron Collider in Geneva could be delivering data 40 million times per second. If all of those data were recorded, they would exceed 500 Exabytes per day – 200 times more than the world’s creation of data per day in 2012 according to IBM as referred to above. However, those data are not actually produced, recorded or processed – before that happens, massive filtering takes place. In reality, the LHC produces a mere 25 Petabyte per year of data (a Petabyte is 1×1015, so 1/1000th of an Exabyte).
But the volume of Big Data is not only a result of machines creating new data. A lot of data are created by humans, e.g. by making and sharing photos and other information through social media, or measuring their daily activities through consumer biometric sensors, and in general by any person using a machine connected to the Internet. It has been predicted that Facebook itself is at risk of collapsing under the sheer amount of data it causes its users to create – Mark Zuckerberg has indicated that the amount of data shared on social media doubles every year.
Yet, the key aspect about Volume in Big Data is not the current size of Data – which is already enormous. Rather, it is the fact that the Volume keeps increasing at an exponential rate. 90% of all data in the world existing today was created in the last two years. In order to understand the impact of this, one needs to keep in mind that “all data in the world” includes everything humanity has learned and created since the beginning of our species, roughly 150,000 years ago. This means that, looking forward, we have to anticipate, in about 5 years time, a world where the massive amount of data and information available today, represents less than 10% of what will be available then, based on a doubling of information every 18 months.
1.2 Velocity and Variety
Velocity is the fact that the capacity of the communication network that carries around all those data doubles every nine months, at equal or diminishing cost. This is why multiple users today can watch video through the same private home Internet access point, sometimes even in HD, whereas, on the Internet of ten years ago, this was completely impossible. The increase in communication capacity enables and supports the growth of Volume, and causes in its turn the ever-increasing Variety in data. Data do not always come in pre-configured or defined categories – indeed, the number of different kinds of data increases in similar, exponential ways. This is what is meant when businesses discuss “unstructured data” – it is the raw data that have not yet been categorized or contextualized.
But Variety is more than that, it also points to the way in which raw data can be connected and re-connected. Variety addresses the possibility of re-combining data, in ever more different ways.
It is the combination of these three characteristics (Volume, Velocity and Variety) that makes Big Data such an important phenomenon.
2. Significance of Big Data – the Emergence of a Darwinian Ecosystem of Open Data and Evolving Analytical Algorithms
Big Data is important because it allows organizations to increase their knowledge and understanding of their environment, and, as a result, become much more efficient in what they are trying to achieve.
Data is not the same as information or knowledge. As described by Denis Pombriant (, “Data, Information and Knowledge, Transformation of data is key”, CRi 2013, pp.97), information is “data in context”, and knowledge is the “understanding or insight created by information”.
2.1 Higher growth of information
Yet, while, according to this classification, information can be viewed as a subset of Big Data, it certainly follows the laws of Big Data – as the amount of raw data doubles, so does, at least, the amount of information that can be extracted from that data. After all, if information is “data in context”, that “context” is, by necessity, other data. This means that, by combining three of four datasets, we will be able to have more than three or four interactions or possible contextual interpretations – indeed, the increase will be more a proportion of a factorial function, where if “n” equals the number of datasets, the number of combinations could go up to “n!”. It is an application of the network effect, where the connections between the nodes, and combinations of use thereof, massively increase the number of potential interactions. Therefore, the increase in Variety, or the potential interconnectedness between datasets, effectively means that the amount of information that can be gathered from data increases by at least the same measure as the data itself, and probably even more.
Big Data is a phenomenon foremost of the information society. It occurs in relation to data created by humans and machines, and in relation to the way those data are subsequently used, communicated, processed, analyzed. However, Big Data acts on more than just “information technology”. By way of example in 2013, the Chemistry for the Future Solvay Prize was awarded to Prof. Peter G. Schultz of the Scripps Research Institute in California, and Director of the California Institute for Biomedial Research. The prize was awarded for his work on the interface between chemistry and biology. In essence, thanks to Prof. Schultz work, it is now possible to create and synthesize new molecules, not in the range of hundreds per year, but in the range of millions per day. An almost endless potential to create and change chemical and biological connections and possibilities arises. It could be compared to the difficulties governments are confronting when faced with so-called “legal highs” – synthetic recreational drugs are being invented, created and manufactured much faster than any government is able to make efficient prohibition bans against them. Is it remotely realistic to think that more than an insignificant part of all those new molecules, regardless of utility, can ever be patented before they become Prior Art?
All chemical and biological elements can be presented as data; think of DNA as the “code of life”, not in a binary code of 1 and 0, but in a quaternary system using the GATC molecules of DNA. In effect, the explosion in binary data representing information is being mirrored by a similar explosion in chemical and biological data. To assume that the Big Data revolution is occurring only in narrowly defined Information Technology is to ignore how most of our reality can, and is, being converted into datasets. Datasets that are all growing exponentially, and will impact any existing technology and innovation process.
Therefore, the impact of Big Data will be felt across all human and economic activity, as more and more aspects of reality become measurable, quantifiable and touched by software.
2.2 Harvesting via analytical algorithms
The way to access and harness the potential value of Big Data is through the operation of what is called “analytics”. Analytics consists of establishing rules that allow us to interpret and make sense of Big Data – they are the tools that enable us to understand and extract value from Big Data. Analytics is done through the operation of algorithms – a specific set of instructions for carrying out a procedure or solving a problem.
Algorithms are at the very heart of Big Data – it is algorithms that allow us to view, make sense of, and use the benefits of Big Data.
However, in order to have value, those algorithms require access to data. And here’s the golden rule about Big Data and its relationship with analytical algorithms:
The value of both the data and the algorithms increases much more rapidly when data access and availability are open. The more data are available, the more the algorithm will be able to add value and develop.
The main reason for this is the ever-changing nature of Big Data: through the continuous operation of Volume, Variety and Velocity, the algorithms necessarily have to change constantly in order to remain relevant or become better. It is like life during the early days of Earth’s primordial soup’s ecosystem: the ever growing flows of data are like the sun’s energy, generating and allowing the emergence of algorithms like amino acids; the very building blocks of life. Those amino acids operate in an environment of incoming solar energy aplenty – but it’s the ones that use the incoming solar energy most efficiently, in combination with the other amino acids around them, that will flourish and evolve the fastest.
The same is true for Big Data analytical algorithms. Those algorithms that have the most access to the most data will be able to adapt and evolve faster. It is a Darwinian system of evolutionary logic, and it is based on, and requires, the open access to data. Just like amino acids need open access to solar energy.
As a side note, one of the unforeseen effects of analytical algorithms, and their rapid evolution, is that the core distinction of data protection law, which is the differentiation between personal data and non-personal data, becomes completely ineffective. Analytical algorithms are already so powerful that a limited amount of non-personal data allows the identification of individuals who are unaware of this – thereby rendering virtually all data “personal data”, i.e. data that allows someone to be identified. But that is outside the scope of this article.
2.3 Advantage for Open Data
The consequence of the existence of a Darwinian ecosystem of algorithms is that, in a world of Big Data, it is the Open Data environments that will be outcompeting the closed data environments. The reason is simple: closed data environments will have much less access to the very source that drives development and evolution: the ever-growing data feeds.
Already, the first signs of the superior ecosystem of Open Data are starting to become visible. McKinsey, the consultancy, estimates that Open Data would add between $3tn and $5tn to the world economy – that’s an economy with a size somewhere between Germany and Japan.
Businesses, governments, and all other kinds of organizations will be pushed to open up their data. The incentives are clear: by opening up their data, the usefulness and value of those data will increase much more than by keeping them closed. And open data will yield superior algorithms, which, in turn, will increase the value of those data. For public organizations such as governments or quasi-government organizations, the motivation is obvious: data mining will allow governments to understand actual consequences of policy choices, or see correlations and causations that would otherwise be missed or misunderstood. As more and more data open up, the knowledge spillover effects will increase, leading to more pressure to open up more data. As a result, also private organizations and businesses will feel increasing pressure to open up their data – those that do, will gain important economic benefits, and will outcompete those who do not. Because of the ubiquity and exponential growth of data, the business value of “being found” will outperform the business value of “charging for access/use”.
In addition, the algorithms used and adapted by businesses and other organizations involved in Big Data will outcompete the algorithms (less)-used and (less)-adapted by businesses and other organizations that remain more closed. Darwinian evolutionary logic is ruthless – and operates fast. Those businesses that do not open up their data streams, in order to allow more recombination, will cause themselves to be structurally handicapped in the competition against those who do.
It is something a business like Google seems to understand very well: the more its search algorithms get access to virtually any and every data available, the more they will be able to evolve, and become more valuable.
3. Impact on Patents
One key metric of Big Data is its exponential growth – the total amount of data and information doubles roughly every 18 months. Compared to the number of patents granted in the world, a striking difference occurs: In the US, the number of (utility) patents has doubled a bit more than twice in 50 years – from 45,679 in 1963 to 253,155 in 2012. Similar statistics are available elsewhere: the number of patent applications with the EPO has grown from 170,000 to 257,000 between 2003 and 2012. In Japan, the number goes up from 122,511 in 2003 to 274,791 in 2012.
This means that the number of patents grows in a linear fashion (with annual growth numbers in single or low double digit percentages), whereas the amount of available data and information grows in an exponential fashion (doubling every 18 months).
3.1 Big Data as Prior Art
The relevance of the difference in both growth numbers becomes clear when it is acknowledged that Big Data is, in essence, equal to Prior Art. While there is no uniform definition of Prior Art across jurisdictions, the Wikipedia definition is useful for the purpose of the argument developed here. According to that definition, Prior Art is:
“… all information that has been made available to the public in any form before a given date that might be relevant to a patent’s claims of originality. If an invention has been described in the prior art, a patent on that invention is not valid.”
From a legal perspective, it would probably have been better if this definition would replace the word “originality”, which is a concept used in copyright law, by “novelty”, which is a concept used in patent law. Maybe the editor wanted to use a concept encapsulating both “novelty” and “inventive step”? In any event, the word “originality” may lead to confusion here; but given the cumulative character of the “novelty” and “inventive step/non-obvious” requirements in patent law, it is possible to remain focused on novelty and ignore inventive step/non-obvious – since both novelty and inventive step/non-obvious are required for the granting of a patent, the absence of novelty is sufficient to block patentability. That being said, to the extent the amount of information (data in context) grows at least as fast or faster than the amount of data, the inventive step/non-obvious requirement is in the same problematic situation as the requirement of novelty.
As explained above in section 1, the amount of data, as well as at least a corresponding amount of information, grows exponentially in timeframes that are shorter than the typical approval period of a single patent. This exponentially growing body of information includes, as a subset, the amount of publicly available information that would invalidate a patent.
This means that, logically, the following conclusions can be drawn:
- The rejection rate of patents must grow at an almost exponential rate, to reach 100% within the foreseeable future;
- The relevance of patent databases as a data source to search for Prior Art declines with approximately one third every single year;
- A novelty-based patent system is unsustainable in a society where publicly available information grows on an exponential basis, whereas the number of patent applications or grants grows on a linear basis.
However, the effects of these logical consequences of Big Data are not as clearly observable as one might expect, given the clear logical sequence of the facts outlined above. There are a number of possible explanations:
Limited Scope of Examination: Patent offices do not take into account all Prior Art when granting patents. Mostly, when searching for Prior Art, patent offices will look at their own databases, and add some knowledge of searches in what are considered relevant industry publications or scientific journals. However, these searches must, by necessity, be incomplete, since they are done by human intelligence (the patent examiners). While this observation also points to the well-known problem of poor quality of patents granted (and subsequently invalidated when prosecuted), the problem is much greater. As the amount of information available in the world doubles twice in the not uncommon time span of three years between a patent application and a grant, how is it possible for patent offices to “know” or “find” Prior Art, let alone take it properly into consideration?
Duration of Litigation:, It takes several years for patents to be filed and prosecuted/invalidated. While the observations above clearly show that we will see an ever-greater number of patents that becoming invalidated once their owners try to enforce them through a court system that allows all Prior Art existing at grant to invalidate the patent, it will not be easy to quantify the effect, or to postulate a relevant time frame.
Non-Invalidation: Since patent litigation can be very expensive and cumbersome, many patents that should never have been granted, don’t get invalidated; and even where the litigation takes place, patent procedural and other rules may limit the effectiveness of invoking the existence of Prior Art.
Consequences for Novelty-Based Patent System
However, the conclusion must remain that the patent system, as it currently exists, is unable to cope with exponentially growing publicly available Prior Art.
First and immediate consequences of this observation are that many patent portfolio’s are probably full of unenforceable patents; this will inevitably affect their potential value, both for operating and non-operating entities alike.
The second consequence is that it is likely that patents will have to be framed ever more narrow in order to survive the patent review phase – while patent offices may not be able to access or use all that exponentially growing Prior Art, there will inevitably be knowledge overflows from the public domain into the patent review system, rendering the process of obtaining patents much harder.
3.2 Comparable Nature of Patents and Algorithms
As established in Section 2.3 above, one of the key ways in which organizations can extract value from Big Data is through the operation of analytical algorithms.
Patents on Algorithms
Is it possible to obtain a patent on an algorithm? The answer will depend on the jurisdiction in which the question is asked, but it will never be easy to have a clear answer. It will typically not be possible to patent a “pure” algorithm. At this level, the algorithm really is an idea, a thought process written down, of the kind “if the data is in format “dd/mm/yyyy”, add “date” as metadata, and store in column A”.
Such clearly theoretical algorithms are not patentable.
However, as an algorithm becomes more complex, more specific and more contextual, the likelihood of patentability increases. When properly deconstructed, a software program is a complex algorithm – and, under the right conditions (e.g. subject to the “machine or transformation” test of Bilski in the US, or if it is incorporated as a technical invention in the EU) certain expressions of such algorithms are clearly patentable. The same is true for e.g. production or manufacturing processes – these are really “ideas incorporated”, yet, at heart, an algorithm in a particular context, and can be patented.
In essence, a patent itself can be described as an algorithm, placed in a particular context.
But it is not the grey zone between these two areas (patentable or non-patentable algorithms) that will be of interest here. What is more of interest is whether patents, as algorithms themselves, can remain of value in an age of Big Data.
I explained that, from a business perspective, the real value of analytical algorithms in Big Data depends on two basic characteristics:
(1) the access to data, and
(2) the ability to evolve continuously.
The argument developed here is that it is these two inherent characteristics of Big Data that will make patents, from a value creation point of view, superfluous in a Big Data environment. The reason why patents will become superfluous is the fixed nature of the algorithm captured in a patent; and that fixed nature is caused by the requirement of contextual specificity.
“Freezing” of Status Quo – Inflexibility
If a patent is an algorithm, it is also a “frozen” algorithm. Only the specific claims, subject to being described in sufficient detail, can be subject to the exclusive rights granted by the patent. By demanding that the patent is specific and contextual enough, patent law effectively diminishes, and arguably extinguishes, the Big Data value of the algorithm as described in the patent.
As an example, let’s return to Google’s search algorithms. These algorithms have the two characteristics of all Big Data analytical algorithms: a) they are dependent upon vast volumes of data, and increase, as analytical algorithms, in value together with the increase of Volume, Variety and Velocity of the data they are “fed”, and b) they need to be continuously adapted, as a result of their continuous interaction with ever more data.
It is the very requirement of contextuality and fixation of claims in sufficient detail in a patent, that makes a patent on such algorithm an exercise in futility. The moment the algorithm is written down in the contextual and specific snapshot picture that is a patent application, it starts to lose value compared to its evolutionary sibling that operates in the real world. It is as if, in an evolutionary environment, money is placed as a bet on a particular amino acid or cellular organism to dominate the ecosystem – while that amino acid or cellular organism has to stay the same, and stop evolving. By placing the bet, by taking the patent, it turns the algorithm into a certain loser. Its competitors will continue to evolve, and outcompete it. Alternatively, if the patented algorithm is allowed to change, it has to adapt quickly to survive. And by adapting, it removes itself rapidly from the frame of protection granted by the patent.
In theory, it could be stated that it would be possible to patent not just one algorithm, but all its evolutionary siblings, and try to cover the whole market.
In reality, that would seem like a very efficient way to destroy value:
- As seen above, obtaining such patents, and enforcing them, will become much, much harder, as the amount of Prior Art (and existing algorithms) grows rapidly;
- The contextual and specific nature of patented analytical algorithms will make it very easy to circumvent or “invent around” such patents;
- The technology cycles are too short; and the Darwinian branching out of the ecosystem will make it impossible to achieve such wide covering of the market, even if it was affordable.
To conclude, it is clear that Big Data will have the following two effects on the patent system:
- Gradually, it will become ever more difficulty for anything to be patentable, as a result of exponentially growing Prior Art; this will also negatively impact enforceability of a growing number of patents granted, where an ever increasing number will be found to be in breach of the novelty requirement;
- Since the value of analytical algorithms in Big Data is conditional upon their access to open data and their ability to evolve continuously, patents, by their nature of fixed and contextual specificity, are unable to capture this value in a significant way;
One factor remains uncertain and less clear – the timeline on which these developments will unfold. But, without changes in legislation, not the outcome. In a world of exponentially available knowledge, a novelty-based patent system is inherently unsustainable.