Basile, Engineer specialized in Big Data, joined LINAGORA’s team in January. His mission is to contribute to the development of the Data Center and in particular the Big Data activity, while continuing his research work for his thesis.
So let’s start from the beginning : What is ‘Big Data’?
Vision from an expert who likes to share his knowledge and experience, like all the Linagorians.
Today Big Data and Machine Learning are present in our daily lives. We meet them for example on Amazon in the blocks of products recommended to our effect. Another visible example of the presence of Big Data in our daily lives is targeted advertising. Perhaps you have realized that very often, the last product that interested you on a sales site magically appears on the side of another website that you visit. And when you change your website, it continues to reappear on another site. All this is a manifestation of Big Data and Machine Learning.
We define Big Data as such a large and complex amount of data that it can not be processed by standard systems (relational database, data warehouse …). Machine learning is a field of artificial intelligence that allows a machine to progressively improve the processing of a task without having been programmed for that purpose. In order to carry out the Big Data processing, it was necessary to rethink the operating mode of the standard systems to be able to process these big volumes. We are talking here about distributed processing. Performing distributed processing involves to do work together a group of computers in order to deliver a common result. For example, if you asked to a group of computers to count the number of people present in a room, each of them will count the people present in a corner of the room and final result given will be the total sum of what every machine will have counted.
In the world of Big Data, Hadoop is a framework, an approach to perform distributed processing on a set of computers called cluster. Hadoop relies on a file system (storage system) called HDFS (Hadoop File System). The peculiarity of this file system is that each file that is deposited there is cut into small pieces and each piece is duplicated on all the computers.
This scheme shows the distribution of a file on a set of 8 computers. The file has been cut into 5 pieces and each piece is replicated several times on the cluster (set of machines). In this case the replication factor is 3. This means that as long as each piece has not been replicated 3 times in the cluster as is the case for pieces 1 and 3, the system will continue its replication process.
In order to perform Big Data processing using the HDFS file system, Hadoop uses a distributed processing algorithm called MapReduce. This algorithm will allow all the computers to work on each part of the file and to provide an overall result which is the expected result. Today, there are several Hadoop distributions on the market. For example: Hortonworks, Cloudera and MapR.
Today, LINAGORA supports its customers in their digital transformation, by offering them Open Source solutions to regain or retain their sovereignty over their data. From the identification of the ‘use cases’ of each organization or company, LINAGORA designs Big Data architectures that allow them to efficiently exploit their own data. LINAGORA experts are also able to ensure the development, production and monitoring of any type of Big Data application.