Data mining is the process of finding stories in large amounts of data. As in other types of journalistic practices in newsgathering, the process depends on the data source, on experts that can explain the data for journalists and their audiences, on the work hypotheses, and on the data interpretation by the journalist.
Large amounts of data, from public and private sources, are now available for journalists, as a result of freedom of information laws and of a generalized vision that public transparency and public access matters. A dataset is a collection of a set of information related to an environment or a process. A dataset is collected using a common structure and a common theme. It represents the raw, unanalysed data about a reality, such as how many people own a house in different European countries, or the gender of employed people in a given region. The information may be stored as numbers (such as the age of someone) or as labels (such as the gender of someone).
Datasets are organised in databases, for which good examples are the European statistics presented by Eurostat or the Global Health Observatory (GHO) data of the World Health Organization. Databases may also include collections of texts, images, videos or sounds: parliamentary discourses, war photography, historical videos, or radio archives. For some of these audio-visual databases, specialists are in an active search of data mining tools so that they become easier to use and manage.
Databases become more and more accessible due to rapid developments in computer sciences and in database-related solutions and to the accompanying development of new tools for cleaning, comparing and finding patterns in large sets of data. Expert advice in this area may come from statisticians and computer science specialists interested in machine learning, data management, pattern recognition and the like. Databases may also be searched by journalists and the public, independently, with the help of tools such as Google Trends (that explores Google data on popular searches, by country and by time period) or Hoaxy (a tool that identifies the spread of false news online). Based on a work hypothesis, journalists and other interested parties may approach a large database to search and compare specific data: museum entries in a certain country over a period of time, public transportation facilities in a developing region, public spending on health and life expectancy at birth. As correlations does not equal causation (this is, more public spending on health is necessary, but not sufficient to increase life expectancy), an efficient usage of data mining results depend on good, reliable sources and on journalistic professionalism.