Páginas

21 de set. de 2017

A-Z Reference Guide for an Ongoing Data-Oriented World



Introduction

The data volumes are blowing up. More data has been created in the past two years than in the entire previous history of the humankind. Moreover, solely along 2017, a wide and growing range of data coming from all type of industries: financial, IoT, healthcare, automotive, astronomy, biotech, cybersecurity, social media, entertainment, amongst several others, can make even higher these impressive numbers. 

Nonetheless, researches has found that less than 0.5 percent of that data is actually being analyzed for operational and business decision making.

By the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet.
By then, our accumulated digital universe of data will grow from 4.4 zettabytes today to around 44 zettabytes, or 44 trillion gigabytes.

According to IDC, the number of connected devices within IoT universe will be 30 billion by 2020 and forecast is a potential market estimated in 7.3 billion dollars yet in 2017.

This huge data mass is what everyone is talking about and that with proper handling can bring significant results and changes for all those industries and for human life.

A-Z

Analytics has emerged as a catch-all term for a variety of different business intelligence (BI) - and application-related initiatives. For some, it is the process of analyzing information from a particular domain, such as website analytics. For others, it is applying the breadth of BI capabilities to a specific content area (for example, sales, service, supply chain and so on). In particular, BI vendors use the “analytics” moniker to differentiate their products from the competition. Increasingly, “analytics” is used to describe statistical and mathematical data analysis that clusters, segments, scores and predicts what scenarios are most likely to happen. Whatever the use cases, “analytics” has moved deeper into the business vernacular. Analytics has garnered a burgeoning interest from business and IT professionals looking to exploit huge mounds of internally generated and externally available data.

Artificial Intelligence is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable.

Big Data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. However, it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

Business Analytics is one aspect of business intelligence, which is the sum of all your research tools and information infrastructure. Due to this close relationship, the terms business intelligence and business analytics are sometimes used interchangeably. Strictly speaking, business analytics focuses on statistical analysis of the information provided by business intelligence.

Business intelligence (BI) refers to the procedural and technical infrastructure that collects, stores and analyzes the data produced by a company’s activities. Business intelligence is a broad term that encompasses data mining, process analysis, performance benchmarking, descriptive analytics, and so on. Business intelligence is meant to take in all the data being generated by a business and present easy to digest performance measures and trends that will inform management decisions.

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

A business intelligence Dashboard is a data visualization tool that displays the current status of metrics and key performance indicators (KPIs) for an enterprise. Dashboards consolidate and arrange numbers, metrics and sometimes performance scorecards on a single screen. They may be tailored for a specific role and display metrics targeted for a single point of view or department. The essential features of a BI dashboard product include a customizable interface and the ability to pull real-time data from multiple sources.

Data Grids are in-memory distributed databases designed for scalability and fast access to large volumes of data. More than just a distributed caching solution, data grids also offer additional functionality such as map/reduce, querying, processing for streaming data, and transaction capabilities.

Data Mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is also known as Knowledge Discovery in Data (KDD).

Data Science is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining. It's a concept to unify statistics, data analysis and their related methods in order to understand and analyze actual phenomena with data. It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, data mining, databases, and visualization.

A Data Warehouse is a storage architecture designed to hold data extracted from transaction systems, operational data stores and external sources. The warehouse then combines that data in an aggregate, summary form suitable for enterprise-wide data analysis and reporting for predefined business needs.

ETL tools perform three functions to move data from one place to another: Extract data from sources such as ERP or CRM applications; Transform that data into a common format that fits with other data in the warehouse; and, Load the data into the data warehouse for analysis.
Microsoft Excel is a software program that allows users to organize, format and calculate data with formulas using a spreadsheet system.

Fact tables are the foundation of the data warehouse. They contain the fundamental measurements of the enterprise, and they are the ultimate target of most data warehouse queries. The real purpose of the fact table is to be the repository of the numeric facts that are observed during the measurement event.

Google Analytics is a Web analytics service that provides statistics and basic analytical tools for search engine optimization (SEO) and marketing purposes.

The Apache Hadoop project develops open-source software library framework for reliable, scalable, distributed computing that allows for the processing of large data sets across clusters of computers using simple programming models. Other Hadoop-related projects at Apache includes Ambari, Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

Histogram is a graphical representation, similar to a bar chart in structure that organizes a group of data points into user-specified ranges. The histogram condenses a data series into an easily interpreted visual by taking many data points and grouping them into logical ranges or bins.

Julia is a high-level, high-performance dynamic programming language for numerical computing. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library.

K-means is one of the oldest and most commonly used clustering algorithms. It is a prototype based clustering technique defining the prototype in terms of a centroid that is considered to be the mean of a group of points and is applicable to objects in a continuous n-dimensional space.

Machine Learning is a core subarea of artificial intelligence that studies computer algorithms for learning to do stuff automatically without human intervention or assistance. The learning that is being done is always based on some sort of observations or data, such as examples, direct experience, or instruction. So in general, machine learning is about learning to do better in the future based on what was experienced in the past. Although a subarea of AI, machine learning also intersects broadly with other fields, especially statistics, but also mathematics, physics, theoretical computer science and more.

MATLAB (matrix laboratory) is a multi-paradigm numerical computing environment and fourth-generation programming language developed by MathWorks, that allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages, including C, C++, C#, Java, Fortran and Python.

MongoDB is an open-source document-based database system. It's name derives from the word “humongous” because of the database’s ability to scale up with ease and hold very large amounts of data. MongoDB stores documents in collections within databases.

By Natural Language, we mean a language that is used for everyday communication by humans; languages like English, Hindi or Portuguese. We can understand Natural Language Processing — or NLP for short — in a wide sense to cover any kind of computer manipulation of natural language. Providing more natural human-machine interfaces, and more sophisticated access to stored information, technologies based on NLP are becoming increasingly widespread (e.g. phones and handheld computers support predictive text and handwriting recognition; machine translation allows us to retrieve texts written in Chinese and read them in Spanish; text analysis enables us to detect sentiment in tweets and blogs). 

NoSQL (Not Only SQL) describe an approach to database design that implements a key-value store, document store, column store or graph format for data. NoSQL databases especially target large sets of distributed data. Nowadays, there are more than 225 NoSQL databases, mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

OLAP (Online Analytical Processing) is the technology behind many Business Intelligence (BI) applications. OLAP is a powerful technology for data discovery, including capabilities for limitless report viewing, complex analytical calculations, trend analysis, sophisticated data modeling and predictive “what if” scenario (budget, forecast) planning.

Power BI is a suite of business analytics tools to analyze data and share insights.

Predictive Analytics is the use of statistics and modeling to determine future performance based on current and historical data. Predictive analytics look at patterns in data to determine if those patterns are likely to emerge again, what allows businesses and investors to adjust where they use their resources in order to take advantage of possible future events.

Python is a high-level interpreted programming language. Provides constructs intended to enable writing clear small and large-scale programs. It features a dynamic type system and automatic memory management, supports multiple programming paradigms, including object-oriented, imperative, functional programming, and procedural styles. Interpreters are available for many operating systems, allowing Python code to run on a wide variety of systems.

R is a language and environment for statistical computing and graphics. It is a GNU project and provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques, and is highly extensible.

SPSS Statistics is a software package used for logical batched and non-batched statistical analysis.  Software name originally stood for Statistical Package for the Social Sciences reflecting the original market, although the software is now popular in other fields being used by used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations, data miners and others. Acquired by IBM in 2009, current versions are officially named IBM SPSS Statistics.

What-If Analysis is the process of changing the values in cells to see how those changes will affect the outcome of formulas on the worksheet. Three kinds of What-If Analysis tools come with MS Excel: Scenarios, Goal Seek, and Data Tables. Scenarios and Data tables take sets of input values and determine possible results. A Data Table works with only one or two variables, but it can accept many different values for those variables. A Scenario can have multiple variables, but it can only accommodate up to 32 values. Goal Seek works differently from Scenarios and Data Tables in that it takes a result and determines possible input values that produce that result.

XML for Analysis (XMLA) is a SOAP-based XML protocol, designed specifically for universal data access to any standard multidimensional data source that can be accessed over an HTTP connection

Now, it’s up to you! Bookmark this page and get back whenever you need!