Blog > Big Data > Post

8 Machine Learning Keywords You Must Know

Thursday, February 1, 2018

So you've probably heard about Machine Learning a thousand times from all kinds of posts/articles, but do you have any idea what it really is? Well, in this article I have included the 8 must-know key terms most directly related to Machine Learning. Nothing fancy and complicated so hopefully anyone that's interested in Machine Learning can take away a few useful points from reading this post. 


The 8 terms covered in the article are:

Natural language processing
Computer vision
Supervised learning
Unsupervised learning
Reinforcement learning
Neural network


1.Natural language processing(NLP)

NLP is a very common concept for machine learning. It had made possible for a computer to read human language and incorporate it into all kinds of processes. 


The most well-known applications for NLP include:

(a) Text classification and sorting

This deals with classifying texts into different categories, or sorting a list of texts based on relevancy. For example, it can be used to screen out spam mails (by analyzing whether the mails are spam mails or not), or businesswise it can also be used to identify and extract information related to your competitors.


(b) Sentiment analysis

With sentiment analysis, a computer will be able to decipher the sentiments, such as anger, sadness, delightfulness etc through analyzing text strings. So basically a computer will be able to tell whether people are feeling happy, sad or angry as they are typing in the words or sentences. This is widely used in customer satisfaction survey to analyze how customers are feeling towards a product.  


(c) Information extraction

This is mainly used to summarize a long paragraph into a short text, much like creating an abstract.


(d) Named-entity recognition

Say that you have extracted a bunch of messy profile data such as an address, phone, name and more all mixed up with one another. Won't you wish you can somehow clean this data so magically they are all identified and matched to the proper data types? This is exactly how Named-entity extraction helps turning messy information into structured data. 


(e) Speech recognition

A great example of this, Siri for Apple.


(f) Natural language understanding and generation

NLU is to use computer to transform human expressions into computer expressions. On the contrary, natural language generation is to transform computer expressions into human expressions. This technology is very commonly used for human communicating with robots.


(g) Machine translation

Machine translation is to translate texts into another language (or to any specific languages) automatically.


2. Database

Database is a necessary component in machine learning. If you want to establish a machine learning system, you will need to either collect data from public resources, or generate new data. All datasets that are used for machine learning combined together to form the database. Generally, scientists will divide data into three categories:

Train dataset: Train dataset is used for training models. Through training, machine learning models will be able to recognize the important features of data

Validate dataset: Validate dataset is used for trimming models’ coefficients, and to compare models to pick out the optimal one. Validate dataset is different from train dataset, and it cannot be used in the training section or overfitting may occur and adversly effect new data generation.

Test dataset: Once the model is confirmed, test dataset is used for testing the model’s performance in new dataset.

In traditional machine learning, the ratio of these three datasets is 50/25/25; however, some models don’t need much tuning or the train dataset can actually be a combination of training and validation(cross-validation), hence the ratio of train/test can be 70/30.


3. Computer vision

Computer vision is an artificial intelligence field focusing on analyzing and understanding figure and video data. The problems we often see in computer vision include:

Image classification: Image classification is a computer vision task that teaches computer to recognize certain images. For example, training a model to recognize particular objects appeared in any specific places.

Target detection: Target detection is to teach model to detect a particular class from a series of predefined categories, and use rectangles to circle them out. For instance, target detection can be used to configure face recognition system. The model can detect every predefined matters and highlight them out.

Image segmentation: Image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as super-pixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze.

Significance test: Once sample data has been gathered through an observational study or experiment, statistical inference allows analysts to assess evidence in favor or some claim about the population from which the sample has been drawn from. The methods of inference used to support or reject claims based on sample data are known as tests of significance.


4. Supervised learning

Supervised learning is the machine learning task of inferring a function from labeled training data. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to be generalized from the training data to unseen situations in a "reasonable" way.


5. Unsupervised learning

Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from "unlabeled" data (a classification or categorization is not included in the observations). Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm—which is one way of distinguishing unsupervised learning from supervised learning and reinforcement learning.



6. Reinforcement learning

Reinforcement learning is different from what we just discussed. Reinforcement learning is like the process of gaming with computers, and its goal is to train computers to take actions in an environment so as to maximize some kinds of cumulative reward. During a series of experiments, the computer learns a series of playing patterns, and during a game, computer can use the optimal pattern to maximize its reward.
A well-known example is Alpha Go, the Alpha Go beat the best human chess player. Recently, reinforcement learning has also been applied to real-time bidding.


7. Neural network

Neural networks are computing systems inspired by the biological neural networks that constitute animal brains. Just like in brains that many neures interconnect and form network, artificial neural network(ANN) is constituted by many layers. Every layer is an assemblage of a series of neures. An ANN can process data consecutively, which means only the first layer is connected with the inputs, along with the layers increasing, an ANN gets more complicated. When layers get greatly large, the model becomes a deep learning model. It’s hard to define an ANN with a certain number of layers. 10 years ago, ANNs with only 3 layers are deep enough, now usually we need 20 layers.


NNs have many variants, the ones in common use are:

  • Convolutional Neural Network- it made great breakthroughs in computer vision
  • Recurrent neural network- created to process data with sequence feature, such as text and stock prices.
  • Fully connected network- it’s the easiest model used for process static/tabular data.


8. Overfitting

Overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably". In another word, when a model learns from insufficient data, deviation would occur, which may adversely affect the model.

This is a common but critical problem.

When overfitting occurs, generally it means model will take random noises as data input, and take it as an important signal to fit in and this is why model may behave worse in new data (there are deviations in random noises, too). This happens a lot in some complicated models such as neural networks or acceleration gradient models.



Artículo en español: 8 Palabras Clave que debe conocer para el Alprendizaje Automático
También puede leer artículos de web scraping en el Website Oficial


Author: The Octoparse Team 



More Resources


Top 20 Web Scraping Tools to Scrape the Websites Quickly

Top 30 Big Data Tools for Data Analysis

Web Scraping Templates Take Away

How to Build a Web Crawler - A Guide for Beginners

Video: Create Your First Scraper with Octoparse 7.X






Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download
We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline