Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Scrape tweets from Twitter

Thursday, August 16, 2018

With Octoparse, you can easily scrape any data you want such as top news, hot topics, worldwide trends etc. from a variety of social media websites. In this tutorial, we would show you how to extract data from Twitter. Any data seen on the web page can be scraped without coding. If you are interested in the data scraping from these social media websites like Twitter, this tutorial can help you get started.

After running your task, just export the result data in provided formats such as EXCEL, CVS, JSON or in your database. 

To illustrate, we will scrape news information from Twitter as an example: https://twitter.com/search?q=news&src=typd&lang=en 

 

By scraping data from Twitter, you can:

      · Know more about the newest trends worldwide 

      · Find out your potential customers for business

      · Analyze the marketing value of hot topics

 

Let's get started with the main steps in this tutorial to start your task. [Download example task file ]

1) "Go To Web Page" - to open the target website

2) Use scrolling down - to get more data from listed page 

3) Create a "Loop Item" - to loop extract each tweet for

4) Set Regular Expression - to clean and reformat data if needed (Optional)

5) Start data extraction - to run your task and export data

 

 

 

 

 

1) "Go To Web Page" - to open the target website

      · Paste the target URL into "Extraction URL" box and save.

 

 

Tips!

      · Please note that this website is the news page from Twitter without login. If you want to extract data behind a login, check out the corresponding tutorial .

 

 

 

 

 

2) Use scrolling down - to get more data from listed page

      · Select "Scroll Down" options under "Advanced Options".

      · Set "Scroll times" and "Interval" you need.

      · Select "Scroll down for one screen" as "Scroll way" and click "OK" button.

 

Tips!

      · Most social media websites use scroll-down-to-refresh to view more data. Learn details of dealing with infinite scrolling. 

      · We suggest better set a relatively higher value of "Scroll times" if you need more data.

 

 

 

3) Create a "Loop Item" - to loop extract each tweet

      · Click data you want on the web page, then the selected area will be highlighted in green.

      · Click "Select all" and select "Extract text from the selected elements" in "Action Tips" panel.

      · Rename the "Field name" column if necessary.

 

 

 

 

4) Use Regular Expression - to clean and reformat data if needed

Regular Expression aims at reformatting data after data extraction in Octoparse. For example, if you want to delete words like "Reply", "Retweet" and "Like" in this case, you can use Regular Expression to get the specific digit value by trimming the strings. If the result already satisfies your needs, you can just skip this step.

      · Select the "Reply" row, click "Customize data field" icon, select "Refine extracted data" option and click "Add step" button.

      · Click "Replace" and paste the "Reply ***" with all space values from extraction data "Reply      856" into "Replace" box.

      · Click "OK" button.

 

Tips!

      · The value you will enter into "Replace" box must be copied with all original space value. In this step, *** just means space value.

      · You can also reformat values in "Retweet" and "Like" rows like this step if needed.

        Read more about 8 data re-format options  

 

 

 

 

5) Start data extraction - to run your task and get data

      · Select "Start Extraction" and "Local Extraction".

      · Select "Export" to get all data you want.

 

 

Related Articles:

Extract behind a login 

Load with infinite scrolling 

Re-format data extracted 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png