Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
Scrape tweets from TwitterThursday, August 16, 2018
With Octoparse, you can easily scrape any data you want such as top news, hot topics, worldwide trends etc. from a variety of social media websites. In this tutorial, we would show you how to extract data from Twitter. Any data seen on the web page can be scraped without coding. If you are interested in the data scraping from these social media websites like Twitter, this tutorial can help you get started.
After running your task, just export the result data in provided formats such as EXCEL, CVS, JSON or in your database.
To illustrate, we will scrape news information from Twitter as an example: https://twitter.com/search?q=news&src=typd&lang=en
By scraping data from Twitter, you can:
· Know more about the newest trends worldwide
· Find out your potential customers for business
· Analyze the marketing value of hot topics
Let's get started with the main steps in this tutorial to start your task. [Download example task file ]
1) "Go To Web Page" - to open the target website
· Paste the target URL into "Extraction URL" box and save.
· Please note that this website is the news page from Twitter without login. If you want to extract data behind a login, check out the corresponding tutorial .
2) Use scrolling down - to get more data from listed page
· Select "Scroll Down" options under "Advanced Options".
· Set "Scroll times" and "Interval" you need.
· Select "Scroll down for one screen" as "Scroll way" and click "OK" button.
· Most social media websites use scroll-down-to-refresh to view more data. Learn details of dealing with infinite scrolling.
· We suggest better set a relatively higher value of "Scroll times" if you need more data.
3) Create a "Loop Item" - to loop extract each tweet
· Click data you want on the web page, then the selected area will be highlighted in green.
· Click "Select all" and select "Extract text from the selected elements" in "Action Tips" panel.
· Rename the "Field name" column if necessary.
4) Use Regular Expression - to clean and reformat data if needed
Regular Expression aims at reformatting data after data extraction in Octoparse. For example, if you want to delete words like "Reply", "Retweet" and "Like" in this case, you can use Regular Expression to get the specific digit value by trimming the strings. If the result already satisfies your needs, you can just skip this step.
· Select the "Reply" row, click "Customize data field" icon, select "Refine extracted data" option and click "Add step" button.
· Click "Replace" and paste the "Reply ***" with all space values from extraction data "Reply 856" into "Replace" box.
· Click "OK" button.
· The value you will enter into "Replace" box must be copied with all original space value. In this step, *** just means space value.
· You can also reformat values in "Retweet" and "Like" rows like this step if needed.
5) Start data extraction - to run your task and get data
· Select "Start Extraction" and "Local Extraction".
· Select "Export" to get all data you want.
- Most popular tutorials
- Scrape product information from Amazon
- How to download images from a list of URLs?
- Extract multiple pages through pagination
- Scraping info from Craigslist
- Scraping search results from Google Scholar