Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
The latest version for this tutorial is available here. Go to have a check now!
With Octoparse, you can easily scrape any data from social media websites, top news, hot topics, worldwide trends, and so much more. Above is a video about scraping Yelp data to excel with Octoparse 8.
In this tutorial, I would show you how to extract data from Twitter using Octoparse 7. As a rule of thumb, any data that is visible on the webpage can be scraped without coding. If you are interested in acquiring data from these social media websites like Twitter, check this one out.
Let's start with the main steps in this tutorial. [Download example task file ]
1) "Go To Web Page" - to open the target website
· Paste the target URL (https://twitter.com/search?q=news&src=typd&lang=en) into "Extraction URL" box and save.
· Please note that this website is the news page from Twitter without login. If you want to extract data behind a login, check out the corresponding tutorial .
2) Use scrolling down - to get more data from listed page
· Select "Scroll Down" options under "Advanced Options".
· Set "Scroll times" and "Interval" you need.
· Select "Scroll down for one screen" as "Scroll way" and click "OK" button.
· Most social media websites use scroll-down-to-refresh to view more data. Learn details of dealing with infinite scrolling.
· We suggest better set a relatively higher value of "Scroll times" if you need more data.
3) Create a "Loop Item" - to loop extract each tweet
· Click data you want on the web page, then the selected area will be highlighted in green.
· Click "Select all" and select "Extract text from the selected elements" in "Action Tips" panel.
· Rename the "Field name" column if necessary.
4) Use Regular Expression - to clean and reformat data if needed
Regular Expression aims at reformatting data after data extraction in Octoparse. For example, if you want to delete words like "Reply", "Retweet" and "Like" in this case, you can use Regular Expression to get the specific digit value by trimming the strings. If the result already satisfies your needs, you can just skip this step.
· Select the "Reply" row, click "Customize data field" icon, select "Refine extracted data" option and click "Add step" button.
· Click "Replace" and paste the "Reply ***" with all space values from extraction data "Reply 856" into "Replace" box.
· Click "OK" button.
· The value you will enter into "Replace" box must be copied with all original space value. In this step, *** just means space value.
· You can also reformat values in "Retweet" and "Like" rows like this step if needed.
5) Start data extraction - to run your task and get data
· Select "Start Extraction" and "Local Extraction".
· Select "Export" to get all data you want.