Facebook Scraping Case Study | Scraping Facebook GroupsFriday, April 14, 2017 2:42 AM
Welcome to Octoparse web scraping case tutorial!
We will be offering a series of tutorials for Facebook scraping. In these tutorials, we will concentrate on helping users to capture various types of information from Facebook as we show you how to troubleshoot various difficult situations.
In this case tutorial, we will walk through the detailed steps to crawl groups from Facebook.
Some features covered in this case tutorial include:
- Scrolling down
- Building a list
- Selected area expansion
- Data Extraction
Now, let's get started!
Step 1. Set up basic information
- Click "Quick Start "
- Create a new task in the Advanced Mode
- Complete the basic information
Step 2. Navigate to the target URL
After completing the basic information, enter the target URL and click "Go" to arrive on the main page to scrape from.
- Enter the target URL in the built-in browser (URL of the example: https://www.facebook.com/groups/?ref=bookmarks)
- Click "Go" icon to open webpage
Step 3. Create a list of items
Move your cursor over the group sections with similar layout, where you would extract the group details.
- First, click any where on the first group section
Note: If the selection area had not been identified properly in the first place, we will need to expand the selection area - keep clicking on the expansion button as below until the desired section is outlined.
- Then, click “Create a list of items” (sections with similar layout)
- Click “Add current item to the list”
Now, the first item has been added to the list, we need to finish adding all items to the list
- Click “Continue to edit the list”
- Click a second section with similar layout
- Click “Add current item to the list” again
All of the similar sections should have been added automatically, at this time, select "finish creating the list", then click "loop".
Step 4. Scroll down to load complete content
As with other multi-page websites, we still need to set up pagination to flip through multiple pages for scraping. However, instead of using the "next page" button, Facebook uses an optional pagination way called "Infinitive Scrolling" to paginate. It means that additional webpage content is loaded dynamically as users approaches the bottom of the page.
It may seem quite different from the usual pagination, luckily Octoparse can easily accommodate infinitive scrolling by having the page loaded completely before the extraction step.
- Navigate to the "Go to Web Page" action
- Go to "Advanced Options"
- Fill in "Scroll times" , "Time Interval" and the "Scroll way"
- Click "Save" then "Next"
Here, I set scrolling times to 5, time interval to 2 seconds and select "Scrolling to bottom of the page " as the scroll way.
The reason I set 5 scrolling times and 2 seconds of interval is because I have tested manually by scrolling down the web pages and scrolling 5 times with 2 seconds between each scroll will load just the data I wanted to capture.
Step 5. Select the data to be extracted
By selecting "Loop" in Step 3, Octoparse will automatically select the first item of the list and click it open. Note that the extraction action we will be setting up for this page is going to apply to the rest of the list. So, let's look through the page to spot the data we want and capture it.
- Click the data field Group Name, such as "The Garden of Thoughts"
- Select "Extract text"
- Follow the same steps to extract data field "Members"
Step 6. Rename data fields
- Rename the any field names if necessary.
- Click "Save"
Step 7. Starting running your task
Now we are done configuring the task, it's time to run the task to get the data we want.
- Select "Local Extraction"
- Click "OK" to start
There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be run in your own machine; with a Cloud extraction, the task will be run on Octoparse Cloud platform, you can basically set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 8. Check the data and export
- Check the data extracted
- Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer
Good job for completing this case tutorial. You can download and run this Example on your own.
Now check out similar case studies:
- Web Scraping Case Study | Security System News
- How to Scrape WordPress Posts
- Web Scraping - Scraping Facebook That Required Login with Octoparse
Or learn more about related topics:
- Get Started with Octoparse in 2 Minutes
- Automatically Cycle Text List
- Use Regular Expressions in Octoparse