Facebook Scraping Case Study | Scraping Group Members InformationWednesday, May 17, 2017 9:32 AM
Welcome to Octoparse web scraping case tutorial!
In this series of tutorials for scraping Facebook, we will show you how to scrape Facebook easily with Octoparse and troubleshoot various difficult situations.
This tutorial is about scraping group member information.
List features covered
Now, let's get started!
Step 1. Set up basic information
- Click "Quick Start"
- Create a new task in the Advanced Mode
- Complete the basic information
Step 2. Navigate to the target website
- Enter the target URL in the built-in browser
- Click "Go" icon to open webpage
Step 3. Load more content
Unlike some other web sites with the usual pagination way by loop clicking the "Next page" button, Facebook adopts the technique of dynamic loading by scrolling down to display more content. Note that we need to finish loading the content we are going to scrape before we configure Octoparse to capture the data.
- Go to "See More" button and click it
- When prompted, select "Loop click the element"
Step 4. Set AJAX timeout
As Facebook uses AJAX to load more members’ profiles, we need to set AJAX timeout for "Click to paginate".
- Navigate to the "Click to Paginate" action
- Go to “Advanced Option”
- Tick "AJAX Load" checkbox
- Set an AJAX timeout of 3 seconds
- Click "Save"
Step 5. Create a list of items
Once we finish the configuration of setting up pagination, we can go back to see the group members. These group members are all arranged in the same web section layout. That means we can build a loop list to add all of these sections and then configure Octoparse to extract data from each section.
Move the cursor over the group members’ profiles with similar layout, where we would extract information from.
- Click anywhere on the first member on the web page (make sure the highlighted section covers all the information of the member)
- Then, click “Create a list of items”
- Click “Add current item to the list”
Note: If the selection area had not been identified properly in the first place, we will need to expand the selection area - keep clicking on the expansion button as below until the desired section is outlined.
Now, the first item has been added to the list, we need to add all items to the list
- Click "Continue to edit the list"
- Click the second section with the same layout
- Click "Add current item to the list" again
Continue to add the third item until most of the items with the same layout are automatically added to the list.
- Click "Finish Creating List"
- Click "loop"and this action will tell Octoparse to click on each section on the list to extract the selected data
Step 6. Adjust relative execution sequence
In the Step 5., the created “Loop Item” has been positioned within the "Cycle Pages" action automatically. However, this doesn't make sense because we need to scrape the information data after we have completed loading all the needed content.
Thus, drag the "Loop Item" action out of "Cycle pages" action as shown below.
Step 7. Modify XPath to locate all members
In the Step 6., we create a loop list which has included only parts of the group member items like below, as Octoparse doesn't detect a proper XPath to locate all of the member items.
However, we do want to get all the group members added to our scraping list. Thus, we should modify its XPath to locate more group members.
- First, open the target URL in Firefox and inspect the group member sections using Firepath like below (Click here to know more about Firepath)
- Modify the XPath to include all of these group members
- Back to Octoparse, and navigate to the "Loop Item" action
- Copy the modified XPath .//div[@data-name='GroupProfileGridItem']/div from Firepath, and paste it in the "Variable list" text box.
- Click "Save"
Now, you can see that the latest loaded member profile sections can still be added to the loop list, no matter how many times you click "See More" button to load more.
Step 8. Select the data to be extracted and rename data fields.
Now, all needed group member information sections have been added to the created loop list by modifying the XPath of the variable list.
Then, time to extract the data we want. Note that the extraction action we will be setting up for this section is going to apply to the rest of sections.
- Navigate to the "Extract Data" action and click it.
- Click the data field of the member names
- Select "Extract text"
- Follow the same steps to extract the other data
- Rename the field names if necessary
- Click "Save"
Step 9. Start running your task
Now it is done configuring the task, and it's time to run the task to get the data.
- Click "Next"
- Click "Next"
- Select "Local Extraction"
- Click "OK" to run the task on your computer.
Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress.
There are Local Extraction and Cloud Extraction (premium plan). With a local extraction, the task will be run in your own machine; with a Cloud extraction, the task will be run on Octoparse Cloud platform, which means you can set it up to run and turn off your desktop or laptop and data will be automatically extracted and saved to the cloud. Features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud here.
Step 10. Check the data and export
After completing the extraction process, we can either check the data extracted or click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
To learn more about extracting data from webpages similar to FB, you can check out below:
We have also picked up some case studies you may feel interested in as following: