Scrape Websites that Require Login and Load Web Content with with Ajax
Monday, April 25, 2016 4:08 AMFor the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.
Some websites require users to log in with an account and password to show content, which means our target data is behind authentication. Fortunately, it is still possible to access the data with Octoparse. We can either tell Octoparse to input the login information (username and password) for us or log in ourselves in the browse mode and use cookies to optimize the workflow.
Let's say we need to scrape data from Facebook.
1) Enter login information to sign in
- Create a new task in Octoparse with a Facebook-related URL
- Click on the username textbox(Email or phone number)
- Select "Enter text" from the Tips panel
- Input your username into the textbox, click "Confirm". You'll see the username entered is automatically populated to the username textbox on the web page
- Repeat the above steps for the "Password" textbox
- Click the "Log In" button and select "Click button" in the Tips panel
2) Use cookies to optimize the workflow
Most of the time, you can optimize the workflow by saving the cookies in the task after login. This way, Octoparse will send the saved cookies to the website during loading, and there's a good chance the website will remember "you" and skip the login steps.
- Toggle on the browse mode on the top right
- You can log in to the website just like what you do on a regular browser.
- After login, go to the "Options" settings of the "Go to web page" action, tick "Use Cookie" and click "Use cookie from the current page".
- Click "Apply" to save the settings
- Now the web page is supposed to "remember" the login and skip the login steps when the crawler is running next time.
Problem solved! Now you can move on to build your task workflow.
Tips! 1. A saved cookie is only effective before it gets expired Cookies come in many different forms. Some have a specific expiration time, others expire immediately as the browser is closed. In Octoparse, the saved cookie will no longer work when it gets expired. To resolve this, you will need to go through the login steps once again under browser mode in order to obtain and save the updated cookie. 2. Your password is well-protected
3. Entering captcha manually while running local extraction If you encounter a captcha, you can manually input the captcha when running the task locally. Cloud Extraction doesn’t support dealing with Captcha. |
Octoparse can easily deal with pages with AJAX. In this article, I will show you how to handle AJAX in Octoparse.
1. What Is AJAX?
AJAX stands for "Asynchronous JavaScript and XML", which allows a web page to update information without reloading the entire page, and request/receive data after the page's loaded.
When AJAX is used, only part of the page gets updated when you hit buttons like the "next page" button, or "show more" on the web page.
2. How do I know if a web page loads content using AJAX?
When you have a click action to load web data, it is rather straightforward to tell if AJAX is being used.
When AJAX is used, the web page loads the additional content without reloading the page.
Hence, the reloading icon is a good indicator to tell apart if AJAX's been used.
- When there's AJAX involved, the page should not reload when additional content gets loaded. So there should be NO reloading sign in this case
- If there's no AJAX involved, you should see the page reloads with the reloading icon running
when you click to load more information
3. How to handle AJAX in Octoparse?
Octoparse uses reloading as a signal when executing the clicked item. If the page reloads after clicking an element, it will execute the next action after the reload finishes.
But as pages with AJAX do not reload, Octoparse doesn't receive the signal to act and would get stuck.
So we need to set up an AJAX timeout for the "Click Item" or "Click to Paginate" to tell Octoparse to go to the next action when the timeout is reached. There are two ways AJAX can be taken care of in Octoparse.
(1) AJAX auto-detection
Octoparse would set up AJAX timeout automatically when AJAX is detected for the page.
For example, Walmart's website uses AJAX to load the next page. So when we choose to click the next page button, Octoparse automatically sets up AJAX timeout for the action.
If you need a longer or shorter timeout, simply click on the dropdown menu and choose the one you'd like.
(2) Set up AJAX manually
When a task is built manually or if Octoparse fails to detect AJAX, it is also possible to set it up manually by clicking the "Click Item" action or the "Click to Paginate" action.
You can find the AJAX settings in the "Options" and tick "Load with AJAX" to select the timeout you want.
Tips! The AJAX timeout should be long enough for the page to load the information we need. |
4. Consider using AJAX timeout for web pages without AJAX
Even for pages that do not use AJAX, AJAX timeout can still be used to ameliorate prolonged wait time for some pages.
For example, if you have a page that is taking forever to load, long after the information you need has been loaded,
you may want to use AJAX timeout to "force" Octoparse to move on to the next step instead of having Octoparse wait until the page loading to finish.
Now you are good to go. If you have further issues with the task or have a suggestion that would make this a better resource for you, we’d love to hear about it. Submit a request to our new help center.
Happy Data Hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.