Scrape ASPX PagesFriday, October 07, 2016 9:23 PM
ASPX, developed by Microsoft to build dynamic web sites, web applications and web services, is an open-source server-side web application framework. Scraping ASPX page often involves loading the page, looking for the items and pagination.
For most web scraping tools, it may be a little complicated, but it is definitely not the case for Octoparse.
In this tutorial, I will take the New York Council for example to show you how to extract data from ASPX pages.
(Download my extraction task of this tutorial HERE just in case you need it.)
Step 1. Choose "Advanced Mode". ➜ Click "Start" ➜Complete basic information. ➜Click "Next".
Step 2. Enter the target URL of ASPX page in the built-in browser. ➜ Click "Go" icon to open the webpage.
( URL of the example: http://legistar.council.nyc.gov/Legislation.aspx )
Step 3. Now you can search you want. Let’s take "New York" for example.
Click the search box.➜Click "Enter text value". And then type "New York" in the text box.➜Click "Save".
Click "Search Legislation". ➜Click "Click an item". And then you could get the search results.
Step 4. Now I will show you how to extract data. A page navigation action is needed first if you want to extract some information.
Drop a "Loop" item into Workflow designer. Choose a "Loop Mode" under "Advanced Options". ➜ Select "Single Element" option.
Since the pagination feature of this website doesn’t have "Next" button, you need to find the XPath first. Then make sure you locate the right place of the pagination link.➜Paste the Xpath on the text box.➜ Click "Save".
Step 5. Drop a "Click Item" into the "Loop item" ➜ Choose "Click Loop items" under "Advanced Option" ➜ Click "Save". Now you’ve configured pagination crawling.
Step 6. Now you can extract the information from the lists on each page. Move your cursor over the section with similar layout, where you would extract data.
Click the first highlighted link ➜ Click "Expand the selection area" until the whole list where you want to scrape the data is in the same pink frame. ➜ Click "Create a list of items". ➜ "Add current item to the list". Then the first list has been added to the list. ➜ Click "Continue to edit the list".
Click the second highlighted link ➜ Click "Expand the selection area" until the whole list where you want to scrape the data is in the same pink frame. ➜ Click "Add current item to the list" again. Now we get all the links with similar layout. ➜Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.
Step 7. Now you can extra the data. Extract the title of the first section.➜Click the title.➜Select "Extract text". Other contents can be extracted in the same way.
Step 8. All the content will be selected in Data Fields. ➜Click the "Field Name" to modify.
Step 9. Click “Next” ➜ Click “Next” ➜ Click “Local Extraction” ➜ “OK” to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 10. The data extracted will be shown in “Data Extracted” pane. Click “Export” button to export the results to Excel file, databases or other formats and save the file to your computer.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!