TEKNIK SCRAPING DAN CRAWLING UNTUK MENGEKSTRAKSI REVIEW HOTEL ONLINE PADA WEBSITE TRAVELOKA (BERBASIS AJAX)
Abstract
The internet can be a source of public data available on various websites. The process of
retrieving data from a website requires certain techniques because the data found on the website is
unstructured data. Data retrieval or extraction techniques are known as scraping processes. A
website also has many web pages that are interconnected so that techniques are also needed to be
able to check all web pages where data will be taken. The technique for accessing linked web pages
is called crawling. In the process of processing data from extraction, structured data is needed,
therefore we need a scraping and crawling system that can produce structured data from a website.
In this final project, it is explained about scraping and crawling techniques for extracting data from
a website. Extracted data is hotel review data from the traveloka website.
The use of javascript and ajax on a website makes accessing data on a website does not
require refesh the entire web page. Data on the website can be displayed more interactively. To
perform crawling on websites that use javascript and ajax, certain techniques are needed so that
the scrawling system can interact with ajax and the scraping process can retrieve all the data on a
web page. Scraping and crawling techniques are developed using and integrating various existing
technologies. Scrapy which is a scraping and scrawling framework is an option in developing this
technique. Selenium and chrome drivers are used to interact with ajax-based web. Elasticsearch are
used as a place to store data from scarping through the item pipeline process.
The development of scraping and scrawling techniques is carried out through several
stages. The stage starts from evaluating the website that will be the source of the data to get the
elements where the data is. The element selection is done by using the xpath selector. Xpath is used
in scraling and crawling processes that are developed in spider in Scrapy framework. All of these
techniques were developed using the Python programming language. The result of developing this
technique is a scraping and crawling system to extract hotel review data from the traveloka web.
The system can run steadily taking millions of hotel reviews. Data review data can also be stored
and displayed properly in elasticsearch.