이 문서는 Selenium 3.7 기반으로 동작하는 AWS Lambda 서비스를 이용한 RPA 서비스 구축하는 방법을 소개합니다.
REST API를 활용할 수 있을 경우, Web Crawler 만들기 01 - Building RPA service using REST API on AWS 를 참고해 주세요.
대부분의 서비스들은 웹을 통해서 서비스가 제공됩니다. 이때 사용자는 웹 브라우저를 활용하며, 서버에서 렌더링된 최종 결과물인 HTML 문서가 웹 브라우저를 통해 해석되며, CSS는 멋진 화면으로 바꿔 줍니다.
HTML 문서는 Element 들을 제공하며, 파싱하면 필요로 하는 정보를 수집할 수 있습니다. 이 문서는 이 과정을 활용하는 방법을 설명합니다.
RPA (Robotic Process Automation)는 사람의 개입없이 반복적 인 작업 (데이터 입력, 뱅킹)을 자동화하는 데 도움이되는 비즈니스 자동화 기술입니다.
여기서는 REST API 만으로 수집 불가능한 웹 브라우저의 다양한 HTML 정보를 Element 단위로 파싱하여 데이터를 수집하는 RPA를 만드는 것을 소개합니다.
Selenium은 다양한 브라우저 (Chrome, Firefox, Safari)에서 웹 애플리케이션 테스트를 자동화하기 위한 오픈 소스 휴대용 프레임 워크입니다.
Test를 자동화 할 수 있다는 것은, 웹 브라우저에서 일어나는 다양한 액션들을 자동으로 수행할 수 있게 만들수 있음을 의미합니다.
Java, Python, C #, Ruby 등과 같은 여러 프로그래밍 언어로 작성할 수 있습니다.
이 문서의 목적은 특정 웹 사이트를 Web Crawler 를 만들어서 수집하는 것입니다.
크게 4가지 파트로 구분되어져 있습니다.
Chrome 웹 브라우저에 Selenium IDE를 익스텐션으로 설치하면, 다음과 같이 웹 브라우저에서 일어나는 액션을 기록하고, 다시 재생시킬 수 있습니다.
아래 코드는 Chrome 웹 브라우저가 제공하는 개발자콘솔에서 수집하는 사이트의 Element가 있는지를 확인하는 샘플 소스 코드입니다. Jquery를 이용하여 HTML의 element의 path 정보를 확인할 수 있습니다.
// Jquery Init var script = document.createElement('script'); script.src = 'https://code.jquery.com/jquery-3.4.1.min.js'; script.type = 'text/javascript'; document.getElementsByTagName('head')[0].appendChild(script); // 기사 검색 $(".left-side-search").each(function(){ $(this).find('li').each(function(){ var title = $(this).find('h3').html(); var updated_date = $(this).find($('.updated')).html(); var link = $(this).find('a').attr("href"); var desc = $(this).find('p').prev().eq(2).html(); console.log(title); console.log(updated_date); console.log(link); console.log(desc); }); }); |
파일 시스템 확장하는 예제 코드 (CLI 환경에서 수행)
curl -s https://gist.githubusercontent.com/wongcyrus/a4e726b961260395efa7811cab0b4516/raw/6a045f51acb2338bb2149024a28621db2abfcaab/resize.sh | bash /dev/stdin 20 |
프로젝트에서 참고할 관련 파일 다운로드 받아 두기
git clone https://github.com/aws-samples/lambda-web-scraper-example.git |
폴더 생성 및 이동, 3번의 주요 파일을 복사하여 가져옴. (아래 단계에서 진행하다가 파일이 보이지 않을 경우, 3번 단계에서 가져온 코드를 복사합니다.)
mkdir rpa1 && cd rpa1 |
폴더에 CDK 초기화
cdk init app --language python |
가상환경 진입
source .venv/bin/activate |
Python 패키지 설정
aws-cdk.core aws-cdk.aws_lambda aws-cdk.aws_events_targets aws-cdk.aws_events aws-cdk.aws_dynamodb aws-cdk.aws_iam aws-cdk.aws_apigateway aws-cdk.aws_s3 aws-cdk.aws_kinesisfirehose aws-cdk.aws_s3_deployment aws-cdk.aws_sqs aws-cdk.aws_lambda_event_sources aws-cdk.aws_kinesisfirehose_destinations |
Python 패키지 설치 (가상환경 활성화 상태)
pip install -r requirements.txt |
도커 파일 생성
FROM amazonlinux RUN yum update -y RUN yum install -y \ gcc \ openssl-devel \ zlib-devel \ libffi-devel \ wget && \ yum -y clean all RUN yum -y groupinstall development WORKDIR /usr/src # Install Python 3.7 RUN yum install -y tar xz RUN wget https://www.python.org/ftp/python/3.7.10/Python-3.7.10.tgz RUN tar xzf Python-3.7.10.tgz RUN cd Python-3.7.10 ; ./configure --enable-optimizations; make altinstall RUN python3.7 -V # Install pip RUN wget https://bootstrap.pypa.io/pip/get-pip.py RUN python3.7 get-pip.py RUN rm get-pip.py RUN pip -V WORKDIR /opt/output/ RUN pip install selenium==3.141.0 -t /opt/output/python/lib/python3.7/site-packages RUN wget https://chromedriver.storage.googleapis.com/2.43/chromedriver_linux64.zip RUN unzip chromedriver_linux64.zip RUN curl -SL https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-55/stable-headless-chromium-amazonlinux-2017-03.zip > headless-chromium.zip RUN unzip headless-chromium.zip RUN rm *.zip COPY run.sh /opt/output/run.sh ENTRYPOINT /opt/output/run.sh |
도커 빌드
docker build -t selenium_layer . docker run -i -v `pwd`/python:/opt/ext -t selenium_layer |
(CDK를 위한 S3 부트스트랩이 없다면) S3 부트스트랩 생성하기 (자신의 AWS 계정과 AWS Region을 치환하여 진행해야 합니다.)
cdk bootstrap aws://your AWS ID/region |
App.py 파일 수정
#!/usr/bin/env python3 from aws_cdk import core from RpaAutoNews.rpa_auto_news_stack import RpaAutoNewsStack app = core.App() RpaAutoNewsStack(app, "RpaAutoNews") app.synth() |
Stack 파일 수정합니다. 위 아키텍처에 CDK Stack 부분이 아래 개발 언어로 인하여 생성됩니다.
from aws_cdk import ( aws_events as events, aws_lambda as lambdas, aws_dynamodb as dynamodb, aws_apigateway as apigateway, aws_events_targets as targets, aws_iam as iam, aws_s3 as s3, aws_s3_deployment as s3_deploy, aws_sqs as sqs, aws_kinesisfirehose as firehose, aws_kinesisfirehose_destinations as destinations, core ) from aws_cdk.aws_lambda import LayerVersion, AssetCode from aws_cdk.aws_lambda_event_sources import DynamoEventSource, SqsDlq from constructs import Construct class RpaAutoNewsStack(core.Stack): def __init__(self, scope: core.Construct, id: str, **kwargs) -> None: super().__init__(scope, id, **kwargs) role = iam.Role( self, 'BotRole', assumed_by= iam.ServicePrincipal('lambda.amazonaws.com')) role.add_to_policy(iam.PolicyStatement( effect = iam.Effect.ALLOW, resources = ["*"], actions= ['events:*'])) role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=["arn:aws:iam::*:role/AWS_Events_Invoke_Targets"], actions=['iam:PassRole'])) role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=["*"], actions=["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"])) role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=["*"], actions=["s3:*"])) role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=["*"], actions=["lambda:*"])) role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=["*"], actions=["sns:*"])) role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=["*"], actions=['translate:TranslateText'])) role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=["*"], actions=['comprehend:*'])) role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=["*"], actions=['ses:*'])) search_table = dynamodb.Table(self, "Search", partition_key = dynamodb.Attribute(name="title", type=dynamodb.AttributeType.STRING), sort_key = dynamodb.Attribute(name="created_date",type=dynamodb.AttributeType.STRING), removal_policy= core.RemovalPolicy.DESTROY ) news_table = dynamodb.Table(self, "News", partition_key = dynamodb.Attribute(name="title", type=dynamodb.AttributeType.STRING), sort_key = dynamodb.Attribute(name="created_date",type=dynamodb.AttributeType.STRING), stream = dynamodb.StreamViewType.NEW_AND_OLD_IMAGES, removal_policy= core.RemovalPolicy.DESTROY ) RpaAutoNewsLambda = lambdas.Function( self, "Bot", handler="index.lambda_handler", code=lambdas.Code.from_asset('lambda/rpanewsbot'), timeout=core.Duration.seconds(600), runtime=lambdas.Runtime.PYTHON_3_7, memory_size=2048, environment=dict( PATH="/opt", CONF_SET="RPA_CONFIG", OPER_EMAIL="김현수 <admin@studydev.com>", RECV_EMAIL="김현수 <admin@studydev.com>", SEND_EMAIL="김현수 <admin@studydev.com>" ), role=role ) search_table.grant_read_write_data(RpaAutoNewsLambda) news_table.grant_read_write_data(RpaAutoNewsLambda) RpaAutoNewsLambda.add_environment("SEARCH_TABLE", search_table.table_name) RpaAutoNewsLambda.add_environment("NEWS_TABLE", news_table.table_name) rule = events.Rule( self, "CronRule", schedule=events.Schedule.cron( minute='0', hour='0', month='*', week_day='*', year='*'), ) rule.add_target(targets.LambdaFunction(RpaAutoNewsLambda)) ac = AssetCode("./python") layer = LayerVersion(self, "selenium_layer", code=ac, description="selenium_layer layer", compatible_runtimes=[lambdas.Runtime.PYTHON_3_7], layer_version_name='selenium_layer') RpaAutoNewsLambda.add_layers(layer) api_role = iam.Role( self, 'ApiRole', assumed_by= iam.ServicePrincipal('lambda.amazonaws.com')) api_role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=["*"], actions=["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"]) ) api_role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=["*"], actions=[ "firehose:DeleteDeliveryStream", "firehose:PutRecord", "firehose:PutRecordBatch", "firehose:UpdateDestination" ]) ) newsListLambda = lambdas.Function( self, "List", handler="index.lambda_handler", code=lambdas.Code.from_asset("lambda/rpanewslist"), timeout=core.Duration.seconds(100), runtime=lambdas.Runtime.PYTHON_3_8, memory_size=256, role = api_role ) newsListLambda.add_environment("NEWS_TABLE", news_table.table_name) news_table.grant_read_write_data(newsListLambda) apiNewsList = apigateway.LambdaRestApi(self, "RpaNewsListApi", handler=newsListLambda) # RPA Web Site rpa_auto_news_bucket = s3.Bucket(self, "RpaAutoNewsWeb", bucket_name= ('rpa-auto-news-web'), website_index_document= 'index.html', website_error_document= 'error.html', public_read_access= True, removal_policy= core.RemovalPolicy.DESTROY, auto_delete_objects=True ) # Deployment RPA Web Site deployment = s3_deploy.BucketDeployment(self, "DeployRpaAutoNewsSite", sources=[s3_deploy.Source.asset("website")], destination_bucket=rpa_auto_news_bucket ) # RPA Analytics Data Lake rpa_analytics_bucket = s3.Bucket(self, "RpaAnalytics", bucket_name= ('rpa-auto-news-analytics'), removal_policy= core.RemovalPolicy.DESTROY, auto_delete_objects=True ) # Firehose firehose_stream = firehose.DeliveryStream(self, "RpaFirehoseStream", destinations=[destinations.S3Bucket(rpa_analytics_bucket, buffering_interval=core.Duration.seconds(60), buffering_size=core.Size.mebibytes(1), )] ) # Trigger to Lambda from DDB Stream ddb_stream_lambda = lambdas.Function( self, "DDBStream", handler="index.lambda_handler", code=lambdas.Code.from_asset("lambda/processddbstream"), timeout=core.Duration.seconds(100), runtime=lambdas.Runtime.PYTHON_3_8, memory_size=256, role = api_role ) dead_letter_queue = sqs.Queue(self, "NewsDLQ") ddb_stream_lambda.add_event_source(DynamoEventSource(news_table, starting_position=lambdas.StartingPosition.TRIM_HORIZON, batch_size=5, bisect_batch_on_error=True, on_failure=SqsDlq(dead_letter_queue), retry_attempts=2 )) # 아직 Firehose 이름 찾는걸 못 해서 못 고침 ddb_stream_lambda.add_environment("FIREHOSE_NAME", "RpaAutoNews-XXXXXXXXXX") |
Lambda 소스를 위한 index.py 생성
import time import datetime import boto3 from botocore.errorfactory import ClientError from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.chrome.options import Options # from pybloom import ScalableBloomFilter from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.common.exceptions import TimeoutException import pickle options = Options() options.headless = True options.binary_location = '/opt/headless-chromium' options.add_argument('--headless') options.add_argument('--no-sandbox') options.add_argument('--single-process') options.add_argument('--disable-dev-shm-usage') def lambda_handler(event, context): print('scraping logic goes here') driver = webdriver.Chrome('/opt/chromedriver',chrome_options=options) driver.get('https://www.amazon.com/') message = driver.title print(message) search_bar = driver.find_element_by_name("field-keywords") search_bar.clear() search_bar.send_keys("amazon echo show") search_bar.send_keys(Keys.RETURN) print(driver.current_url) driver.close(); driver.quit(); response = { "statusCode": 200, "body": message } return response |
이후 Lambda에서 자동화된 Selenium IDE 코드로 변경
(이 소스 코드는 S사를 위한 AWS Lambda 샘플 코드가 있었으나, 보안상의 이유로 데이터 수집 영역은 삭제 하였습니다.)
import time import datetime import boto3 import os import json from dateutil import parser from decimal import Decimal from botocore.errorfactory import ClientError from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import NoSuchElementException, ElementNotVisibleException # from pybloom import ScalableBloomFilter from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.common.exceptions import TimeoutException import pickle options = Options() options.headless = True options.binary_location = '/opt/headless-chromium' options.add_argument('--headless') options.add_argument('--no-sandbox') options.add_argument('--single-process') options.add_argument('--disable-dev-shm-usage') translate = boto3.client(service_name='translate') comprehend = boto3.client(service_name='comprehend') dynamodb = boto3.resource('dynamodb') search_table = dynamodb.Table(os.environ.get('SEARCH_TABLE')) news_table = dynamodb.Table(os.environ.get('NEWS_TABLE')) # check exist css element def check_exists_by_css(driver, css_string): try: driver.find_element_by_css_selector(css_string) except NoSuchElementException: return False return True def send_summary_email(SUBJECT, BODY_TEXT, BODY_HTML): recv_email_list = os.environ['RECV_EMAIL'].split(";") # The character encoding for the email. CHARSET = "UTF-8" # Create a new SES resource and specify a region. client = boto3.client('ses', region_name="us-east-1") # Try to send the email. try: #Provide the contents of the email. response = client.send_email( Destination={ 'ToAddresses': recv_email_list }, Message={ 'Body': { 'Html': { 'Charset': CHARSET, 'Data': BODY_HTML, }, 'Text': { 'Charset': CHARSET, 'Data': BODY_TEXT, }, }, 'Subject': { 'Charset': CHARSET, 'Data': SUBJECT, } }, Source = os.environ['SEND_EMAIL'], ConfigurationSetName = os.environ['CONF_SET'] ) # Display an error if something goes wrong. except ClientError as e: print(e.response['Error']['Message']) return e.response['Error']['Message'] else: print("Email sent! Message ID:"), print(response['MessageId']) return response['MessageId'] ... # Application Area ... def make_email_body(summary, news_items): # The HTML body of the email. body_head = """ <!DOCTYPE html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>AWS Lambda RPA 데이터 수집 결과 발송</title> <meta name="viewport" content="width=device-width, initial-scale=1.0"/> </head> <body style="margin: 0; padding: 0;"> <table role="presentation" border="0" cellpadding="0" cellspacing="0" width="100%" style="font-family: 'Malgun Gothic', '맑은 고딕', Arial, sans-serif, 'Open Sans'"> <tbody> <tr> <td> <table align="center" border="0" cellpadding="0" cellspacing="0" width="1024px" style="border-collapse: collapse; border: 0px solid #cccccc;"> <tbody> <tr> <td align="left" style="padding: 10px 0 10px 20px;"> <h1>기사 스크랩 RPA 수행 보고서</h1> </td> <td align="right" > <img src="https://a0.awsstatic.com/libra-css/images/logos/aws_logo_smile_1200x630.png" width="120"> </td> </tr> </tbody> </table> <table align="center" border="0" cellpadding="0" cellspacing="0" width="1024px" style="border-collapse: collapse; border: 1px solid #cccccc;"> <tbody> <tr> <td align="left" style="font-size: 14px; padding: 10px 0 0 20px;"> <p>이 RPA 보고서는 <strong>Website-scraper-RPA-Function</strong> 함수에서 발송합니다.</p> <p><strong>{data[search_date]}</strong> 에 생성된 기사에 대한 자동 수집 보고서입니다.</p> </td> </tr> <tr> <td style="padding: 0px 10px 0px 10px;"><hr/></td> </tr> <tr> <td bgcolor="#ffffff" style="padding: 20px 30px 10px 30px;"> <table border="0" cellpadding="0" cellspacing="0" width="100%" style="border-collapse: collapse;"> <tbody> <tr> <td style="color: #153643; font-family: Arial, sans-serif;"> <h2 style="font-size: 20px; margin: 0;">스크랩 요약</h2> </td> </tr> <tr> <td style="color: #153643; font-family: Arial, sans-serif; font-size: 16px; line-height: 24px; padding: 20px 0 30px 0;"> <table border="0" cellpadding="0" cellspacing="0" width="1024px" style="font-size: 14px; border-collapse: collapse;" > <tbody> <tr> <td width="1000" valign="top"> <table border="1" cellpadding="0" cellspacing="0" width="100%" style="border-collapse: collapse; border: 1px solid #000000;"> <tbody> <tr > <td align="center" width="200px" bgcolor="#dddddd"> <strong>검색 기사 개수</strong> </td> <td align="center" width="280px" bgcolor="#ffffff"> {data[news_total_count]} 개 </td> <td align="center" width="200px" bgcolor="#dddddd"> <strong>중립 감정 기사 개수</strong> </td> <td align="center" width="280px" bgcolor="#ffffff"> {data[news_senti_neutral]} 개 </td> </tr> <tr > <td align="center" width="200px" bgcolor="#dddddd"> <strong>저장 기사 개수</strong> </td> <td align="center" width="280px" bgcolor="#ffffff"> {data[news_total_save]} 개 </td> <td align="center" width="200px" bgcolor="#dddddd"> <strong>부정 감정 기사 개수</strong> </td> <td align="center" width="280px" bgcolor="#ffffff"> {data[news_senti_negative]} 개 </td> </tr> <tr > <td align="center" width="200px" bgcolor="#dddddd"> <strong>미저장 기사 개수</strong> </td> <td align="center" width="280px" bgcolor="#ffffff"> {data[news_total_unsave]} 개 </td> <td align="center" width="200px" bgcolor="#dddddd"> <strong>긍정 감정 기사 개수</strong> </td> <td align="center" width="280px" bgcolor="#ffffff"> {data[news_senti_positive]} 개 </td> </tr> <tr > <td align="center" width="200px" bgcolor="#dddddd"> <strong></strong> </td> <td align="center" width="280px" bgcolor="#ffffff"> </td> <td align="center" width="200px" bgcolor="#dddddd"> <strong>혼합 감정 기사 개수</strong> </td> <td align="center" width="280px" bgcolor="#ffffff"> {data[news_senti_mixed]} 개 </td> </tr> </tbody> </table> </td> </tr> </tbody> </table> </td> </tr> </tbody> </table> </td> </tr> <tr> <td bgcolor="#ffffff" style="padding: 0px 30px 10px 30px;"> <table border="0" cellpadding="0" cellspacing="0" width="100%" style="border-collapse: collapse;"> <tbody> <tr> <td style="color: #153643; font-family: Arial, sans-serif;"> <h2 style="font-size: 20px; margin: 0;">수집 기사 정보</h2> </td> </tr> <tr> <td style="color: #153643; font-family: Arial, sans-serif; font-size: 16px; line-height: 24px; padding: 20px 0 30px 0;"> <table border="0" cellpadding="0" cellspacing="0" width="1024px" style="font-size: 14px; border-collapse: collapse;" > <tbody> <tr> <td width="1000" valign="top"> <table border="1" cellpadding="0" cellspacing="0" width="100%" style="border-collapse: collapse; border: 1px solid #000000;"> <tbody> <tr bgcolor="#ddddd" align="center"> <th width="90px"> <strong>날짜</strong> </th> <th width="290px"> <strong>제목</strong> </th> <th width="420px"> <strong>부제</strong> </th> <th width="100px"> <strong>연관 제조사</strong> </th> <th width="50px"> <strong>감성</strong> </th> <th width="50px"> <strong>스크랩</strong> </th> </tr>""".format(data = summary) body_middle = "" for i in range( len(news_items) ): news_item = news_items[i] if news_item['sentiment'] == "NEUTRAL": news_items[i]['sentiment_ko'] = '<span style="color:#2185d0">중립</font>' elif news_item['sentiment'] == "NEGATIVE": news_items[i]['sentiment_ko'] = '<span style="color:#db2828">부정</font>' elif news_item['sentiment'] == "POSITIVE": news_items[i]['sentiment_ko'] = '<span style="color:#21ba45">긍정</font>' elif news_item['sentiment'] == "MIXED": news_items[i]['sentiment_ko'] = '<span style="color:#000000">혼합</font>' else: news_items[i]['sentiment_ko'] = '<span style="color:#000000">-</font>' body_middle = body_middle + """ <tr bgcolor="#ffffff" align="center"> <td> {search_date} </td> <td align="left" style="padding-left:10px;"> <a href="{data[link_url]}">{data[title]}</a><br>{data[ko_title]} </td> <td align="left" style="padding-left:10px;"> {data[sub_title]} </td> <td> {data[car_manufacturer]} </td> <td> {data[sentiment_ko]} </td> <td> {data[reg_flag]} </td> </tr>""".format(data = news_item, search_date = summary['search_date']) body_tail = """ </tbody> </table> </td> </tr> </tbody> </table> </td> </tr> </tbody> </table> </td> </tr> <tr> <td bgcolor="#1d2834" style="padding: 20px 30px;"> <table border="0" cellpadding="0" cellspacing="0" width="100%" style="border-collapse: collapse;"> <tbody> <tr> <td style="color: #dddddd; font-family: Arial, sans-serif; font-size: 12px;"> <p style="margin: 0;">ALL CONTENTS Copyright ⓒ2022 AWS Korea LTD.ALL RIGHTS RESERVED<br> </td> <td align="right"> <img src="https://a0.awsstatic.com/libra-css/images/logos/aws_smile-header-mobile-en-white_48x29.png" width="48"> </td> </tr> </tbody> </table> </td> </tr> </tbody> </table> </td> </tr> </tbody> </table> </body> </html> """ return body_head + body_middle + body_tail |
S3 HTML File
<html class='gr__semantic-ui_com'> <head> <!-- Standard Meta --> <meta charset='UTF-8'> <meta http-equiv='X-UA-Compatible' content='IE=edge,chrome=1' /> <meta name='viewport' content='width=device-width, initial-scale=1.0, maximum-scale=1.0'> <!-- Site Properties --> <title>AWS Serverless 서비스 기반 RPA 데모</title> <link rel='stylesheet' href='styles.css'> <link rel='stylesheet' href='https://cdn.jsdelivr.net/npm/semantic-ui@2.4.2/dist/semantic.min.css'> </head> <body data-gr-c-s-loaded='true' class='ui pushable'> <div class='ui large top fixed menu borderless'> <div class='ui container'> <a href='/' class='header item'> <img class='logo' src='favicon.ico'> AWS Serverless 서비스 기반 RPA 데모 </a> <div class="ui right aligned item"> <div class="ui toggle checkbox"> <input id="ko_trans" type="checkbox" name="public"> <label>한글 전환</label> </div> </div> </div> </div> <div class='ui pusher'> <div class='ui inverted dimmer'> <div class='ui loader'></div> </div> <div class='ui container'> <p><br><br><br></p> <div class='ui placeholder segment'> <img src="https://aws.studydev.com/rpa/rpa_architecture.svg" style="width:100%; margin:0px 0 0 0;"> </div> <div class='ui form'> <div class='field'> <label>기사 검색: </label> <div class='ui action input'> <input type='text' id='title' value='*' placeholder='Search...'> <button id='searchButton' class='ui icon button'> <i class='search icon'></i> </button> </div> </div> </div> <table id='news_en' class='ui celled table'> <thead> <tr> <th class='two wide active'>Date</th> <th class='six wide active'>Title/Sub title</th> <th class='eleven wide active'>Article</th> </tr> </thead> <tbody> <td colspan='3'>No data.</td> </tbody> </table> <table id='news_ko' class='ui celled table' style="display: none;"> <thead> <tr> <th class='two wide active'>날짜</th> <th class='six wide active'>제목/부제</th> <th class='eleven wide active'>기사</th> </tr> </thead> <tbody> <td colspan='3'>검색된 데이터가 없습니다.</td> </tbody> </table> </div> <p> </p> </div> <!-- Site javascript files --> <script src='https://semantic-ui.com/examples/assets/library/jquery.min.js'></script> <script src='https://cdn.jsdelivr.net/npm/semantic-ui@2.4.2/dist/semantic.min.js'></script> <script src='https://semantic-ui.com/examples/assets/library/iframe.js'></script> <script src='scripts.js'></script> <script> var news_data = []; function nl2p(str){ return str.replace(/(?:\r\n|\r|\n)/g, '</p><p>'); } function semantics(str){ var semantic = "" switch ( str ) { case "NEUTRAL" : semantic = ' <button class="ui mini primary button">NEUTRAL</button>'; break; case "NEGATIVE" : semantic = ' <button class="ui mini negative button">NEGATIVE</button>'; break; case "MIXED" : semantic = ' <button class="ui mini secondary button">MIXED</button>'; break; case "POSITIVE" : semantic = ' <button class="ui mini positive button">POSITIVE</button>'; break; } return semantic; } $("#ko_trans").change(function(){ $("#news_en").toggle(); $("#news_ko").toggle(); }); // Search Button document.getElementById('searchButton').onclick = function(){ var title = $('#title').val(); $.ajax({ url: API_ENDPOINT + '?title='+title, type: 'GET', success: function (response) { news_data = response; $('#news_en tr').slice(1).remove(); $('#news_ko tr').slice(1).remove(); {{/* 영문 뉴스 테이블 */}} jQuery.each(response, function(i,data) { $('#news_en').append('<tr> \ <td>' + data['created_date'].substring(0, 10) + '<br>' + data['created_date'].substring(11, 19) + '</td> \ <td>' + '<div class="content"><h2 class="ui header"><a href="' + data['link_url'] + '" target="_blank"> \ ' + data['title'] + '</a></h2><p>' + data['sub_title'] + '</p></div> \ ' + '<div class="ui divider"></div><h4>Category</h4><p>' + data['category'] + '</p> \ ' + '<h4>Entities ' + semantics(data['sentiment']) + '</h4><p>' + data['entities'] + '</p></td> \ <td>' + '<div class="content"><p>' + nl2p(data['paragraph']) + '</p></div></td> \ </tr>'); }); {{/* 한글 뉴스 테이블 */}} jQuery.each(response, function(i,data) { $('#news_ko').append('<tr> \ <td>' + data['created_date'].substring(0, 10) + '<br>' + data['created_date'].substring(11, 19) + '</td> \ <td>' + '<div class="content"><h2 class="ui header"><a href="' + data['link_url'] + '" target="_blank"> \ ' + data['ko_title'] + '</a></h2><p>' + data['ko_sub_title'] + '</p></div> \ ' + '<div class="ui divider"></div><h4>카테고리</h4><p>' + data['ko_category'] + '</p> \ ' + '<h4>엔티티 ' + semantics(data['sentiment']) + '</h4><p>' + data['entities'] + '</p></td> \ <td>' + '<div class="content"><p>' + nl2p(data['ko_paragraph']) + '</p></div></td> \ </tr>'); }); }, error: function () { alert('error'); } }); } </script> </body> </html> |
S3 scripts.js
아래 API GW Endpoint URL은 추후 CDK 배포 후에 자동으로 생성되는 API를 위한 Endpoint URL로 교체합니다.
var API_ENDPOINT = '{API GW Endpoint URL 입력해 주세요.}'; if (API_ENDPOINT === '') { alert('scripts.js 파일의 상단에 API Gateway에 배포한 URL을 등록하고 실행하세요.'); } $(document) .ajaxStart(function () { $('.pusher').dimmer('show'); }) .ajaxStop(function () { $('.pusher').dimmer('hide'); }); $('.ui.dropdown').dropdown({}); |
S3 styles.css
.buttons { border : solid 0px #e6b215; border-radius : 8px; moz-border-radius : 8px; font-size : 16px; color : #ffffff; padding : 5px 18px; background-color : #FF9900; cursor:pointer; } .buttons:hover { background-color:#ffc477; } .buttons:active { position:relative; top:1px; } #newPost { margin: 0 auto; width: 90%; } #charCounter { float:right } textarea { width: 100%; } .img_thumb { width: 100px; height: 100px; } body > .ui.container { margin-top: 3em; } iframe { border: none; width: calc(100% + 2em); margin: 0em -1em; height: 300px; } iframe html { overflow: hidden; } iframe body { padding: 0em; } .ui.container > h1 { font-size: 3em; text-align: center; font-weight: normal; } .ui.container > h2.dividing.header { font-size: 2em; font-weight: normal; margin: 4em 0em 3em; } .ui.table { table-layout: fixed; } |
Dynamic REST API
from __future__ import print_function import boto3 import os import json import decimal from boto3.dynamodb.conditions import Key, Attr def lambda_handler(event, context): if "queryStringParameters" in event: parmas = event['queryStringParameters'] print (parmas) title = parmas["title"] dynamodb = boto3.resource('dynamodb') table = dynamodb.Table(os.environ['NEWS_TABLE']) if title == "*": items = table.scan(Limit=100) else: items = table.query(KeyConditionExpression=Key('title').eq(title)) response = { 'statusCode': 200, 'body': json.dumps(items["Items"]), 'headers': { 'Content-Type': 'application/json', 'Access-Control-Allow-Origin': '*' } } return response |
웹 사이트 내의 대시보드는 Amazon QuickSight에서 작성한 보드를 임베드 한 것입니다. 시각화를 통해서 최근 트랜드를 빠르게 확인할 수 있습니다.
https://aws.amazon.com/ko/blogs/architecture/serverless-architecture-for-a-web-scraping-solution/
https://github.com/aws-samples/lambda-web-scraper-example
https://github.com/studydev/lambda-web-scraper-example
https://github.com/aws-samples/serverless-ui-testing-using-selenium
https://github.com/studydev/serverless-ui-testing-using-selenium
https://book.coalastudy.com/data_crawling/week6/stage2
https://www.browserstack.com/guide/python-selenium-to-run-web-automation-test
https://aws.amazon.com/ko/blogs/opensource/run-selenium-tests-at-scale-using-aws-fargate/
https://aws.amazon.com/ko/blogs/startups/infinite-scaling-of-selenium-ui-tests-using-aws-lambda/
https://aws.amazon.com/ko/blogs/architecture/scaling-up-a-serverless-web-crawler-and-search-engine/
https://broadcast.amazon.com/videos/418113
실제 샘플 코드는 AWS Solutions Architect 를 통해서 비즈니스 요건을 설명한 후 도움을 받는 것을 추천 드립니다.
해당 솔루션은 2022년 S사에 제공된 솔루션 샘플에 대한 예시입니다.