stavrosian | Unsorted

Telegram-канал stavrosian - Selv

-

Subscribe to a channel

Selv

https://youtube.com/shorts/BO3S0tPM2zc?feature=share

Читать полностью…

Selv

Jennifer Way <jennifer.way@vk.ai>
Dan Kasmierski <Daniel.Kasmierski@vk.ai>
Brian O'Keefe <brian.okeefe@vk.ai>
Aaron Smith <aaron.smith@vk.ai>
John Keyerleber <john.keyerleber@vk.ai>
Matt Carpenter <matt@vk.ai>

Читать полностью…

Selv

Day 2

Tuesday 23 August 2022

Managers meeting
- [ ] Jason
- [ ] Scott
- [ ] Jess
- [ ] Eric

Читать полностью…

Selv

Day 4

Thursday 25 August 2022

Apple spider in FW

Читать полностью…

Selv

Day 6

Monday 29 August 2022

SELECT parse_batch_id AS parse, COUNT(parse_batch_id) AS cnt, process_ended_at FROM collectionframework_linkstate WHERE process_class="AppleStoreSpider" GROUP BY parse_batch_id HAVING cnt>2;

Читать полностью…

Selv

Day 8

Wednesday 31 August 2022

Installed Postman on Legion and got LifeStance to return data

PlanetFitness only works with ScrapingBee, not ScraperAPI

My first PR-duty day

Читать полностью…

Selv

Sept 12
Password

Setting up password from 7:45 - 10:45
…continuation form yesterday

Chris helped me with cf3
vk_django_something=0.5.0
added the “=0.5.0”

Helped vallée for 2 hours with BW3

Читать полностью…

Selv

Sept 14
Pull requests
15m https://github.com/verticalknowledge/cf1-locationsystem/pull/512
10m GitLab
20m GitHub

RCTs

Onboarding: cf3 setup

Retail2 RCT with Eric
Retail data is big with big maintenance
Discovery (Sundays) & Details (Tuesdays)
Retail is no fun
Bill 45m to training

Guidepoint: Evolus
https://jira.vkportal.com/browse/COL-14892
2h research

Читать полностью…

Selv

Sept 15
LifeStance

There's a lot going on with this spider.

The spider runs in two (2) parts: first LifeStanceSpider runs then spawns DigitalLifeStanceSpider.

April
The first run back in April has almost 4000 locations with attributes: url, langs, areas, gender and type; but, the report did not include this data because that information is reported by the second spider which did not run for some reason. So the report contains the default fields from the Location model only.
May
The second run in May is the only one from the 18 total runs that actually spawned the second spider. It contains all the data and is in the correct csv format. For the 6700 link states it collected a little over 5000 locations.
June
The June run had almost 5200 locations and we started to see the data in its current format with the multiple locations in the attributes: “street_address_1”, “street_address_2”, “phone_1”, “phone_2”, etc.
July
July started the weekly runs and the first two had about 5300 locations each. But then the spider encountered parsing errors and collected less than 200 locations each week.
August
In August the spider was collecting 1000 location or more each week but still having the same invalid state parsing issue. Toward the end of the month it began experiencing timeout limits and not collecting anything.
September
With commit 8a684be1 the Sept 9th batch saw a return to data collection with almost 5500 locations but still not generating the correct report because the second spider is still not spawning.

The data in the Sept 9th fetch looked like this:

name "Doylestown, PA – 4259 W Swamp Rd"
url "https://lifestance.com/l…town-pa-4259-w-swamp-rd/"
address "4259 W Swamp Rd, Suite 404\nDoylestown, PA 18902"
phone "6108923800"

The data in the Sept 13th fetch looked like this:
name "Doylestown, PA – 4259 W Swamp Rd"
url "https://lifestance.com/l…town-pa-4259-w-swamp-rd/"
address ""
suite "4259 W Swamp Rd, Suite 404"
city "Doylestown"
state "PA"
zip "18902"
phone "6108923800"
Blocked to Eric, arg!

Local CF3

This is one for ERIC!

Setting up new spider to run:
- [ ] Go to UI Dashboard http://127.0.0.1:8000/
- [ ] Go to spider
- [ ] Go to Settings
- [ ] REFETCH_LIMIT cannot be None, make it 1

Doc

The entire “training” channel in Slack could be in documentation

Читать полностью…

Selv

Sept 19
Pull Requests

Dockerfile

FROM ubuntu:21.10
maybe we can use Debian instead of Ubuntu


Vroom fetching done, not parsing

Читать полностью…

Selv

Sept 29
Errors

Vroom

https://carsalesystem.cf1.vk.ai/system/batch-review/119922

Blank Parse Statuses Found 1 blank parse statuses, 0 allowed


Cars.com

https://jira.vkportal.com/browse/COL-16579

https://www.cars.com/dealers/14607/mccluskey-chevrolet/inventory/?page=1&amp;stock_type=used

Regex error:
Found 16 errors, 0 allowed
File "/projects/carsalesystem/src/django-vehicleapp/vehicleapp/spiders/cars.py", line 258, in process_dealership_vehicles

num_vehicles = re.capture(r'\d+', matches, required=False) or '0'
TooManyCaptureException: 'Too many matches found (num_vehicles)'
1,118 matches


Evolus

https://jira.vkportal.com/browse/COL-14892

File "/home/code/system/src/django-locationapp/locationapp/spiders/12WestCollection/evolus.py", line 220, in process_specialist
TypeError: expected string or buffer site_id=site_id,
https://locationsystem.cf1.vk.ai/system/group/3/source/991/fetch-batch/200281/parse-batch/205666/

Phone:
- [ ] ‚Äã‚Äã‚Äã(212) 570-2067

Errors:
https://locationsystem.cf1.vk.ai/system/group/0/source/991/fetch-batch/200281/parse-batch/205666/filter/?process_status=3#

Recursion (mark these as bad in code?):
- [ ] https://www.evolus.com/specialists/wendie-do/
- [ ] http://indirect.me/?url=https%3A//www.evolus.com/specialists/taily-alonso-msn-aprn/
- [ ] http://indirect.me/?url=https%3A//www.evolus.com/specialists/tara-howell-md/
- [ ] https://www.evolus.com/specialists/shqipe-rn/
- [ ] http://indirect.me/?url=https%3A//www.evolus.com/specialists/melissa-austin-md/
- [ ] https://www.evolus.com/specialists/elena-ginsberg/
- [ ] https://www.evolus.com/specialists/david-weitzman-md/

Jpg?:
- [ ] https://www.evolus.com/specialists/laura-walsh-pa-c/
- [ ] Etc... maime moose?
- [ ] If resoles to jpg, ignore


Comp

Jason said to comp my monitor in Netsuke (see email)

Читать полностью…

Selv

Oct 5
Eric’s Team Meeting

Matt Daye - lead Docker dev w/ Spencer

S3 coming offline to save $50k/mo

CF-new-generation

Gemini is a VK product of egress points (VPN)

Revisions

Vroom

add missing link-state status
https://github.com/verticalknowledge/cf1-carsalesystem/pull/219
https://carsalesystem.cf1.vk.ai/system/group/0/source/137
https://jira.vkportal.com/browse/COL-16885

Читать полностью…

Selv

Oct 17
Week to talk to Eric

Things to work on:

- [ ] Async
- [ ] Documentation: Sphinx,
- [ ] Logging ??? What is collectionframework.models.LogEntry
- [ ] Testing: method return types ??? What is test_these_links, test_random_limit
- [ ] Performance: bottlenecks / tuning
- [ ] Docker
- [ ] Code style discussions
- [ ] Slack channel for codes snippets: shell, python, perl, rust, go
- [ ] VK University
- [ ] VK Blog (external / public)
- [ ] VK TV
- [ ] Cost analysis: $50K/mo S3 bill


Horizontal alignment

Nesting

ls = self.create_link_state(url=self.url, name='Video Gaming Reports')
self.process_reports(ls)

Vs.

self.process_reports(
self.create_link_state(
url=self.url,
name='Video Gaming Reports’,
)
)

Читать полностью…

Selv

Oct 21
Talked to Eric

Give it 2 more weeks in the fires

Monster in prod

Accidental sql queries to cf1 instead of cf3

SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST

kill query 130461309;
kill query 130462218;
kill query 130462421;
kill query 130462461;
kill query 130462532;

Читать полностью…

Selv

Nov 28
PR Duty

Helped Svetoslav with his MR

IT would be good to be able to see for a spider:

- [ ] Is it part of RCS or manual?
- [ ] When is the next schedule? (RCS spiders don’t reveal this in the UI)

Читать полностью…

Selv

https://youtube.com/shorts/HONwms_ksvU?feature=share

Читать полностью…

Selv

Day 1

Monday 22 August 2022

Jamie greet
Veronica tour
Cla
Slack for Eric & dev 95% on Slack

Mac Welcome!
ID card: 14706 or 36987

Jess project manager

Training Chanel on slack

I’m #17 + 2 more soon
New team, 12 been here less than a year

2 Training spiders: location data, done by thurs
Pyquery / regex
Pycharm
SQL workbench
Docker
Then integrate them

CF1 py2.7
CF2 There is no CF2
CF3 py3
CFx Next Generation, dream phase

Jira tickets is what I look at every day
Tempo tab
Log to training at first VK-10 everyday

Lunch around noon

Team stand up Jason 11:30 (scrum)

Confluence plugin for Jira does docs, calendars vacations

Monthly meetings with Json for 1on1 to talk about stuff

Open to improvements

LinkedIn

locati

We all get PR days for code review

External:
X5 outside pricy
Brightdata
Scrapuer API

Clients
Raven is #1

Читать полностью…

Selv

Day 3

Wednesday 24 August 2022

Framework walk-thru with Eric

Laptop setup

Читать полностью…

Selv

Day 5

Friday 26 August 2022

Eric help with Advance Auto Parts investigation

Improvements: parameratized

Lunch & Learn for Jira

Читать полностью…

Selv

Day 7

Tuesday 30 August 2022

Advance Auto Parts ready

Jess’ documentation meeting with collections

Cf1-locationsystem very active so it is the only system that does not have developers merge the code adhoc but instead code is merged everyday at 4:00 PM

Читать полностью…

Selv

Sept 6
Jason

RCT

Master/child

Update master if I have notes

Cyber Duck for S3 buckets

[|] link

RCT / SD can have S3 links or upload

If not valid then spend only 15 mins to see what the problem is, if can’t find problem then block

Collection tickets are higher priority than RCTs

RCT have 16-18 hours / week / dev

Workflow: if it has ReadyForQA then need to attach report otherwise if it has next status is Valid(Done) then just validate and it will go to QA with no report

@here don't use hn4 for your proxies....if you use hn4 you're gonna have a bad time. use hn1
class DigitalLifeStanceSpider(SpiderBase):
default_proxy_address = 'http://80146:EybWb4@hn1.nohodo.com:6505/'

Читать полностью…

Selv

Sept 13
Mentoring

Helped Vallee again for close to an hour with Git PR - 1hr

Still working on LifeStance

QA & Data overview - 1hr

Читать полностью…

Selv

Sept 9
Late

Office hours 1h Courtney says Jess’s My Tasks dashboard is locked down somehow and that’s why I can’t edit my copies of it
db access still denied: prob bc wrong vpn
time tempo timesheets sill not working, bc 100+ employees

Keeper Stavros13!
40m

Installing cf3 to move Planet Fitness
2h

Vpn just disconnected and can’t reconnect

Can’t install cf3
make migrate
File "/home/retailcf3/.local/lib/python3.8/site-packages/debug_toolbar/apps.py", line 10, in <module>
from django.utils.translation import ugettext_lazy as _
ImportError: cannot import name 'ugettext_lazy' from 'django.utils.translation' (/home/retailcf3/.local/lib/python3.8/site-packages/django/utils/translation/__init__.py)
make: *** [sync_system] Error 1

Time Tempo time sheet still not working, and also not working at least from yesterday

Читать полностью…

Selv

Sept 16
Eric not Happy

and just for transparency, the reason I keep bringing up productivity and efficiency is because we brought you in as a SE III, thats about as high as we get on the team. Most of the team are SE I or SE II, but your experience in the field is greater than theirs. The goal is to get you on some of the harder stuff and so far, we havent really gotten to that level yet. Im confident we will get there, I know learning the framework takes time.

So there’s that

Eric finished LifeStance when I failed to finish the second spider

Ported PlanetFitness from cf1 to cf3

Читать полностью…

Selv

Sept 21
Dillon

Helped with retail2 act
https://jira.vkportal.com/browse/RCT-402784

For some reason it looks like PyCharm cf1-locationsystem has caches for the collection framework package

Worked til 11:00 PM finishing Planet Fitness report generator

Читать полностью…

Selv

Oct 1
Saturday

Deep Thoughts

Name

Framework needs a name: like “GrabberWorx”

Walk-thrus

Would be nice to have a walk-thru of component process of:

- [ ] framework - spiders: start, process-methods, reports etc… tasks
- [ ] Docker - Dockerfile, compose etc

Screen Time

Each person shares screen and team debugs for:

- [ ] PyCharm debugger
- [ ] Docker setup so IDE sees framework packages (good on cf3, not cf1)

Читать полностью…

Selv

Oct 14
Notes

Proxies are http (not https)

http://80301:twpfpb@connect.vk.ai:6501

The codebase is ugly, sorry to say

Eric

Im booked this afternoon, but I want to chat next week. There are a few things I want to cover. We cant keep up this pace. Apple for example, we told the client it would be done 30/Sep/22, its 14/Oct/22 and is not even on track to go out today.
1:37
On track to deliver 0 COL tickets this week
1:38
So I’d like to understand what you are doing with your time, because there are only ~20 hours logged between those 2 tickets, and like I said, one has been open for 2 full weeks.

Читать полностью…

Selv

Oct 24
Refinitive

Training presentation by Scott Jewell

Increasing client

Finance sector

Running on super old data but it’s our job to deal with it

Sharepoint / one note

Web-grabbing
Email-grabbing
Ftp-grabbing

2019

Spiders should use Scott’s RefinitiveBaseSPider

RIC : Reuters Instrument Codes (rows description)

FID : Field Identifiers (column heading)

Value: table cell with the data

CSV file

pre-formatted

JSON file

format data
uploaded to S3, inserted into their db, removed from bucket
doesn’t actually need to be strict correct JSON, we can use ”val”: 7.5000

We probably don’t need to write a generate_reports method, just use the one in base class

RefinitiveEmailBase
fake link_states
superseded by S3FetchBase

S3FetchBase
not real link-states
everything handled in start method with helpers
errors are hard to find as link-states don’t report them

sort_s3_messages task kicks off jobs, not schedule
do not create fold in S3! Everything is a flat file and Cyberduck will create some random file
add new collections to this file

Keep LAST_UPDATED_DATE spider setting

PROD_TEST_MAP

Читать полностью…

Selv

Nov 22
Jason 1on1

I need to do better with Attention to detail: I gave an xls report when they wanted csv

Keep SD in the loop: don’t spend 3 hours on something that maybe SD doesn’t really need

Читать полностью…

Selv

Dec 2
Staff

Brian COO

John (JPK) CTO

Читать полностью…
Subscribe to a channel