Sept 12
Password
Setting up password from 7:45 - 10:45
…continuation form yesterday
Chris helped me with cf3
vk_django_something=0.5.0
added the “=0.5.0”
Helped vallée for 2 hours with BW3
Sept 14
Pull requests
15m https://github.com/verticalknowledge/cf1-locationsystem/pull/512
10m GitLab
20m GitHub
RCTs
Onboarding: cf3 setup
Retail2 RCT with Eric
Retail data is big with big maintenance
Discovery (Sundays) & Details (Tuesdays)
Retail is no fun
Bill 45m to training
Guidepoint: Evolus
https://jira.vkportal.com/browse/COL-14892
2h research
Sept 15
LifeStance
There's a lot going on with this spider.
The spider runs in two (2) parts: first LifeStanceSpider runs then spawns DigitalLifeStanceSpider.
April
The first run back in April has almost 4000 locations with attributes: url, langs, areas, gender and type; but, the report did not include this data because that information is reported by the second spider which did not run for some reason. So the report contains the default fields from the Location model only.
May
The second run in May is the only one from the 18 total runs that actually spawned the second spider. It contains all the data and is in the correct csv format. For the 6700 link states it collected a little over 5000 locations.
June
The June run had almost 5200 locations and we started to see the data in its current format with the multiple locations in the attributes: “street_address_1”, “street_address_2”, “phone_1”, “phone_2”, etc.
July
July started the weekly runs and the first two had about 5300 locations each. But then the spider encountered parsing errors and collected less than 200 locations each week.
August
In August the spider was collecting 1000 location or more each week but still having the same invalid state parsing issue. Toward the end of the month it began experiencing timeout limits and not collecting anything.
September
With commit 8a684be1 the Sept 9th batch saw a return to data collection with almost 5500 locations but still not generating the correct report because the second spider is still not spawning.
The data in the Sept 9th fetch looked like this:
name "Doylestown, PA – 4259 W Swamp Rd"
url "https://lifestance.com/l…town-pa-4259-w-swamp-rd/"
address "4259 W Swamp Rd, Suite 404\nDoylestown, PA 18902"
phone "6108923800"
The data in the Sept 13th fetch looked like this:
name "Doylestown, PA – 4259 W Swamp Rd"
url "https://lifestance.com/l…town-pa-4259-w-swamp-rd/"
address ""
suite "4259 W Swamp Rd, Suite 404"
city "Doylestown"
state "PA"
zip "18902"
phone "6108923800"
Blocked to Eric, arg!
Local CF3
This is one for ERIC!
Setting up new spider to run:
- [ ] Go to UI Dashboard http://127.0.0.1:8000/
- [ ] Go to spider
- [ ] Go to Settings
- [ ] REFETCH_LIMIT cannot be None, make it 1
Doc
The entire “training” channel in Slack could be in documentation
Sept 19
Pull Requests
Dockerfile
FROM ubuntu:21.10
maybe we can use Debian instead of Ubuntu
Vroom fetching done, not parsing
Sept 29
Errors
Vroom
https://carsalesystem.cf1.vk.ai/system/batch-review/119922
Blank Parse Statuses Found 1 blank parse statuses, 0 allowed
Cars.com
https://jira.vkportal.com/browse/COL-16579
https://www.cars.com/dealers/14607/mccluskey-chevrolet/inventory/?page=1&stock_type=used
Regex error:
Found 16 errors, 0 allowed
File "/projects/carsalesystem/src/django-vehicleapp/vehicleapp/spiders/cars.py", line 258, in process_dealership_vehicles
num_vehicles = re.capture(r'\d+', matches, required=False) or '0'
TooManyCaptureException: 'Too many matches found (num_vehicles)'
1,118 matches
Evolus
https://jira.vkportal.com/browse/COL-14892
File "/home/code/system/src/django-locationapp/locationapp/spiders/12WestCollection/evolus.py", line 220, in process_specialist
TypeError: expected string or buffer site_id=site_id,
https://locationsystem.cf1.vk.ai/system/group/3/source/991/fetch-batch/200281/parse-batch/205666/
Phone:
- [ ] ​​​(212) 570-2067
Errors:
https://locationsystem.cf1.vk.ai/system/group/0/source/991/fetch-batch/200281/parse-batch/205666/filter/?process_status=3#
Recursion (mark these as bad in code?):
- [ ] https://www.evolus.com/specialists/wendie-do/
- [ ] http://indirect.me/?url=https%3A//www.evolus.com/specialists/taily-alonso-msn-aprn/
- [ ] http://indirect.me/?url=https%3A//www.evolus.com/specialists/tara-howell-md/
- [ ] https://www.evolus.com/specialists/shqipe-rn/
- [ ] http://indirect.me/?url=https%3A//www.evolus.com/specialists/melissa-austin-md/
- [ ] https://www.evolus.com/specialists/elena-ginsberg/
- [ ] https://www.evolus.com/specialists/david-weitzman-md/
Jpg?:
- [ ] https://www.evolus.com/specialists/laura-walsh-pa-c/
- [ ] Etc... maime moose?
- [ ] If resoles to jpg, ignore
Comp
Jason said to comp my monitor in Netsuke (see email)
Oct 5
Eric’s Team Meeting
Matt Daye - lead Docker dev w/ Spencer
S3 coming offline to save $50k/mo
CF-new-generation
Gemini is a VK product of egress points (VPN)
Revisions
Vroom
add missing link-state status
https://github.com/verticalknowledge/cf1-carsalesystem/pull/219
https://carsalesystem.cf1.vk.ai/system/group/0/source/137
https://jira.vkportal.com/browse/COL-16885
Oct 17
Week to talk to Eric
Things to work on:
- [ ] Async
- [ ] Documentation: Sphinx,
- [ ] Logging ??? What is collectionframework.models.LogEntry
- [ ] Testing: method return types ??? What is test_these_links, test_random_limit
- [ ] Performance: bottlenecks / tuning
- [ ] Docker
- [ ] Code style discussions
- [ ] Slack channel for codes snippets: shell, python, perl, rust, go
- [ ] VK University
- [ ] VK Blog (external / public)
- [ ] VK TV
- [ ] Cost analysis: $50K/mo S3 bill
Horizontal alignment
Nesting
ls = self.create_link_state(url=self.url, name='Video Gaming Reports')
self.process_reports(ls)
Vs.
self.process_reports(
self.create_link_state(
url=self.url,
name='Video Gaming Reports’,
)
)
Oct 21
Talked to Eric
Give it 2 more weeks in the fires
Monster in prod
Accidental sql queries to cf1 instead of cf3
SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST
kill query 130461309;
kill query 130462218;
kill query 130462421;
kill query 130462461;
kill query 130462532;
Nov 28
PR Duty
Helped Svetoslav with his MR
IT would be good to be able to see for a spider:
- [ ] Is it part of RCS or manual?
- [ ] When is the next schedule? (RCS spiders don’t reveal this in the UI)
Jan 18
Jason 1-1
I need to communicate early on my tickets as I’m approaching ballparks
https://www.zenrows.com/blog/bypass-cloudflare#what-is-cloudflare-bot-management
Читать полностью…Vertical Knowledge
Paylocity
Company ID: 45540
Username: stevenalmeroth
Password: Stavros13
IT Services / Consulting
Chagrin Falls, Ohio
150 employees
Founded: 2006
Enabling companies and organizations to harness the power of Publicly Available Information
https://vk.ai/
As the saying goes, when everyone's digging for oil, it's good to be in the drill business. Data is today's oil and VK builds the most sophisticated and powerful drilling technology in the market.
harness PAI (Publicly Available Information, also known as public data) - the fastest-growing type of alternative data in the world
Senior Software Engineer
https://jobs.lever.co/vk/6bbea41b-4bea-4832-ae23-47ea08903909
Full-time remote
As an experienced senior level professional, you should be highly results oriented and comfortable assuming a lead role to other junior level colleagues, and being on client facing calls and communications.
Required:
Expert Python
Front-end web (CSS, HTML5, JavaScript, React)
SQL, Elasticsearch and/or NoSQL
Cloud services, preferably AWS
Skills:
Agile
team lead
client calls
documentation
Thursday, July 14, 2022 10:13 AM
Gino Conte
Talent Acquisition Partner
gino.conte@vk.ai
M: 216.374.3319
https://www.vk.ai
finance
spider/bot scrape PAI then sell it
gov spec.ops DOD to stay anonymous "support the troops"
U.S. DoD "Top Secret" Clearance
Wednesday, Jul 20, 2022 11:00 AM Eastern Time
Dan Kasmierski - Project Manager
Interview with SWE III/Sr. SWE candidate Steve Almeroth
Join ZoomGov Meeting: https://vk-ai.zoomgov.com/j/1602320262
"kaz" engineering team 5-10, collection team 20, devops 15-20
tech interview
what are my questions for the team?
5-YEAR PLANS FOR COMPANY
how much time on phone
Dan sounds real
Fosters trust culture
soldiers lives on the line
VK in growth phase, they need the bodies, pipeline is 10x from a few years ago
retool unification collection strategies
stepping from startup to mature mentality
Next: jes & erik partial tech interview
Then: exec interview
great interview!
Wednesday, July 27, 2022 3:00 PM
called Gino to verify it will be a tech interview
Eric Mocny: Director of Collections, Sr software dev, at VK 11 years
collection is a gateway role, can move to other depts after 6-mos or so
eric came from web-design and started when VK had only 12 people (moving from perl to python)
many branches off the vk tree
VK is evolving right now into something new: emerging out into the world
looking to scale
CF1: Collection Framework 1 - in-house Django framework, amazon ec2
CF3: new, the future
CF? - something new new
pycharm
docker
wk1: training spiders
wk2: writing spiders
custom collections, not a data wholesaler
bot detection
1-wk sprints
my experience is rare: I have many tools
team lead role?
now fully remote
still a cool office in CFalls
Friday, August 5, 2022 10:00 AM
Aaron Smith, VP of Collections
been at VK since beginning with Matt Carpenter CEO
aaron putting together a team
gave backstory
clearance - people come safe to their families
my careers - my path aligns well with vk's
mentorship
questions
Turnover
decided not to sell out the soldiers
loyalty
5/6 people turnover
Soldiers get to go home
"shoe spider"?
POI system
keep soldiers alive with video face recognition
my neew challenges
r & d / exploratory
anti-bot measures
q4 2022 - what does the next 10 years of collections look like?
mentorship
frameworks
- based on Scrapy
-
-
remote
pre covid 80% worked in CF
family
won't miss dinners
John CTO
Friday, August 19, 2022 11:11 AM
cameron or clayton are tech guys: ask about using my desktop at home
Santa Cruz Blur TR Fox Performance 120mm/115mm SRAM GX $5900
Читать полностью…Sept 13
Mentoring
Helped Vallee again for close to an hour with Git PR - 1hr
Still working on LifeStance
QA & Data overview - 1hr
Sept 9
Late
Office hours 1h Courtney says Jess’s My Tasks dashboard is locked down somehow and that’s why I can’t edit my copies of it
db access still denied: prob bc wrong vpn
time tempo timesheets sill not working, bc 100+ employees
Keeper Stavros13!
40m
Installing cf3 to move Planet Fitness
2h
Vpn just disconnected and can’t reconnect
Can’t install cf3
make migrate
File "/home/retailcf3/.local/lib/python3.8/site-packages/debug_toolbar/apps.py", line 10, in <module>
from django.utils.translation import ugettext_lazy as _
ImportError: cannot import name 'ugettext_lazy' from 'django.utils.translation' (/home/retailcf3/.local/lib/python3.8/site-packages/django/utils/translation/__init__.py)
make: *** [sync_system] Error 1
Time Tempo time sheet still not working, and also not working at least from yesterday
Sept 16
Eric not Happy
and just for transparency, the reason I keep bringing up productivity and efficiency is because we brought you in as a SE III, thats about as high as we get on the team. Most of the team are SE I or SE II, but your experience in the field is greater than theirs. The goal is to get you on some of the harder stuff and so far, we havent really gotten to that level yet. Im confident we will get there, I know learning the framework takes time.
So there’s that
Eric finished LifeStance when I failed to finish the second spider
Ported PlanetFitness from cf1 to cf3
Sept 21
Dillon
Helped with retail2 act
https://jira.vkportal.com/browse/RCT-402784
For some reason it looks like PyCharm cf1-locationsystem has caches for the collection framework package
Worked til 11:00 PM finishing Planet Fitness report generator
Oct 1
Saturday
Deep Thoughts
Name
Framework needs a name: like “GrabberWorx”
Walk-thrus
Would be nice to have a walk-thru of component process of:
- [ ] framework - spiders: start, process-methods, reports etc… tasks
- [ ] Docker - Dockerfile, compose etc
Screen Time
Each person shares screen and team debugs for:
- [ ] PyCharm debugger
- [ ] Docker setup so IDE sees framework packages (good on cf3, not cf1)
Oct 14
Notes
Proxies are http (not https)
http://80301:twpfpb@connect.vk.ai:6501
The codebase is ugly, sorry to say
Eric
Im booked this afternoon, but I want to chat next week. There are a few things I want to cover. We cant keep up this pace. Apple for example, we told the client it would be done 30/Sep/22, its 14/Oct/22 and is not even on track to go out today.
1:37
On track to deliver 0 COL tickets this week
1:38
So I’d like to understand what you are doing with your time, because there are only ~20 hours logged between those 2 tickets, and like I said, one has been open for 2 full weeks.
Oct 24
Refinitive
Training presentation by Scott Jewell
Increasing client
Finance sector
Running on super old data but it’s our job to deal with it
Sharepoint / one note
Web-grabbing
Email-grabbing
Ftp-grabbing
2019
Spiders should use Scott’s RefinitiveBaseSPider
RIC : Reuters Instrument Codes (rows description)
FID : Field Identifiers (column heading)
Value: table cell with the data
CSV file
pre-formatted
JSON file
format data
uploaded to S3, inserted into their db, removed from bucket
doesn’t actually need to be strict correct JSON, we can use ”val”: 7.5000
We probably don’t need to write a generate_reports method, just use the one in base class
RefinitiveEmailBase
fake link_states
superseded by S3FetchBase
S3FetchBase
not real link-states
everything handled in start method with helpers
errors are hard to find as link-states don’t report them
sort_s3_messages task kicks off jobs, not schedule
do not create fold in S3! Everything is a flat file and Cyberduck will create some random file
add new collections to this file
Keep LAST_UPDATED_DATE spider setting
PROD_TEST_MAP
Nov 22
Jason 1on1
I need to do better with Attention to detail: I gave an xls report when they wanted csv
Keep SD in the loop: don’t spend 3 hours on something that maybe SD doesn’t really need
Jan 30
Company Meeting
Brian O’Keefe announcing
John (JPK) moving from CTO to advisory role, Aaron (Sharrow?, not Smith) to fill his shoes
Food Storage
8 x Patriot Pantry 2-week buckets @ 2000 calories /day /person
3 x 88 servings protein: 32 meat, 56 beans
Trek Top Fuel 9.7 Fox Performance 120mm/130mm Shimano $4500
Читать полностью…HD
JB weld
Supply turnouts (purple) 3
Turnout brace 2
Pex valves 5
Escutcheon 5
1/2" round KO box cable conns