bdscience | Unsorted

Telegram-канал bdscience - Big Data Science

4168

Big Data Science channel gathers together all interesting facts about Data Science. For cooperation: a.chernobrovov@gmail.com 💼 — https://t.me/bds_job — channel about Data Science jobs and career 💻 — https://t.me/bdscience_ru — Big Data Science [RU]

Subscribe to a channel

Big Data Science

💥Top sources of various datasets for data visualization
FiveThirtyEight is a journalism site that makes its datasets from its stories available to the public. These provide researched data suitable for visualization and include sets such as airline safety, election predictions, and U.S. weather history. The sets are easily searchable, and the site continually updates.
Earth Data offers science-related datasets for researchers in open access formats. Information comes from NASA data repositories, and users can explore everything from climate data to specific regions like oceans, to environmental challenges like wildfires. The site also includes tutorials and webinars, as well as articles. The rich data offers environmental visualizations and contains data from scientific partners as well.
The GDELT Project collects events at a global scale. It offers one of the biggest data repositories for human civilization. Researchers can explore people, locations, themes, organizations, and other types of subjects. Data is free, and users can also download RAW data sets for unique use cases. The site also offers a variety of tools as well for users with less experience doing their own visualizations.
Singapore Public Data - another civic source of data, the Singapore government makes these datasets available for research and exploration. Users can search by subject through the navigation bar or enter search terms themselves. Datasets cover subjects like the environment, education, infrastructure, and transport.

Читать полностью…

Big Data Science

📚Top Data Science Books 2022
Ethics and Data Science
- in this book, the author introduces us to the principles of working with data and what to do to implement them today.
Data Science for Economics and Finance - this book deals with data science, including machine learning, social network analysis, web analytics, time series analysis, and more in relation to economics and finance.
Leveraging Data Science for Global Health - This book explores the use of information technology and machine learning to fight disease and promote health.
Understanding Statistics and Experimental Design - the book provides the foundations needed to properly understand and interpret statistics. This book covers the key concepts and discusses why experiments are often not reproducible.
Building data science teams - the book covers the skills, perspectives, tools, processes needed to grow teams
Mathematics for Machine Learning - this book covers the basics of mathematics (linear algebra, geometry, vectors, etc.), as well as the main problems of machine learning.

Читать полностью…

Big Data Science

👑Working with Graphs in Python with NetWorkX
NetWorkX is a library that is designed to create, study and manipulate graphs and other network structures. This library is free, distributed under the BSD license. NetworkX is used to teach graph theory, as well as scientific research and solving applied problems in which it is applied. NetWorkX has a number of strong benefits, including:
High performance - NetworkX is able to freely operate large network systems containing up to 10 million vertices and 100 million edges between them. This is especially useful when analyzing Big Data - for example, downloads from social networks that unite millions of users.
Easy to use - since the NetworkX library is written in Python, working with it is not difficult for both professional programmers and amateurs. And graph visualization modules provide visibility of the result, which can be corrected in real time. In order to create a full-fledged graph, you need only 4 lines of code (one of them is just an import):
import networkx as nx
G=nx.Graph()
G.add_edge(1,2)
G.add_edge(2,3,weight=0.9)
Efficiency - due to the fact that the library is implemented on the low-level data structure of the Python programming language, the efficient use of computer hardware and software resources is ensured. This increases the ability to scale graphs, and also reduces dependence on the features of the hardware platform and operating system.
Ongoing Support - Detailed documentation has been developed for NetworkX, describing the functionality and limitations of the library. The repositories are constantly updated. They contain ready-made standard solutions for programmers, which greatly facilitate the work.
Open source code - the user gets great opportunities for customizing and expanding the functionality of the library, adapting it to specific tasks. If desired, the user himself can develop additional software for working with this library.

Читать полностью…

Big Data Science

😎What is Pandas and why is it so good?
Pandas
is a Python library for processing and analyzing structured data, its name comes from "panel data" ("panel data"). Panel data refers to information obtained because of research and structured in the form of tables. To work with such data arrays in Python, the Pandas library was created.
This library is based on DataFrame - a table-type data structure. Any tabular representation of data, such as spreadsheets or databases, can be used as a DataFrame. The DataFrame object is made up of Series objects, which are one-dimensional arrays that are combined under the same name and data type. Series can be thought of as a table column. Pandas has in its arsenal such advantages as:
Easily manage messy data in an organized form - dataframes have indexes for easy access to any element
Flexible change of forms: adding, deleting, attaching new or old data.
Intelligent indexing, which involves the manipulation and management of columns and rows.
Quickly merge and merge data sets by indexing them, for example, two or more Series objects into one DataFrame.
Support for hierarchical indexing - the ability to combine columns under a common category (MultiIndex).
Open Access - Pandas is an open source library, meaning its source code is in the public domain
Detailed documentation - Pandas has its own official website, which contains detailed documentation with explanations. More details can be found at the following link: https://pandas.pydata.org/docs/

Читать полностью…

Big Data Science

💡Top 6 data sources for deep diving into Machine Learning
Chronic disease data is a source where you can find data on various chronic diseases in the United States.
IMF Data - The International Monetary Fund, which also publishes data on international finance, debt indicators, foreign exchange reserves, investments, and so on.
Financial Times Market Data - contains information about the financial markets around the world, which includes indicators such as commodities, currencies, stock price indices
ImageNet is the image data for the new algorithms, organized according to the WordNet hierarchy, where hundreds and thousands of images represent each node of the hierarchy
Stanford Dogs Dataset - contains a huge number of images of various breeds of dogs
HotspotQA Dataset - question-answer type data that allows you to create systems for answering questions in the most understandable way.

Читать полностью…

Big Data Science

🤔Why is the NumPy library so popular in Python?
NumPy is a Python language library that adds support for large multi-dimensional arrays and matrices, as well as high-level (and very fast) math functions to operate on these arrays. This library has several important features that have made it a popular tool.
Firstly, you can find its source code on GitHub, which is why NumPy is called an open-source module for Python. https://github.com/numpy/numpy/tree/main/numpy
Second, NumPy is written in C. This language is compiled, that is, the constructions crated with the kind of this language standards and rules are converted into machine code - a set of instructions for a particular type of processor. The conversion takes place with the help of a special compiler program, due to which all calculations occur quickly enough.
Let’s compare the performance between NumPy arrays and standard Python lists by the code below:
import numpy
import time
list1 = range(1000000)
list2 = range(1000000)
array1 = numpy.arange(1000000)
array2 = numpy.arange(1000000)
initialTime = time.time()
resultantList = [(a * b) for a, b in zip(list1, list2)]
print("Time taken by Python lists :",(time.time() - initialTime),"secs")
initialTime = time.time()
resultantArray = array1 * array2
print("Time taken by NumPy :", (time.time() - initialTime),"secs")
As the result of the test we can guess that NumPy arrays (0.002 sec) are more faster that standard Python lists (0.11 sec)
Performance differs across platforms due to software and hardware differences. The default bit generator has been chosen to perform well on 64-bit platforms. Performance on 32-bit operating systems is very different. You can see the details here: https://numpy.org/doc/stable/reference/random/performance.html#performance-on-different-operating-systems

Читать полностью…

Big Data Science

💥Not only Copilot: try Codex by OpenAI
OpenAI released Codex models based of GPT-3 that can interpret and generate code. Their training data contains both natural language and billions of lines of public code from GitHub. These neural network models are best versed in Python and speak over a dozen languages, including JavaScript, Go, Perl, PHP, Ruby, Swift, TypeScript, SQL, and even Shell. The code is useful in the following tasks:
• Make code from comments
• Complete a statement or function in a spec.
• Find payload or API call
• Add comments
• Make code refactoring to achieve its efficiency
https://beta.openai.com/docs/guides/code

Читать полностью…

Big Data Science

🤔SQLlite as an embedded database in Python
SQLite is an embedded relational file-based database management system (RDBMS) that can be used in Python applications without installing additional software. It is enough to import the sqlite3 built-in Python library to use SQLite.
First, create a database connection by importing sqlite3 and then call the .connect() method with the name of the database to be created, eg new_database.db.
import sqlite3
conn = sqlite3.connect('new_database.db')
Before creating a table, you need to create a cursor - an object that is used to create a connection for executing SQL queries. This calls the created connection and the .cursor() method.
c = conn.cursor()
You can then use the .execute() method to create a new table in the database. Inside the quotes is written the usual SQL query syntax used to create a table in most DBMS. For example, creating a table through the CREATE TABLE statement:
c.execute("""CREATE TABLE new_table (
name TEXT,
weight INTEGER)""")
After filling the table with data, you can execute standard SQL queries on it to select and change values.
https://towardsdatascience.com/yes-python-has-a-built-in-database-heres-how-to-use-it-b3c033f172d3

Читать полностью…

Big Data Science

🤔PyPy vs CPython: Under the Hood of Python
Every Python-developer knows about CPython, the most common implementation of a virtual machine that interprets written code. As alternative to CPython, there is PyPy built using the RPython language. Compared to CPython, PyPy is faster, implementing Python 2.7.18, 3.9.15, and 3.8.15. PyPy supports most of the commonly used Python library modules. The x86 version of PyPy runs on multiple instances such as Linux (32/64 bits), MacOS (64 bits), Windows (32 bits), OpenBSD, FreeBSD. Versions other than x86 are found on Linux, and ARM64 on MacOS.
However, PyPy doesn't scale up code in short-running processes that take less than a couple of seconds to load: the JIT compiler won't have enough time to "warm up". Also, PyPy does not give a speed gain if all the execution time falls on runtime libraries, i.e. in C functions, not in the actual execution of the Python code. Therefore, PyPy works best when executing executable programs, when bigger part of the time is spent executing Python code.
In terms of memory consumption, PyPy can also outperform CPython: Python programs with high RAM consumption (hundreds of MB or more) can end up taking up less space in PyPy than in CPython.
https://www.pypy.org/features.html

Читать полностью…

Big Data Science

🌲TOP-5 DS-events in December 2022:
1. Nov 28-Dec 9 •
NeurIPS 2022 https://nips.cc/
2. Dec 5-6 • 7th Global Conference on Data Science and Machine Learning, Dubai,UAE https://datascience.pulsusconference.com/
3. Dec 7 • Data Science Salon NYC | AI and ML in Finance & Technology • New York, NY, USA https://www.datascience.salon/newyork/
4.  Dec 12-16 • The 20th Australasian Data Mining Conference 2022 (AUSDM’22) • Sydney, Australia + Virtual https://ausdm22.ausdm.org/
5. Dec 17-18 • 3rd International Conference on Data Science and Cloud Computing (DSCC 2022), Dubai, UAE https://cse2022.org/dscc/index

Читать полностью…

Big Data Science

Yandex named the laureates of its annual scientific award

Scientists who are engaged in research in the field of computer science will receive one million rubles for the development of their projects. In 2022, six young scientists became laureates:

Maxim Velikanov — is engaged in the theory of deep learning, studies infinitely wide neural networks and statistical physics;

Petr Mokrov — studies Wasserstein gradient flows, nonlinear filtering and Bayesian logistic regression;

Maxim Kodryan — is engaged in deep learning, as well as optimization and generalization neural network models;

Ruslan Rakhimov — works with neural visualization, CV and deep learning;

Sergey Samsonov — studies Monte Carlo algorithms with Markov chains, stochastic approximation and other topics;

Taras Hahulin — works in the field of computer vision.

It's cool that scientific supervisors are singled out also separately. This year, two people received grants — Dmitry Vetrov, Head of the HSE Center for Deep Learning and Bayesian Methods, and Alexey Naumov, Associate Professor at the HSE Faculty of Computer Science, Head of the International Laboratory of Stochastic Algorithms and Analysis Multidimensional Data.

More information about the awards and laureates of 2022 can be found on the website

Читать полностью…

Big Data Science

🔥Data validation in a Python script with the pydantic library
This lightweight library allows you to validate data and manage settings using Python annotations by applying type hints at runtime. The library shows errors if the data is invalid. It is enough to determine what the data should be in pure, canonical Python and check it with pydantic.
To be fair, pydantic is primarily a parsing library, not a validator. The library benefits from types and limits of the output model rather than the input. However, while model validation is not core pydantic, it can be used for this.
The main methods for defining objects in pydantic models are derived classes from the BaseModel base class. You can set up models as data types in strongly typed sizes. Untrusted data can be passed into models, and after parsing and checking pydantic parameters, that the fields of the resulting model instance will match the field types of the files in the models.
The library plays well with any IDE and is very fast: part of a large library is compiled with Cython. Pydantic performs complex data structure validation using recursive models, and allows you to define the characteristics of data types through the use of a decorator with parsing and validating input data.
It is noteworthy that this project with likely source code uses many companies and products, including FastAPI, Jupyter, Microsoft, AWS, Uber, etc. You can try it right now by importing it into your Python script, first find it yourself through the pip project manager, and then its base class is BaseModel:
pip install pydantic
from pydantic import BaseModel
https://pydantic-docs.helpmanual.io/

Читать полностью…

Big Data Science

🚀How to scale up Pandas with the Pandarallel library
Every Data Scientist knows that the Pandas Python library is quite slow and doesn't allocate large amounts of data. However, every Data Scientist uses it. 🤷‍♀️ To make Pandas faster, you can include Pandarallel in your project, a simple and convenient tool for parallelizing Pandas operations on all available processors.
Pandas only uses a single CPU profit, while Pandarallel allows you to profit from a multi-core computer. Pandarallel also offers progress bars for programs available on the laptop and terminal to get a rough idea of ​​the remaining results of the calculations that are needed to collect data.
The library can be used on any computer running Linux and macOS, and on Windows there are small quirks: due to the multiprocessor system, the function that is offered in Pandarallel must be standalone and must not be excluded from external sources.
https://nalepae.github.io/pandarallel/

Читать полностью…

Big Data Science

🐍Python instead of Cypher in Neo4j with Py2neo
Manipulations with data in the Neo4j graph database are performed using the SQL-like query language Cypher. It is graph-optimized, identifying and exploiting data relationships, exploring relationships in all directions to discover previously unseen relationships and clusters. However, Python gives you a lot of flexibility in working with data. Therefore, many Data Scientists prefer it for various programs, including automating the process of creating nodes and links.
Use the Py2neo library, a Python-package to work with Neo4j. It can be installed using the pip package manager through the well-known pip install py2neo command on the command line. Next, you can open your favorite Python editor and run the graph in Neo4j.
Py2neo includes a set of tools for working with Neo4j from Python applications and from the command line. The library supports both Bolt and HTTP and provides a high-level API, OGM, administration tools, an interactive console, a Cypher lexical encoder for Pygments, and many other features that come in handy for graph analysis. Starting with version 2021.1, Py2neo contains full routing support provided by the Neo4j cluster, which can be enabled using the neo4j://... URI or by passing the routing=True parameter to the Graph class constructor.
https://py2neo.org/2021.1/

Читать полностью…

Big Data Science

🚀Python 3.11.0: Major New Features for the Developer
On October 24, 2022, a new version of Python 3.11.0 was released. It provides the following new features and bug fixes:
• fixed multiplication of a list by an integer (list *= int), which resulted in an integer overflow when the new allocated length is close to the maximum size;
• Changed forkserver start method code - on Linux the multiprocessing module now uses unix domain sockets again, using the file system to operate on the forkserver process instead of the Linux abstract socket namespace. Abstract sockets do not have permissions and can be used on the system in the same network destination (particularly the entire system) to inject code into the forkserver multiprocessor process. This is a significant privilege escalation vulnerability and has been fixed in the new version of Python.
• Fixed an issue where frame objects could also be contained by the same interpreter frame, resulting in memory corruption and severe interpreter crashes;
• Fixed possible propagation of data or data sets when accessing the f_back element of a newly created generator or program frames;
• Fixed an implementation crash when calling PyEval_GetFrame() when the topmost Python frame is in a partially implemented state.
• command line parsing fixed: reject -X int_max_str_digits with no value is invalid when environment setting PYTHONINTMAXSTRDIGITS is set to a valid value;
• fixed undefined behavior in _testcapimodule.c;
• updated pip and setuptools archives to version 22.3 and 65.5.0 respectively;
• the asyncio.Task.cancel("message") method, previously declared discovered, now works again and is not found discovered;
• semaphores work faster, which is important for multitasking programs;
• fixed flag to use CONFORM bounds, allowing flags to be combined with unexplored values;
• on Windows, when typing Python tests behind the -jN option for a temporary stdout file, ANSI encoding is now used instead of UTF-8;
• fixed a bug due to which multiprocessing from the natural environment spawned child processes in Windows when it was not needed;
• fixed py.exe startup scheme with -V:<company>/ parameter when default settings were set in environment settings or configuration file;
• The macOS 13 SDK includes support for the mkfifoat and mknodat system calls. Previously, using the dir_fd option with os.mkfifo() or os.mknod() resulted in a segfault if cpython was built with the macOS 13 SDK but worked on an earlier version of macOS.

https://www.python.org/downloads/release/python-3110/
https://docs.python.org/release/3.11.0/whatsnew/changelog.html#python-3-11-0-final

Читать полностью…

Big Data Science

🌎TOP-10 DS-events all over the world in March:
Mar 6-7
• REINFORCE AI CONFERENCE: International AI and ML Hybrid Conference • Budapest, Hungary https://reinforceconf.com/2023
Mar 10-12 • International Conference on Machine Vision and Applications (ICMVA) • Singapore http://icmva.org/
Mar 13-16 • International Conference on Human-Robot Interaction (ACM/IEEE) • Stockholm, Sweden https://humanrobotinteraction.org/2023/
Mar 14 • Quant Strats • New York, USA https://www.alphaevents.com/events-quantstratsus
Mar 20-23 • Gartner Data & Analytics Summit • Orlando, USA https://www.gartner.com/en/conferences/na/data-analytics-us
Mar 20-23 • NVIDIA GTC • Online https://www.nvidia.com/gtc/
Mar 24-26 • 5th ICNLP Conference • Guangzhou, China http://www.icnlp.net/
Mar 27-28 • Data & Analytics in Healthcare • Melbourne, Australia https://datainhealthcare.coriniumintelligence.com/
Mar 27-31 • Annual Conference on Intelligent User Interfaces (IUI) • Sydney, Australia https://iui.acm.org/2023/
Mar 30 • MLCONF • New York, USA https://mlconf.com/event/mlconf-new-york-city/

Читать полностью…

Big Data Science

⛑Top 7 Medical DS Startups in 2022
SWORD Health is a physical therapy and rehabilitation service that includes a range of wearable devices that can read physiological indicators that signal pain, allowing you to analyze large amounts of data and offer more effective treatment, as well as adjust movements to eliminate pain
Cala Health is currently the only prescription non-invasive treatment for essential tremor based on measured fluctuation data from wearable devices that are also capable of personalized peripheral nerve stimulation based on this.
AppliedVR is a platform for treating chronic pain by building a library of pain-influenced data, enabling immersive therapy through VR
Digital Diagnostics is the first FDA (Food and Drug Administration)-approved standalone AI based on retinal imagery data to diagnose eye diseases caused by diabetes without the participation of a doctor
Iterative Health is a product that is a service for automating the processes of conducting and analyzing the results of endoscopy data. This technology is based on the interpretation of endoscopic image data, thereby helping clinicians to better evaluate patients with potential gastrointestinal problems.
Viz.ai is a service for intelligent coordination and medical care in radiology. This platform is designed to analyze data from CT scans of the brain in order to find blockages in large vessels in it. The system transmits all the results obtained to a specialist in the field of neurovascular diseases in order to ensure therapeutic intervention at an early stage. The system receives such results in just a few minutes, thus providing a quick response.
Unlearn is a startup that offers a platform to accelerate clinical trials using artificial intelligence, digital twins and various statistical methods. This service is capable of processing historical datasets of clinical trials from patients to create “disease-specific” machine learning models, which in turn could be used to create digital twins with the corresponding virtual medical records.

Читать полностью…

Big Data Science

🔥Processing data with Elastic Stack
Elastic Stack is a vast ecosystem of components that are used to search and process big data. This ecosystem is a JSON-based distributed system that combines the features of a NoSQL database. The work of the Elastic Stack is provided by such components as:
Elasticsearch is a large, fast, and highly scalable non-relational data store that has become a great tool for log search and analytics due to its power, simplicity, schemaless JSON documents, multilingual support, and geolocation. The system can quickly process large volumes of logs, index system logs as they arrive, and query them in real time. Performing operations in Elasticsearch, such as reading or writing data, usually takes less than a second, which makes it suitable for use cases where you need to react almost in real time, such as monitoring applications and detecting any anomalies.
Longstash - is utility to help centralizing data related to events, such as information from log files (logs), various indicators (metrics) or any other data in any format. It can perform data processing before forming the sample you need. It is the key component of the Elastic Stack and is used to collect and process your data containers. Logstash is considered a server side component. Its main purpose is to perform the collection of data from a wide range of input sources in a scalable way, as well as processing the information and sending it to the destination. By default, the converted information goes to Elasticsearch, and you can choose from many other output options. Logstash's architecture is plugin-based and easily extensible. Three types of plugins are supported: input, filtering and output.
Kibana is an Elastic Stack visualization tool that helps visualize data in Elasticsearch. Kibana offers a variety of visualization options such as histogram, map, line graphs, time series, and more. Kibana allows you to create visualizations with just a couple of mouse clicks and explore your data in an interactive way. In addition, it is possible to create beautiful dashboards consisting of various visualizations, share them, and also receive high-quality reports.
Beats is an open source data delivery platform that complements Logstash. Unlike Logstash, which runs on the server side, Beats is on the client side. At the heart of this platform is the libbeat library, which provides an API for passing data from a source, configuring input, and implementing data collection. Beats is installed on devices that are not part of the server components such as Elasticsearch, Logstash or Kibana. They are hosted on non-clustered nodes, which are also sometimes referred to as edge nodes.
You can download the elements of the Elastic Stack from the following link: https://www.elastic.co/elastic-stack/

Читать полностью…

Big Data Science

💥Top 5 Reasons to Use Apache Spark for Big Data Processing
Apache Spark
is a popular open source Big Data framework for processing large amounts of data in a distributed environment. It is part of the Apache Hadoop project ecosystem. This framework is good because it has the following elements in its arsenal:
Wide API - Spark provides the developer with a fairly extensive API, which allows you to work with different programming languages, for example: Python, R, Scala and Java. Spark also offers the user a dataframe abstraction (dataframe), which uses object-oriented methods for transforming, combining data, filtering it, and many other useful features.
Pretty broad functionality - Spark has a wide range of functionality due to components such as:
1. Spark SQL - a module that serves for analytical data processing using SQL queries
2. Spark Streaming - a module that provides an add-on for processing streaming data online
3. MLLib - a module that provides a set of machine learning libraries in a distributed environment
Lazy evaluations - allow reducing the total amount of calculations and improving the performance of the program by reducing memory requirements. This type of calculation is very useful, as it allows you to determine the complex structure of transformations represented as objects. It is also possible to check the structure of the result without performing any intermediate steps. Spark also automatically checks the query execution plan or program for errors. This allows you to quickly catch bugs and debug them.
Open Source - Part of the Apache Software Foundation's line of projects, Spark continues to be actively developed through the developer community. In addition, despite the fact that Spark is a free tool, it has very detailed documentation: https://spark.apache.org/documentation.html
Distributed data processing - Apache Spark provides distributed data processing, including the concept of distributed datasets RDD (resilient distributed dataset) is a distributed data structure that resides in RAM. Each such dataset contains a fragment of data distributed over the nodes of the cluster. This makes it fault-tolerant: if a partition is lost due to a node failure, it can be restored from its original sources. Thus, Spark itself spreads the code across all nodes of the cluster, breaks it into subtasks, creates an execution plan and monitors the success of the execution.

Читать полностью…

Big Data Science

😎Top of 6 libraries for time series analysis
Time series is an ordered sequence of points or features that are measured at identified time intervals and that represent a characteristic feature of process. There are some popular libraries for time series processing:
Statsmodels is an open source library. Based on NumPy and SciPy. Statsmodel allows to build and analyze statistical models, including time series models. It also includes statistical tests, the ability to work with big data, etc.
Sktime is an open source machine learning library in Python. It is designed specifically for time series analysis. Sktime includes special machine learning algorithms, is well suited for forecasting, and time series classification tasks.
tslearn - a universal library designed for time series analysis using the Python language. It is based on the scikit-learn, numpy and scipy libraries. This library offers tools for preprocessing and feature extraction, as well as special models for clustering, classification, and regression.
Tsfresh - this library is great for preparing data for a classic tabular form in order to formulate and solve problems of classification, forecasting, etc. With Tsfresh you can quickly select a large number of time series features, and then select only the necessary ones.
Merlion is an open source library. It is designed to work with time series, mainly for forecasting and detecting collective anomalies. There is generic interface for most models and datasets. Allows quickly developing a model for solving common time series problems and testing it on various data sets.
PyOD (Python Outlier Detection or PyOD) is a Python library that able to detect point anomalies or outliers in data. More than 30 algorithms are implemented in PyOD, ranging from classical algorithms such as Isolation Forest to methods recently presented in scientific articles, such as COPOD and others. PyOD also allows to combine outlier search models into ensembles to improve the quality of problem solving. The library is simple and straightforward, and the examples in the documentation detail how it can be used.

Читать полностью…

Big Data Science

🌎TOP-25 DS-events all over the world:
Feb 9-11 • WAICF - World Artificial Intelligence Cannes Festival • Cannes, France https://worldaicannes.com/
Feb 15-16 • Deep Learning Summit• San Francisco, USA https://ai-west-dl.re-work.co/
Mar 30 • MLconf • New York City, USA https://mlconf.com/event/mlconf-new-york-city/
Apr 26-27 • Computer Vision Summit • San Jose, USA https://computervisionsummit.com/location/cvsanjose
Apr 27-29 • SIAM International Conference on Data Mining (SDM23) • Minneapolis, USA https://www.siam.org/conferences/cm/conference/sdm23
May 01-05 • ICLR - International Conference on Learning Representations • online https://iclr.cc/
May 17-19 • World Data Summit• Amsterdam, The Netherlands https://worlddatasummit.com/
May 25-26 • The Data Science Conference • Chicago, USA https://www.thedatascienceconference.com/
Jun 14-15 • The AI Summit London • London, UK https://london.theaisummit.com/
Jun 18-22 • Machine Learning Week • Las Vegas, USA https://www.predictiveanalyticsworld.com/machinelearningweek/
Jun 19-22 The Event For Machine Learning Technologies & Innovations • Munich, Germany https://mlconference.ai/munich/
Jul 13-14 • DELTA - International Conference on Deep Learning Theory and Applications • Rome, Italy https://delta.scitevents.org/
Jul 23-29 • ICML - International Conference on Machine Learning • Honolulu, Hawai’i https://icml.cc/
Aug 06-10 • KDD - Knowledge Discovery and Data Mining • Long Beach, USA https://kdd.org/kdd2023/
Sep 18-22 • RecSys – ACM Conference on Recommender Systems • Singapore, Singapore https://recsys.acm.org/recsys23/
Oct 11-12 • Enterprise AI Summit • Berlin, Germany https://berlin-enterprise-ai.re-work.co/
Oct 16-20 • AI Everything 2023 Summit • Dubai, UAE https://ai-everything.com/home
Oct 18-19 • AI in Healthcare Summit • Boston, USA https://boston-ai-healthcare.re-work.co/
Oct 23-25 • Marketing Analytics & Data Science (MADS) Conference • Denver, USA https://informaconnect.com/marketing-analytics-data-science/
Oct 24-25 • Data2030 Summit 2023 • Stockholm, Sweden https://data2030summit.com/
Nov 01-02 • Deep Learning Summit • Montreal, Canada https://montreal-dl.re-work.co/
Dec 06-07 • The AI Summit New York • New York, USA https://newyork.theaisummit.com/
Nov • Data Science Conference • Belgrade, Serbia •https://datasciconference.com/
Dec • NeurIPS • https://nips.cc/
Dec • Data Science Summit • Warsaw, Poland • https://dssconf.pl/

Читать полностью…

Big Data Science

🤔TOP 4 risks of embeddings in ML
Embedded models are widely used in machine learning to translate raw input data into a low-dimensional vector that captures its semantic meaning and can be used for various subsequent models. Pretrained embeddings as feature extractors are used to develop features of various input types (text, image, audio, video, multimodal) or categorical features with high cardinality.
The main risks of embeddings are:
• High obfuscation - Changing the output of the upstream embedding model affects the performance of the downstream model. Relying on the output of the embedding model, downstream models should be retrained.
• Hidden feedback loops. Pre-trained embedding features are often used as a black box. But knowing what the raw input embedding was trained on is very important for the quality of the model and its interpretability.
• High costs of real-time output (storage and maintenance). This directly affects the return on investment in ML. It is important to ensure quality during embedding service outages, cost of training, and cost of service per request.
• High cost of debugging: - debugging and monitoring built-in features or root causes of failures is very expensive. Therefore, built-in features that are not of great importance for the model should be abandoned.
https://medium.com/better-ml/embeddings-the-high-interest-credit-card-of-feature-engineering-414c00cb82e1

Читать полностью…

Big Data Science

🚀Speed up the quality and runtime of your Python code with Refurb
Every Data Scientist knows that Python is an interpreted language. Interpreted code is always slower than compiled to direct machine code because the interpreted instruction takes much longer to implement. Therefore, the way you write your Python code greatly affects the speed of its execution. Having a good code structure and respecting the language transformation will increase the speed of Python code. The Refurb library will help improve the quality of Python code. It can upgrade or change Python code with a single command. The library is inspired by clippy, Rust's built-in linter.
Just install it via pip package manager
pip install refurb
And use hints to improve the quality and speed of your Python code.
https://github.com/dosisod/refurb

Читать полностью…

Big Data Science

💥3 Useful Python Libraries for Data Science and More
Let’s introduce 3 libraries that can be useful in Data Science tasks:
• Fabric is a high-level Python library (2.7, 3.4+) for executing shell commands remotely over SSH to get useful Python objects. It is based on the Invoke API (execution of subprocess commands and command line functions) and Paramiko (an implementation of the SSH protocol). https://github.com/fabric/fabric
• TextDistance is a library for comparing the distance between two or more sequences using over 30 algorithms. Useful in NLP tasks to determine the distance and similarity between sequences. https://github.com/life4/textdistance
• Watchdog - Python API (3.6+) and shell utilities for monitoring file system events. Useful for monitoring directories specified as command line arguments, allows you to log generated events. https://github.com/gorakhargosh/watchdog

Читать полностью…

Big Data Science

On December 3, Sberbank is holding a One Day Offer for Data scientists, Data analysts and Data engineers. Pass all the selection stages in one day and get an offer from the largest bank in the country!

👨‍🎓We are looking for specialists in the field of AI, ML, RecSys, Watch V, NLP.

Our team create information products for decision-making based on data, analytics, machine learning and artificial intelligence.

👉 You will have to:
- Solve classification/regression/ uplift-modeling tasks;
- Support the output of models in the PROM;
- Analyze and monitor the quality of models;
- Calculate CLTV and Unit - economics;
- Interact with the validation and finance departments on assessing the quality of models/financial results issues.

Data on more than 1 billion transactions daily, 75PB of information, 100TB of memory and over 7,200 CPU cores in sandboxes will be available for work.

Become a part of the bank's AI community!

✍️ Send a request for participation

Читать полностью…

Big Data Science

👍🏻Need beautiful graphs in Jupyter Notebook? It’s easy with PivotTable.js!
PivotTable.js
is a Javascript implementation of source code pivot tables with drag and drop functionality. The project is distributed under the MIT license and compiled through the pip package manager:
pip install pivot tables
The library allows you to conveniently and quickly visualize the statistics of a data set by simply drag and drop the necessary data.
https://pypi.org/project/pivottablejs/

Читать полностью…

Big Data Science

🤷‍♀️Not all missing data is equal
Missing data is a problem that often comes up in Data Science and Machine Learning. There are many reasons why data may be missing, depending on the type of data and collection methods. But not all missing data is the same, they can be broken down into the following categories:
• Missing due to the impossibility of their collection or the high cost of this procedure
• Structurally missing data that cannot be obtained due to the characteristics of the research objects;
• Missing due to random failure;
• Missing due to non-obvious or unknown reasons.
Depending on the reason for the lack of data in the dataset, you can smooth or even eliminate it. For example, structurally missing data suggests a change in the structure of questions about the object of study in order to get answers to them. Instead of collecting rare or expensive data, you can choose proxy metrics that allow you to make a decision. Constantly recurring failures need to be addressed, as well as more knowledge about the reasons for the lack of expected data.
nahmed3536/types-of-missing-data-e718e6ac2a55" rel="nofollow">https://medium.com/@nahmed3536/types-of-missing-data-e718e6ac2a55

Читать полностью…

Big Data Science

👀TOP 5 Open Source Data lineage Tools
Data lineage allows a company to comply with regulatory requirements, better understand and trust its data, and save time on manually analyzing the effects of data changes. To monitor data, there are many tools on the market today, open source and proprietary. Of the source code tools, the most important are:
Tokern allows users to get column-level data passing data from databases and data stores hosted by Google BigQuery, AWS Redshift, and Snowflake. Tokern can be integrated with other bugs as well. it works well with a large number of data directories with large source code and ETL frameworks. Tokern also provides users with the ability to generate data from external sources or its frequency ETL, making an overview for high end BI and ETL tools. Tokern uses PostgreSQL as its local storage, and NetworkX for quick analysis of network graphs. These available users can be manipulated, visualized, and analyzed in column-level library origin data. Additionally, users can also interact with walkthrough data using the Tokern SDK or API. Tokern also discovered PII (personal information) and PHI (personal health information) using PIICatcher, combining regular expressions with NLP libraries Spacy and Stanford NER.
Egeria - a metadata standard with original source code, enabling seamless integration of data processing tools for a robust and unified representation of metadata. Egeria allows you to build better solutions for data production, data quality checks, PII identification, and more. in addition to cataloging and metadata retrieval. Egeria Law on the OpenLineage data collection and storage standard. This allows users to gain a more complete view of the data by providing horizontal and vertical assignment and tracing. To get information about the origin of the data, Egeria listens to the Kafka events sent by the original messages.
Pachyderm is a data collection tool that empowers developers to create machine learning pipelines regardless of language and framework, instead of focusing on cloud storage. It uses a version control system like LakeFS or Git saves and saves changes like commits keeping a complete and unaltered audit trail. Pachyderm also has a full audit trail and uses a central repository based on object storage in a custom Pachyderm File System to track data generation and version control. Pachyderm collects user data sources, uses global origin event and data object identifiers. Pachyderm allows you to create an immutable data production graph as a DAG in the user interface, which is especially useful when working with ML pipelines. Thick Skin integrates well with many databases, repositories and data lakes. Therefore, many companies use it for MLOps operations, ETL unstructured data, and NLP workflows.
OpenLineage is a Linux Foundation project based on the highest ETL platforms, data orchestration mechanisms, metadata catalogs, data quality control mechanisms, and data inheritance tools. OpenLineage uses JSONSchema as the API definition and supports multiple languages and platforms. The previously mentioned Egeria has a core metadata layer built on top of OpenLineage. WeWork's Marquez also underpins the OpenLineage architecture, a preferred UI and metadata repository, as well as an API for collecting metadata, validating via GraphQL, and a REST API.
TrueDat is a comprehensive data management solution that allows you to efficiently classify, search and evaluate data, as well as visualize the entire data lifecycle. The tool was created in 2017 by BlueTab, which is part of IBM, and is still actively developed.
https://blog.devgenius.io/5-best-open-source-data-lineage-tools-in-2022-f8ef39a7d5f6

Читать полностью…

Big Data Science

🌞Easy prediction with Lazy Predict
Lazy Predict is a Python-library for comparing performance of different machine learning models in a dataset. In fact, this is a wrapper that allows you to quickly fit all ML models to a dataset and compare their performance in just a couple of lines of code. Lazy Predict works great with regression and classification of machine learning model comparison results by the most important metrics: R-Squared, RMSE, Accuracy (Precision), F1 Score and Training Time.
The library is open source and highly compatible with sklearn, numpy, and other Python libraries that are used in Data Science.
https://lazypredict.readthedocs.io/en/stable/readme.html

Читать полностью…

Big Data Science

🌨🍂Cool DS-events in November 2022 all over the world
Nov 1
• StreamSets Roadshow: SF Bay Area • San Mateo, CA, USA https://streamsets.com/roadshow/san-francisco/
Nov 1-3 • ODSC West 2022 • San Francisco, CA, USA https://odsc.com/california
Nov 2 • Modern Data Stack Roadshow • Denver, CO, USA https://www.moderndatastackroadshow.com/
Nov 2 • AiX Summit 2022 • San Francisco, CA, USA + Virtual https://odsc.com/california/aix-west/
Nov 2-3 • The AI Summit Austin • Austin, TX, USA https://austin.appliedintelligence.live/welcome
Nov 3 • StreamSets Roadshow: New York • New York, NY, USA https://streamsets.com/roadshow/new-york/
Nov 8 • The Data Science Symposium 2022, Presented by the UC Center for Business Analytics • Cincinnati, OH, USA https://web.cvent.com/event/36b0c2f9-5cfa-4f4c-875d-66c7f1ee2ed5/summary
Nov 8 • iMerit ML DataOps Summit 2022 • Virtual https://techcrunch.com/events/imerit-dataops-summit-2022/
Nov 9-10 • RE•WORK MLOps Summit • London, UK • https://london-ml-ops.re-work.co/
Nov 9-10 • Deep Learning Summit • Toronto, ON, Canada https://toronto-dl.re-work.co/
Nov 10 • Unleashing Hybrid and Multi-Cloud Data Science at Scale • Virtual https://www.dominodatalab.com/resources/unleashing-hybrid-multi-cloud-mlops
Nov 14-16 • Marketing Analytics & Data Science (MADS) Conference • San Antonio, TX, USA • https://informaconnect.com/marketing-analytics-data-science/
Nov 15-16 • Snowflake Build ‘22 • Virtual https://www.snowflake.com/build/
Nov 16-17 • 3rd Edition of Future Data Centres and Cloud Infrastructures • Dubai, UAE https://www.futuredatacentre.com/
Nov 16-17 • Smart Data Summit Plus 2022 • Dubai, UAE https://www.bigdata-me.com/
Nov 17-18 • Data Science Summit 2022 • Warsaw, Poland + Virtual https://dssconf.pl/
Nov 22 • trade/off: The Decision Intelligence Summit • Berlin, Germany https://www.tradeoff.ai/
Nov 22-23 • Nordic Data Science and Machine Learning Summit • Stockholm, Sweden + Virtual https://ndsmlsummit.com/

Читать полностью…
Subscribe to a channel