About the project

Posted by Ilya on July 27, 2024


Website
my-handicapped-pet.io
Source
FE | BE | CI
Status
Fetal (Proof of Concept)

History

Few years ago I realized that everyone who has eyes is now obsessed by data science, neural networks, machine learning, or some other flavour of this very matter. Once had a gap in my high-paid and important job of dubbing an office chair, burning eyes under fluoriscent lights and hearing for mumbling of the bodies sitting nearby, I decided that it is time, finally, to get myself somewhat familiar with this subject.

I took some course in Google Developers, which was absolutely awful. It was positioned like 101 for whole AI/NN domain, however it was mostly dedicated to linear regression. It offered to do analytics of real estate market in California, the tasks were to tinkle code in little text areas spreaded among the page and sharing the common context. I dropped it realizing that it was mostly about fighting their platform rather than working with data.

Some time passed, and after playing with some data sources which were more interesting for me at the time than California housing with aid of python libs such as numpy and plotty, I realized that if I want to learn something I should do my own project rather than chasing endless courses, tutorials, articles, or books. So this project was (and is) intended as a playground for studying libs, models, trying ideas and so on.

Min/max goals

So as I already mentioned the first goal is learning for myself (and probably other participants in future) and mastering skills of working with data from gathering to recognizing to visualization to statistical processing, as well as supplimentary technologies such as frontend and mongodb querying.

So what is this project all about? It's thought to be Excel on steroids, where you can not only to do simple graphs, calculate correlation or build a trend, but do advanced things such as automatic recognition of data types despite of format (e.g. I can write number as 100k or 100.000), analyze distributions and build graphs e.g. calculate and show confidence intervals automatically without need to write a formula for them etcetera etcetera etcetera... So I want something more powerful than awesome-tables, something easier to learn than holoviz or dash, which one can use out of the box without writing code and only fall to scripting when there is no ready solution for their need.

Another point is about data itself. There is long lasting movement about open source, but for now I think open data is much more important. And I am not talking about scientific but mundane data now. I moaned about information asymmetry in my recent post. From online shops to job listings to dating every platform if not direct scam shows you what it wants to show, not what you want to find. For example if you want to buy a new laptop you can filter listings by a vendor or a screen diagonal, but hardly by screen reflection rate or height of a blue spike comparing to overall brightness. Having grabbed data from the shop's site to a separate data source, you can, on the other hand, do whatever queries you want. Having many such data sources from vendor site, different shop sites, we can merge different instances of these (not really big) data and get some meaningful information. So if we have some decent amount of open data sources it can help a lot.

For now I see the following short-term things to do

  • Add authorization. Now all data sources are public, may be someone does want to export their private data and limit access to it. Further we can think of how to manage usage of public and private data and incentivise people to make data public.
  • Add crowlers. Should be framework to easily write crowler that defines fields of the data source to be saved.
  • Add data types recognition (numbers, currency, time, etc.). Since data from different source can have many different formats, the app should understand as much as possible. After that we can do visualization and other meaningful representation.
  • Improve UI. Make an editor to construct queries for visualization and filtering as JSON, use D3 graphs etc...
  • Improve crawlers. Add scheduling, debug, video recording etc...

Collaboration

Highly welcomed. If you want to participate, just leave a comment to this post or ping me in reddit.


project Aquarius era

Leave a comment on how you felt of the article...

Author Name
Author Email
Note: This email will not be shared to anyone
Comment